Skip to content
GitLab
Explorer
Connexion
S'inscrire
Navigation principale
Rechercher ou aller à…
Projet
P
python-unitex
Gestion
Activité
Membres
Labels
Programmation
Tickets
Tableaux des tickets
Jalons
Wiki
Code
Requêtes de fusion
Dépôt
Branches
Validations
Étiquettes
Graphe du dépôt
Comparer les révisions
Extraits de code
Compilation
Pipelines
Jobs
Planifications de pipeline
Artéfacts
Déploiement
Releases
Registre de conteneur
Registre de modèles
Opération
Environnements
Surveillance
Incidents
Analyse
Données d'analyse des chaînes de valeur
Analyse des contributeurs
Données d'analyse CI/CD
Données d'analyse du dépôt
Expériences du modèle
Aide
Aide
Support
Documentation de GitLab
Comparer les forfaits GitLab
Forum de la communauté
Contribuer à GitLab
Donner votre avis
Conditions générales et politique de confidentialité
Raccourcis clavier
?
Extraits de code
Groupes
Projets
Afficher davantage de fils d'Ariane
Patrick Watrin
python-unitex
Validations
6acdc38d
Valider
6acdc38d
rédigé
9 years ago
par
Patrick Watrin
Parcourir les fichiers
Options
Téléchargements
Correctifs
Plain Diff
documentation reformatting
parent
159d839c
Aucune branche associée trouvée
Branches contenant la validation
Aucune étiquette associée trouvée
Aucune requête de fusion associée trouvée
Modifications
4
Masquer les modifications d'espaces
En ligne
Côte à côte
Affichage de
4 fichiers modifiés
unitex/__init__.py
+8
-4
8 ajouts, 4 suppressions
unitex/__init__.py
unitex/io.py
+16
-8
16 ajouts, 8 suppressions
unitex/io.py
unitex/resources.py
+18
-9
18 ajouts, 9 suppressions
unitex/resources.py
unitex/tools.py
+32
-20
32 ajouts, 20 suppressions
unitex/tools.py
avec
74 ajouts
et
41 suppressions
unitex/__init__.py
+
8
−
4
Voir le fichier @
6acdc38d
...
...
@@ -110,7 +110,8 @@ _LOGGER = logging.getLogger(__name__)
def
enable_stdout
():
"""
This function enables Unitex standard output. This is the default but
"""
This function enables Unitex standard output. This is the default but
should be used for debug purposes only.
Return [bool]:
...
...
@@ -124,7 +125,8 @@ def enable_stdout():
return
ret
def
disable_stdout
():
"""
This function disables Unitex standard output to ensure multithread
"""
This function disables Unitex standard output to ensure multithread
output consistency (i.e. avoid output mixing between threads) and to
improve performances.
...
...
@@ -139,7 +141,8 @@ def disable_stdout():
return
ret
def
enable_stderr
():
"""
This function enables Unitex error output. This is the default but
"""
This function enables Unitex error output. This is the default but
should be used for debug purposes only.
Return [bool]:
...
...
@@ -153,7 +156,8 @@ def enable_stderr():
return
ret
def
disable_stderr
():
"""
This function disables Unitex error output to ensure multithread
"""
This function disables Unitex error output to ensure multithread
output consistency (i.e. avoid output mixing between threads) and to
improve performances.
...
...
Ce diff est replié.
Cliquez pour l'agrandir.
unitex/io.py
+
16
−
8
Voir le fichier @
6acdc38d
...
...
@@ -21,7 +21,8 @@ _LOGGER = logging.getLogger(__name__)
def
cp
(
source_path
,
target_path
):
"""
This function copies a file. Both pathes can be on the virtual filesystem
"""
This function copies a file. Both pathes can be on the virtual filesystem
or the disk filesystem. Therefor, this function can be used to virtualize a
file or to dump a virtual file.
...
...
@@ -40,7 +41,8 @@ def cp(source_path, target_path):
return
ret
def
rm
(
path
):
"""
This function removes a file. The path can be on the virtual filesystem
"""
This function removes a file. The path can be on the virtual filesystem
or the disk filesystem.
Argument:
...
...
@@ -57,7 +59,8 @@ def rm(path):
return
ret
def
mv
(
old_path
,
new_path
):
"""
This function moves/renames a file. Both pathes can be on the virtual
"""
This function moves/renames a file. Both pathes can be on the virtual
filesystem or the disk filesystem.
Arguments:
...
...
@@ -75,7 +78,8 @@ def mv(old_path, new_path):
return
ret
def
mkdir
(
path
):
"""
This function creates a directory on the disk.
"""
This function creates a directory on the disk.
Argument:
path [str] -- directory path
...
...
@@ -91,7 +95,8 @@ def mkdir(path):
return
ret
def
rmdir
(
path
):
"""
This function removes a directory on the disk.
"""
This function removes a directory on the disk.
Argument:
path [str] -- directory path
...
...
@@ -107,7 +112,8 @@ def rmdir(path):
return
ret
def
ls
(
path
):
"""
This function lists (disk or virtual) directory contents.
"""
This function lists (disk or virtual) directory contents.
Argument:
path [str] -- directory path
...
...
@@ -120,7 +126,8 @@ def ls(path):
return
unitex_ls
(
path
)
def
exists
(
path
):
"""
This function verify if a file exists (on disk or virtual filesystem).
"""
This function verify if a file exists (on disk or virtual filesystem).
Argument:
path [str] -- directory path
...
...
@@ -135,7 +142,8 @@ def exists(path):
class
UnitexFile
(
object
):
"""
The UnitexFile class provides the minimum functionality necessary to
"""
The UnitexFile class provides the minimum functionality necessary to
manipulate files on the disk and the virtual filesystems. It
'
s mainly
useful to read files from virtual filesystem whithout having to copy them
to the disk.
...
...
Ce diff est replié.
Cliquez pour l'agrandir.
unitex/resources.py
+
18
−
9
Voir le fichier @
6acdc38d
...
...
@@ -20,7 +20,8 @@ _LOGGER = logging.getLogger(__name__)
def
load_persistent_dictionary
(
path
):
"""
This function loads a dictionary in persistent space.
"""
This function loads a dictionary in persistent space.
Argument:
path [str] -- the exisent file path in filespace (hard disk or virtual file system)
...
...
@@ -34,7 +35,8 @@ def load_persistent_dictionary(path):
return
unitex_load_persistent_dictionary
(
path
)
def
is_persistent_dictionary
(
path
):
"""
This function checks if a dictionary path points to the persistent space.
"""
This function checks if a dictionary path points to the persistent space.
Argument:
path [str] -- the file path to check
...
...
@@ -45,7 +47,8 @@ def is_persistent_dictionary(path):
return
unitex_is_persistent_dictionary
(
path
)
def
free_persistent_dictionary
(
path
):
"""
This function unloads a dictionary from persistent space.
"""
This function unloads a dictionary from persistent space.
Argument:
path [str] -- the persistent file path returned by the
'
load_persistent_dictionary
'
...
...
@@ -57,7 +60,8 @@ def free_persistent_dictionary(path):
def
load_persistent_fst2
(
path
):
"""
This function loads a fst2 in persistent space.
"""
This function loads a fst2 in persistent space.
Argument:
path [str] -- the exisent file path in filespace (hard disk or virtual file system)
...
...
@@ -71,7 +75,8 @@ def load_persistent_fst2(path):
return
unitex_load_persistent_fst2
(
path
)
def
is_persistent_fst2
(
path
):
"""
This function checks if a fst2 path points to the persistent space.
"""
This function checks if a fst2 path points to the persistent space.
Argument:
path [str] -- the file path to check
...
...
@@ -82,7 +87,8 @@ def is_persistent_fst2(path):
return
unitex_is_persistent_fst2
(
path
)
def
free_persistent_fst2
(
path
):
"""
This function unloads a fst2 from persistent space.
"""
This function unloads a fst2 from persistent space.
Argument:
path [str] -- the persistent file path returned by the
'
load_persistent_fst2
'
...
...
@@ -94,7 +100,8 @@ def free_persistent_fst2(path):
def
load_persistent_alphabet
(
path
):
"""
This function loads a alphabet in persistent space.
"""
This function loads a alphabet in persistent space.
Argument:
path [str] -- the exisent file path in filespace (hard disk or virtual file system)
...
...
@@ -108,7 +115,8 @@ def load_persistent_alphabet(path):
return
unitex_load_persistent_alphabet
(
path
)
def
is_persistent_alphabet
(
path
):
"""
This function checks if a alphabet path points to the persistent space.
"""
This function checks if a alphabet path points to the persistent space.
Argument:
path [str] -- the file path to check
...
...
@@ -119,7 +127,8 @@ def is_persistent_alphabet(path):
return
unitex_is_persistent_alphabet
(
path
)
def
free_persistent_alphabet
(
path
):
"""
This function unloads a alphabet from persistent space.
"""
This function unloads a alphabet from persistent space.
Argument:
path [str] -- the persistent file path returned by the
'
load_persistent_alphabet
'
...
...
Ce diff est replié.
Cliquez pour l'agrandir.
unitex/tools.py
+
32
−
20
Voir le fichier @
6acdc38d
...
...
@@ -25,7 +25,8 @@ _LOGGER = logging.getLogger(__name__)
def
check_dic
(
dictionary
,
dtype
,
alphabet
,
**
kwargs
):
"""
This function checks the format of <dela> and produces a file named
"""
This function checks the format of <dela> and produces a file named
CHECK_DIC.TXT that contains check result informations. This file is
stored in the <dela> directory.
...
...
@@ -76,7 +77,8 @@ def check_dic(dictionary, dtype, alphabet, **kwargs):
def
compress
(
dictionary
,
**
kwargs
):
"""
This function takes a DELAF dictionary as a parameter and compresses it. The
"""
This function takes a DELAF dictionary as a parameter and compresses it. The
compression of a dictionary dico.dic produces two files:
- dico.bin: a binary file containing the minimum automaton of the inflected
...
...
@@ -138,7 +140,8 @@ def compress(dictionary, **kwargs):
def
concord
(
index
,
alphabet
,
**
kwargs
):
"""
This function takes a concordance index file produced by the function Locate and
"""
This function takes a concordance index file produced by the function Locate and
produces a concordance. It is also possible to produce a modified text version taking
into account the transducer outputs associated to the occurrences.
...
...
@@ -319,7 +322,8 @@ def concord(index, alphabet, **kwargs):
def
dico
(
dictionaries
,
text
,
alphabet
,
**
kwargs
):
"""
This function applies dictionaries to a text. The text must have been cut up into
"""
This function applies dictionaries to a text. The text must have been cut up into
lexical units by the
'
tokenize
'
function.
The function
'
dico
'
produces the following files, and saves them in the directory of
...
...
@@ -397,7 +401,8 @@ def dico(dictionaries, text, alphabet, **kwargs):
def
extract
(
text
,
output
,
index
,
**
kwargs
):
"""
This program extracts from the given text all sentences that contain at least one
"""
This function extracts from the given text all sentences that contain at least one
occurrence from the concordance. The parameter <text> represents the complete
path of the text file, without omitting the extension .snt.
...
...
@@ -445,7 +450,8 @@ def extract(text, output, index, **kwargs):
def
fst2txt
(
grammar
,
text
,
alphabet
,
**
kwargs
):
"""
This function applies a transducer to a text in longest match mode at the preprocessing
"""
This function applies a transducer to a text in longest match mode at the preprocessing
stage, when the text has not been cut into lexical units yet. This function modifies the input
text file.
...
...
@@ -513,11 +519,12 @@ def fst2txt(grammar, text, alphabet, **kwargs):
def
grf2fst2
(
grammar
,
alphabet
,
**
kwargs
):
"""
This program compiles a grammar into a .fst2 file (for more details see section
"""
This function compiles a grammar into a .fst2 file (for more details see section
6.2). The parameter <grf> denotes the complete path of the main graph of the
grammar, without omitting the extension .grf.
The result is a file with the same name as the graph passed to the
program
as a
The result is a file with the same name as the graph passed to the
function
as a
parameter, but with extension .fst2. This file is saved in the same directory as
<grf>.
...
...
@@ -601,7 +608,8 @@ def grf2fst2(grammar, alphabet, **kwargs):
def
locate
(
grammar
,
text
,
alphabet
,
**
kwargs
):
"""
This function applies a grammar to a text and constructs an index of the occurrences
"""
This function applies a grammar to a text and constructs an index of the occurrences
found.
This function saves the references to the found occurrences in a file called concord.ind.
...
...
@@ -765,7 +773,8 @@ def locate(grammar, text, alphabet, **kwargs):
def
normalize
(
text
,
**
kwargs
):
"""
This function carries out a normalization of text separators. The separators are
"""
This function carries out a normalization of text separators. The separators are
space, tab, and newline. Every sequence of separators that contains at least one
newline is replaced by a unique newline. All other sequences of separators are replaced
by a single space.
...
...
@@ -791,7 +800,7 @@ def normalize(text, **kwargs):
output_offsets [str] -- offset file to be produced
replacement_rules [str] -- specifies the normalization rule file
to be used. See section 14.13.6 for details about the
format of this file. By default, the
program
only
format of this file. By default, the
function
only
replaces { and } by [ and ]
no_separator_normalization [bool] -- only applies replacement rules specified with -r
(default: False)
...
...
@@ -835,7 +844,8 @@ def normalize(text, **kwargs):
def
sort_txt
(
text
,
**
kwargs
):
"""
This function carries out a lexicographical sorting of the lines of file <txt>. <txt>
"""
This function carries out a lexicographical sorting of the lines of file <txt>. <txt>
represents the complete path of the file to be sorted.
The input text file is modified. By default, the sorting is performed in the order of
...
...
@@ -899,12 +909,13 @@ def sort_txt(text, **kwargs):
def
tokenize
(
text
,
alphabet
,
**
kwargs
):
"""
This function tokenizes a tet text into lexical units. <txt> the complete path of the
"""
This function tokenizes a tet text into lexical units. <txt> the complete path of the
text file, without omitting the .snt extension.
The
program
codes each unit as a whole. The list of units is saved in a text file called
The
function
codes each unit as a whole. The list of units is saved in a text file called
tokens.txt. The sequence of codes representing the units now allows the coding
of the text. This sequence is saved in a binary file named text.cod. The
program
of the text. This sequence is saved in a binary file named text.cod. The
function
also produces the following four files:
- tok_by_freq.txt: text file containing the units sorted by frequency
- tok_by_alph.txt: text file containing the units sorted alphabetically
...
...
@@ -915,9 +926,9 @@ def tokenize(text, alphabet, **kwargs):
coded representation of the text does not contain newlines, but spaces.
Since a newline counts as two characters and a space as a single one,
it is necessary to know where newlines occur in the text when the
positions of occurrences located by the
L
ocate
program
are to be
positions of occurrences located by the
'
l
ocate
'
function
are to be
synchronized with the text file. File enter.pos is used for this by
the
C
oncord
program
. Thanks to this, when clicking on an occurrence in
the
'
c
oncord
'
function
. Thanks to this, when clicking on an occurrence in
a concordance, it is correctly selected in the text. File enter.pos is
a binary file containing the list of the positions of newlines in the
text.
...
...
@@ -930,7 +941,7 @@ def tokenize(text, alphabet, **kwargs):
Keyword arguments:
- Generic options:
char_by_char [bool] -- indicates whether the
program
is applied character by
char_by_char [bool] -- indicates whether the
function
is applied character by
character, with the exceptions of the sentence delimiter
{S}, the stop marker {STOP} and lexical tags like
{today,.ADV} which are considered to be single units
...
...
@@ -984,10 +995,11 @@ def tokenize(text, alphabet, **kwargs):
def
txt2tfst
(
text
,
alphabet
,
**
kwargs
):
"""
This function constructs an automaton of a text.
"""
This function constructs an automaton of a text.
If the text is separated into sentences, the function constructs an automaton for each
sentence. If this is not the case, the
program
arbitrarily cuts the text into sequences
sentence. If this is not the case, the
function
arbitrarily cuts the text into sequences
of 2000 tokens and produces an automaton for each of these sequences.
The result is a file called text.tfst which is saved in the directory of the text.
...
...
Ce diff est replié.
Cliquez pour l'agrandir.
Aperçu
0%
Chargement en cours
Veuillez réessayer
ou
joindre un nouveau fichier
.
Annuler
You are about to add
0
people
to the discussion. Proceed with caution.
Terminez d'abord l'édition de ce message.
Enregistrer le commentaire
Annuler
Veuillez vous
inscrire
ou vous
se connecter
pour commenter