Skip to content
GitLab
Explorer
Connexion
S'inscrire
Navigation principale
Rechercher ou aller à…
Projet
P
python-unitex
Gestion
Activité
Membres
Labels
Programmation
Tickets
Tableaux des tickets
Jalons
Wiki
Code
Requêtes de fusion
Dépôt
Branches
Validations
Étiquettes
Graphe du dépôt
Comparer les révisions
Extraits de code
Compilation
Pipelines
Jobs
Planifications de pipeline
Artéfacts
Déploiement
Releases
Registre de conteneur
Registre de modèles
Opération
Environnements
Surveillance
Incidents
Analyse
Données d'analyse des chaînes de valeur
Analyse des contributeurs
Données d'analyse CI/CD
Données d'analyse du dépôt
Expériences du modèle
Aide
Aide
Support
Documentation de GitLab
Comparer les forfaits GitLab
Forum de la communauté
Contribuer à GitLab
Donner votre avis
Conditions générales et politique de confidentialité
Raccourcis clavier
?
Extraits de code
Groupes
Projets
Afficher davantage de fils d'Ariane
Patrick Watrin
python-unitex
Validations
3971d2ec
Valider
3971d2ec
rédigé
9 years ago
par
Patrick Watrin
Parcourir les fichiers
Options
Téléchargements
Correctifs
Plain Diff
documentation reformatting for sphinx
parent
5e6f0e73
Aucune branche associée trouvée
Branches contenant la validation
Aucune étiquette associée trouvée
Aucune requête de fusion associée trouvée
Modifications
1
Masquer les modifications d'espaces
En ligne
Côte à côte
Affichage de
1 fichier modifié
unitex/tools.py
+176
-168
176 ajouts, 168 suppressions
unitex/tools.py
avec
176 ajouts
et
168 suppressions
unitex/tools.py
+
176
−
168
Voir le fichier @
3971d2ec
...
@@ -753,86 +753,88 @@ def locate(grammar, text, alphabet, **kwargs):
...
@@ -753,86 +753,88 @@ def locate(grammar, text, alphabet, **kwargs):
recognized units within the text are saved in a file called
recognized units within the text are saved in a file called
concord.n. These two files are stored in the directory of the text.
concord.n. These two files are stored in the directory of the text.
Arguments:
*Arguments:*
grammar [str] -- the fst2 to apply on the text.
- **grammar [str]** -- the fst2 to apply on the text.
text [str] -- the text file, with extension .snt.
- **text [str]** -- the text file, with extension .snt.
alphabet [str] -- the alphabet file of the language of the text.
- **alphabet [str]** -- the alphabet file of the language of the
Keyword arguments:
text.
- Generic options:
start_on_space [bool] -- this parameter indicates that the
*Keyword arguments:*
search will start at any position in the text, even
before a space. This parameter should only be used to
- *Generic options*:
carry out morphological searches (default: False).
- **start_on_space [bool]** -- this parameter indicates that the
char_by_char [bool] -- works in character by character
search will start at any position in the text, even before a
tokenization mode. This is useful for languages like
space. This parameter should only be used to carry out
Thai (default: False).
morphological searches (default: False).
morpho [list(str)] -- this optional argument indicates which
- **char_by_char [bool]** -- works in character by character
morphological mode dictionaries are to be used, if
tokenization mode. This is useful for languages like Thai
needed by some .fst2 dictionaries. The argument is a
(default: False).
list of dictionary path (bin format).
- **morpho [list(str)]** -- this optional argument indicates which
korean [bool] -- specify the dictionary is in korean
morphological mode dictionaries are to be used, if needed by
(default: False).
some .fst2 dictionaries. The argument is a list of dictionary
path (bin format).
arabic_rules [str] -- specifies the Arabic typographic rule
configuration file path.
- **korean [bool]** -- specify the dictionary is in korean
(default: False).
sntdir [str] -- puts produced files in
'
sntdir
'
instead of
the text directory. Note that
'
sntdir
'
must end with a
- **arabic_rules [str]** -- specifies the Arabic typographic rule
file separator (\ or /).
configuration file path.
negation_operator [str] -- specifies the negation operator
- **sntdir [str]** -- puts produced files in
'
sntdir
'
instead of
to be used in Locate patterns. The two legal values for
the text directory. Note that
'
sntdir
'
must end with a file
X are minus and tilde (default). Using minus provides
separator (\ or /).
backward compatibility with previous versions of Unitex.
- **negation_operator [str]** -- specifies the negation operator
- Search limit options:
to be used in Locate patterns. The two legal values for X are
number_of_matches [int] -- stops after the first N matches
minus and tilde (default). Using minus provides backward
(default: all matches).
compatibility with previous versions of Unitex.
- Maximum iterations per token options:
- *Search limit options:*
stop_token_count [list(int_1, int_2)] -- emits a warning
- **number_of_matches [int]** -- stops after the first N matches
after
'
int_1
'
iterations on a token and stops after
(default: all matches).
'
int_2
'
iterations.
- *Maximum iterations per token options:*
- Matching mode options:
- **stop_token_count [list(int_1, int_2)]** -- emits a warning
match_mode [str] -- Possible values are:
after
'
int_1
'
iterations on a token and stops after
'
int_2
'
- UnitexConstants.MATCH_MODE_SHORTEST
iterations.
- UnitexConstants.MATCH_MODE_LONGEST (default)
- UnitexConstants.MATCH_MODE_ALL
- *Matching mode options:*
- **match_mode [str]** -- Possible values are:
- Output options:
- UnitexConstants.MATCH_MODE_SHORTEST
output_mode [str] -- Possible values are:
- UnitexConstants.MATCH_MODE_LONGEST (default)
- UnitexConstants.OUTPUT_MODE_IGNORE (default)
- UnitexConstants.MATCH_MODE_ALL
- UnitexConstants.OUTPUT_MODE_MERGE
- UnitexConstants.OUTPUT_MODE_REPLACE
- Output options:
- **output_mode [str]** -- Possible values are:
protect_dic_chars [bool] -- when
'
merge
'
or
'
replace
'
mode
- UnitexConstants.OUTPUT_MODE_IGNORE (default)
is used, this option protects some input characters with
- UnitexConstants.OUTPUT_MODE_MERGE
a backslash. This is useful when Locate is called by
- UnitexConstants.OUTPUT_MODE_REPLACE
'
dico
'
in order to avoid producing bad lines like:
3,14,.PI.NUM (default: True).
- **protect_dic_chars [bool]** -- when
'
merge
'
or
'
replace
'
mode
is used, this option protects some input characters with a
variable [list(str_1, str_2)] -- sets an output variable
backslash. This is useful when Locate is called by
'
dico
'
in
named str_1 with content str_2. Note that str_2 must be
order to avoid producing bad lines like: 3,14,.PI.NUM
ASCII.
(default: True).
- Ambiguous output options:
- **variable [list(str_1, str_2)]** -- sets an output variable
ambiguous_outputs [bool] -- allows the production of several
named str_1 with content str_2. Note that str_2 must be ASCII.
matches with same input but different outputs. If False,
in case of ambiguous outputs, one will be arbitrarily
- *Ambiguous output options:*
chosen and kept, depending on the internal state of the
- **ambiguous_outputs [bool]** -- allows the production of several
function (default: True).
matches with same input but different outputs. If False, in case
of ambiguous outputs, one will be arbitrarily chosen and kept,
variable_error [str] -- Possible values are:
depending on the internal state of the function (default: True).
- UnitexConstants.ON_ERROR_EXIT
- UnitexConstants.ON_ERROR_IGNORE (default)
- **variable_error [str]** -- Possible values are:
- UnitexConstants.ON_ERROR_BACKTRACK
- UnitexConstants.ON_ERROR_EXIT
- UnitexConstants.ON_ERROR_IGNORE (default)
- UnitexConstants.ON_ERROR_BACKTRACK
*Return [bool]:*
*Return [bool]:*
...
@@ -943,32 +945,34 @@ def normalize(text, **kwargs):
...
@@ -943,32 +945,34 @@ def normalize(text, **kwargs):
delimiter {S}, the stop marker {STOP}, or valid entries in the DELAF
delimiter {S}, the stop marker {STOP}, or valid entries in the DELAF
format ({aujourd’hui,.ADV}).
format ({aujourd’hui,.ADV}).
NOTE: the function creates a modified version of the text that is
**NOTE:** the function creates a modified version of the text that is
saved in a file with extension .snt.
saved in a file with extension .snt.
WARNING: if you specify a normalization rule file, its rules will be
applied prior to anything else. So, you have to be very
careful if you manipulate separators in such rules.
Arguments:
**WARNING:** if you specify a normalization rule file, its rules
text [str] -- the text file to normalize.
will be applied prior to anything else. So, you have to be very
careful if you manipulate separators in such rules.
Keyword arguments:
*Arguments:*
no_carriage_return [bool] -- every separator sequence will be
turned into a single space (default: False).
input_offsets [str] -- base offset file to be used.
output_offsets
[str] --
offse
t file to
be produced
.
- **text
[str]
**
--
the tex
t file to
normalize
.
replacement_rules [str] -- specifies the normalization rule file
*Keyword arguments:*
to be used. See section 14.13.6 for details about the format
of this file. By default, the function only replaces { and }
by [ and ].
no_separator_normalization [bool] -- only applies replacement
- ** no_carriage_return [bool]** -- every separator sequence will be
rules specified with the
'
replacement_rules
'
option
turned into a single space (default: False).
(default: False).
- **input_offsets [str]** -- base offset file to be used.
- **output_offsets [str]** -- offset file to be produced.
- **replacement_rules [str]** -- specifies the normalization rule
file to be used. See section 14.13.6 for details about the format
of this file. By default, the function only replaces { and } by
[ and ].
- **no_separator_normalization [bool]** -- only applies replacement
rules specified with the
'
replacement_rules
'
option
(default: False).
*Return [bool]:*
*Return [bool]:*
...
@@ -1019,26 +1023,28 @@ def sort_txt(text, **kwargs):
...
@@ -1019,26 +1023,28 @@ def sort_txt(text, **kwargs):
performed in the order of Unicode characters, removing duplicate
performed in the order of Unicode characters, removing duplicate
lines.
lines.
Arguments:
*Arguments:*
text [str] -- the text file to sort.
Keyword arguments:
duplicates [bool] -- keep duplicate lines (default: False).
reverse [bool] -- sort in descending order (default: False).
sort_order [str] -- sorts using the alphabet order defined in
this file. If this parameter is missing, the sorting is done
according to the order of Unicode characters.
line_info [str] -- backup the number of lines of the result file
- **text [str]** -- the text file to sort.
in this file.
thai [bool] -- option for sorting Thai text (default: False).
*Keyword arguments:*
factorize_inflectional_codes [bool] -- makes two entries X,Y.Z:A
- **duplicates [bool]** -- keep duplicate lines (default: False).
and X,Y.Z:B become a single entry X,Y.Z:A:B
(default: False).
- **reverse [bool]** -- sort in descending order (default: False).
- **sort_order [str]** -- sorts using the alphabet order defined in
this file. If this parameter is missing, the sorting is done
according to the order of Unicode characters.
- **line_info [str]** -- backup the number of lines of the result
file in this file.
- **thai [bool]** -- option for sorting Thai text (default: False).
- **factorize_inflectional_codes [bool]** -- makes two entries
X,Y.Z:A and X,Y.Z:B become a single entry X,Y.Z:A:B
(default: False).
*Return [bool]:*
*Return [bool]:*
...
@@ -1092,50 +1098,50 @@ def tokenize(text, alphabet, **kwargs):
...
@@ -1092,50 +1098,50 @@ def tokenize(text, alphabet, **kwargs):
in a binary file named text.cod. The function also produces the
in a binary file named text.cod. The function also produces the
following four files:
following four files:
- tok_by_freq.txt: text file containing the units sorted by
- tok_by_freq.txt: text file containing the units sorted by
frequency.
frequency.
- tok_by_alph.txt: text file containing the units sorted
- tok_by_alph.txt: text file containing the units sorted
alphabetically.
alphabetically.
- stats.n: text file containing information on the number of
- stats.n: text file containing information on the number of
sentence separators, the number of units, the number
sentence separators, the number of units, the number of simple
of simple words and the number of numbers.
words and the number of numbers.
- enter.pos: binary file containing the list of newline
- enter.pos: binary file containing the list of newline positions in
positions in the text. The coded representation of
the text. The coded representation of the text does not contain
the text does not contain newlines, but spaces.
newlines, but spaces. Since a newline counts as two characters and
Since a newline counts as two characters and a
a space as a single one, it is necessary to know where newlines
space as a single one, it is necessary to know
occur in the text when the positions of occurrences located by the
where newlines occur in the text when the positions
'
locate
'
function are to be synchronized with the text file. File
of occurrences located by the
'
locate
'
function are
enter.pos is used for this by the
'
concord
'
function. Thanks to
to be synchronized with the text file. File
this, when clicking on an occurrence in a concordance, it is
enter.pos is used for this by the
'
concord
'
correctly selected in the text. File enter.pos is a binary file
function. Thanks to this, when clicking on an
containing the list of the positions of newlines in the text.
occurrence in a concordance, it is correctly
selected in the text. File enter.pos is a binary
file containing the list of the positions of
newlines in the text.
All produced files are saved in the text directory
All produced files are saved in the text directory
Arguments:
*Arguments:*
text [str] -- the text file to tokenize (.snt format).
alphabe
t [str] -- the
alphabet file
.
- **tex
t [str]
**
-- the
text file to tokenize (.snt format)
.
Keyword arguments:
- **alphabet [str]** -- the alphabet file.
- Generic options:
char_by_char [bool] -- indicates whether the function is
applied character by character, with the exceptions of
the sentence delimiter {S}, the stop marker {STOP} and
lexical tags like {today,.ADV} which are considered to
be single units (default: False).
tokens [str] -- specifies a tokens.txt file to load and
*Keyword arguments:*
modify, instead of creating a new one from scratch.
- *Generic options:*
- Offsets options:
- **char_by_char [bool]** -- indicates whether the function is
input_offsets [str] -- base offset file to be used.
applied character by character, with the exceptions of the
sentence delimiter {S}, the stop marker {STOP} and lexical
tags like {today,.ADV} which are considered to be single units
(default: False).
- **tokens [str]** -- specifies a tokens.txt file to load and
modify, instead of creating a new one from scratch.
- *Offsets options:*
output_offsets [str] -- offset file to be produced.
- **input_offsets [str]** -- base offset file to be used.
- **output_offsets [str]** -- offset file to be produced.
*Return [bool]:*
*Return [bool]:*
...
@@ -1191,24 +1197,26 @@ def txt2tfst(text, alphabet, **kwargs):
...
@@ -1191,24 +1197,26 @@ def txt2tfst(text, alphabet, **kwargs):
The result is a file called text.tfst which is saved in the
The result is a file called text.tfst which is saved in the
directory of the text. Another file named text.tind is also produced.
directory of the text. Another file named text.tind is also produced.
Arguments:
*Arguments:*
text [str] -- the path to the text file in .snt format.
alphabet [str] -- the alphabet file.
Keyword arguments:
- **text [str]** -- the path to the text file in .snt format.
clean [bool] -- indicates whether the rule of conservation of
the best paths (see section 7.2.4) should be applied
(default: False).
normalization_grammar [str] -- name of a normalization grammar
- alphabet [str]** -- the alphabet file.
that is to be applied to the text automaton.
tagset [str] -- Elag tagset file to use to normalize dictionary
*Keyword arguments:*
entries.
korean [bool] -- tells the function that it works on Korean
- **clean [bool]** -- indicates whether the rule of conservation of
(default: False).
the best paths (see section 7.2.4) should be applied
(default: False).
- **normalization_grammar [str]** -- name of a normalization grammar
that is to be applied to the text automaton.
- **tagset [str]** -- Elag tagset file to use to normalize
dictionary entries.
- **korean [bool]** -- tells the function that it works on Korean
(default: False).
*Return [bool]:*
*Return [bool]:*
...
...
Ce diff est replié.
Cliquez pour l'agrandir.
Aperçu
0%
Chargement en cours
Veuillez réessayer
ou
joindre un nouveau fichier
.
Annuler
You are about to add
0
people
to the discussion. Proceed with caution.
Terminez d'abord l'édition de ce message.
Enregistrer le commentaire
Annuler
Veuillez vous
inscrire
ou vous
se connecter
pour commenter