Skip to content
Extraits de code Groupes Projets
Valider 3971d2ec rédigé par Patrick Watrin's avatar Patrick Watrin
Parcourir les fichiers

documentation reformatting for sphinx

parent 5e6f0e73
Aucune branche associée trouvée
Aucune étiquette associée trouvée
Aucune requête de fusion associée trouvée
...@@ -753,86 +753,88 @@ def locate(grammar, text, alphabet, **kwargs): ...@@ -753,86 +753,88 @@ def locate(grammar, text, alphabet, **kwargs):
recognized units within the text are saved in a file called recognized units within the text are saved in a file called
concord.n. These two files are stored in the directory of the text. concord.n. These two files are stored in the directory of the text.
Arguments: *Arguments:*
grammar [str] -- the fst2 to apply on the text.
- **grammar [str]** -- the fst2 to apply on the text.
text [str] -- the text file, with extension .snt.
- **text [str]** -- the text file, with extension .snt.
alphabet [str] -- the alphabet file of the language of the text.
- **alphabet [str]** -- the alphabet file of the language of the
Keyword arguments: text.
- Generic options:
start_on_space [bool] -- this parameter indicates that the *Keyword arguments:*
search will start at any position in the text, even
before a space. This parameter should only be used to - *Generic options*:
carry out morphological searches (default: False).
- **start_on_space [bool]** -- this parameter indicates that the
char_by_char [bool] -- works in character by character search will start at any position in the text, even before a
tokenization mode. This is useful for languages like space. This parameter should only be used to carry out
Thai (default: False). morphological searches (default: False).
morpho [list(str)] -- this optional argument indicates which - **char_by_char [bool]** -- works in character by character
morphological mode dictionaries are to be used, if tokenization mode. This is useful for languages like Thai
needed by some .fst2 dictionaries. The argument is a (default: False).
list of dictionary path (bin format).
- **morpho [list(str)]** -- this optional argument indicates which
korean [bool] -- specify the dictionary is in korean morphological mode dictionaries are to be used, if needed by
(default: False). some .fst2 dictionaries. The argument is a list of dictionary
path (bin format).
arabic_rules [str] -- specifies the Arabic typographic rule
configuration file path. - **korean [bool]** -- specify the dictionary is in korean
(default: False).
sntdir [str] -- puts produced files in 'sntdir' instead of
the text directory. Note that 'sntdir' must end with a - **arabic_rules [str]** -- specifies the Arabic typographic rule
file separator (\ or /). configuration file path.
negation_operator [str] -- specifies the negation operator - **sntdir [str]** -- puts produced files in 'sntdir' instead of
to be used in Locate patterns. The two legal values for the text directory. Note that 'sntdir' must end with a file
X are minus and tilde (default). Using minus provides separator (\ or /).
backward compatibility with previous versions of Unitex.
- **negation_operator [str]** -- specifies the negation operator
- Search limit options: to be used in Locate patterns. The two legal values for X are
number_of_matches [int] -- stops after the first N matches minus and tilde (default). Using minus provides backward
(default: all matches). compatibility with previous versions of Unitex.
- Maximum iterations per token options: - *Search limit options:*
stop_token_count [list(int_1, int_2)] -- emits a warning - **number_of_matches [int]** -- stops after the first N matches
after 'int_1' iterations on a token and stops after (default: all matches).
'int_2' iterations.
- *Maximum iterations per token options:*
- Matching mode options: - **stop_token_count [list(int_1, int_2)]** -- emits a warning
match_mode [str] -- Possible values are: after 'int_1' iterations on a token and stops after 'int_2'
- UnitexConstants.MATCH_MODE_SHORTEST iterations.
- UnitexConstants.MATCH_MODE_LONGEST (default)
- UnitexConstants.MATCH_MODE_ALL - *Matching mode options:*
- **match_mode [str]** -- Possible values are:
- Output options: - UnitexConstants.MATCH_MODE_SHORTEST
output_mode [str] -- Possible values are: - UnitexConstants.MATCH_MODE_LONGEST (default)
- UnitexConstants.OUTPUT_MODE_IGNORE (default) - UnitexConstants.MATCH_MODE_ALL
- UnitexConstants.OUTPUT_MODE_MERGE
- UnitexConstants.OUTPUT_MODE_REPLACE - Output options:
- **output_mode [str]** -- Possible values are:
protect_dic_chars [bool] -- when 'merge' or 'replace' mode - UnitexConstants.OUTPUT_MODE_IGNORE (default)
is used, this option protects some input characters with - UnitexConstants.OUTPUT_MODE_MERGE
a backslash. This is useful when Locate is called by - UnitexConstants.OUTPUT_MODE_REPLACE
'dico' in order to avoid producing bad lines like:
3,14,.PI.NUM (default: True). - **protect_dic_chars [bool]** -- when 'merge' or 'replace' mode
is used, this option protects some input characters with a
variable [list(str_1, str_2)] -- sets an output variable backslash. This is useful when Locate is called by 'dico' in
named str_1 with content str_2. Note that str_2 must be order to avoid producing bad lines like: 3,14,.PI.NUM
ASCII. (default: True).
- Ambiguous output options: - **variable [list(str_1, str_2)]** -- sets an output variable
ambiguous_outputs [bool] -- allows the production of several named str_1 with content str_2. Note that str_2 must be ASCII.
matches with same input but different outputs. If False,
in case of ambiguous outputs, one will be arbitrarily - *Ambiguous output options:*
chosen and kept, depending on the internal state of the - **ambiguous_outputs [bool]** -- allows the production of several
function (default: True). matches with same input but different outputs. If False, in case
of ambiguous outputs, one will be arbitrarily chosen and kept,
variable_error [str] -- Possible values are: depending on the internal state of the function (default: True).
- UnitexConstants.ON_ERROR_EXIT
- UnitexConstants.ON_ERROR_IGNORE (default) - **variable_error [str]** -- Possible values are:
- UnitexConstants.ON_ERROR_BACKTRACK - UnitexConstants.ON_ERROR_EXIT
- UnitexConstants.ON_ERROR_IGNORE (default)
- UnitexConstants.ON_ERROR_BACKTRACK
*Return [bool]:* *Return [bool]:*
...@@ -943,32 +945,34 @@ def normalize(text, **kwargs): ...@@ -943,32 +945,34 @@ def normalize(text, **kwargs):
delimiter {S}, the stop marker {STOP}, or valid entries in the DELAF delimiter {S}, the stop marker {STOP}, or valid entries in the DELAF
format ({aujourd’hui,.ADV}). format ({aujourd’hui,.ADV}).
NOTE: the function creates a modified version of the text that is **NOTE:** the function creates a modified version of the text that is
saved in a file with extension .snt. saved in a file with extension .snt.
WARNING: if you specify a normalization rule file, its rules will be
applied prior to anything else. So, you have to be very
careful if you manipulate separators in such rules.
Arguments: **WARNING:** if you specify a normalization rule file, its rules
text [str] -- the text file to normalize. will be applied prior to anything else. So, you have to be very
careful if you manipulate separators in such rules.
Keyword arguments: *Arguments:*
no_carriage_return [bool] -- every separator sequence will be
turned into a single space (default: False).
input_offsets [str] -- base offset file to be used.
output_offsets [str] -- offset file to be produced. - **text [str]** -- the text file to normalize.
replacement_rules [str] -- specifies the normalization rule file *Keyword arguments:*
to be used. See section 14.13.6 for details about the format
of this file. By default, the function only replaces { and }
by [ and ].
no_separator_normalization [bool] -- only applies replacement - ** no_carriage_return [bool]** -- every separator sequence will be
rules specified with the 'replacement_rules' option turned into a single space (default: False).
(default: False).
- **input_offsets [str]** -- base offset file to be used.
- **output_offsets [str]** -- offset file to be produced.
- **replacement_rules [str]** -- specifies the normalization rule
file to be used. See section 14.13.6 for details about the format
of this file. By default, the function only replaces { and } by
[ and ].
- **no_separator_normalization [bool]** -- only applies replacement
rules specified with the 'replacement_rules' option
(default: False).
*Return [bool]:* *Return [bool]:*
...@@ -1019,26 +1023,28 @@ def sort_txt(text, **kwargs): ...@@ -1019,26 +1023,28 @@ def sort_txt(text, **kwargs):
performed in the order of Unicode characters, removing duplicate performed in the order of Unicode characters, removing duplicate
lines. lines.
Arguments: *Arguments:*
text [str] -- the text file to sort.
Keyword arguments:
duplicates [bool] -- keep duplicate lines (default: False).
reverse [bool] -- sort in descending order (default: False).
sort_order [str] -- sorts using the alphabet order defined in
this file. If this parameter is missing, the sorting is done
according to the order of Unicode characters.
line_info [str] -- backup the number of lines of the result file - **text [str]** -- the text file to sort.
in this file.
thai [bool] -- option for sorting Thai text (default: False). *Keyword arguments:*
factorize_inflectional_codes [bool] -- makes two entries X,Y.Z:A - **duplicates [bool]** -- keep duplicate lines (default: False).
and X,Y.Z:B become a single entry X,Y.Z:A:B
(default: False). - **reverse [bool]** -- sort in descending order (default: False).
- **sort_order [str]** -- sorts using the alphabet order defined in
this file. If this parameter is missing, the sorting is done
according to the order of Unicode characters.
- **line_info [str]** -- backup the number of lines of the result
file in this file.
- **thai [bool]** -- option for sorting Thai text (default: False).
- **factorize_inflectional_codes [bool]** -- makes two entries
X,Y.Z:A and X,Y.Z:B become a single entry X,Y.Z:A:B
(default: False).
*Return [bool]:* *Return [bool]:*
...@@ -1092,50 +1098,50 @@ def tokenize(text, alphabet, **kwargs): ...@@ -1092,50 +1098,50 @@ def tokenize(text, alphabet, **kwargs):
in a binary file named text.cod. The function also produces the in a binary file named text.cod. The function also produces the
following four files: following four files:
- tok_by_freq.txt: text file containing the units sorted by - tok_by_freq.txt: text file containing the units sorted by
frequency. frequency.
- tok_by_alph.txt: text file containing the units sorted - tok_by_alph.txt: text file containing the units sorted
alphabetically. alphabetically.
- stats.n: text file containing information on the number of - stats.n: text file containing information on the number of
sentence separators, the number of units, the number sentence separators, the number of units, the number of simple
of simple words and the number of numbers. words and the number of numbers.
- enter.pos: binary file containing the list of newline - enter.pos: binary file containing the list of newline positions in
positions in the text. The coded representation of the text. The coded representation of the text does not contain
the text does not contain newlines, but spaces. newlines, but spaces. Since a newline counts as two characters and
Since a newline counts as two characters and a a space as a single one, it is necessary to know where newlines
space as a single one, it is necessary to know occur in the text when the positions of occurrences located by the
where newlines occur in the text when the positions 'locate' function are to be synchronized with the text file. File
of occurrences located by the 'locate' function are enter.pos is used for this by the 'concord' function. Thanks to
to be synchronized with the text file. File this, when clicking on an occurrence in a concordance, it is
enter.pos is used for this by the 'concord' correctly selected in the text. File enter.pos is a binary file
function. Thanks to this, when clicking on an containing the list of the positions of newlines in the text.
occurrence in a concordance, it is correctly
selected in the text. File enter.pos is a binary
file containing the list of the positions of
newlines in the text.
All produced files are saved in the text directory All produced files are saved in the text directory
Arguments: *Arguments:*
text [str] -- the text file to tokenize (.snt format).
alphabet [str] -- the alphabet file. - **text [str]** -- the text file to tokenize (.snt format).
Keyword arguments: - **alphabet [str]** -- the alphabet file.
- Generic options:
char_by_char [bool] -- indicates whether the function is
applied character by character, with the exceptions of
the sentence delimiter {S}, the stop marker {STOP} and
lexical tags like {today,.ADV} which are considered to
be single units (default: False).
tokens [str] -- specifies a tokens.txt file to load and *Keyword arguments:*
modify, instead of creating a new one from scratch.
- *Generic options:*
- Offsets options: - **char_by_char [bool]** -- indicates whether the function is
input_offsets [str] -- base offset file to be used. applied character by character, with the exceptions of the
sentence delimiter {S}, the stop marker {STOP} and lexical
tags like {today,.ADV} which are considered to be single units
(default: False).
- **tokens [str]** -- specifies a tokens.txt file to load and
modify, instead of creating a new one from scratch.
- *Offsets options:*
output_offsets [str] -- offset file to be produced. - **input_offsets [str]** -- base offset file to be used.
- **output_offsets [str]** -- offset file to be produced.
*Return [bool]:* *Return [bool]:*
...@@ -1191,24 +1197,26 @@ def txt2tfst(text, alphabet, **kwargs): ...@@ -1191,24 +1197,26 @@ def txt2tfst(text, alphabet, **kwargs):
The result is a file called text.tfst which is saved in the The result is a file called text.tfst which is saved in the
directory of the text. Another file named text.tind is also produced. directory of the text. Another file named text.tind is also produced.
Arguments: *Arguments:*
text [str] -- the path to the text file in .snt format.
alphabet [str] -- the alphabet file.
Keyword arguments: - **text [str]** -- the path to the text file in .snt format.
clean [bool] -- indicates whether the rule of conservation of
the best paths (see section 7.2.4) should be applied
(default: False).
normalization_grammar [str] -- name of a normalization grammar - alphabet [str]** -- the alphabet file.
that is to be applied to the text automaton.
tagset [str] -- Elag tagset file to use to normalize dictionary *Keyword arguments:*
entries.
korean [bool] -- tells the function that it works on Korean - **clean [bool]** -- indicates whether the rule of conservation of
(default: False). the best paths (see section 7.2.4) should be applied
(default: False).
- **normalization_grammar [str]** -- name of a normalization grammar
that is to be applied to the text automaton.
- **tagset [str]** -- Elag tagset file to use to normalize
dictionary entries.
- **korean [bool]** -- tells the function that it works on Korean
(default: False).
*Return [bool]:* *Return [bool]:*
......
0% Chargement en cours ou .
You are about to add 0 people to the discussion. Proceed with caution.
Terminez d'abord l'édition de ce message.
Veuillez vous inscrire ou vous pour commenter