Skip to content
GitLab
Explorer
Connexion
S'inscrire
Navigation principale
Rechercher ou aller à…
Projet
P
python-unitex
Gestion
Activité
Membres
Labels
Programmation
Tickets
Tableaux des tickets
Jalons
Wiki
Code
Requêtes de fusion
Dépôt
Branches
Validations
Étiquettes
Graphe du dépôt
Comparer les révisions
Extraits de code
Compilation
Pipelines
Jobs
Planifications de pipeline
Artéfacts
Déploiement
Releases
Registre de conteneur
Registre de modèles
Opération
Environnements
Surveillance
Incidents
Analyse
Données d'analyse des chaînes de valeur
Analyse des contributeurs
Données d'analyse CI/CD
Données d'analyse du dépôt
Expériences du modèle
Aide
Aide
Support
Documentation de GitLab
Comparer les forfaits GitLab
Forum de la communauté
Contribuer à GitLab
Donner votre avis
Conditions générales et politique de confidentialité
Raccourcis clavier
?
Extraits de code
Groupes
Projets
Afficher davantage de fils d'Ariane
Patrick Watrin
python-unitex
Validations
6af98b93
Valider
6af98b93
rédigé
9 years ago
par
Patrick Watrin
Parcourir les fichiers
Options
Téléchargements
Correctifs
Plain Diff
adding documentation to the parameters in the config file
parent
2da432cb
Branches
master
Aucune étiquette associée trouvée
Aucune requête de fusion associée trouvée
Modifications
4
Masquer les modifications d'espaces
En ligne
Côte à côte
Affichage de
4 fichiers modifiés
config/unitex-example.yaml
+1
-1
1 ajout, 1 suppression
config/unitex-example.yaml
config/unitex-template.yaml
+145
-5
145 ajouts, 5 suppressions
config/unitex-template.yaml
examples/build-config-file.py
+1
-0
1 ajout, 0 suppression
examples/build-config-file.py
unitex/tools.py
+4
-5
4 ajouts, 5 suppressions
unitex/tools.py
avec
151 ajouts
et
11 suppressions
config/unitex-example.yaml
+
1
−
1
Voir le fichier @
6af98b93
...
@@ -11,7 +11,7 @@ resources:
...
@@ -11,7 +11,7 @@ resources:
-
/home/dev/projects/python-unitex/dependencies/Unitex-GramLab-3.1rc/French/Dela/dela-fr-public.bin
-
/home/dev/projects/python-unitex/dependencies/Unitex-GramLab-3.1rc/French/Dela/dela-fr-public.bin
-
/home/dev/projects/python-unitex/dependencies/Unitex-GramLab-3.1rc/French/Dela/ajouts80jours.bin
-
/home/dev/projects/python-unitex/dependencies/Unitex-GramLab-3.1rc/French/Dela/ajouts80jours.bin
-
/home/dev/projects/python-unitex/dependencies/Unitex-GramLab-3.1rc/French/Dela/motsGramf-.bin
-
/home/dev/projects/python-unitex/dependencies/Unitex-GramLab-3.1rc/French/Dela/motsGramf-.bin
language
:
null
language
:
fr
replace
:
/home/dev/projects/python-unitex/dependencies/Unitex-GramLab-3.1rc/French/Graphs/Preprocessing/Replace/Replace.fst2
replace
:
/home/dev/projects/python-unitex/dependencies/Unitex-GramLab-3.1rc/French/Graphs/Preprocessing/Replace/Replace.fst2
sentence
:
/home/dev/projects/python-unitex/dependencies/Unitex-GramLab-3.1rc/French/Graphs/Preprocessing/Sentence/Sentence.fst2
sentence
:
/home/dev/projects/python-unitex/dependencies/Unitex-GramLab-3.1rc/French/Graphs/Preprocessing/Sentence/Sentence.fst2
tools
:
tools
:
...
...
Ce diff est replié.
Cliquez pour l'agrandir.
config/unitex-template.yaml
+
145
−
5
Voir le fichier @
6af98b93
# Do not modify this file. Use the 'build-config-file.py' script to generate a
# working version adapted to you local Unitex installation or copy this file
# before editing.
# The 'global' section contains the global configuration parameters.
global
:
global
:
# There is 3 'debug' level:
# 0: the error output is disabled;
# 1: the error output is limited to the logging system implemented in the
# bindings;
# 2: the error output is activated for both the bindings and the Unitex
# processor.
# NOTE: if you activate the debug for level >= 1, the verbose level is
# automatically activated at level 2.
debug
:
0
debug
:
0
# There is 4 'verbose' level:
# 0: the standard output is disabled;
# 1: the standard output shows 'warnings' emitted by the bindings logging
# system;
# 2: the standard output shows 'warnings' and various other informations
# emitted by the bindings logging system;
# 3: the standard output is activated for both the bindings and the Unitex
# processor.
verbose
:
0
verbose
:
0
# If not 'null', the error and standard outputs are redirected to the file
# specified by this parameters.
#log: /var/log/unitex.log
log
:
null
log
:
null
# If you are using the high-level 'Processor' class from, this parameter
# activate or deactivate the resource persistence. If persistency is
# activated, dictionaries, grammar and alphabet are loaded during the
# object initialization and kept in memory in order to improve
# performances.
persistence
:
True
persistence
:
True
# The Unitex library implements a virtual filesystem which avoids a lot
# of I/O and improves the performance. If this parameter is set to 'True',
# The high-level 'Processor' class will activate automatically this virtual
# filesystem.
virtualization
:
True
virtualization
:
True
# The 'resources' section is automatically filled by the 'build-config-file.py'
# script. If you want to do it manually, be sure to give the absolute path of
# each resource as shown below:
# resources:
# language: fr
# alphabet: /home/dev/projects/python-unitex/dependencies/Unitex-GramLab-3.1rc/French/Alphabet.txt
# alphabet-sorted: /home/dev/projects/python-unitex/dependencies/Unitex-GramLab-3.1rc/French/Alphabet_sort.txt
# sentence: /home/dev/projects/python-unitex/dependencies/Unitex-GramLab-3.1rc/French/Graphs/Preprocessing/Sentence/Sentence.fst2
# replace: /home/dev/projects/python-unitex/dependencies/Unitex-GramLab-3.1rc/French/Graphs/Preprocessing/Replace/Replace.fst2
# dictionaries:
# - /home/dev/projects/python-unitex/dependencies/Unitex-GramLab-3.1rc/French/Dela/dela-fr-public.bin
# - /home/dev/projects/python-unitex/dependencies/Unitex-GramLab-3.1rc/French/Dela/ajouts80jours.bin
# - /home/dev/projects/python-unitex/dependencies/Unitex-GramLab-3.1rc/French/Dela/motsGramf-.bin
resources
:
resources
:
language
:
null
language
:
null
...
@@ -16,35 +65,126 @@ resources:
...
@@ -16,35 +65,126 @@ resources:
dictionaries
:
null
dictionaries
:
null
# The 'tools' section can contain any of the argument used by the unitex tools
# The 'tools' section can contain any of the argument used by the unitex tools.
# functions. Note that, if you use the 'Processor' high-level class some argument
# Note that, if you use the 'Processor' high-level class some parameters will
# could be overriden to fit the 'tag', 'extract' and 'search' functions
# be overriden to fit the 'tag' functions behaviour. For intance, there is no
# behaviour. For intance, there is no point to define a font or a context for
# point to define a font or a context for 'concord'.
# 'concord'.
tools
:
tools
:
# CheckDic command (Unitex manual, p.266)
check_dic
:
check_dic
:
# If set to True, Unitex will use a strict syntax checking against
# unprotected dot and comma
strict
:
False
strict
:
False
# If set to 'True', 'no_space_warning' tells Unitex to tolerate spaces
# in grammatical, semantic and inflectional codes.
no_space_warning
:
False
no_space_warning
:
False
# Compress command (Unitex manual, p.266)
compress
:
compress
:
# 'output' sets the output file. By default, a file xxx.dic will
# produce a file xxx.bin.
output
:
null
output
:
null
# If set to 'True', 'flip' indicates that the inflected and canonical
# forms should be swapped in the compressed dictionary. This option is
# used to construct an inverse dictionary.
flip
:
False
flip
:
False
# If set to 'True', 'semitic' indicates that the semitic compression
# algorithm should be used. Setting this option with semitic languages
# like Arabic significantly reduces the size of the output dictionary.
semitic
:
False
semitic
:
False
# 'version: v1' produces an old style .bin file
# 'version: v2' produces a new style .bin file, with no file size
# limitation to 16 Mb and a smaller resulting size
version
:
"
v2"
version
:
"
v2"
# Concord command (Unitex manual, p.267)
concord
:
concord
:
# 'font' specifies the name of the font to use if the output is an
# HTML file.
#font: "Courier new"
font
:
null
font
:
null
# 'fontsize' specifies the font size to use if the output is an HTML
# file.
#fontsize: 12
fontsize
:
null
fontsize
:
null
# If 'only_ambiguous' is set to 'True', Unitex will only displays
# identical occurrences with ambiguous outputs, in text order
only_ambiguous
:
False
only_ambiguous
:
False
# If 'only_matches' is set to 'True', Unitex will force empty right
# and left contexts. Moreover, if used with -t/–text, Concord will
# not surround matches with tabulations
only_matches
:
False
only_matches
:
False
# 'left' specifies the number of characters on the left of the
# occurrences. In Thai mode, this means the number of non-diacritic
# characters. For both 'left' and 'right' parameters, you can add the
# 's' character to stop at the first {S} tag. For instance, if you set
# 40s for the left value, the left context will end at 40 characters at
# most, less if the {S} tag is found before.
left
:
"
0"
left
:
"
0"
# 'right' specifies the number of characters (non-diacritic ones in
# Thai mode) on the right of the occurrences (default=0). If the
# occurrence is shorter than this value, the concordance line is
# completed up to right. If the occurrence is longer than the length
# defined by right, it is nevertheless saved as whole.
right
:
"
0"
right
:
"
0"
# 'sort' specifies the sort order. Possible values are:
# - TO: text order
# - LC: first left context then center
# - LC: first left context then right
# - CL: first center then left context
# - CR: first center then right context
# - RL: first right context then left context
# - RC: first right context then center
sort
:
"
TO"
sort
:
"
TO"
# 'format' specifies the output format. Possible values are:
# - html: produces a concordance in HTML format
# - text: produces a concordance in text format
# - glossanet: produces a concordance for GlossaNet in HTML format
# where occurrences are links described by the 'script'
# parameter
# - script: produces a HTML concordance file where occurrences are
# links described by the 'script' parameter
# - index: produces an index of the concordance, made of the content
# of the occurrences (with the grammar outputs, if any),
# preceded by the positions of the occurrences in the text
# file given in characters
# - uima: produces an index of the concordance relative to the
# original text file, before any Unitex operation. The
# 'offsets' parameter must be provided
# - prlg: produces a concordance for PRLG corpora where each line is
# prefixed by information extracted with Unxmlize’s 'prlg'
# option. You must provide both the 'offsets' and the
# 'unxmlize' parameter
# - xml: produces xml index of the concordance
# - xml-with-header: produces an xml index of the concordance with
# full xml header
# - axis: quite the same as 'index', but the numbers represent the
# median character of each occurrence
# - xalign: another index file, used by the text alignment module.
# Each line is made of 3 integers X Y Z followed by the
# content of the occurrence. X is the sentence number,
# starting from 1. Y and Z are the starting and ending
# positions of the occurrence in the sentence, given in
# characters
# - merge: indicates to the function that it is supposed to produce
# a modified version of the text and save it in a file. The
# filename must be provided with the 'output' parameter
format
:
"
text"
format
:
"
text"
# 'script' describes the links format for 'glossanet' and 'script'
# output. For instance, if you use 'http://www.google.com/search?q=',
# you will obtain a HTML concordance file where occurrences are
# hyperlinks to Google queries.
script
:
null
script
:
null
# 'offsets' provides the file produced by Tokenize’s output_offsets
# option (needed by the 'uima' and the 'prlg' format).
offsets
:
null
offsets
:
null
# 'unxmlize' provides the file produced by Unxmlize’s 'prlg' option
# (needed by the 'prlg' format).
unxmlize
:
null
unxmlize
:
null
# 'directory' indicates to the function that it must not work in the
# same directory than <index> but in 'directory'
directory
:
null
directory
:
null
# 'thai' indicates that the input text is in Thai language
thai
:
False
thai
:
False
dico
:
dico
:
...
...
Ce diff est replié.
Cliquez pour l'agrandir.
examples/build-config-file.py
+
1
−
0
Voir le fichier @
6af98b93
...
@@ -161,6 +161,7 @@ if __name__ == "__main__":
...
@@ -161,6 +161,7 @@ if __name__ == "__main__":
sentence
,
replace
=
load_preprocessing_fsts
(
directory
)
sentence
,
replace
=
load_preprocessing_fsts
(
directory
)
alphabet
,
alphabet_sorted
=
load_alphabets
(
directory
)
alphabet
,
alphabet_sorted
=
load_alphabets
(
directory
)
options
[
"
resources
"
][
"
language
"
]
=
language
options
[
"
resources
"
][
"
dictionaries
"
]
=
dictionaries
options
[
"
resources
"
][
"
dictionaries
"
]
=
dictionaries
options
[
"
resources
"
][
"
sentence
"
]
=
sentence
options
[
"
resources
"
][
"
sentence
"
]
=
sentence
options
[
"
resources
"
][
"
replace
"
]
=
replace
options
[
"
resources
"
][
"
replace
"
]
=
replace
...
...
Ce diff est replié.
Cliquez pour l'agrandir.
unitex/tools.py
+
4
−
5
Voir le fichier @
6af98b93
...
@@ -164,9 +164,8 @@ def concord(index, alphabet, **kwargs):
...
@@ -164,9 +164,8 @@ def concord(index, alphabet, **kwargs):
- Generic options:
- Generic options:
font [str] -- the name of the font to use if the output is an HTML
font [str] -- the name of the font to use if the output is an HTML
file
file.
fontsize [int] -- the font size to use if the output is an HTML file. The
fontsize [int] -- the font size to use if the output is an HTML file.
font parameters are required if the output is an HTML file;
only_ambiguous [bool] -- Only displays identical occurrences with ambiguous
only_ambiguous [bool] -- Only displays identical occurrences with ambiguous
outputs, in text order (default: False)
outputs, in text order (default: False)
only_matches [bool] -- this option will force empty right and left contexts. Moreover,
only_matches [bool] -- this option will force empty right and left contexts. Moreover,
...
@@ -210,8 +209,8 @@ def concord(index, alphabet, **kwargs):
...
@@ -210,8 +209,8 @@ def concord(index, alphabet, **kwargs):
UnitexConstants.FORMAT_PRLG: produces a concordance for PRLG corpora where each line is prefixed
UnitexConstants.FORMAT_PRLG: produces a concordance for PRLG corpora where each line is prefixed
by information extracted with Unxmlize’s
'
prlg
'
option. You must
by information extracted with Unxmlize’s
'
prlg
'
option. You must
provide both the
'
offsets
'
and the
'
unxmlize
'
argument
provide both the
'
offsets
'
and the
'
unxmlize
'
argument
UnitexConstants.FORMAT_XML: produces xml index of the concordance
UnitexConstants.FORMAT_XML: produces
an
xml index of the concordance
UnitexConstants.FORMAT_XML_WITH_HEADER: produces xml index of the concordance with full xml header
UnitexConstants.FORMAT_XML_WITH_HEADER: produces
an
xml index of the concordance with full xml header
UnitexConstants.FORMAT_AXIS: quite the same as
'
index
'
, but the numbers represent the median
UnitexConstants.FORMAT_AXIS: quite the same as
'
index
'
, but the numbers represent the median
character of each occurrence
character of each occurrence
UnitexConstants.FORMAT_XALIGN: another index file, used by the text alignment module. Each line is
UnitexConstants.FORMAT_XALIGN: another index file, used by the text alignment module. Each line is
...
...
Ce diff est replié.
Cliquez pour l'agrandir.
Aperçu
0%
Chargement en cours
Veuillez réessayer
ou
joindre un nouveau fichier
.
Annuler
You are about to add
0
people
to the discussion. Proceed with caution.
Terminez d'abord l'édition de ce message.
Enregistrer le commentaire
Annuler
Veuillez vous
inscrire
ou vous
se connecter
pour commenter