Newer
Older
# Python bindings for [Unitex/GramLab](http://www-igm.univ-mlv.fr/~unitex/)
This package provides access to the Unitex C++ Library.
* Persistence of linguistic resources
The library has been tested on MacOSX (with Python from the [MacPorts](https://www.macports.org/) project) and Linux ([Debian](https://www.debian.org/)) for versions 2.7 and 3.5 of Python. The installation requires the Python header files and the `setuptools` module. If you plan to use the configuration system, you will also need the `yaml` module.
```bash
# Run as root
# On MacOSX (MacPorts)
# NOTE: the Python header files are installed by default
# (cf. https://trac.macports.org/wiki/Python)
port install py27-setuptools
port install py27-yaml
port install py35-setuptools
port install py35-yaml
# On Linux (distributions based on Debian)
# Python 2.7
apt-get install python-dev
apt-get install python-setuptools
apt-get install python-yaml
Once you have met these requirements and downloaded the [Unitex](http://igm.univ-mlv.fr/~unitex/index.php?page=3&html=download.html) source distribution, run (as root):
UNITEX_INC=/path/to/unitex/Src/C++ python setup.py install
```
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
## Installation testing
If you want to test your installation, open a terminal, move in the `tests` directory and run (as user):
```bash
pat@lucy /home/dev/projects/python-unitex/tests [2]$ python 01_test_tools.py
............
----------------------------------------------------------------------
Ran 12 tests in 0.023s
OK
```
```bash
pat@lucy /home/dev/projects/python-unitex/tests [3]$ python 02_test_resources.py
......
----------------------------------------------------------------------
Ran 6 tests in 0.003s
OK
```
```bash
pat@lucy /home/dev/projects/python-unitex/tests [4]$ python 03_test_io.py
....................
----------------------------------------------------------------------
Ran 20 tests in 0.004s
OK
```
```bash
pat@lucy /home/dev/projects/python-unitex/tests [5]$ python 04_test_processor.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.045s
OK
```
## Documentation
The documentation is accessible from any Python interactive shell using the command `help()`.
In [1]: import _unitex
In [2]: help(_unitex.unitex_tool)
Help on built-in function unitex_tool in module _unitex:
unitex_tool(...)
This function launches an Unitex command.
- **0 [str]** -- the Unitex command.
*Return [bool]:*
**True** if the command succeeds, **False** otherwise.
If you want to generate HTML documentation from the source files, you can also use the [Sphinx Documentation Generator](http://www.sphinx-doc.org/).
```bash
# Run as root
# On MacOSX (MacPorts)
# Python 2.7
port install py27-sphinx
# Python 3.5
port install py35-sphinx
# On Linux (distributions based on Debian)
# Python 2.7
apt-get install python-sphinx
```
Once the tool installed, just open a terminal, move in the [documentation]() directory and run (as user):
```bash
**NOTE: The texts must be encoded in UTF-8. There is so far no support for UTF-16-(LE|BE) or any other encoding.**
There are three ways to use the bindings (from low to high-level):
1. The `_unitex` C++ extension.
2. The Unitex basic commands and features.
3. The high-level `Processor` class.
The following sections give some sample code for each of these methods.
### The `_unitex` C++ extension
This works like the JNI interface which is part of the Unitex distribution and provides conversion of arguments (into C++ types) and returned values (into Python objects).
from _unitex import unitex_load_persistent_alphabet,\
unitex_free_persistent_alphabet,\
unitex_tool
alphabet = unitex_load_persistent_alphabet("/path/to/Alphabet.txt")
command = "UnitexTool Grf2Fst2 --no_loop_check --alphabet=%s %s -qutf8-no-bom" % (alphabet, grammar)
### The Unitex basic commands and features
This part of the bindings is just an abstraction layer in front of the C++ extension. It provides a basic logging system and a number of checks (on arguments). There is also the possibility of storing the different resources and (tools) options in a [configuration file](https://github.com/patwat/python-unitex/blob/master/config/unitex-template.yaml) which offers more flexibility.
import yaml
from unitex.tools import grf2fst2
from unitex.config import UnitexConfig
from unitex.resources import load_persistent_alphabet, free_persistent_alphabet
config = yaml.load(f)
options = UnitexConfig(config)
alphabet = options["resources"]["alphabet"]
if options["persistence"] is True:
alphabet = load_persistent_alphabet(alphabet)
kwargs = options["tools"]["grf2fst2"]
if options["persistence"] is True:
free_persistent_alphabet(alphabet)
### The high-level `Processor` class
This class hides most of the Unitex (pre-)processing procedures in order to facilitate its usage.
from unitex.resources import load_persistent_fst2, free_persistent_fst2
from unitex.processor import UnitexProcessor
# Persistence is achieved during object initialization
for f in files:
path, _ = os.path.splitext(f)
output = "%s.xml" % path
# mode: 's': segment (apply Sentence.fst2)
# 'r': replace (apply Replace.fst2)
# 't': tokenize
# 'l': lexicalize (apply dictionaries)
processor.open(f, mode="srtl", tagged=False)
processor.tag(grammar, output, **kwargs)
# 'clean': suppress the files produced by Unitex
# 'free': unload all the persisted resources
processor.close(clean=True, free=True)
free_persistent_fst2(grammar)
```
In the [`examples`](https://github.com/patwat/python-unitex/blob/master/examples/) directory, you will find two scripts you can use to achieve two simple tasks.
1. `build-config-file.py`: this script builds, for a given language, a default YAML configuration file adapted to your local Unitex installation. This configuration file allows you to define the different parameters required by Unitex and by the bindings.
2. `do-concord.py`: this script can be used to perform a real Unitex process: the extraction of concordances from a corpus on the basis of a grammar. It also illustrates in a more detailed way of using the package for other tasks.
To visualize the different options and arguments of these scripts, open a terminal, move in the `examples` directory and run (as user) each script with the `--help` option as shown below:
```bash
pat@lucy /home/dev/projects/python-unitex/examples [1]$ python build-config-file.py --help
Build Config File -- build the (default) config file for a given language
$ build-config-file [OPTIONS] <Unitex YAML config template>
Options:
[ -h, --help = this help message ]
-o, --output = the resulting config filename
-l, --language = the language name
-d, --directory = the original resources directory for the language
(i.e. the language directory from Unitex distribution)
Example:
$ build-config-file -l fr -d /path/to/French -o unitex-fr.yaml unitex.yaml
pat@lucy /home/dev/projects/python-unitex/examples [2]$ python do-concord.py --help
Do Concord -- A simple script to illustrate the Unitex Python bindings
$ do-concord [OPTIONS] <file1(, file2, ...)>
Options:
[ -h, --help = this help message ]
-c, --config = the Unitex config file
-g, --grammar = the fst2 grammar to use
Example:
$ do-concord -c unitex-fr.yaml *.txt
* The **Unitex/GramLab** corpus processor: [documentation](http://igm.univ-mlv.fr/~unitex/index.php?page=4)
* The **Unitex C++/Java library**: [documentation](http://unitex-library-fr.readthedocs.org/)