Skip to content
Extraits de code Groupes Projets
README.md 8,15 ko
Newer Older
  • Learn to ignore specific revisions
  • Patrick Watrin's avatar
    Patrick Watrin a validé
    # Python bindings for [Unitex/GramLab](http://www-igm.univ-mlv.fr/~unitex/)
    
    This package provides access to the Unitex C++ Library.
    
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    ## Main features
    
    * Unitex as a library
    
    * Persistence of linguistic resources
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    * File virtualization
    
    
    ## Installation
    
    
    The library has been tested on MacOSX (with Python from the [MacPorts](https://www.macports.org/) project) and Linux ([Debian](https://www.debian.org/)) for versions 2.7 and 3.5 of Python. The installation requires the Python header files and the `setuptools` module. If you plan to use the configuration system, you will also need the `yaml` module.
    
    ```bash
    # Run as root
    # On MacOSX (MacPorts)
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    # NOTE: the Python header files are installed by default
    #       (cf. https://trac.macports.org/wiki/Python)
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    # Python 2.7
    
    port install py27-setuptools
    port install py27-yaml
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    # Python 3.5
    
    port install py35-setuptools
    port install py35-yaml
    
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    # On Linux (distributions based on Debian)
    # Python 2.7
    
    apt-get install python-dev
    apt-get install python-setuptools
    apt-get install python-yaml
    
    Once you have met these requirements and downloaded the [Unitex](http://igm.univ-mlv.fr/~unitex/index.php?page=3&html=download.html) source distribution, run (as root):
    
    UNITEX_INC=/path/to/unitex/Src/C++ python setup.py install
    ```
    
    ## Installation testing
    
    If you want to test your installation, open a terminal, move in the `tests` directory and run (as user):
    
    ```bash
    pat@lucy /home/dev/projects/python-unitex/tests [2]$ python 01_test_tools.py
    ............
    ----------------------------------------------------------------------
    Ran 12 tests in 0.023s
    
    OK
    ```
    
    ```bash
    pat@lucy /home/dev/projects/python-unitex/tests [3]$ python 02_test_resources.py
    ......
    ----------------------------------------------------------------------
    Ran 6 tests in 0.003s
    
    OK
    ```
    
    ```bash
    pat@lucy /home/dev/projects/python-unitex/tests [4]$ python 03_test_io.py
    ....................
    ----------------------------------------------------------------------
    Ran 20 tests in 0.004s
    
    OK
    ```
    
    ```bash
    pat@lucy /home/dev/projects/python-unitex/tests [5]$ python 04_test_processor.py
    ..
    ----------------------------------------------------------------------
    Ran 2 tests in 0.045s
    
    OK
    ```
    
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    ## Documentation
    
    The documentation is accessible from any Python interactive shell using the command `help()`.
    
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    ```
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    In [1]: import _unitex
    
    In [2]: help(_unitex.unitex_tool)
    Help on built-in function unitex_tool in module _unitex:
    
    unitex_tool(...)
        This function launches an Unitex command.
    
    
        *Positional arguments (length: 1):*
    
        - **0 [str]** -- the Unitex command.
    
        *Return [bool]:*
    
          **True** if the command succeeds, **False** otherwise.
    
    If you want to generate HTML documentation from the source files, you can also use the [Sphinx Documentation Generator](http://www.sphinx-doc.org/).
    
    ```bash
    # Run as root
    # On MacOSX (MacPorts)
    # Python 2.7
    port install py27-sphinx
    
    # Python 3.5
    port install py35-sphinx
    
    # On Linux (distributions based on Debian)
    # Python 2.7
    apt-get install python-sphinx
    ```
    
    Once the tool installed, just open a terminal, move in the [documentation]() directory and run (as user):
    
    ```bash
    
    make html
    
    ## Getting started
    
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    **NOTE: The texts must be encoded in UTF-8. There is so far no support for UTF-16-(LE|BE) or any other encoding.**
    
    There are three ways to use the bindings (from low to high-level):
    
    1. The `_unitex` C++ extension.
    2. The Unitex basic commands and features.
    
    3. The high-level `Processor` class.
    
    The following sections give some sample code for each of these methods.
    
    ### The `_unitex` C++ extension
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    
    
    This works like the JNI interface which is part of the Unitex distribution and provides conversion of arguments (into C++ types) and returned values (into Python objects).
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    ```python
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    from _unitex import unitex_load_persistent_alphabet,\
    					unitex_free_persistent_alphabet,\
    					unitex_tool
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    alphabet = unitex_load_persistent_alphabet("/path/to/Alphabet.txt")
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    grammar = "/path/to/grammar.grf"
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    command = "UnitexTool Grf2Fst2 --no_loop_check --alphabet=%s %s -qutf8-no-bom" % (alphabet, grammar)
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    
    ret = unitex_tool(command)
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    
    unitex_free_persistent_alphabet(alphabet)
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    ```
    
    ### The Unitex basic commands and features
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    This part of the bindings is just an abstraction layer in front of the C++ extension. It provides a basic logging system and a number of checks (on arguments). There is also the possibility of storing the different resources and (tools) options in a [configuration file](https://github.com/patwat/python-unitex/blob/master/config/unitex-template.yaml) which offers more flexibility. 
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    ```python
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    import yaml
    
    from unitex.tools import grf2fst2
    from unitex.config import UnitexConfig
    from unitex.resources import load_persistent_alphabet, free_persistent_alphabet
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    grammar = "/path/to/grammar.grf"
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    config = None
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    with open("/path/to/unitex.yaml", "r") as f:
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
        config = yaml.load(f)
    
    options = UnitexConfig(config)
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    
    alphabet = options["resources"]["alphabet"]
    if options["persistence"] is True:
    	alphabet = load_persistent_alphabet(alphabet)
    
    kwargs = options["tools"]["grf2fst2"]
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    
    ret = grf2fst2(grammar, alphabet, **kwargs)
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    
    if options["persistence"] is True:
    	free_persistent_alphabet(alphabet)
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    ```
    
    
    ### The high-level `Processor` class
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    This class hides most of the Unitex (pre-)processing procedures in order to facilitate its usage.
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    import os
    
    
    from unitex.resources import load_persistent_fst2, free_persistent_fst2
    from unitex.processor import UnitexProcessor
    
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    files = [ ... ]
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    grammar = load_persistent_fst2("/path/to/grammar.fst2")
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    # Persistence is achieved during object initialization
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    processor = UnitexProcessor("/path/to/unitex.yaml")
    
    kwargs["xml"] = False
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    
    for f in files:
        path, _ = os.path.splitext(f)
        output = "%s.xml" % path
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
        # mode: 's': segment (apply Sentence.fst2)
        #       'r': replace (apply Replace.fst2)
        #       't': tokenize
        #       'l': lexicalize (apply dictionaries)
        processor.open(f, mode="srtl", tagged=False)
        processor.tag(grammar, output, **kwargs)
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    # 'clean': suppress the files produced by Unitex
    # 'free': unload all the persisted resources
    
    processor.close(clean=True, free=True)
    
    free_persistent_fst2(grammar)
    ```
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    
    
    In the [`examples`](https://github.com/patwat/python-unitex/blob/master/examples/) directory, you will find two scripts you can use to achieve two simple tasks.
    
    1. `build-config-file.py`: this script builds, for a given language, a default YAML configuration file adapted to your local Unitex installation. This configuration file allows you to define the different parameters required by Unitex and by the bindings.
    2. `do-concord.py`: this script can be used to perform a real Unitex process: the extraction of concordances from a corpus on the basis of a grammar. It also illustrates in a more detailed way of using the package for other tasks.
    
    To visualize the different options and arguments of these scripts, open a terminal, move in the `examples` directory and run (as user) each script with the `--help` option as shown below:
    
    
    ```bash
    pat@lucy /home/dev/projects/python-unitex/examples [1]$ python build-config-file.py --help
    Build Config File -- build the (default) config file for a given language
    
      $ build-config-file [OPTIONS] <Unitex YAML config template>
    
    Options:
      [ -h, --help      = this help message                                      ]
        -o, --output    = the resulting config filename
        -l, --language  = the language name
        -d, --directory = the original resources directory for the language
                          (i.e. the language directory from Unitex distribution)
    
    Example:
    
      $ build-config-file -l fr -d /path/to/French -o unitex-fr.yaml unitex.yaml
    
    pat@lucy /home/dev/projects/python-unitex/examples [2]$ python do-concord.py --help
    
    Do Concord -- A simple script to illustrate the Unitex Python bindings
    
    
      $ do-concord [OPTIONS] <file1(, file2, ...)>
    
    Options:
      [ -h, --help    = this help message       ]
        -c, --config  = the Unitex config file
        -g, --grammar = the fst2 grammar to use
    
    Example:
    
      $ do-concord -c unitex-fr.yaml *.txt
    
    ## Useful links
    
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    * The **Unitex/GramLab** corpus processor: [documentation](http://igm.univ-mlv.fr/~unitex/index.php?page=4)
    
    Patrick Watrin's avatar
    Patrick Watrin a validé
    * The **Unitex C++/Java library**: [documentation](http://unitex-library-fr.readthedocs.org/)