Introduction ------------ The tokenizer tokenizes a text into sentences and words. ### Confused by some terminology? This software is part of a larger collection of natural language processing tools known as "the OpeNER project". You can find more information about the project at [the OpeNER portal](http://opener-project.github.io). There you can also find references to terms like KAF (an XML standard to represent linguistic annotations in texts), component, cores, scenario's and pipelines. Quick Use Example ----------------- Installing the tokenizer can be done by executing: gem install tokenizer Please bare in mind that all components in OpeNER take KAF as an input and output KAF by default. ### Command line interface You should now be able to call the tokenizer as a regular shell command: by its name. Once installed the gem normally sits in your path so you can call it directly from anywhere. Tokenizing some text: echo "This is English text" | tokenizer -l en --no-kaf Will result in This is English text The available languages for tokenization are: English (en), German (de), Dutch (nl), French (fr), Spanish (es), Italian (it) #### KAF input format The tokenizer is capable of taking KAF as input, and actually does so by default. You can do so like this: echo "This is what I call, a test!" | tokenizer Will result in this is an english text If the argument -k (--kaf) is passed, then the argument -l (--language) is ignored. ### Webservices You can launch a language identification webservice by executing: tokenizer-server This will launch a mini webserver with the webservice. It defaults to port 9292, so you can access it at . To launch it on a different port provide the `-p [port-number]` option like this: tokenizer-server -p 1234 It then launches at Documentation on the Webservice is provided by surfing to the urls provided above. For more information on how to launch a webservice run the command with the `--help` option. ### Daemon Last but not least the tokenizer comes shipped with a daemon that can read jobs (and write) jobs to and from Amazon SQS queues. For more information type: tokenizer-daemon --help Description of dependencies --------------------------- This component runs best if you run it in an environment suited for OpeNER components. You can find an installation guide and helper tools in the [OpeNER installer](https://github.com/opener-project/opener-installer) and [an installation guide on the Opener Website](http://opener-project.github.io/getting-started/how-to/local-installation.html). At least you need the following system setup: ### Dependencies for normal use: * Perl 5 * MRI 1.9.3 ### Dependencies if you want to modify the component: * Maven (for building the Gem) Language Extension ------------------ The tokenizer module is a wrapping around a Perl script, which performs the actual tokenization based on rules (when to break a character sequence). The tokenizer already supports a lot of languages. Have a look to the core script to figure out how to extend to new languages. The Core -------- The component is a fat wrapper around the actual language technology core. The core is a rule based tokenizer implemented in Perl. You can find the core technologies in the following repositories: * [tokenizer-base](http://github.com/opener-project/tokenizer-base) Where to go from here --------------------- * [Check the project website](http://opener-project.github.io) * [Checkout the webservice](http://opener.olery.com/tokenizer) Report problem/Get help ----------------------- If you encounter problems, please email or leave an issue in the [issue tracker](https://github.com/opener-project/tokenizer/issues). Contributing ------------ 1. Fork it ( http://github.com/opener-project/tokenizer/fork ) 2. Create your feature branch (`git checkout -b my-new-feature`) 3. Commit your changes (`git commit -am 'Add some feature'`) 4. Push to the branch (`git push origin my-new-feature`) 5. Create new Pull Request