README.md in pragmatic_segmenter-0.0.3 vs README.md in pragmatic_segmenter-0.0.4

- old
+ new

@@ -1,8 +1,8 @@ #Pragmatic Segmenter -[![Gem Version](https://badge.fury.io/rb/pragmatic_segmenter.svg)](http://badge.fury.io/rb/pragmatic_segmenter) [![Code Climate](https://codeclimate.com/github/diasks2/pragmatic_segmenter/badges/gpa.svg)](https://codeclimate.com/github/diasks2/pragmatic_segmenter) [![Build Status](https://travis-ci.org/diasks2/pragmatic_segmenter.png)](https://travis-ci.org/diasks2/pragmatic_segmenter) [![Test Coverage](https://codeclimate.com/github/diasks2/pragmatic_segmenter/badges/coverage.svg)](https://codeclimate.com/github/diasks2/pragmatic_segmenter) +[![Gem Version](https://badge.fury.io/rb/pragmatic_segmenter.svg)](http://badge.fury.io/rb/pragmatic_segmenter) [![Code Climate](https://codeclimate.com/github/diasks2/pragmatic_segmenter/badges/gpa.svg)](https://codeclimate.com/github/diasks2/pragmatic_segmenter) [![Build Status](https://travis-ci.org/diasks2/pragmatic_segmenter.png)](https://travis-ci.org/diasks2/pragmatic_segmenter) [![Test Coverage](https://codeclimate.com/github/diasks2/pragmatic_segmenter/badges/coverage.svg)](https://codeclimate.com/github/diasks2/pragmatic_segmenter) [![License](https://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](https://github.com/diasks2/pragmatic_segmenter/blob/master/LICENSE.txt) Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages. ##Install @@ -95,10 +95,12 @@ Therefore, I created a set of distinct edge cases to compare segmentation tools on. As most segmentation tools have very high accuracy, in my opinion what is really important to test is how a segmenter handles the edge cases - not whether it can segment 20,000 sentences that end with a regular word followed by a period. These example tests I have named the “Golden Rules". This list is by no means complete and will evolve and expand over time. If you would like to contribute to (or complain about) the test set, please open an issue. The Holy Grail of sentence segmentation appears to be **Golden Rule #18** as no segmenter I tested was able to correctly segment that text. The difficulty being that an abbreviation (in this case a.m./A.M./p.m./P.M.) followed by a capitalized abbreviation (such as Mr., Mrs., etc.) or followed by a proper noun such as a name can be both a sentence boundary and a non sentence boundary. +Download the Golden Rules: [[txt](https://s3.amazonaws.com/tm-town-nlp-resources/golden_rules.txt) | [Ruby RSpec](https://s3.amazonaws.com/tm-town-nlp-resources/golden_rules_rspec.rb)] + ####Golden Rules (English) 1.) **Simple period to end sentence** ``` Hello World. My name is Jonas. @@ -652,11 +654,11 @@ * [segtok](https://pypi.python.org/pypi/segtok/1.1.0) * [LingPipe](http://alias-i.com/lingpipe/demos/tutorial/sentences/read-me.html) ## Speed Performance Benchmarks -To test the relative performance of different segmentation tools and libraries I created a simple benchmark test. The test takes the 50 English Golden Rules combined into one string and runs it 100 times through the segmenter. This speed benchmark is by no means the most scientific benchmark, but it should help to give some relative performance data. The tests were done on a Mac Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5. For Punkt the tests were run using the [Ruby port](https://github.com/lfcipriani/punkt-segmenter). For Standford CoreNLP the tests were run using the [Ruby port](https://github.com/louismullie/stanford-core-nlp). For OpenNLP the tests were run using the [Ruby port](https://github.com/louismullie/open-nlp). +To test the relative performance of different segmentation tools and libraries I created a simple benchmark test. The test takes the 50 English Golden Rules combined into one string and runs it 100 times through the segmenter. This speed benchmark is by no means the most scientific benchmark, but it should help to give some relative performance data. The tests were done on a Mac Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5. For Punkt the tests were run using this [Ruby port](https://github.com/lfcipriani/punkt-segmenter), for Standford CoreNLP the tests were run using this [Ruby port](https://github.com/louismullie/stanford-core-nlp), and for OpenNLP the tests were run using this [Ruby port](https://github.com/louismullie/open-nlp). ## Languages with sentence boundary punctuation that is different than English *If you know of any languages that are missing from the list below, please open an issue. Thank you.* @@ -672,9 +674,10 @@ * Persian * Urdu ##Segmentation Papers and Books +* *Sentence Boundary Detection: A Long Solved Problem?* (Second Edition) - Jonathon Read, Rebecca Dridan, Stephan Oepen, Lars Jørgen Solberg (2012) [[pdf](http://www.aclweb.org/anthology/C12-2096) | [mirror](https://s3.amazonaws.com/tm-town-nlp-resources/C12-2096.pdf)] * *Handbook of Natural Language Processing* (Second Edition) - Nitin Indurkhya and Fred J. Damerau (2010) [[amazon](http://www.amazon.com/Handbook-Language-Processing-Learning-Recognition/dp/1420085921)] * *Sentence Boundary Detection and the Problem with the U.S.* - Dan Gillick (2009) [[pdf](http://dgillick.com/resource/sbd_naacl_2009.pdf) | [mirror](https://s3.amazonaws.com/tm-town-nlp-resources/sbd_naacl_2009.pdf)] * *Thoughts on Word and Sentence Segmentation in Thai* - Wirote Aroonmanakun (2007) [[pdf](http://pioneer.chula.ac.th/~awirote/ling/snlp2007-wirote.pdf) | [mirror](https://s3.amazonaws.com/tm-town-nlp-resources/snlp2007-wirote.pdf)] * *Unsupervised Multilingual Sentence Boundary Detection* - Tibor Kiss and Jan Strunk (2005) [[pdf](http://www.linguistics.ruhr-uni-bochum.de/~strunk/ks2005FINAL.pdf) | [mirror](https://s3.amazonaws.com/tm-town-nlp-resources/ks2005FINAL.pdf)] * *An Analysis of Sentence Boundary Detection Systems for English and Portuguese Documents* - Carlos N. Silla Jr. and Celso A. A. Kaestner (2004) [[pdf](https://www.cs.kent.ac.uk/pubs/2004/2930/content.pdf) | [mirror](https://s3.amazonaws.com/tm-town-nlp-resources/An+Analysis+of+Sentence+Boundary+Detection+Systems+for+English+and+Portuguese+Documents.pdf)] \ No newline at end of file