Sha256: 4f28e72a26b4c0f42315c9c0a45fa0977a1cd5a74506418476fe8e145407a1af

Contents?: true

Size: 568 Bytes

Versions: 4

Compression:

Stored size: 568 Bytes

Contents

Current language detector in `detector.yaml` is trained on bigrams from 5000 tweets. It automatically removes all non-ASCII characters (thus, it doesn't use non-Roman characters when attempting to classify), though you could obviously add a simple check for these. (Adding a check isn't really necessary, though; really, the only case where it seems to help is when the tweet consists solely of non-Roman characters, since in those cases, the language detector automatically strips out all non-Roman characters, so it's effectively trying to classify an empty string.)

Version data entries

4 entries across 4 versions & 1 rubygems

Version Path
unsupervised-language-detection-0.0.4 website/README.md
unsupervised-language-detection-0.0.3 website/README.md
unsupervised-language-detection-0.0.2 website/README.md
unsupervised-language-detection-0.0.1 website/README.md