Sha256: 4f28e72a26b4c0f42315c9c0a45fa0977a1cd5a74506418476fe8e145407a1af
Contents?: true
Size: 568 Bytes
Versions: 4
Compression:
Stored size: 568 Bytes
Contents
Current language detector in `detector.yaml` is trained on bigrams from 5000 tweets. It automatically removes all non-ASCII characters (thus, it doesn't use non-Roman characters when attempting to classify), though you could obviously add a simple check for these. (Adding a check isn't really necessary, though; really, the only case where it seems to help is when the tweet consists solely of non-Roman characters, since in those cases, the language detector automatically strips out all non-Roman characters, so it's effectively trying to classify an empty string.)
Version data entries
4 entries across 4 versions & 1 rubygems