Current language detector in `detector.yaml` is trained on bigrams from 5000 tweets. It automatically removes all non-ASCII characters (thus, it doesn't use non-Roman characters when attempting to classify), though you could obviously add a simple check for these. (Adding a check isn't really necessary, though; really, the only case where it seems to help is when the tweet consists solely of non-Roman characters, since in those cases, the language detector automatically strips out all non-Roman characters, so it's effectively trying to classify an empty string.)