=Langa - A language analyzer Langa was created in a few weeks after a request came up to me, that language recognition would be a fine extension to Lingo (see http://www.lex-lingo.de). The basic idea of how language recognition could be done, was born after a few minutes. So there had to be a proof of concept. Langa is the proof, that this concept works, with little limitations. ==Concept of Language Recognition Every language has its own charateristical usage of characters. This focuses on the set of characters used, the frequency of each character, the proportion of consonant, vowel and special characters and the appearance of special language specific high frequency words. Langa by now concentrates on the first to subjects, character set and frequency. Therefor Langa processes a textfile and extracts a language specific fingerprint. This fingerprint is comparable. You can measure the distance of several fingerprint and declare the one with the shortest distance as a match. ==Langa.dna - The language fingerprints For the comparism of the fingerprint of a given file with the fingerprints of several languages, we need these fingerprints first. So how do we get them? The easiest way is to take a large file of a given language, process the fingerprint for that file and take this fingerprint as a reference for the language. The first source for large language files was the 'Wortschatz' from the university of Leipzig/Germany (see http://corpora.informatik.uni-leipzig.de/). There are 18 languages in good quality text files and large enough for our purposes. The second source is from the Unbound Bible (see http://www.unboundbible.org/), where the bible is translated in several languages (see examples/). ==Quick Start ===Users View To see how Langa works from a user point of view, from the langa directory call % bin/langa examples/* examples/afrikaans_1953_utf8.txt............................Language is Afrikaans (afk) examples/albanian_utf8.txt..................................Language is Albanian (sqi) ... examples/wolof_utf8.txt.....................................Language is Wolof (wol) examples/xhosa_utf8.txt.....................................Language is Xhosa (xho) ===Developers View As a developer, you want to find out, what language a file contains, call require 'langa' # => locate langa.dna this_path = File.dirname(__FILE__) langa_dna = File.join(this_path, '..', 'lib', 'langa', 'langa.dna') # => process la = LanguageAnalyzer.new(langa_dna) lang = la.analyze(file, codepage) puts 'Language is %s (%s)' % [la.config(lang)['name'], lang] See documentation for details. ==Add a new language If you want to add a new language, process as follows: - Find a textfile that contains lots of written sentences in the desired language. The bigger, the better the results. Let's name it i.e. language.txt - Call langa from the command line with % bin/langa --dna language.txt please be patient, analyzing takes some time... : name: iso1: source: examples/asv_utf8.txt size: 142256 utf8: eathondsirlmfuwbcygvpkjzxq fingerprint: 101-12616+97-10560+116-8889+104-8721+111-7544+110-7313+100-6081+115-5491+105-5355+114-4844+108-3465+109-2661+102-2396+117-2164+119-2016+98-1755+99-1664+121-1663+103-1587+118-1117+112-954+107-665+106-353+122-52+120-46+113-14 % Now paste the output (without the patient message) into the langa.dna file. Replace the '<...>' strings with correct values from the iso 639-3 standard (see http://www.sil.org/iso639-3/codes.asp?order=reference_name&letter=a).