=Langa - A language analyzer
Langa was created in a few weeks after a request came up to me, that language
recognition would be a fine extension to Lingo (see http://www.lex-lingo.de).
The basic idea of how language recognition could be done, was born after a few
minutes. So there had to be a proof of concept. Langa is the proof, that this
concept works, with little limitations.
==Concept of Language Recognition
Every language has its own charateristical usage of characters. This focuses
on the set of characters used, the frequency of each character, the proportion
of consonant, vowel and special characters and the appearance of special
language specific high frequency words.
Langa by now concentrates on the first to subjects, character set and
frequency. Therefor Langa processes a textfile and extracts a language
specific fingerprint. This fingerprint is comparable. You can measure the
distance of several fingerprint and declare the one with the shortest distance
as a match.
==Langa.dna - The language fingerprints
For the comparism of the fingerprint of a given file with the fingerprints of
several languages, we need these fingerprints first. So how do we get them?
The easiest way is to take a large file of a given language, process the
fingerprint for that file and take this fingerprint as a reference for the
language. The first source for large language files was the 'Wortschatz' from
the university of Leipzig/Germany (see http://corpora.informatik.uni-leipzig.de/).
There are 18 languages in good quality text files and large enough for our purposes.
The second source is from the Unbound Bible (see http://www.unboundbible.org/), where 
the bible is translated in several languages (see examples/).
==Quick Start
===Users View
To see how Langa works from a user point of view, from the langa directory call
  % bin/langa examples/*
  examples/afrikaans_1953_utf8.txt............................Language is Afrikaans (afk)
  examples/albanian_utf8.txt..................................Language is Albanian (sqi)
  ...
  examples/wolof_utf8.txt.....................................Language is Wolof (wol)
  examples/xhosa_utf8.txt.....................................Language is Xhosa (xho)
===Developers View
As a developer, you want to find out, what language a file contains, call

  require 'langa'
  
  # => locate langa.dna
  this_path = File.dirname(__FILE__)
  langa_dna = File.join(this_path, '..', 'lib', 'langa', 'langa.dna')

  # => process
  la = LanguageAnalyzer.new(langa_dna)
  lang = la.analyze(file, codepage)
  puts 'Language is %s (%s)' % [la.config(lang)['name'], lang]

See documentation for details.
==Add a new language
If you want to add a new language, process as follows:
  - Find a textfile that contains lots of written sentences in the desired language.
    The bigger, the better the results. Let's name it i.e. language.txt
  - Call langa from the command line with
    % bin/langa --dna language.txt
    please be patient, analyzing takes some time...
    <iso 639-3 code>:
        name:   <full language name>
        iso1:   <iso 639-1 code (optional)>
        source: examples/asv_utf8.txt
        size:   142256
        utf8:   eathondsirlmfuwbcygvpkjzxq
        fingerprint:    101-12616+97-10560+116-8889+104-8721+111-7544+110-7313+100-6081+115-5491+105-5355+114-4844+108-3465+109-2661+102-2396+117-2164+119-2016+98-1755+99-1664+121-1663+103-1587+118-1117+112-954+107-665+106-353+122-52+120-46+113-14
    % 
    
    Now paste the output (without the patient message) into the langa.dna file. 
    Replace the '<...>' strings with correct values from the iso 639-3 standard (see http://www.sil.org/iso639-3/codes.asp?order=reference_name&letter=a).