Sha256: 31a33dfa544087e14330db38732eec6c079b3547d7ee788840c1297384816d8f

Contents?: true

Size: 898 Bytes

Versions: 1

Compression:

Stored size: 898 Bytes

Contents

# What?
Given a set of strings from different languages, build a detector for the majority language (often, but not necessarily, English). More information on the algorithm [here](http://blog.echen.me/2011/05/01/unsupervised-language-detection-algorithms/).

# Example

	training_sentences = File.readlines("datasets/gutenberg-training.txt")
	detector = LanguageDetector.new(:ngram_size => 3)
	detector.train(30, training_sentences)

	puts "Testing on English sentences..."
	true_english = 0
	false_spanish = 0
	IO.foreach("datasets/gutenberg-test-en.txt") do |line|
	  next if line.strip.empty?
	  if detector.classify(line) == "majority"
	    true_english += 1
	  else
	    puts line
	    false_spanish += 1    
	  end
	end
	puts false_spanish
	puts true_english
	
![Example](https://img.skitch.com/20110303-qfrnb8gstgheh4xech4iutfskd.jpg)

# Demo
See a demo [here](http://babel-fett.heroku.com).

Version data entries

1 entries across 1 versions & 1 rubygems

Version Path
unsupervised-language-detection-0.0.1 README.md