Sha256: 7df2bbc6c650b4744edea8f07e3840c9d12c420d7df6101d15c099ad3281b2cd

Contents?: true

Size: 1.17 KB

Versions: 5

Compression:

Stored size: 1.17 KB

Contents

# What?
Given a set of strings from different languages, build a detector for the majority language (often, but not necessarily, English). More information on the algorithm [here](http://blog.echen.me/2011/05/01/unsupervised-language-detection-algorithms/).

# Example

	training_sentences = File.readlines("datasets/gutenberg-training.txt")
	detector = LanguageDetector.new(:ngram_size => 3)
	detector.train(30, training_sentences)

	puts "Testing on English sentences..."
	true_english = 0
	false_spanish = 0
	IO.foreach("datasets/gutenberg-test-en.txt") do |line|
	  next if line.strip.empty?
	  if detector.classify(line) == "majority"
	    true_english += 1
	  else
	    puts line
	    false_spanish += 1    
	  end
	end
	puts false_spanish
	puts true_english
	
![Example](https://img.skitch.com/20110303-qfrnb8gstgheh4xech4iutfskd.jpg)

# Using the Gem

	gem install unsupervised-language-detection
	
	require 'rubygems'
	require 'unsupervised-language-detection'
	
	UnsupervisedLanguageDetection.is_english_tweet?("I am an English sentence.") # => true
	UnsupervisedLanguageDetection.is_english_tweet?("Hola, me llamo Edwin.") # => false
	
# Demo
See a demo [here](http://babel-fett.heroku.com).

Version data entries

5 entries across 5 versions & 1 rubygems

Version Path
unsupervised-language-detection-0.0.6 README.md
unsupervised-language-detection-0.0.5 README.md
unsupervised-language-detection-0.0.4 README.md
unsupervised-language-detection-0.0.3 README.md
unsupervised-language-detection-0.0.2 README.md