Sha256: 7a27ea88e967573cb81c35e6f67adcae1331f5c6c28e89dd32e646b2a567c774
Contents?: true
Size: 1.1 KB
Versions: 6
Compression:
Stored size: 1.1 KB
Contents
require_relative './lib/unsupervised-language-detection/language-detector' # Train on a mix of English and Spanish sentences, pulled from Project Gutenberg text. training_sentences = File.readlines("datasets/gutenberg-training.txt") detector = LanguageDetector.new(:ngram_size => 3) detector.train(30, training_sentences) # See how well we can classify English text (sentences from a different Project Gutenberg text, not the one we trained on). puts "Testing on English sentences..." true_english = 0 false_spanish = 0 IO.foreach("datasets/gutenberg-test-en.txt") do |line| next if line.strip.empty? if detector.classify(line) == "majority" true_english += 1 else puts line false_spanish += 1 end end puts false_spanish puts true_english # See how well we can classify Spanish text. puts puts "Testing on Spanish sentences..." true_spanish = 0 false_english = 0 IO.foreach("datasets/gutenberg-test-sp.txt") do |line| next if line.strip.empty? if detector.classify(line) == "majority" puts line false_english += 1 else true_spanish += 1 end end puts false_english puts true_spanish
Version data entries
6 entries across 6 versions & 1 rubygems