Sha256: 13927f2d1a5043ab967dcd0b05ac95867bb335a9a2734104e170feb0b97f5611
Contents?: true
Size: 515 Bytes
Versions: 2
Compression:
Stored size: 515 Bytes
Contents
require 'rubygems' require 'open-uri' require 'hpricot' # Grab the first 2000 stories from twssstories.com (10 per page) f = File.open(File.expand_path("../../data/twss.txt", __FILE__), "w") domain = "http://twssstories.com" 200.times do |i| url = domain + "/node?page=#{i}" puts url doc = Hpricot(open(url).read) doc.search('div.content p') do |story| # now pull out the good stuff... if story.to_plain_text =~ /\"(.*)?\"/ f.puts $1 end end f.flush sleep rand * 3.0 end f.close
Version data entries
2 entries across 2 versions & 1 rubygems
Version | Path |
---|---|
twss-0.0.5 | script/collect_twss.rb |
twss-0.0.4 | script/collect_twss.rb |