Sha256: 13927f2d1a5043ab967dcd0b05ac95867bb335a9a2734104e170feb0b97f5611

Contents?: true

Size: 515 Bytes

Versions: 2

Compression:

Stored size: 515 Bytes

Contents

require 'rubygems'
require 'open-uri'
require 'hpricot'

# Grab the first 2000 stories from twssstories.com (10 per page)

f = File.open(File.expand_path("../../data/twss.txt", __FILE__), "w")

domain = "http://twssstories.com"
200.times do |i|
  url = domain + "/node?page=#{i}"
  puts url
  doc = Hpricot(open(url).read)
  doc.search('div.content p') do |story|
    # now pull out the good stuff...
    if story.to_plain_text =~ /\"(.*)?\"/
      f.puts $1
    end
  end
  f.flush
  sleep rand * 3.0
end

f.close

Version data entries

2 entries across 2 versions & 1 rubygems

Version Path
twss-0.0.5 script/collect_twss.rb
twss-0.0.4 script/collect_twss.rb