Sha256: 2025d7ea2fa62a1f7cedfe8311dc6fedc27ee73a26d4588c4a06f2b43cbc164c
Contents?: true
Size: 737 Bytes
Versions: 1
Compression:
Stored size: 737 Bytes
Contents
require 'nokogiri' require 'tjcrawler/page' module Tjcrawler class Parser # a nokogiri doc will be yield in block, return true/flase # for successful/failed parsing. def initialize &block yield 'Block required' unless block_given? @strategy = block end def parse content doc = Nokogiri::HTML(content) ret = @strategy[doc] print :'.' ret end def start loop do sleep 1 until page = find_next page.touch(:parsed_at) if parse(page.content) end end private def find_next Page.where('crawled_at IS NOT NULL AND (parsed_at IS NULL OR parsed_at < ?)', 1.day.ago).order('parsed_at IS NOT NULL, parsed_at').first end end end
Version data entries
1 entries across 1 versions & 1 rubygems
Version | Path |
---|---|
tjcrawler-0.0.1 | lib/tjcrawler/parser.rb |