Sha256: 2025d7ea2fa62a1f7cedfe8311dc6fedc27ee73a26d4588c4a06f2b43cbc164c

Contents?: true

Size: 737 Bytes

Versions: 1

Compression:

Stored size: 737 Bytes

Contents

require 'nokogiri'
require 'tjcrawler/page'

module Tjcrawler
  class Parser
    # a nokogiri doc will be yield in block, return true/flase
    # for successful/failed parsing.
    def initialize &block
      yield 'Block required' unless block_given?
      @strategy = block
    end

    def parse content
      doc = Nokogiri::HTML(content)
      ret = @strategy[doc]
      print :'.'
      ret
    end

    def start
      loop do
        sleep 1 until page = find_next
        page.touch(:parsed_at) if parse(page.content)
      end
    end

    private

    def find_next
      Page.where('crawled_at IS NOT NULL AND (parsed_at IS NULL OR parsed_at < ?)', 1.day.ago).order('parsed_at IS NOT NULL, parsed_at').first
    end
  end
end

Version data entries

1 entries across 1 versions & 1 rubygems

Version Path
tjcrawler-0.0.1 lib/tjcrawler/parser.rb