Sha256: 108677173cb789a7c7c4a81b1661fead84b963b050044c323c7bba6783c6d622

Contents?: true

Size: 1.07 KB

Versions: 1

Compression:

Stored size: 1.07 KB

Contents

RegexpCrawler
============

RegexpCrawler is a crawler which use regrex expression to catch data.


Install
=======

gem sources -a http://gems.github.com
gem install flyerhzm-regexp_crawler


Usage
=====

>> crawler = RegexpCrawler::Crawler.new(:start_page => "http://www.tijee.com/tags/64-google-face-questions/posts", :continue_regexp => %r{"(/posts/\d+-[^#]*?)"}, :capture_regexp => %r{<h2 class='title'><a.*?>(.*?)</a></h2>.*?<div class='body'>(.*?)</div>}m, :named_captures => ['title', 'body'], :model => Post)
>> crawler.start

=>[{:page=>"http://www.tijee.com/posts/327-google-face-questions-many-companies-will-ask-oh", :model=>#<Post id: nil, title: "Google面试题(很多公司都会问的哦)", body: "\n内容摘要:几星期前,一个朋友接受...", created_at: nil, updated_at: nil, verify: false>}, {:page=>"http://www.tijee.com/posts/328-java-surface-together-with-the-google-test", :model=>#<Post id: nil, title: "google的一道JAVA面试题", body: "\n内容摘要:有一个整数n,写一个函数f(n...", created_at: nil, updated_at: nil, verify: false>}]

Version data entries

1 entries across 1 versions & 1 rubygems

Version Path
flyerhzm-regexp_crawler-0.2.0 README