Sha256: 70e7a145351b2ed7623df71e4f568fd5307829cdbf083603620e9d1b3e4ef59c
Contents?: true
Size: 1.37 KB
Versions: 2
Compression:
Stored size: 1.37 KB
Contents
= Anemone Anemone is a web spider framework that can spider a domain and collect useful information about the pages it visits. It is versatile, allowing you to write your own specialized spider tasks quickly and easily. See http://anemone.rubyforge.org for more information. This branch of Anemone, sutch-anemone, has been enhanced for {wmonk}[https://github.com/sutch/wmonk]. == Features * Multi-threaded design for high performance * Tracks 301 HTTP redirects * Built-in BFS algorithm for determining page depth * Allows exclusion of URLs based on regular expressions * Choose the links to follow on each page with focus_crawl() * HTTPS support * Records response time for each page * CLI program can list all pages in a domain, calculate page depths, and more * Obey robots.txt * In-memory or persistent storage of pages during crawl, using TokyoCabinet, SQLite3, MongoDB, or Redis == Examples See the scripts under the <tt>lib/anemone/cli</tt> directory for examples of several useful Anemone tasks. == Requirements * nokogiri * robots == Development To test and develop this gem, additional requirements are: * rspec * fakeweb * tokyocabinet * kyotocabinet-ruby * mongo * redis * sqlite3 You will need to have KyotoCabinet, {Tokyo Cabinet}[http://fallabs.com/tokyocabinet/], {MongoDB}[http://www.mongodb.org/], and {Redis}[http://code.google.com/p/redis/] installed on your system and running.
Version data entries
2 entries across 2 versions & 1 rubygems
Version | Path |
---|---|
sutch-anemone-0.7.2.2 | README.rdoc |
sutch-anemone-0.7.2.1 | README.rdoc |