Sha256: 70e7a145351b2ed7623df71e4f568fd5307829cdbf083603620e9d1b3e4ef59c

Contents?: true

Size: 1.37 KB

Versions: 2

Compression:

Stored size: 1.37 KB

Contents

= Anemone

Anemone is a web spider framework that can spider a domain and collect useful
information about the pages it visits. It is versatile, allowing you to
write your own specialized spider tasks quickly and easily.

See http://anemone.rubyforge.org for more information.

This branch of Anemone, sutch-anemone, has been enhanced for {wmonk}[https://github.com/sutch/wmonk].

== Features
* Multi-threaded design for high performance
* Tracks 301 HTTP redirects
* Built-in BFS algorithm for determining page depth
* Allows exclusion of URLs based on regular expressions
* Choose the links to follow on each page with focus_crawl()
* HTTPS support
* Records response time for each page
* CLI program can list all pages in a domain, calculate page depths, and more
* Obey robots.txt
* In-memory or persistent storage of pages during crawl, using TokyoCabinet, SQLite3, MongoDB, or Redis

== Examples
See the scripts under the <tt>lib/anemone/cli</tt> directory for examples of several useful Anemone tasks.

== Requirements
* nokogiri
* robots

== Development
To test and develop this gem, additional requirements are:
* rspec
* fakeweb
* tokyocabinet
* kyotocabinet-ruby
* mongo
* redis
* sqlite3

You will need to have KyotoCabinet, {Tokyo Cabinet}[http://fallabs.com/tokyocabinet/], {MongoDB}[http://www.mongodb.org/], and {Redis}[http://code.google.com/p/redis/] installed on your system and running.

Version data entries

2 entries across 2 versions & 1 rubygems

Version Path
sutch-anemone-0.7.2.2 README.rdoc
sutch-anemone-0.7.2.1 README.rdoc