README.md in spidr-0.6.1 vs README.md in spidr-0.7.0
- old
+ new
@@ -1,13 +1,13 @@
# Spidr
+[![CI](https://github.com/postmodern/spidr/actions/workflows/ruby.yml/badge.svg)](https://github.com/postmodern/spidr/actions/workflows/ruby.yml)
+
* [Homepage](https://github.com/postmodern/spidr#readme)
* [Source](https://github.com/postmodern/spidr)
* [Issues](https://github.com/postmodern/spidr/issues)
* [Mailing List](http://groups.google.com/group/spidr)
-* [IRC](http://webchat.freenode.net/?channels=spidr&uio=d4)
-* [![Build Status](https://travis-ci.org/postmodern/spidr.svg)](https://travis-ci.org/postmodern/spidr)
## Description
Spidr is a versatile Ruby web spidering library that can spider a site,
multiple domains, certain links or infinitely. Spidr is designed to be fast
@@ -47,153 +47,212 @@
## Examples
Start spidering from a URL:
- Spidr.start_at('http://tenderlovemaking.com/')
+```ruby
+Spidr.start_at('http://tenderlovemaking.com/') do |agent|
+ # ...
+end
+```
Spider a host:
- Spidr.host('solnic.eu')
+```ruby
+Spidr.host('solnic.eu') do |agent|
+ # ...
+end
+```
+Spider a domain (and any sub-domains):
+
+```ruby
+Spidr.domain('ruby-lang.org') do |agent|
+ # ...
+end
+```
+
Spider a site:
- Spidr.site('http://www.rubyflow.com/')
+```ruby
+Spidr.site('http://www.rubyflow.com/') do |agent|
+ # ...
+end
+```
Spider multiple hosts:
- Spidr.start_at(
- 'http://company.com/',
- hosts: [
- 'company.com',
- /host[\d]+\.company\.com/
- ]
- )
+```ruby
+Spidr.start_at('http://company.com/', hosts: ['company.com', /host[\d]+\.company\.com/]) do |agent|
+ # ...
+end
+```
Do not spider certain links:
- Spidr.site('http://company.com/', ignore_links: [%{^/blog/}])
+```ruby
+Spidr.site('http://company.com/', ignore_links: [%{^/blog/}]) do |agent|
+ # ...
+end
+```
Do not spider links on certain ports:
- Spidr.site('http://company.com/', ignore_ports: [8000, 8010, 8080])
+```ruby
+Spidr.site('http://company.com/', ignore_ports: [8000, 8010, 8080]) do |agent|
+ # ...
+end
+```
Do not spider links blacklisted in robots.txt:
- Spidr.site(
- 'http://company.com/',
- robots: true
- )
+```ruby
+Spidr.site('http://company.com/', robots: true) do |agent|
+ # ...
+end
+```
Print out visited URLs:
- Spidr.site('http://www.rubyinside.com/') do |spider|
- spider.every_url { |url| puts url }
- end
+```ruby
+Spidr.site('http://www.rubyinside.com/') do |spider|
+ spider.every_url { |url| puts url }
+end
+```
Build a URL map of a site:
- url_map = Hash.new { |hash,key| hash[key] = [] }
+```ruby
+url_map = Hash.new { |hash,key| hash[key] = [] }
- Spidr.site('http://intranet.com/') do |spider|
- spider.every_link do |origin,dest|
- url_map[dest] << origin
- end
- end
+Spidr.site('http://intranet.com/') do |spider|
+ spider.every_link do |origin,dest|
+ url_map[dest] << origin
+ end
+end
+```
Print out the URLs that could not be requested:
- Spidr.site('http://company.com/') do |spider|
- spider.every_failed_url { |url| puts url }
- end
+```ruby
+Spidr.site('http://company.com/') do |spider|
+ spider.every_failed_url { |url| puts url }
+end
+```
Finds all pages which have broken links:
- url_map = Hash.new { |hash,key| hash[key] = [] }
+```ruby
+url_map = Hash.new { |hash,key| hash[key] = [] }
- spider = Spidr.site('http://intranet.com/') do |spider|
- spider.every_link do |origin,dest|
- url_map[dest] << origin
- end
- end
+spider = Spidr.site('http://intranet.com/') do |spider|
+ spider.every_link do |origin,dest|
+ url_map[dest] << origin
+ end
+end
- spider.failures.each do |url|
- puts "Broken link #{url} found in:"
+spider.failures.each do |url|
+ puts "Broken link #{url} found in:"
- url_map[url].each { |page| puts " #{page}" }
- end
+ url_map[url].each { |page| puts " #{page}" }
+end
+```
Search HTML and XML pages:
- Spidr.site('http://company.com/') do |spider|
- spider.every_page do |page|
- puts ">>> #{page.url}"
+```ruby
+Spidr.site('http://company.com/') do |spider|
+ spider.every_page do |page|
+ puts ">>> #{page.url}"
- page.search('//meta').each do |meta|
- name = (meta.attributes['name'] || meta.attributes['http-equiv'])
- value = meta.attributes['content']
+ page.search('//meta').each do |meta|
+ name = (meta.attributes['name'] || meta.attributes['http-equiv'])
+ value = meta.attributes['content']
- puts " #{name} = #{value}"
- end
- end
+ puts " #{name} = #{value}"
end
+ end
+end
+```
Print out the titles from every page:
- Spidr.site('https://www.ruby-lang.org/') do |spider|
- spider.every_html_page do |page|
- puts page.title
- end
- end
+```ruby
+Spidr.site('https://www.ruby-lang.org/') do |spider|
+ spider.every_html_page do |page|
+ puts page.title
+ end
+end
+```
+Print out every HTTP redirect:
+
+```ruby
+Spidr.host('company.com') do |spider|
+ spider.every_redirect_page do |page|
+ puts "#{page.url} -> #{page.headers['Location']}"
+ end
+end
+```
+
Find what kinds of web servers a host is using, by accessing the headers:
- servers = Set[]
+```ruby
+servers = Set[]
- Spidr.host('company.com') do |spider|
- spider.all_headers do |headers|
- servers << headers['server']
- end
- end
+Spidr.host('company.com') do |spider|
+ spider.all_headers do |headers|
+ servers << headers['server']
+ end
+end
+```
Pause the spider on a forbidden page:
- Spidr.host('company.com') do |spider|
- spider.every_forbidden_page do |page|
- spider.pause!
- end
- end
+```ruby
+Spidr.host('company.com') do |spider|
+ spider.every_forbidden_page do |page|
+ spider.pause!
+ end
+end
+```
Skip the processing of a page:
- Spidr.host('company.com') do |spider|
- spider.every_missing_page do |page|
- spider.skip_page!
- end
- end
+```ruby
+Spidr.host('company.com') do |spider|
+ spider.every_missing_page do |page|
+ spider.skip_page!
+ end
+end
+```
Skip the processing of links:
- Spidr.host('company.com') do |spider|
- spider.every_url do |url|
- if url.path.split('/').find { |dir| dir.to_i > 1000 }
- spider.skip_link!
- end
- end
+```ruby
+Spidr.host('company.com') do |spider|
+ spider.every_url do |url|
+ if url.path.split('/').find { |dir| dir.to_i > 1000 }
+ spider.skip_link!
end
+ end
+end
+```
## Requirements
* [ruby] >= 2.0.0
* [nokogiri] ~> 1.3
## Install
- $ gem install spidr
+```shell
+$ gem install spidr
+```
## License
-Copyright (c) 2008-2016 Hal Brodigan
+Copyright (c) 2008-2022 Hal Brodigan
See {file:LICENSE.txt} for license information.
[ruby]: https://www.ruby-lang.org/
[nokogiri]: http://www.nokogiri.org/