README.md in spidr-0.6.1 vs README.md in spidr-0.7.0

- old
+ new

@@ -1,13 +1,13 @@ # Spidr +[![CI](https://github.com/postmodern/spidr/actions/workflows/ruby.yml/badge.svg)](https://github.com/postmodern/spidr/actions/workflows/ruby.yml) + * [Homepage](https://github.com/postmodern/spidr#readme) * [Source](https://github.com/postmodern/spidr) * [Issues](https://github.com/postmodern/spidr/issues) * [Mailing List](http://groups.google.com/group/spidr) -* [IRC](http://webchat.freenode.net/?channels=spidr&uio=d4) -* [![Build Status](https://travis-ci.org/postmodern/spidr.svg)](https://travis-ci.org/postmodern/spidr) ## Description Spidr is a versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast @@ -47,153 +47,212 @@ ## Examples Start spidering from a URL: - Spidr.start_at('http://tenderlovemaking.com/') +```ruby +Spidr.start_at('http://tenderlovemaking.com/') do |agent| + # ... +end +``` Spider a host: - Spidr.host('solnic.eu') +```ruby +Spidr.host('solnic.eu') do |agent| + # ... +end +``` +Spider a domain (and any sub-domains): + +```ruby +Spidr.domain('ruby-lang.org') do |agent| + # ... +end +``` + Spider a site: - Spidr.site('http://www.rubyflow.com/') +```ruby +Spidr.site('http://www.rubyflow.com/') do |agent| + # ... +end +``` Spider multiple hosts: - Spidr.start_at( - 'http://company.com/', - hosts: [ - 'company.com', - /host[\d]+\.company\.com/ - ] - ) +```ruby +Spidr.start_at('http://company.com/', hosts: ['company.com', /host[\d]+\.company\.com/]) do |agent| + # ... +end +``` Do not spider certain links: - Spidr.site('http://company.com/', ignore_links: [%{^/blog/}]) +```ruby +Spidr.site('http://company.com/', ignore_links: [%{^/blog/}]) do |agent| + # ... +end +``` Do not spider links on certain ports: - Spidr.site('http://company.com/', ignore_ports: [8000, 8010, 8080]) +```ruby +Spidr.site('http://company.com/', ignore_ports: [8000, 8010, 8080]) do |agent| + # ... +end +``` Do not spider links blacklisted in robots.txt: - Spidr.site( - 'http://company.com/', - robots: true - ) +```ruby +Spidr.site('http://company.com/', robots: true) do |agent| + # ... +end +``` Print out visited URLs: - Spidr.site('http://www.rubyinside.com/') do |spider| - spider.every_url { |url| puts url } - end +```ruby +Spidr.site('http://www.rubyinside.com/') do |spider| + spider.every_url { |url| puts url } +end +``` Build a URL map of a site: - url_map = Hash.new { |hash,key| hash[key] = [] } +```ruby +url_map = Hash.new { |hash,key| hash[key] = [] } - Spidr.site('http://intranet.com/') do |spider| - spider.every_link do |origin,dest| - url_map[dest] << origin - end - end +Spidr.site('http://intranet.com/') do |spider| + spider.every_link do |origin,dest| + url_map[dest] << origin + end +end +``` Print out the URLs that could not be requested: - Spidr.site('http://company.com/') do |spider| - spider.every_failed_url { |url| puts url } - end +```ruby +Spidr.site('http://company.com/') do |spider| + spider.every_failed_url { |url| puts url } +end +``` Finds all pages which have broken links: - url_map = Hash.new { |hash,key| hash[key] = [] } +```ruby +url_map = Hash.new { |hash,key| hash[key] = [] } - spider = Spidr.site('http://intranet.com/') do |spider| - spider.every_link do |origin,dest| - url_map[dest] << origin - end - end +spider = Spidr.site('http://intranet.com/') do |spider| + spider.every_link do |origin,dest| + url_map[dest] << origin + end +end - spider.failures.each do |url| - puts "Broken link #{url} found in:" +spider.failures.each do |url| + puts "Broken link #{url} found in:" - url_map[url].each { |page| puts " #{page}" } - end + url_map[url].each { |page| puts " #{page}" } +end +``` Search HTML and XML pages: - Spidr.site('http://company.com/') do |spider| - spider.every_page do |page| - puts ">>> #{page.url}" +```ruby +Spidr.site('http://company.com/') do |spider| + spider.every_page do |page| + puts ">>> #{page.url}" - page.search('//meta').each do |meta| - name = (meta.attributes['name'] || meta.attributes['http-equiv']) - value = meta.attributes['content'] + page.search('//meta').each do |meta| + name = (meta.attributes['name'] || meta.attributes['http-equiv']) + value = meta.attributes['content'] - puts " #{name} = #{value}" - end - end + puts " #{name} = #{value}" end + end +end +``` Print out the titles from every page: - Spidr.site('https://www.ruby-lang.org/') do |spider| - spider.every_html_page do |page| - puts page.title - end - end +```ruby +Spidr.site('https://www.ruby-lang.org/') do |spider| + spider.every_html_page do |page| + puts page.title + end +end +``` +Print out every HTTP redirect: + +```ruby +Spidr.host('company.com') do |spider| + spider.every_redirect_page do |page| + puts "#{page.url} -> #{page.headers['Location']}" + end +end +``` + Find what kinds of web servers a host is using, by accessing the headers: - servers = Set[] +```ruby +servers = Set[] - Spidr.host('company.com') do |spider| - spider.all_headers do |headers| - servers << headers['server'] - end - end +Spidr.host('company.com') do |spider| + spider.all_headers do |headers| + servers << headers['server'] + end +end +``` Pause the spider on a forbidden page: - Spidr.host('company.com') do |spider| - spider.every_forbidden_page do |page| - spider.pause! - end - end +```ruby +Spidr.host('company.com') do |spider| + spider.every_forbidden_page do |page| + spider.pause! + end +end +``` Skip the processing of a page: - Spidr.host('company.com') do |spider| - spider.every_missing_page do |page| - spider.skip_page! - end - end +```ruby +Spidr.host('company.com') do |spider| + spider.every_missing_page do |page| + spider.skip_page! + end +end +``` Skip the processing of links: - Spidr.host('company.com') do |spider| - spider.every_url do |url| - if url.path.split('/').find { |dir| dir.to_i > 1000 } - spider.skip_link! - end - end +```ruby +Spidr.host('company.com') do |spider| + spider.every_url do |url| + if url.path.split('/').find { |dir| dir.to_i > 1000 } + spider.skip_link! end + end +end +``` ## Requirements * [ruby] >= 2.0.0 * [nokogiri] ~> 1.3 ## Install - $ gem install spidr +```shell +$ gem install spidr +``` ## License -Copyright (c) 2008-2016 Hal Brodigan +Copyright (c) 2008-2022 Hal Brodigan See {file:LICENSE.txt} for license information. [ruby]: https://www.ruby-lang.org/ [nokogiri]: http://www.nokogiri.org/