README.md in spidr-0.4.1 vs README.md in spidr-0.5.0

- old
+ new

@@ -1,10 +1,10 @@ # Spidr -* [Homepage](http://spidr.rubyforge.org/) -* [Source](http://github.com/postmodern/spidr) -* [Issues](http://github.com/postmodern/spidr/issues) +* [Homepage](https://github.com/postmodern/spidr#readme) +* [Source](https://github.com/postmodern/spidr) +* [Issues](https://github.com/postmodern/spidr/issues) * [Mailing List](http://groups.google.com/group/spidr) * [IRC](http://webchat.freenode.net/?channels=spidr&uio=d4) ## Description @@ -13,13 +13,13 @@ and easy to use. ## Features * Follows: - * a tags. - * iframe tags. - * frame tags. + * `a` tags. + * `iframe` tags. + * `frame` tags. * Cookie protected links. * HTTP 300, 301, 302, 303 and 307 Redirects. * Meta-Refresh Redirects. * HTTP Basic Auth protected links. * Black-list or white-list URLs based upon: @@ -49,40 +49,44 @@ Spidr.start_at('http://tenderlovemaking.com/') Spider a host: - Spidr.host('coderrr.wordpress.com') + Spidr.host('solnic.eu') Spider a site: - Spidr.site('http://rubyflow.com/') + Spidr.site('http://www.rubyflow.com/') Spider multiple hosts: Spidr.start_at( 'http://company.com/', - :hosts => [ + hosts: [ 'company.com', - /host\d\.company\.com/ + /host[\d]+\.company\.com/ ] ) Do not spider certain links: - Spidr.site('http://matasano.com/', :ignore_links => [/log/]) + Spidr.site('http://company.com/', ignore_links: [%{^/blog/}]) Do not spider links on certain ports: + Spidr.site('http://company.com/', ignore_ports: [8000, 8010, 8080]) + +Do not spider links blacklisted in robots.txt: + Spidr.site( - 'http://sketchy.content.com/', - :ignore_ports => [8000, 8010, 8080] + 'http://company.com/', + robots: true ) Print out visited URLs: - Spidr.site('http://rubyinside.org/') do |spider| + Spidr.site('http://www.rubyinside.com/') do |spider| spider.every_url { |url| puts url } end Build a URL map of a site: @@ -94,11 +98,11 @@ end end Print out the URLs that could not be requested: - Spidr.site('http://sketchy.content.com/') do |spider| + Spidr.site('http://company.com/') do |spider| spider.every_failed_url { |url| puts url } end Finds all pages which have broken links: @@ -116,75 +120,79 @@ url_map[url].each { |page| puts " #{page}" } end Search HTML and XML pages: - Spidr.site('http://company.withablog.com/') do |spider| + Spidr.site('http://company.com/') do |spider| spider.every_page do |page| - puts "[-] #{page.url}" + puts ">>> #{page.url}" page.search('//meta').each do |meta| name = (meta.attributes['name'] || meta.attributes['http-equiv']) value = meta.attributes['content'] - puts " #{name} = #{value}" + puts " #{name} = #{value}" end end end Print out the titles from every page: - Spidr.site('http://www.rubypulse.com/') do |spider| + Spidr.site('https://www.ruby-lang.org/') do |spider| spider.every_html_page do |page| puts page.title end end Find what kinds of web servers a host is using, by accessing the headers: servers = Set[] - Spidr.host('generic.company.com') do |spider| + Spidr.host('company.com') do |spider| spider.all_headers do |headers| servers << headers['server'] end end Pause the spider on a forbidden page: - spider = Spidr.host('overnight.startup.com') do |spider| + spider = Spidr.host('company.com') do |spider| spider.every_forbidden_page do |page| spider.pause! end end Skip the processing of a page: - Spidr.host('sketchy.content.com') do |spider| + Spidr.host('company.com') do |spider| spider.every_missing_page do |page| spider.skip_page! end end Skip the processing of links: - Spidr.host('sketchy.content.com') do |spider| + Spidr.host('company.com') do |spider| spider.every_url do |url| if url.path.split('/').find { |dir| dir.to_i > 1000 } spider.skip_link! end end end ## Requirements -* [nokogiri](http://nokogiri.rubyforge.org/) ~> 1.3 +* [ruby] >= 1.9.1 +* [nokogiri] ~> 1.3 ## Install - $ sudo gem install spidr + $ gem install spidr ## License -Copyright (c) 2008-2011 Hal Brodigan +Copyright (c) 2008-2016 Hal Brodigan See {file:LICENSE.txt} for license information. + +[ruby]: https://www.ruby-lang.org/ +[nokogiri]: http://www.nokogiri.org/