README.md in spidr-0.4.1 vs README.md in spidr-0.5.0
- old
+ new
@@ -1,10 +1,10 @@
# Spidr
-* [Homepage](http://spidr.rubyforge.org/)
-* [Source](http://github.com/postmodern/spidr)
-* [Issues](http://github.com/postmodern/spidr/issues)
+* [Homepage](https://github.com/postmodern/spidr#readme)
+* [Source](https://github.com/postmodern/spidr)
+* [Issues](https://github.com/postmodern/spidr/issues)
* [Mailing List](http://groups.google.com/group/spidr)
* [IRC](http://webchat.freenode.net/?channels=spidr&uio=d4)
## Description
@@ -13,13 +13,13 @@
and easy to use.
## Features
* Follows:
- * a tags.
- * iframe tags.
- * frame tags.
+ * `a` tags.
+ * `iframe` tags.
+ * `frame` tags.
* Cookie protected links.
* HTTP 300, 301, 302, 303 and 307 Redirects.
* Meta-Refresh Redirects.
* HTTP Basic Auth protected links.
* Black-list or white-list URLs based upon:
@@ -49,40 +49,44 @@
Spidr.start_at('http://tenderlovemaking.com/')
Spider a host:
- Spidr.host('coderrr.wordpress.com')
+ Spidr.host('solnic.eu')
Spider a site:
- Spidr.site('http://rubyflow.com/')
+ Spidr.site('http://www.rubyflow.com/')
Spider multiple hosts:
Spidr.start_at(
'http://company.com/',
- :hosts => [
+ hosts: [
'company.com',
- /host\d\.company\.com/
+ /host[\d]+\.company\.com/
]
)
Do not spider certain links:
- Spidr.site('http://matasano.com/', :ignore_links => [/log/])
+ Spidr.site('http://company.com/', ignore_links: [%{^/blog/}])
Do not spider links on certain ports:
+ Spidr.site('http://company.com/', ignore_ports: [8000, 8010, 8080])
+
+Do not spider links blacklisted in robots.txt:
+
Spidr.site(
- 'http://sketchy.content.com/',
- :ignore_ports => [8000, 8010, 8080]
+ 'http://company.com/',
+ robots: true
)
Print out visited URLs:
- Spidr.site('http://rubyinside.org/') do |spider|
+ Spidr.site('http://www.rubyinside.com/') do |spider|
spider.every_url { |url| puts url }
end
Build a URL map of a site:
@@ -94,11 +98,11 @@
end
end
Print out the URLs that could not be requested:
- Spidr.site('http://sketchy.content.com/') do |spider|
+ Spidr.site('http://company.com/') do |spider|
spider.every_failed_url { |url| puts url }
end
Finds all pages which have broken links:
@@ -116,75 +120,79 @@
url_map[url].each { |page| puts " #{page}" }
end
Search HTML and XML pages:
- Spidr.site('http://company.withablog.com/') do |spider|
+ Spidr.site('http://company.com/') do |spider|
spider.every_page do |page|
- puts "[-] #{page.url}"
+ puts ">>> #{page.url}"
page.search('//meta').each do |meta|
name = (meta.attributes['name'] || meta.attributes['http-equiv'])
value = meta.attributes['content']
- puts " #{name} = #{value}"
+ puts " #{name} = #{value}"
end
end
end
Print out the titles from every page:
- Spidr.site('http://www.rubypulse.com/') do |spider|
+ Spidr.site('https://www.ruby-lang.org/') do |spider|
spider.every_html_page do |page|
puts page.title
end
end
Find what kinds of web servers a host is using, by accessing the headers:
servers = Set[]
- Spidr.host('generic.company.com') do |spider|
+ Spidr.host('company.com') do |spider|
spider.all_headers do |headers|
servers << headers['server']
end
end
Pause the spider on a forbidden page:
- spider = Spidr.host('overnight.startup.com') do |spider|
+ spider = Spidr.host('company.com') do |spider|
spider.every_forbidden_page do |page|
spider.pause!
end
end
Skip the processing of a page:
- Spidr.host('sketchy.content.com') do |spider|
+ Spidr.host('company.com') do |spider|
spider.every_missing_page do |page|
spider.skip_page!
end
end
Skip the processing of links:
- Spidr.host('sketchy.content.com') do |spider|
+ Spidr.host('company.com') do |spider|
spider.every_url do |url|
if url.path.split('/').find { |dir| dir.to_i > 1000 }
spider.skip_link!
end
end
end
## Requirements
-* [nokogiri](http://nokogiri.rubyforge.org/) ~> 1.3
+* [ruby] >= 1.9.1
+* [nokogiri] ~> 1.3
## Install
- $ sudo gem install spidr
+ $ gem install spidr
## License
-Copyright (c) 2008-2011 Hal Brodigan
+Copyright (c) 2008-2016 Hal Brodigan
See {file:LICENSE.txt} for license information.
+
+[ruby]: https://www.ruby-lang.org/
+[nokogiri]: http://www.nokogiri.org/