README.md in ronin-web-spider-0.1.0.beta2 vs README.md in ronin-web-spider-0.1.0
- old
+ new
@@ -1,9 +1,10 @@
# ronin-web-spider
[![CI](https://github.com/ronin-rb/ronin-web-spider/actions/workflows/ruby.yml/badge.svg)](https://github.com/ronin-rb/ronin-web-spider/actions/workflows/ruby.yml)
[![Code Climate](https://codeclimate.com/github/ronin-rb/ronin-web-spider.svg)](https://codeclimate.com/github/ronin-rb/ronin-web-spider)
+[![Gem Version](https://badge.fury.io/rb/ronin-web-spider.svg)](https://badge.fury.io/rb/ronin-web-spider)
* [Website](https://ronin-rb.dev/)
* [Source](https://github.com/ronin-rb/ronin-web-spider)
* [Issues](https://github.com/ronin-rb/ronin-web-spider/issues)
* [Documentation](https://ronin-rb.dev/docs/ronin-web-spider/frames)
@@ -18,72 +19,343 @@
## Features
* Built on top of the battle tested and versatile [spidr] gem.
* Provides additional callback methods:
- * `every_host` - yields every unique host name that's spidered.
- * `every_cert` - yields every unique SSL/TLS certificate encountered while
- spidering.
- * `every_favicon` - yields every favicon file that's encountered while
- spidering.
- * `every_html_comment` - yields every HTML comment.
- * `every_javascript` - yields all JavaScript source code from either inline
- `<script>` or `.js` files.
- * `every_javascript_string` - yields every single-quoted or double-quoted
- String literal from all JavaScript source code.
- * `every_javascript_comment` - yields every JavaScript comment.
- * `every_comment` - yields every HTML or JavaScript comment.
+ * [every_host][docs-every_host] - yields every unique host name that's
+ spidered.
+ * [every_cert][docs-every_cert] - yields every unique SSL/TLS certificate
+ encountered while spidering.
+ * [every_favicon][docs-every_favicon] - yields every favicon file that's
+ encountered while spidering.
+ * [every_html_comment][docs-every_html_comment] - yields every HTML comment.
+ * [every_javascript][docs-every_javascript] - yields all JavaScript source
+ code from either inline `<script>` or `.js` files.
+ * [every_javascript_string][docs-every_javascript_string] - yields every
+ single-quoted or double-quoted String literal from all JavaScript source
+ code.
+ * [every_javascript_comment][docs-every_javascript_comment] - yields every
+ JavaScript comment.
+ * [every_comment][docs-every_comment] - yields every HTML or JavaScript
+ comment.
* Supports archiving spidered pages to a directory or git repository.
* Has 94% documentation coverage.
* Has 94% test coverage.
+[docs-every_host]: https://ronin-rb.dev/docs/ronin-web-spider/Ronin/Web/Spider/Agent.html#every_host-instance_method
+[docs-every_cert]: https://ronin-rb.dev/docs/ronin-web-spider/Ronin/Web/Spider/Agent.html#every_cert-instance_method
+[docs-every_favicon]: https://ronin-rb.dev/docs/ronin-web-spider/Ronin/Web/Spider/Agent.html#every_favicon-instance_method
+[docs-every_html_comment]: https://ronin-rb.dev/docs/ronin-web-spider/Ronin/Web/Spider/Agent.html#every_html_comment-instance_method
+[docs-every_javascript]: https://ronin-rb.dev/docs/ronin-web-spider/Ronin/Web/Spider/Agent.html#every_javascript-instance_method
+[docs-every_javascript_string]: https://ronin-rb.dev/docs/ronin-web-spider/Ronin/Web/Spider/Agent.html#every_javascript_string-instance_method
+[docs-every_javascript_comment]: https://ronin-rb.dev/docs/ronin-web-spider/Ronin/Web/Spider/Agent.html#every_javascript_comment-instance_method
+[docs-every_comment]: https://ronin-rb.dev/docs/ronin-web-spider/Ronin/Web/Spider/Agent.html#every_comment-instance_method
+
## Examples
Spider a host:
```ruby
require 'ronin/web/spider'
-Ronin::Web::Spider.host('www.example.com') do |agent|
- agent.ever_url do |url|
- # ...
+Ronin::Web::Spider.start_at('http://tenderlovemaking.com/') do |agent|
+ # ...
+end
+```
+
+Spider a host:
+
+```ruby
+Ronin::Web::Spider.host('solnic.eu') do |agent|
+ # ...
+end
+```
+
+Spider a domain (and any sub-domains):
+
+```ruby
+Ronin::Web::Spider.domain('ruby-lang.org') do |agent|
+ # ...
+end
+```
+
+Spider a site:
+
+```ruby
+Ronin::Web::Spider.site('http://www.rubyflow.com/') do |agent|
+ # ...
+end
+```
+
+Spider multiple hosts:
+
+```ruby
+Ronin::Web::Spider.start_at('http://company.com/', hosts: ['company.com', /host[\d]+\.company\.com/]) do |agent|
+ # ...
+end
+```
+
+Do not spider certain links:
+
+```ruby
+Ronin::Web::Spider.site('http://company.com/', ignore_links: [%{^/blog/}]) do |agent|
+ # ...
+end
+```
+
+Do not spider links on certain ports:
+
+```ruby
+Ronin::Web::Spider.site('http://company.com/', ignore_ports: [8000, 8010, 8080]) do |agent|
+ # ...
+end
+```
+
+Do not spider links blacklisted in robots.txt:
+
+```ruby
+Ronin::Web::Spider.site('http://company.com/', robots: true) do |agent|
+ # ...
+end
+```
+
+Print out visited URLs:
+
+```ruby
+Ronin::Web::Spider.site('http://www.rubyinside.com/') do |spider|
+ spider.every_url { |url| puts url }
+end
+```
+
+Build a URL map of a site:
+
+```ruby
+url_map = Hash.new { |hash,key| hash[key] = [] }
+
+Ronin::Web::Spider.site('http://intranet.com/') do |spider|
+ spider.every_link do |origin,dest|
+ url_map[dest] << origin
end
+end
+```
- agent.every_url_like(/.../) do |url|
- # ...
+Print out the URLs that could not be requested:
+
+```ruby
+Ronin::Web::Spider.site('http://company.com/') do |spider|
+ spider.every_failed_url { |url| puts url }
+end
+```
+
+Finds all pages which have broken links:
+
+```ruby
+url_map = Hash.new { |hash,key| hash[key] = [] }
+
+spider = Ronin::Web::Spider.site('http://intranet.com/') do |spider|
+ spider.every_link do |origin,dest|
+ url_map[dest] << origin
end
+end
- agent.every_page do |page|
- # ...
+spider.failures.each do |url|
+ puts "Broken link #{url} found in:"
+
+ url_map[url].each { |page| puts " #{page}" }
+end
+```
+
+Search HTML and XML pages:
+
+```ruby
+Ronin::Web::Spider.site('http://company.com/') do |spider|
+ spider.every_page do |page|
+ puts ">>> #{page.url}"
+
+ page.search('//meta').each do |meta|
+ name = (meta.attributes['name'] || meta.attributes['http-equiv'])
+ value = meta.attributes['content']
+
+ puts " #{name} = #{value}"
+ end
end
end
```
-See [Spidr::Agent] documentation for more agent methods.
+Print out the titles from every page:
-[Spidr::Agent]: https://rubydoc.info/gems/spidr/Spidr/Agent
+```ruby
+Ronin::Web::Spider.site('https://www.ruby-lang.org/') do |spider|
+ spider.every_html_page do |page|
+ puts page.title
+ end
+end
+```
-Spider a domain:
+Print out every HTTP redirect:
```ruby
-Ronin::Web::Spider.domain('example.com') do |agent|
- agent.every_page do |page|
- # ...
+Ronin::Web::Spider.host('company.com') do |spider|
+ spider.every_redirect_page do |page|
+ puts "#{page.url} -> #{page.headers['Location']}"
end
end
```
-Spider a website:
+Find what kinds of web servers a host is using, by accessing the headers:
```ruby
-Ronin::Web::Spider.site('https://www.example.com/index.html') do |agent|
- agent.every_page do |page|
- # ...
+servers = Set[]
+
+Ronin::Web::Spider.host('company.com') do |spider|
+ spider.all_headers do |headers|
+ servers << headers['server']
end
end
```
+Pause the spider on a forbidden page:
+
+```ruby
+Ronin::Web::Spider.host('company.com') do |spider|
+ spider.every_forbidden_page do |page|
+ spider.pause!
+ end
+end
+```
+
+Skip the processing of a page:
+
+```ruby
+Ronin::Web::Spider.host('company.com') do |spider|
+ spider.every_missing_page do |page|
+ spider.skip_page!
+ end
+end
+```
+
+Skip the processing of links:
+
+```ruby
+Ronin::Web::Spider.host('company.com') do |spider|
+ spider.every_url do |url|
+ if url.path.split('/').find { |dir| dir.to_i > 1000 }
+ spider.skip_link!
+ end
+ end
+end
+```
+
+Detect when a new host name is spidered:
+
+```ruby
+Ronin::Web::Spider.domain('example.com') do |spider|
+ spider.every_host do |host|
+ puts "Spidering #{host} ..."
+ end
+end
+```
+
+Detect when a new SSL/TLS certificate is encountered:
+
+```ruby
+Ronin::Web::Spider.domain('example.com') do |spider|
+ spider.every_cert do |cert|
+ puts "Discovered new cert for #{cert.subject.command_name}, #{cert.subject_alt_name}"
+ end
+end
+```
+
+Print the MD5 checksum of every `favicon.ico` file:
+
+```ruby
+Ronin::Web::Spider.domain('example.com') do |spider|
+ spider.every_favicon do |page|
+ puts "#{page.url}: #{page.body.md5}"
+ end
+end
+```
+
+Print every HTML comment:
+
+```ruby
+Ronin::Web::Spider.domain('example.com') do |spider|
+ spider.every_html_comment do |comment|
+ puts comment
+ end
+end
+```
+
+Print all JavaScript source code:
+
+```ruby
+Ronin::Web::Spider.domain('example.com') do |spider|
+ spider.every_javascript do |js|
+ puts js
+ end
+end
+```
+
+Print every JavaScript string literal:
+
+```ruby
+Ronin::Web::Spider.domain('example.com') do |spider|
+ spider.every_javascript_string do |str|
+ puts str
+ end
+end
+```
+
+Print every JavaScript comment:
+
+```ruby
+Ronin::Web::Spider.domain('example.com') do |spider|
+ spider.every_javascript_comment do |comment|
+ puts comment
+ end
+end
+```
+
+Print every HTML and JavaScript comment:
+
+```ruby
+Ronin::Web::Spider.domain('example.com') do |spider|
+ spider.every_comment do |comment|
+ puts comment
+ end
+end
+```
+
+Spider a host and archive every web page:
+
+```ruby
+require 'ronin/web/spider'
+require 'ronin/web/spider/archive'
+
+Ronin::Web::Spider::Archive.open('path/to/root') do |archive|
+ Ronin::Web::Spider.every_page(host: 'example.com') do |page|
+ archive.write(page.url,page.body)
+ end
+end
+```
+
+Spider a host and archive every web page to a Git repository:
+
+```ruby
+require 'ronin/web/spider/git_archive'
+require 'ronin/web/spider'
+require 'date'
+
+Ronin::Web::Spider::GitArchive.open('path/to/root') do |archive|
+ archive.commit("Updated #{Date.today}") do
+ Ronin::Web::Spider.every_page(host: 'example.com') do |page|
+ archive.write(page.url,page.body)
+ end
+ end
+end
+```
+
## Requirements
* [Ruby] >= 3.0.0
* [spidr] ~> 0.7
* [ronin-support] ~> 1.0
@@ -117,10 +389,10 @@
7. `bundle exec rake spec`
8. `git push origin my_feature`
## License
-Copyright (c) 2006-2022 Hal Brodigan (postmodern.mod3 at gmail.com)
+Copyright (c) 2006-2023 Hal Brodigan (postmodern.mod3 at gmail.com)
ronin-web-spider is free software: you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as published
by the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.