README.md in ronin-web-spider-0.1.0.beta2 vs README.md in ronin-web-spider-0.1.0

- old
+ new

@@ -1,9 +1,10 @@ # ronin-web-spider [![CI](https://github.com/ronin-rb/ronin-web-spider/actions/workflows/ruby.yml/badge.svg)](https://github.com/ronin-rb/ronin-web-spider/actions/workflows/ruby.yml) [![Code Climate](https://codeclimate.com/github/ronin-rb/ronin-web-spider.svg)](https://codeclimate.com/github/ronin-rb/ronin-web-spider) +[![Gem Version](https://badge.fury.io/rb/ronin-web-spider.svg)](https://badge.fury.io/rb/ronin-web-spider) * [Website](https://ronin-rb.dev/) * [Source](https://github.com/ronin-rb/ronin-web-spider) * [Issues](https://github.com/ronin-rb/ronin-web-spider/issues) * [Documentation](https://ronin-rb.dev/docs/ronin-web-spider/frames) @@ -18,72 +19,343 @@ ## Features * Built on top of the battle tested and versatile [spidr] gem. * Provides additional callback methods: - * `every_host` - yields every unique host name that's spidered. - * `every_cert` - yields every unique SSL/TLS certificate encountered while - spidering. - * `every_favicon` - yields every favicon file that's encountered while - spidering. - * `every_html_comment` - yields every HTML comment. - * `every_javascript` - yields all JavaScript source code from either inline - `<script>` or `.js` files. - * `every_javascript_string` - yields every single-quoted or double-quoted - String literal from all JavaScript source code. - * `every_javascript_comment` - yields every JavaScript comment. - * `every_comment` - yields every HTML or JavaScript comment. + * [every_host][docs-every_host] - yields every unique host name that's + spidered. + * [every_cert][docs-every_cert] - yields every unique SSL/TLS certificate + encountered while spidering. + * [every_favicon][docs-every_favicon] - yields every favicon file that's + encountered while spidering. + * [every_html_comment][docs-every_html_comment] - yields every HTML comment. + * [every_javascript][docs-every_javascript] - yields all JavaScript source + code from either inline `<script>` or `.js` files. + * [every_javascript_string][docs-every_javascript_string] - yields every + single-quoted or double-quoted String literal from all JavaScript source + code. + * [every_javascript_comment][docs-every_javascript_comment] - yields every + JavaScript comment. + * [every_comment][docs-every_comment] - yields every HTML or JavaScript + comment. * Supports archiving spidered pages to a directory or git repository. * Has 94% documentation coverage. * Has 94% test coverage. +[docs-every_host]: https://ronin-rb.dev/docs/ronin-web-spider/Ronin/Web/Spider/Agent.html#every_host-instance_method +[docs-every_cert]: https://ronin-rb.dev/docs/ronin-web-spider/Ronin/Web/Spider/Agent.html#every_cert-instance_method +[docs-every_favicon]: https://ronin-rb.dev/docs/ronin-web-spider/Ronin/Web/Spider/Agent.html#every_favicon-instance_method +[docs-every_html_comment]: https://ronin-rb.dev/docs/ronin-web-spider/Ronin/Web/Spider/Agent.html#every_html_comment-instance_method +[docs-every_javascript]: https://ronin-rb.dev/docs/ronin-web-spider/Ronin/Web/Spider/Agent.html#every_javascript-instance_method +[docs-every_javascript_string]: https://ronin-rb.dev/docs/ronin-web-spider/Ronin/Web/Spider/Agent.html#every_javascript_string-instance_method +[docs-every_javascript_comment]: https://ronin-rb.dev/docs/ronin-web-spider/Ronin/Web/Spider/Agent.html#every_javascript_comment-instance_method +[docs-every_comment]: https://ronin-rb.dev/docs/ronin-web-spider/Ronin/Web/Spider/Agent.html#every_comment-instance_method + ## Examples Spider a host: ```ruby require 'ronin/web/spider' -Ronin::Web::Spider.host('www.example.com') do |agent| - agent.ever_url do |url| - # ... +Ronin::Web::Spider.start_at('http://tenderlovemaking.com/') do |agent| + # ... +end +``` + +Spider a host: + +```ruby +Ronin::Web::Spider.host('solnic.eu') do |agent| + # ... +end +``` + +Spider a domain (and any sub-domains): + +```ruby +Ronin::Web::Spider.domain('ruby-lang.org') do |agent| + # ... +end +``` + +Spider a site: + +```ruby +Ronin::Web::Spider.site('http://www.rubyflow.com/') do |agent| + # ... +end +``` + +Spider multiple hosts: + +```ruby +Ronin::Web::Spider.start_at('http://company.com/', hosts: ['company.com', /host[\d]+\.company\.com/]) do |agent| + # ... +end +``` + +Do not spider certain links: + +```ruby +Ronin::Web::Spider.site('http://company.com/', ignore_links: [%{^/blog/}]) do |agent| + # ... +end +``` + +Do not spider links on certain ports: + +```ruby +Ronin::Web::Spider.site('http://company.com/', ignore_ports: [8000, 8010, 8080]) do |agent| + # ... +end +``` + +Do not spider links blacklisted in robots.txt: + +```ruby +Ronin::Web::Spider.site('http://company.com/', robots: true) do |agent| + # ... +end +``` + +Print out visited URLs: + +```ruby +Ronin::Web::Spider.site('http://www.rubyinside.com/') do |spider| + spider.every_url { |url| puts url } +end +``` + +Build a URL map of a site: + +```ruby +url_map = Hash.new { |hash,key| hash[key] = [] } + +Ronin::Web::Spider.site('http://intranet.com/') do |spider| + spider.every_link do |origin,dest| + url_map[dest] << origin end +end +``` - agent.every_url_like(/.../) do |url| - # ... +Print out the URLs that could not be requested: + +```ruby +Ronin::Web::Spider.site('http://company.com/') do |spider| + spider.every_failed_url { |url| puts url } +end +``` + +Finds all pages which have broken links: + +```ruby +url_map = Hash.new { |hash,key| hash[key] = [] } + +spider = Ronin::Web::Spider.site('http://intranet.com/') do |spider| + spider.every_link do |origin,dest| + url_map[dest] << origin end +end - agent.every_page do |page| - # ... +spider.failures.each do |url| + puts "Broken link #{url} found in:" + + url_map[url].each { |page| puts " #{page}" } +end +``` + +Search HTML and XML pages: + +```ruby +Ronin::Web::Spider.site('http://company.com/') do |spider| + spider.every_page do |page| + puts ">>> #{page.url}" + + page.search('//meta').each do |meta| + name = (meta.attributes['name'] || meta.attributes['http-equiv']) + value = meta.attributes['content'] + + puts " #{name} = #{value}" + end end end ``` -See [Spidr::Agent] documentation for more agent methods. +Print out the titles from every page: -[Spidr::Agent]: https://rubydoc.info/gems/spidr/Spidr/Agent +```ruby +Ronin::Web::Spider.site('https://www.ruby-lang.org/') do |spider| + spider.every_html_page do |page| + puts page.title + end +end +``` -Spider a domain: +Print out every HTTP redirect: ```ruby -Ronin::Web::Spider.domain('example.com') do |agent| - agent.every_page do |page| - # ... +Ronin::Web::Spider.host('company.com') do |spider| + spider.every_redirect_page do |page| + puts "#{page.url} -> #{page.headers['Location']}" end end ``` -Spider a website: +Find what kinds of web servers a host is using, by accessing the headers: ```ruby -Ronin::Web::Spider.site('https://www.example.com/index.html') do |agent| - agent.every_page do |page| - # ... +servers = Set[] + +Ronin::Web::Spider.host('company.com') do |spider| + spider.all_headers do |headers| + servers << headers['server'] end end ``` +Pause the spider on a forbidden page: + +```ruby +Ronin::Web::Spider.host('company.com') do |spider| + spider.every_forbidden_page do |page| + spider.pause! + end +end +``` + +Skip the processing of a page: + +```ruby +Ronin::Web::Spider.host('company.com') do |spider| + spider.every_missing_page do |page| + spider.skip_page! + end +end +``` + +Skip the processing of links: + +```ruby +Ronin::Web::Spider.host('company.com') do |spider| + spider.every_url do |url| + if url.path.split('/').find { |dir| dir.to_i > 1000 } + spider.skip_link! + end + end +end +``` + +Detect when a new host name is spidered: + +```ruby +Ronin::Web::Spider.domain('example.com') do |spider| + spider.every_host do |host| + puts "Spidering #{host} ..." + end +end +``` + +Detect when a new SSL/TLS certificate is encountered: + +```ruby +Ronin::Web::Spider.domain('example.com') do |spider| + spider.every_cert do |cert| + puts "Discovered new cert for #{cert.subject.command_name}, #{cert.subject_alt_name}" + end +end +``` + +Print the MD5 checksum of every `favicon.ico` file: + +```ruby +Ronin::Web::Spider.domain('example.com') do |spider| + spider.every_favicon do |page| + puts "#{page.url}: #{page.body.md5}" + end +end +``` + +Print every HTML comment: + +```ruby +Ronin::Web::Spider.domain('example.com') do |spider| + spider.every_html_comment do |comment| + puts comment + end +end +``` + +Print all JavaScript source code: + +```ruby +Ronin::Web::Spider.domain('example.com') do |spider| + spider.every_javascript do |js| + puts js + end +end +``` + +Print every JavaScript string literal: + +```ruby +Ronin::Web::Spider.domain('example.com') do |spider| + spider.every_javascript_string do |str| + puts str + end +end +``` + +Print every JavaScript comment: + +```ruby +Ronin::Web::Spider.domain('example.com') do |spider| + spider.every_javascript_comment do |comment| + puts comment + end +end +``` + +Print every HTML and JavaScript comment: + +```ruby +Ronin::Web::Spider.domain('example.com') do |spider| + spider.every_comment do |comment| + puts comment + end +end +``` + +Spider a host and archive every web page: + +```ruby +require 'ronin/web/spider' +require 'ronin/web/spider/archive' + +Ronin::Web::Spider::Archive.open('path/to/root') do |archive| + Ronin::Web::Spider.every_page(host: 'example.com') do |page| + archive.write(page.url,page.body) + end +end +``` + +Spider a host and archive every web page to a Git repository: + +```ruby +require 'ronin/web/spider/git_archive' +require 'ronin/web/spider' +require 'date' + +Ronin::Web::Spider::GitArchive.open('path/to/root') do |archive| + archive.commit("Updated #{Date.today}") do + Ronin::Web::Spider.every_page(host: 'example.com') do |page| + archive.write(page.url,page.body) + end + end +end +``` + ## Requirements * [Ruby] >= 3.0.0 * [spidr] ~> 0.7 * [ronin-support] ~> 1.0 @@ -117,10 +389,10 @@ 7. `bundle exec rake spec` 8. `git push origin my_feature` ## License -Copyright (c) 2006-2022 Hal Brodigan (postmodern.mod3 at gmail.com) +Copyright (c) 2006-2023 Hal Brodigan (postmodern.mod3 at gmail.com) ronin-web-spider is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.