This is some text
# SiteDiff CLI **Warning:** SiteDiff 1.2.0 requires at least Ruby 3.1.2. **Warning:** SiteDiff 1.0.0 introduces some backwards incompatible changes. [![Build Status](https://travis-ci.org/evolvingweb/sitediff.svg?branch=master)](https://travis-ci.org/evolvingweb/sitediff) ## Table of contents - [Introduction](#introduction) - [Installation](#installation) - [Demo](#demo) - [Usage](#usage) - [Getting Started](#getting-started) - [Comparing 2 Sites](#comparing-2-sites) - [Spurious Diffs](#spurious-diffs) - [Command Line Options](#command-line-options) - [Finding Configuration Files](#finding-configuration-files) - [Specifying Paths](#specifying-paths) - [Debugging Rules](#debugging-rules) - [Including and Excluding URLs](#including-and-excluding-urls) - [Paths and Paths-file](#paths--paths-file) - [Report Export](#export) - [Running inside containers](#running-inside-containers) - [Configuration](#configuration) - [before_url / after_url](#before_url--after_url) - [selector](#selector) - [sanitization](#sanitization) - [ignore_whitespace](#ignore_whitespace) - [before / after](#before--after) - [includes](#incudes) - [dom_transform](#dom_transform) - [remove](#remove) - [strip](#strip) - [unwrap](#unwrap) - [remove_class](#remove_class) - [unwrap_root](#unwrap_root) - [Organizing configuration files](#organizing-configuration-files) - [Named regions](#named-regions) - [report](#report) - [title](#title) - [details](#details) - [before_note](#before_note) - [after_note](#after_note) - [before_url_report / after_url_report](#before_url_report--after_url_report) - [Miscellaneous](#miscellaneous) - [preset](#preset) - [Include / Exclude Paths](#includeexclude-paths) - [Curl Options](#curl-options) - [Throttling](#throttling) - [Timeouts](#timeouts) - [Handling security](#handling-security) - [interval](#interval) - [concurrency](#concurrency) - [depth](#depth) - [curl_opts](#curl_opts) - [Tips and Tricks](#tips-and-tricks) - [Removing empty elements](#removing-empty-elements) - [HTML Tag Formatting](#html-tag-formatting) - [Empty Attributes](#empty-attributes) - [Acknowledgements](#acknowledgements) ## Introduction SiteDiff makes it easy to see how a website changes. It can compare two similar sites or it can show how a single site changed over time. It helps identify undesirable changes to the site's HTML and it's a useful tool for conducting QA on re-deployments, site upgrades, and more! When you run SiteDiff, it produces an HTML report showing whether pages on your site have changed or not. For pages that have changed, you can see a colorized diff exactly what changed, or compare the visual differences side-by-side in a browser. SiteDiff supports a range of normalization / sanitization rules. These allow you to eliminate spurious differences, narrowing down differences to the ones that materially affect the site. ## Installation SiteDiff is fairly easy to install. Please refer to the [installation docs](INSTALLATION.md). ## Demo After installing all dependencies including the `bundle` version 2 gem, you can quickly see what SiteDiff can do. Simply use the following commands: ```sh git clone https://github.com/evolvingweb/sitediff cd sitediff bundle install bundle exec thor fixture:serve ``` Then visit `http://localhost:13080/` to view the report. SiteDiff shows you an overview of all the pages and clearly indicates which pages have changed and not changed. ![page report preview](misc/sitediff%20-%20overview%20report.png?raw=true) When you click on a changed page, you see a colorized diff of the page's markup showing exactly what changed on the page. ![page report preview](misc/sitediff%20-%20page%20report.png?raw=true) ## Usage Here are some instructions on getting started with SiteDiff. To see a list of commands that SiteDiff offers, you can run: ```sitediff help``` To get help for a particular command, say, `diff`, you can run: ```sitediff help diff``` ### Getting started To use SiteDiff on your site, create a configuration for your site: ```sitediff init http://mysite.example.com``` SiteDiff will generate a configuration file named `sitediff.yaml` by default. You can open the configuration file ```sitediff/sitediff.yaml``` to see the default configuration generated by SiteDiff. The [the configuration reference](#configuration) section explains the contents of this file and helps you customize it as per your requirements. Then get SiteDiff to crawl your site by using: ```sitediff crawl``` SiteDiff will then crawl your site, finding pages and caching their contents. A list of discovered paths will be saved to a `paths.txt` file. Now, you can make alterations to your site. For example, change a word on your site's front page. After you're done, you can check what actually changed: ```sitediff diff``` For each page, SiteDiff will report whether it did or did not change. For pages that changed, it will display a diff. You can also see an HTML version of the report using the following command: ```sitediff serve``` SiteDiff will start an internal web server and open a report page on your browser. For each page, you can see the diff and a side-by-side view of the old and new versions. You can now see if the changes were as you expected, or if some things didn't quite work out as you hoped. If you noticed unexpected changes, congratulations: SiteDiff just helped you find an issue you would have otherwise missed! As you fix any issues, you can continue to alter your site and run ```sitediff diff``` to check the changes against the old version. Once you're satisfied with the state of your site, you can inform SiteDiff that it should re-cache your site: ```sitediff store``` This takes a snapshot of your website and the next time you run ```sitediff diff```, it will use this new version as the reference for comparison. Happy diffing! ### Comparing 2 sites Sometimes you have two sites that you want to compare, for example a production site hosted on a public server and a development site hosted on your computer. SiteDiff can handle this situation, too! Just inform SiteDiff that there are two sites to compare: ```sitediff init http://mysite.example.com http://localhost/mysite``` Then when you run `sitediff diff`, it will compare the cached version of the first site with the current version of the second site. If both the first and second sites may be changing, you should tell SiteDiff not to cache either site: ```sitediff diff --cached=none``` ### Spurious diffs Sometimes sites have spurious differences, that you don't want to show up in a comparison. For example, many sites protect against Cross-Site Request Forgery using a [semi-random token](http://en.wikipedia.org/wiki/Cross-site_request_forgery#Synchronizer_token_pattern). Since this token changes on each HTTP GET, you probably don't care about such a change. To help with issues such as this, SiteDiff allows you to normalize the HTML it fetches as it compares pages. In the ```sitediff.yaml``` configuration file, you can add "sanitization rules", which specify either DOM transformations or regular expression substitutions. Here's an example of a rule you might add to remove CSRF-protection tokens generated by Django: ```yaml dom_transform: - title: Remove CSRF tokens type: remove selector: input[name=csrfmiddlewaretoken] ``` You can use one of the presets to apply framework-specific sanitization. Currently, SiteDiff only comes with Drupal-specific presets. See the [preset](#preset) section for more details. ## Command Line Options ### Finding configuration files By default SiteDiff will put everything in the `sitediff` folder. You can use the `--directory` flag to specify a different directory. ```bash sitediff init -C my_project_folder https://example.com sitediff diff -C my_project_folder sitediff serve -C my_project_folder ``` ### Specifying paths When you run ```sitediff diff```, you can specify which pages to look at in 2 ways: 1. The option ```--paths /foo /bar ...```. If you're trying to fix one page in particular, specifying just that one path will make ```sitediff diff``` run quickly! 2. The option ```--paths-file FILE``` with a newline-delimited text file. This is particularly useful when you're trying to eliminate all diffs. SiteDiff creates a file ```output/failures.txt``` containing all paths which had differences, so as you try to fix differences, you can run: ```sitediff diff --paths-file sitediff/failures.txt``` ### Debugging rules When a sanitization rule isn't working quite right for you, you might run `sitediff diff` many times over. If fetching all the pages is taking too long, try adding the option ```--cached=all```. This tells SiteDiff not to re-fetch the content, but just compare previously cached versions — it's a lot faster! ### Including and Excluding URLs By default sitediff crawls pages that are indicated with an HTML anchor using the ` ``` We're not interested in comparing random content, so we could use the following rule to fix this: ```yaml sanitization: # Remove form build IDs - pattern: '' selector: 'input' substitute: '' ``` Sanitization rules may also have a **path** attribute, whose value is a regular expression. If present, the rule will only apply to matching paths. ### ignore_whitespace Ignore whitespace when doing the diff. This passes the `-w` option to the native OS `diff` command. ```yaml ignore_whitespace: true ``` On the command line, use `-w` or `--ignore-whitespace`. ```bash sitediff diff -w ``` ### before / after Applies rules to just one side of the comparison. These blocks can contain any of the following sections: `selector`, `sanitization`, `dom_transform`. Such a section placed in `before` will be applied just to the `before` side of the comparison and similarly for `after`. For example, if you wanted to let different date formatting not create diff failures, you might use the following: ```yaml before: sanitization: - pattern: '[1-2][0-9]{3}/[0-1][0-9]/[0-9]{2}' substitute: '__date__' after: sanitization: - pattern: '[A-Z][a-z]{2} [0-9]{1,2}(st|nd|rd|th) [1-2][0-9]{3}' substitute: '__date__' ``` The above rule will replace dates of the form `2004/12/05` in `before` and dates of the form `May 12th 2004` in `after` with `__date__`. ### includes The names of other configuration YAML files to merge with this one. ```yaml includes: - config/sanitize_domains.yaml - config/strip_css_js.yaml ``` ### dom_transform A list of transformations to apply to the HTML before comparing. This is similar to _sanitization_, but it applies transformations to the structure of the HTML, instead of to the text. Each transformation has a **type**, and potentially other attributes. The following types are available: #### remove Given a **selector**, removes all elements that match it. For example, say we have a block containing the current time, which is expected to change. To ignore that, we might choose to delete the block before comparison: ```yaml dom_transform: # Remove current time block - type: remove - selector: div#block-time ``` #### strip Strip leading and trailing whitespace from the contents of a tag. Uses the Ruby string `strip()` method. Whitespace is defined as any of the following characters: null, horizontal tab, line feed, vertical tab, form feed, carriage return, space. To transform `
This is some text
``` But on the other side, it might be wrapped in an `article` tag: ```htmlThis is some text
Lorem ipsum...