# Loofah * https://github.com/flavorjones/loofah * Docs: http://rubydoc.info/github/flavorjones/loofah/main/frames * Mailing list: [loofah-talk@googlegroups.com](https://groups.google.com/forum/#!forum/loofah-talk) ## Status [![ci](https://github.com/flavorjones/loofah/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/flavorjones/loofah/actions/workflows/ci.yml) [![Tidelift dependencies](https://tidelift.com/badges/package/rubygems/loofah)](https://tidelift.com/subscription/pkg/rubygems-loofah?utm_source=rubygems-loofah&utm_medium=referral&utm_campaign=readme) ## Description Loofah is a general library for manipulating and transforming HTML/XML documents and fragments, built on top of Nokogiri. Loofah also includes some HTML sanitizers based on `html5lib`'s safelist, which are a specific application of the general transformation functionality. Active Record extensions for HTML sanitization are available in the [`loofah-activerecord` gem](https://github.com/flavorjones/loofah-activerecord). ## Features * Easily write custom transformations for HTML and XML * Common HTML sanitizing transformations are built-in: * _Strip_ unsafe tags, leaving behind only the inner text. * _Prune_ unsafe tags and their subtrees, removing all traces that they ever existed. * _Escape_ unsafe tags and their subtrees, leaving behind lots of < and > entities. * _Whitewash_ the markup, removing all attributes and namespaced nodes. * Other common HTML transformations are built-in: * Add the _nofollow_ attribute to all hyperlinks. * Add the _target=\_blank_ attribute to all hyperlinks. * Remove _unprintable_ characters from text nodes. * Format markup as plain text, with (or without) sensible whitespace handling around block elements. * Replace Rails's `strip_tags` and `sanitize` view helper methods. ## Compare and Contrast Loofah is both: - a general framework for transforming XML, XHTML, and HTML documents - a specific toolkit for HTML sanitization ### General document transformation Loofah tries to make it easy to write your own custom scrubbers for whatever document transformation you need. You don't like the built-in scrubbers? Build your own, like a boss. ### HTML sanitization Another Ruby library that provides HTML sanitization is [`rgrove/sanitize`](https://github.com/rgrove/sanitize), another library built on top of Nokogiri, which provides a bit more flexibility on the tags and attributes being scrubbed. You may also want to look at [`rails/rails-html-sanitizer`](https://github.com/rails/rails-html-sanitizer) which is built on top of Loofah and provides some useful extensions and additional flexibility in the HTML sanitization. ## The Basics Loofah wraps [Nokogiri](http://nokogiri.org) in a loving embrace. Nokogiri is a stable, well-maintained parser for XML, HTML4, and HTML5. Loofah implements the following classes: * `Loofah::HTML5::Document` * `Loofah::HTML5::DocumentFragment` * `Loofah::HTML4::Document` (aliased as `Loofah::HTML::Document` for now) * `Loofah::HTML4::DocumentFragment` (aliased as `Loofah::HTML::DocumentFragment` for now) * `Loofah::XML::Document` * `Loofah::XML::DocumentFragment` These document and fragment classes are subclasses of the similarly-named Nokogiri classes `Nokogiri::HTML5::Document` et al. Loofah also implements `Loofah::Scrubber`, which represents the document transformation, either by wrapping a block, ``` ruby span2div = Loofah::Scrubber.new do |node| node.name = "div" if node.name == "span" end ``` or by implementing a method. ### Side Note: Fragments vs Documents Generally speaking, unless you expect to have a DOCTYPE and a single root node, you don't have a *document*, you have a *fragment*. For HTML, another rule of thumb is that *documents* have `html` and `body` tags, and *fragments* usually do not. **HTML fragments** should be parsed with `Loofah.html5_fragment` or `Loofah.html4_fragment`. The result won't be wrapped in `html` or `body` tags, won't have a DOCTYPE declaration, `head` elements will be silently ignored, and multiple root nodes are allowed. **HTML documents** should be parsed with `Loofah.html5_document` or `Loofah.html4_document`. The result will have a DOCTYPE declaration, along with `html`, `head` and `body` tags. **XML fragments** should be parsed with `Loofah.xml_fragment`. The result won't have a DOCTYPE declaration, and multiple root nodes are allowed. **XML documents** should be parsed with `Loofah.xml_document`. The result will have a DOCTYPE declaration and a single root node. ### Side Note: HTML4 vs HTML5 ⚠ _HTML5 functionality is not available on JRuby, or with versions of Nokogiri `< 1.14.0`._ Currently, Loofah's methods `Loofah.document` and `Loofah.fragment` are aliases to `.html4_document` and `.html4_fragment`, which use Nokogiri's HTML4 parser. (Similarly, `Loofah::HTML::Document` and `Loofah::HTML::DocumentFragment` are aliased to `Loofah::HTML4::Document` and `Loofah::HTML4::DocumentFragment`.) **Please note** that in a future version of Loofah, these methods and classes may switch to using Nokogiri's HTML5 parser and classes on platforms that support it [1]. **We strongly recommend that you explicitly use `.html5_document` or `.html5_fragment`** unless you know of a compelling reason not to. If you are sure that you need to use the HTML4 parser, you should explicitly call `.html4_document` or `.html4_fragment` to avoid breakage in a future version. [1]: [[feature request] HTML5 parser for JRuby implementation · Issue #2227 · sparklemotion/nokogiri](https://github.com/sparklemotion/nokogiri/issues/2227) ### `Loofah::HTML5::Document` and `Loofah::HTML5::DocumentFragment` These classes are subclasses of `Nokogiri::HTML5::Document` and `Nokogiri::HTML5::DocumentFragment`. The module methods `Loofah.html5_document` and `Loofah.html5_fragment` will parse either an HTML document and an HTML fragment, respectively. ``` ruby Loofah.html5_document(unsafe_html).is_a?(Nokogiri::HTML5::Document) # => true Loofah.html5_fragment(unsafe_html).is_a?(Nokogiri::HTML5::DocumentFragment) # => true ``` Loofah injects a `scrub!` method, which takes either a symbol (for built-in scrubbers) or a `Loofah::Scrubber` object (for custom scrubbers), and modifies the document in-place. Loofah overrides `to_s` to return HTML: ``` ruby unsafe_html = "ohai!
div is safe
" doc = Loofah.html5_fragment(unsafe_html).scrub!(:prune) doc.to_s # => "ohai!
div is safe
" ``` and `text` to return plain text: ``` ruby doc.text # => "ohai! div is safe " ``` Also, `to_text` is available, which does the right thing with whitespace around block-level and line break elements. ``` ruby doc = Loofah.html5_fragment("

Title

Content
Next line
") doc.text # => "TitleContentNext line" # probably not what you want doc.to_text # => "\nTitle\n\nContent\nNext line\n" # better ``` ### `Loofah::HTML4::Document` and `Loofah::HTML4::DocumentFragment` These classes are subclasses of `Nokogiri::HTML4::Document` and `Nokogiri::HTML4::DocumentFragment`. The module methods `Loofah.html4_document` and `Loofah.html4_fragment` will parse either an HTML document and an HTML fragment, respectively. ``` ruby Loofah.html4_document(unsafe_html).is_a?(Nokogiri::HTML4::Document) # => true Loofah.html4_fragment(unsafe_html).is_a?(Nokogiri::HTML4::DocumentFragment) # => true ``` ### `Loofah::XML::Document` and `Loofah::XML::DocumentFragment` These classes are subclasses of `Nokogiri::XML::Document` and `Nokogiri::XML::DocumentFragment`. The module methods `Loofah.xml_document` and `Loofah.xml_fragment` will parse an XML document and an XML fragment, respectively. ``` ruby Loofah.xml_document(bad_xml).is_a?(Nokogiri::XML::Document) # => true Loofah.xml_fragment(bad_xml).is_a?(Nokogiri::XML::DocumentFragment) # => true ``` ### Nodes and Node Sets Nokogiri's `Node` and `NodeSet` classes also get a `scrub!` method, which makes it easy to scrub subtrees. The following code will apply the `employee_scrubber` only to the `employee` nodes (and their subtrees) in the document: ``` ruby Loofah.xml_document(bad_xml).xpath("//employee").scrub!(employee_scrubber) ``` And this code will only scrub the first `employee` node and its subtree: ``` ruby Loofah.xml_document(bad_xml).at_xpath("//employee").scrub!(employee_scrubber) ``` ### `Loofah::Scrubber` A Scrubber wraps up a block (or method) that is run on a document node: ``` ruby # change all tags to
tags span2div = Loofah::Scrubber.new do |node| node.name = "div" if node.name == "span" end ``` This can then be run on a document: ``` ruby Loofah.html5_fragment("foo

bar

").scrub!(span2div).to_s # => "
foo

bar

" ``` Scrubbers can be run on a document in either a top-down traversal (the default) or bottom-up. Top-down scrubbers can optionally return `Scrubber::STOP` to terminate the traversal of a subtree. Read below and in the `Loofah::Scrubber` class for more detailed usage. Here's an XML example: ``` ruby # remove all tags that have a "deceased" attribute set to true bring_out_your_dead = Loofah::Scrubber.new do |node| if node.name == "employee" and node["deceased"] == "true" node.remove Loofah::Scrubber::STOP # don't bother with the rest of the subtree end end Loofah.xml_document(File.read('plague.xml')).scrub!(bring_out_your_dead) ``` ### Built-In HTML Scrubbers Loofah comes with a set of sanitizing scrubbers that use `html5lib`'s safelist algorithm: ``` ruby doc = Loofah.html5_document(input) doc.scrub!(:strip) # replaces unknown/unsafe tags with their inner text doc.scrub!(:prune) # removes unknown/unsafe tags and their children doc.scrub!(:escape) # escapes unknown/unsafe tags, like this: <script> doc.scrub!(:whitewash) # removes unknown/unsafe/namespaced tags and their children, # and strips all node attributes ``` Loofah also comes with some common transformation tasks: ``` ruby doc.scrub!(:nofollow) # adds rel="nofollow" attribute to links doc.scrub!(:noopener) # adds rel="noopener" attribute to links doc.scrub!(:noreferrer) # adds rel="noreferrer" attribute to links doc.scrub!(:unprintable) # removes unprintable characters from text nodes doc.scrub!(:targetblank) # adds target="_blank" attribute to links ``` See `Loofah::Scrubbers` for more details and example usage. ### Chaining Scrubbers You can chain scrubbers: ``` ruby Loofah.html5_fragment("hello ") \ .scrub!(:prune) \ .scrub!(span2div).to_s # => "
hello
" ``` ### Shorthand The class methods `Loofah.scrub_html5_fragment` and `Loofah.scrub_html5_document` (and the corresponding HTML4 methods) are shorthand. These methods: ``` ruby Loofah.scrub_html5_fragment(unsafe_html, :prune) Loofah.scrub_html5_document(unsafe_html, :prune) Loofah.scrub_html4_fragment(unsafe_html, :prune) Loofah.scrub_html4_document(unsafe_html, :prune) Loofah.scrub_xml_fragment(bad_xml, custom_scrubber) Loofah.scrub_xml_document(bad_xml, custom_scrubber) ``` do the same thing as (and arguably semantically clearer than): ``` ruby Loofah.html5_fragment(unsafe_html).scrub!(:prune) Loofah.html5_document(unsafe_html).scrub!(:prune) Loofah.html4_fragment(unsafe_html).scrub!(:prune) Loofah.html4_document(unsafe_html).scrub!(:prune) Loofah.xml_fragment(bad_xml).scrub!(custom_scrubber) Loofah.xml_document(bad_xml).scrub!(custom_scrubber) ``` ### View Helpers Loofah has two "view helpers": `Loofah::Helpers.sanitize` and `Loofah::Helpers.strip_tags`, both of which are drop-in replacements for the Rails Action View helpers of the same name. These are not required automatically. You must require `loofah/helpers` to use them. ## Requirements * Nokogiri >= 1.5.9 ## Installation Unsurprisingly: > gem install loofah Requirements: * Ruby >= 2.5 ## Support The bug tracker is available here: * https://github.com/flavorjones/loofah/issues And the mailing list is on Google Groups: * Mail: loofah-talk@googlegroups.com * Archive: https://groups.google.com/forum/#!forum/loofah-talk Consider subscribing to [Tidelift][tidelift] which provides license assurances and timely security notifications for your open source dependencies, including Loofah. [Tidelift][tidelift] subscriptions also help the Loofah maintainers fund our [automated testing](https://ci.nokogiri.org) which in turn allows us to ship releases, bugfixes, and security updates more often. [tidelift]: https://tidelift.com/subscription/pkg/rubygems-loofah?utm_source=undefined&utm_medium=referral&utm_campaign=enterprise ## Security See [`SECURITY.md`](SECURITY.md) for vulnerability reporting details. ## Related Links * loofah-activerecord: https://github.com/flavorjones/loofah-activerecord * Nokogiri: http://nokogiri.org * libxml2: http://xmlsoft.org * html5lib: https://github.com/html5lib/ ## Authors * [Mike Dalessio](http://mike.daless.io) ([@flavorjones](https://twitter.com/flavorjones)) * Bryan Helmkamp Featuring code contributed by: * Aaron Patterson * John Barnette * Josh Owens * Paul Dix * Luke Melia And a big shout-out to Corey Innis for the name, and feedback on the API. ## Thank You The following people have generously funded Loofah: * Bill Harding ## Historical Note This library was once named "Dryopteris", which was a very bad name that nobody could spell properly. ## License Distributed under the MIT License. See `MIT-LICENSE.txt` for details.