# Loofah * https://github.com/flavorjones/loofah * Docs: http://rubydoc.info/github/flavorjones/loofah/main/frames * Mailing list: [loofah-talk@googlegroups.com](https://groups.google.com/forum/#!forum/loofah-talk) ## Status [![ci](https://github.com/flavorjones/loofah/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/flavorjones/loofah/actions/workflows/ci.yml) [![Tidelift dependencies](https://tidelift.com/badges/package/rubygems/loofah)](https://tidelift.com/subscription/pkg/rubygems-loofah?utm_source=rubygems-loofah&utm_medium=referral&utm_campaign=readme) ## Description Loofah is a general library for manipulating and transforming HTML/XML documents and fragments, built on top of Nokogiri. Loofah also includes some HTML sanitizers based on `html5lib`'s safelist, which are a specific application of the general transformation functionality. Active Record extensions for HTML sanitization are available in the [`loofah-activerecord` gem](https://github.com/flavorjones/loofah-activerecord). ## Features * Easily write custom transformations for HTML and XML * Common HTML sanitizing transformations are built-in: * _Strip_ unsafe tags, leaving behind only the inner text. * _Prune_ unsafe tags and their subtrees, removing all traces that they ever existed. * _Escape_ unsafe tags and their subtrees, leaving behind lots of < and > entities. * _Whitewash_ the markup, removing all attributes and namespaced nodes. * Other common HTML transformations are built-in: * Add the _nofollow_ attribute to all hyperlinks. * Add the _target=\_blank_ attribute to all hyperlinks. * Remove _unprintable_ characters from text nodes. * Format markup as plain text, with (or without) sensible whitespace handling around block elements. * Replace Rails's `strip_tags` and `sanitize` view helper methods. ## Compare and Contrast Loofah is both: - a general framework for transforming XML, XHTML, and HTML documents - a specific toolkit for HTML sanitization ### General document transformation Loofah tries to make it easy to write your own custom scrubbers for whatever document transformation you need. You don't like the built-in scrubbers? Build your own, like a boss. ### HTML sanitization Another Ruby library that provides HTML sanitization is [`rgrove/sanitize`](https://github.com/rgrove/sanitize), another library built on top of Nokogiri, which provides a bit more flexibility on the tags and attributes being scrubbed. You may also want to look at [`rails/rails-html-sanitizer`](https://github.com/rails/rails-html-sanitizer) which is built on top of Loofah and provides some useful extensions and additional flexibility in the HTML sanitization. ## The Basics Loofah wraps [Nokogiri](http://nokogiri.org) in a loving embrace. Nokogiri is a stable, well-maintained parser for XML, HTML4, and HTML5. Loofah implements the following classes: * `Loofah::HTML5::Document` * `Loofah::HTML5::DocumentFragment` * `Loofah::HTML4::Document` (aliased as `Loofah::HTML::Document` for now) * `Loofah::HTML4::DocumentFragment` (aliased as `Loofah::HTML::DocumentFragment` for now) * `Loofah::XML::Document` * `Loofah::XML::DocumentFragment` These document and fragment classes are subclasses of the similarly-named Nokogiri classes `Nokogiri::HTML5::Document` et al. Loofah also implements `Loofah::Scrubber`, which represents the document transformation, either by wrapping a block, ``` ruby span2div = Loofah::Scrubber.new do |node| node.name = "div" if node.name == "span" end ``` or by implementing a method. ### Side Note: Fragments vs Documents Generally speaking, unless you expect to have a DOCTYPE and a single root node, you don't have a *document*, you have a *fragment*. For HTML, another rule of thumb is that *documents* have `html` and `body` tags, and *fragments* usually do not. **HTML fragments** should be parsed with `Loofah.html5_fragment` or `Loofah.html4_fragment`. The result won't be wrapped in `html` or `body` tags, won't have a DOCTYPE declaration, `head` elements will be silently ignored, and multiple root nodes are allowed. **HTML documents** should be parsed with `Loofah.html5_document` or `Loofah.html4_document`. The result will have a DOCTYPE declaration, along with `html`, `head` and `body` tags. **XML fragments** should be parsed with `Loofah.xml_fragment`. The result won't have a DOCTYPE declaration, and multiple root nodes are allowed. **XML documents** should be parsed with `Loofah.xml_document`. The result will have a DOCTYPE declaration and a single root node. ### Side Note: HTML4 vs HTML5 ⚠ _HTML5 functionality is not available on JRuby, or with versions of Nokogiri `< 1.14.0`._ Currently, Loofah's methods `Loofah.document` and `Loofah.fragment` are aliases to `.html4_document` and `.html4_fragment`, which use Nokogiri's HTML4 parser. (Similarly, `Loofah::HTML::Document` and `Loofah::HTML::DocumentFragment` are aliased to `Loofah::HTML4::Document` and `Loofah::HTML4::DocumentFragment`.) **Please note** that in a future version of Loofah, these methods and classes may switch to using Nokogiri's HTML5 parser and classes on platforms that support it [1]. **We strongly recommend that you explicitly use `.html5_document` or `.html5_fragment`** unless you know of a compelling reason not to. If you are sure that you need to use the HTML4 parser, you should explicitly call `.html4_document` or `.html4_fragment` to avoid breakage in a future version. [1]: [[feature request] HTML5 parser for JRuby implementation · Issue #2227 · sparklemotion/nokogiri](https://github.com/sparklemotion/nokogiri/issues/2227) ### `Loofah::HTML5::Document` and `Loofah::HTML5::DocumentFragment` These classes are subclasses of `Nokogiri::HTML5::Document` and `Nokogiri::HTML5::DocumentFragment`. The module methods `Loofah.html5_document` and `Loofah.html5_fragment` will parse either an HTML document and an HTML fragment, respectively. ``` ruby Loofah.html5_document(unsafe_html).is_a?(Nokogiri::HTML5::Document) # => true Loofah.html5_fragment(unsafe_html).is_a?(Nokogiri::HTML5::DocumentFragment) # => true ``` Loofah injects a `scrub!` method, which takes either a symbol (for built-in scrubbers) or a `Loofah::Scrubber` object (for custom scrubbers), and modifies the document in-place. Loofah overrides `to_s` to return HTML: ``` ruby unsafe_html = "ohai!
bar
").scrub!(span2div).to_s # => "bar
" ``` Scrubbers can be run on a document in either a top-down traversal (the default) or bottom-up. Top-down scrubbers can optionally return `Scrubber::STOP` to terminate the traversal of a subtree. Read below and in the `Loofah::Scrubber` class for more detailed usage. Here's an XML example: ``` ruby # remove all