# HTMLPipeline
> **Note**
> This README refers to the behavior in the new 3.0.0.pre1 gem.
HTML processing filters and utilities. This module is a small
framework for defining CSS-based content filters and applying them to user
provided content.
[Although this project was started at GitHub](https://github.com/blog/1311-html-pipeline-chainable-content-filters), they no longer do. This gem must be considered standalone and independent from GitHub.
- [Installation](#installation)
- [Usage](#usage)
- [More Examples](#more-examples)
- [Filters](#filters)
- [Dependencies](#dependencies)
- [Documentation](#documentation)
- [Instrumenting](#instrumenting)
- [Third Party Extensions](#third-party-extensions)
- [FAQ](#faq)
## Installation
Add this line to your application's Gemfile:
```ruby
gem 'html-pipeline'
```
And then execute:
```sh
$ bundle
```
Or install it by yourself as:
```sh
$ gem install html-pipeline
```
## Usage
This library provides a handful of chainable HTML filters to transform user
content into HTML markup. Each filter does some work, and then hands off the
results tothe next filter. A pipeline has several kinds of filters available to use:
- Multiple `TextFilter`s, which operate a UTF-8 string
- A `ConvertFilter` filter, which turns text into HTML (eg., Commonmark/Asciidoc -> HTML)
- A `SanitizationFilter`, which remove dangerous/unwanted HTML elements and attributes
- Multiple `NodeFilter`s, which operate on a UTF-8 HTML document
You can assemble each sequence into a single pipeline, or choose to call each filter individually.
As an example, suppose we want to transform Commonmark source text into Markdown HTML. With the content, we also want to:
- change every instance of `$NAME` to "`Johnny"
- strip undesired HTML
- linkify @mention
We can construct a pipeline to do all that like this:
```ruby
require 'html_pipeline'
class HelloJohnnyFilter < HTMLPipelineFilter
def call
text.gsub("$NAME", "Johnny")
end
end
pipeline = HTMLPipeline.new(
text_filters: [HelloJohnnyFilter.new]
convert_filter: HTMLPipeline::ConvertFilter::MarkdownFilter.new),
# note: next line is not needed as sanitization occurs by default;
# see below for more info
sanitization_config: HTMLPipeline::SanitizationFilter::DEFAULT_CONFIG,
node_filters: [HTMLPipeline::NodeFilter::MentionFilter.new]
)
pipeline.call(user_supplied_text) # recommended: can call pipeline over and over
```
Filters can be custom ones you create (like `HelloJohnnyFilter`), and `HTMLPipeline` additionally provides several helpful ones (detailed below). If you only need a single filter, you can call one individually, too:
```ruby
filter = HTMLPipeline::ConvertFilter::MarkdownFilter.new(text)
filter.call
```
Filters combine into a sequential pipeline, and each filter hands its
output to the next filter's input. Text filters are
processed first, then the convert filter, sanitization filter, and finally, the node filters.
Some filters take optional `context` and/or `result` hash(es). These are
used to pass around arguments and metadata between filters in a pipeline. For
example, if you want to disable footnotes in the `MarkdownFilter`, you can pass an option in the context hash:
```ruby
context = { markdown: extensions: { footnotes: false } }
filter = HTMLPipeline::ConvertFilter::MarkdownFilter.new("Hi **world**!", context: context)
filter.call
```
Please refer to the documentation for each filter to understand what configuration options are available.
### More Examples
Different pipelines can be defined for different parts of an app. Here are a few
paraphrased snippets to get you started:
```ruby
# The context hash is how you pass options between different filters.
# See individual filter source for explanation of options.
context = {
asset_root: "http://your-domain.com/where/your/images/live/icons",
base_url: "http://your-domain.com"
}
# Pipeline used for user provided content on the web
MarkdownPipeline = HTMLPipeline.new (
text_filters: [HTMLPipeline::TextFilter::ImageMaxWidthFilter.new],
convert_filter: [HTMLPipeline::ConvertFilter::MarkdownFilter.new],
node_filters: [
HTMLPipeline::NodeFilter::HttpsFilter.new,HTMLPipeline::NodeFilter::MentionFilter.new,
], context: context)
# Pipelines aren't limited to the web. You can use them for email
# processing also.
HtmlEmailPipeline = HTMLPipeline.new(
text_filters: [
PlainTextInputFilter.new,
ImageMaxWidthFilter.new
], {})
```
## Filters
### TextFilters
`TextFilter`s must define a method named `call` which is called on the text. `@text`, `@config`, and `@result` are available to use, and any changes made to these ivars are passed on to the next filter.
- `ImageFilter` - converts image `url` into `` tag
- `PlainTextInputFilter` - html escape text and wrap the result in a `
`
### ConvertFilter
The `ConvertFilter` takes text and turns it into HTML. `@text`, `@config`, and `@result` are available to use. `ConvertFilter` must defined a method named `call`, taking one argument, `text`. `call` must return a string representing the new HTML document.
- `MarkdownFilter` - creates HTML from text using [Commonmarker](https://www.github.com/gjtorikian/commonmarker)
### Sanitization
Because the web can be a scary place, HTML is automatically sanitized after the `ConvertFilter` runs and before the `NodeFilter`s are processed. This is to prevent malicious or unexpected input from entering the pipeline.
The sanitization process takes a hash configuration of settings. See the [Selma](https://www.github.com/gjtorikian/selma) documentation for more information on how to configure these settings.
A default sanitization config is provided by this library (`HTMLPipeline::SanitizationFilter::DEFAULT_CONFIG`). A sample custom sanitization allowlist might look like this:
```ruby
ALLOWLIST = {
elements: ["p", "pre", "code"]
}
pipeline = HTMLPipeline.new \
text_filters: [
HTMLPipeline::MarkdownFilter,
],
convert_filter: [HTMLPipeline::ConvertFilter::MarkdownFilter.new],
sanitization_config: ALLOWLIST
result = pipeline.call <<-CODE
This is *great*:
some_code(:first)
CODE
result[:output].to_s
```
This would print:
```html
This is great:
some_code(:first)
```
Sanitization can be disabled if and only if `nil` is explicitly passed as
the config:
```ruby
pipeline = HTMLPipeline.new \
text_filters: [
HTMLPipeline::MarkdownFilter,
],
convert_filter: [HTMLPipeline::ConvertFilter::MarkdownFilter.new],
sanitization_config: nil
```
For more examples of customizing the sanitization process to include the tags you want, check out [the tests](test/sanitization_filter_test.rb) and [the FAQ](#faq).
### NodeFilters
`NodeFilters`s can operate either on HTML elements or text nodes using CSS selectors. Each `NodeFilter` must define a method named `selector` which provides an instance of `Selma::Selector`. If elements are being manipulated, `handle_element` must be defined, taking one argument, `element`; if text nodes are being manipulated, `handle_text_chunk` must be defined, taking one argument, `text_chunk`. `@config`, and `@result` are available to use, and any changes made to these ivars are passed on to the next filter.
`NodeFilter` also has an optional method, `after_initialize`, which is run after the filter initializes. This can be useful in setting up a custom state for `result` to take advantage of.
Here's an example `NodeFilter` that adds a base url to images that are root relative:
```ruby
require 'uri'
class RootRelativeFilter < HTMLPipeline::NodeFilter
SELECTOR = Selma::Selector.new(match_element: "img")
def selector
SELECTOR
end
def handle_element(img)
next if img['src'].nil?
src = img['src'].strip
if src.start_with? '/'
img["src"] = URI.join(context[:base_url], src).to_s
end
end
end
```
For more information on how to write effective `NodeFilter`s, refer to the provided filters, and see the underlying lib, [Selma](https://www.github.com/gjtorikian/selma) for more information.
- `AbsoluteSourceFilter`: replace relative image urls with fully qualified versions
- `EmojiFilter`: converts `:
:` to [emoji](http://www.emoji-cheat-sheet.com/)
- (Note: the included `MarkdownFilter` will already convert emoji)
- `HttpsFilter`: Replacing http urls with https versions
- `ImageMaxWidthFilter`: link to full size image for large images
- `MentionFilter`: replace `@user` mentions with links
- `SanitizationFilter`: allow sanitize user markup
- `SyntaxHighlightFilter`: applies syntax highlighting to `pre` blocks
- (Note: the included `MarkdownFilter` will already apply highlighting)
- `TableOfContentsFilter`: anchor headings with name attributes and generate Table of Contents html unordered list linking headings
- `TeamMentionFilter`: replace `@org/team` mentions with links
## Dependencies
Since filters can be customized to your heart's content, gem dependencies are _not_ bundled; this project doesn't know which of the default filters you might use, and as such, you must bundle each filter's gem dependencies yourself.
For example, `SyntaxHighlightFilter` uses [rouge](https://github.com/jneen/rouge)
to detect and highlight languages; to use the `SyntaxHighlightFilter`, you must add the following to your Gemfile:
```ruby
gem "rouge"
```
> **Note**
> See the [Gemfile](/Gemfile) `:test` group for any version requirements.
When developing a custom filter, call `HTMLPipeline.require_dependency` at the start to ensure that the local machine has the necessary dependency. You can also use `HTMLPipeline.require_dependencies` to provide a list of dependencies to check.
## Documentation
Full reference documentation can be [found here](http://rubydoc.info/gems/html-pipeline/frames).
## Instrumenting
Filters and Pipelines can be set up to be instrumented when called. The pipeline
must be setup with an
[ActiveSupport::Notifications](http://api.rubyonrails.org/classes/ActiveSupport/Notifications.html)
compatible service object and a name. New pipeline objects will default to the
`HTMLPipeline.default_instrumentation_service` object.
```ruby
# the AS::Notifications-compatible service object
service = ActiveSupport::Notifications
# instrument a specific pipeline
pipeline = HTMLPipeline.new [MarkdownFilter], context
pipeline.setup_instrumentation "MarkdownPipeline", service
# or set default instrumentation service for all new pipelines
HTMLPipeline.default_instrumentation_service = service
pipeline = HTMLPipeline.new [MarkdownFilter], context
pipeline.setup_instrumentation "MarkdownPipeline"
```
Filters are instrumented when they are run through the pipeline. A
`call_filter.html_pipeline` event is published once any filter finishes; `call_text_filters`
and `call_node_filters` is published when all of the text and node filters are finished, respectively.
The `payload` should include the `filter` name. Each filter will trigger its own
instrumentation call.
```ruby
service.subscribe "call_filter.html_pipeline" do |event, start, ending, transaction_id, payload|
payload[:pipeline] #=> "MarkdownPipeline", set with `setup_instrumentation`
payload[:filter] #=> "MarkdownFilter"
payload[:context] #=> context Hash
payload[:result] #=> instance of result class
payload[:result][:output] #=> output HTML String
end
```
The full pipeline is also instrumented:
```ruby
service.subscribe "call_text_filters.html_pipeline" do |event, start, ending, transaction_id, payload|
payload[:pipeline] #=> "MarkdownPipeline", set with `setup_instrumentation`
payload[:filters] #=> ["MarkdownFilter"]
payload[:doc] #=> HTML String
payload[:context] #=> context Hash
payload[:result] #=> instance of result class
payload[:result][:output] #=> output HTML String
end
```
## Third Party Extensions
If you have an idea for a filter, propose it as
[an issue](https://github.com/gjtorikian/html-pipeline/issues) first. This allows us to discuss
whether the filter is a common enough use case to belong in this gem, or should be
built as an external gem.
Here are some extensions people have built:
- [html-pipeline-asciidoc_filter](https://github.com/asciidoctor/html-pipeline-asciidoc_filter)
- [jekyll-html-pipeline](https://github.com/gjtorikian/jekyll-html-pipeline)
- [nanoc-html-pipeline](https://github.com/burnto/nanoc-html-pipeline)
- [html-pipeline-bitly](https://github.com/dewski/html-pipeline-bitly)
- [html-pipeline-cite](https://github.com/lifted-studios/html-pipeline-cite)
- [tilt-html-pipeline](https://github.com/bradgessler/tilt-html-pipeline)
- [html-pipeline-wiki-link'](https://github.com/lifted-studios/html-pipeline-wiki-link) - WikiMedia-style wiki links
- [task_list](https://github.com/github/task_list) - GitHub flavor Markdown Task List
- [html-pipeline-nico_link](https://github.com/rutan/html-pipeline-nico_link) - An HTMLPipeline filter for [niconico](http://www.nicovideo.jp) description links
- [html-pipeline-gitlab](https://gitlab.com/gitlab-org/html-pipeline-gitlab) - This gem implements various filters for html-pipeline used by GitLab
- [html-pipeline-youtube](https://github.com/st0012/html-pipeline-youtube) - An HTMLPipeline filter for YouTube links
- [html-pipeline-flickr](https://github.com/st0012/html-pipeline-flickr) - An HTMLPipeline filter for Flickr links
- [html-pipeline-vimeo](https://github.com/dlackty/html-pipeline-vimeo) - An HTMLPipeline filter for Vimeo links
- [html-pipeline-hashtag](https://github.com/mr-dxdy/html-pipeline-hashtag) - An HTMLPipeline filter for hashtags
- [html-pipeline-linkify_github](https://github.com/jollygoodcode/html-pipeline-linkify_github) - An HTMLPipeline filter to autolink GitHub urls
- [html-pipeline-redcarpet_filter](https://github.com/bmikol/html-pipeline-redcarpet_filter) - Render Markdown source text into Markdown HTML using Redcarpet
- [html-pipeline-typogruby_filter](https://github.com/bmikol/html-pipeline-typogruby_filter) - Add Typogruby text filters to your HTMLPipeline
- [korgi](https://github.com/jodeci/korgi) - HTMLPipeline filters for links to Rails resources
## FAQ
### 1. Why doesn't my pipeline work when there's no root element in the document?
To make a pipeline work on a plain text document, put the `PlainTextInputFilter`
at the end of your `text_filter`s config . This will wrap the content in a `div` so the filters have a root element to work with. If you're passing in an HTML fragment,
but it doesn't have a root element, you can wrap the content in a `div`
yourself.
### 2. How do I customize an allowlist for `SanitizationFilter`s?
`HTMLPipeline::SanitizationFilter::ALLOWLIST` is the default allowlist used if no `sanitization_config`
argument is given. The default is a good starting template for
you to add additional elements. You can either modify the constant's value, or
re-define your own config and pass that in, such as:
```ruby
config = HTMLPipeline::SanitizerFilter::DEFAULT_CONFIG.dup
config[:elements] << "iframe" # sure, whatever you want
```
### Contributors
Thanks to all of [these contributors](https://github.com/gjtorikian/html-pipeline/graphs/contributors).
This project is a member of the [OSS Manifesto](http://ossmanifesto.org/).