README.md in html2rss-0.7.0 vs README.md in html2rss-0.8.0
- old
+ new
@@ -1,135 +1,363 @@
![html2rss logo](https://github.com/gildesmarais/html2rss/raw/master/support/logo.png)
[![Build Status](https://travis-ci.org/gildesmarais/html2rss.svg?branch=master)](https://travis-ci.org/gildesmarais/html2rss)
[![Gem Version](https://badge.fury.io/rb/html2rss.svg)](http://rubygems.org/gems/html2rss/)
-[API docs on RubyDoc.info](https://www.rubydoc.info/gems/html2rss)
+[![Coverage Status](https://coveralls.io/repos/github/gildesmarais/html2rss/badge.svg?branch=master)](https://coveralls.io/github/gildesmarais/html2rss?branch=master)
+[![Yard Docs](http://img.shields.io/badge/yard-docs-blue.svg)](https://www.rubydoc.info/gems/html2rss)
+![Retro Badge: valid RSS](https://validator.w3.org/feed/images/valid-rss-rogers.png)
-Request HTML from an URL and transform it to a Ruby RSS 2.0 object.
+**Searching for a ready to use app which serves generated feeds via HTTP?**
+[Head over to `html2rss-web`!](https://github.com/gildesmarais/html2rss-web)
-**Are you searching for a ready to use "website to RSS" solution?**
-[Check out `html2rss-web`!](https://github.com/gildesmarais/html2rss-web)
+This Ruby gem builds RSS 2.0 feeds from a _feed config_.
-Each website needs a _feed config_ which contains the URL to scrape and
-CSS selectors to extract the required information (like title, URL, ...).
-This gem provides [extractors](https://github.com/gildesmarais/html2rss/blob/master/lib/html2rss/item_extractors) (e.g. extract the information from an HTML attribute)
-and chainable [post processors](https://github.com/gildesmarais/html2rss/tree/master/lib/html2rss/attribute_post_processors) to make information retrieval even easier.
+With the _feed config_ containing the URL to scrape and
+CSS selectors for information extraction (like title, URL, ...) your RSS builds.
+[Extractors](#using-extractors) and chain-able [post processors](#using-post-processors)
+make information extraction, processing and sanitizing a breeze.
+[Scraping JSON](#scraping-json) responses and
+[setting HTTP request headers](#set-any-http-header-in-the-request) is
+supported, too.
## Installation
-Add this line to your application's Gemfile: `gem 'html2rss'`
-Then execute: `bundle`
+| 🤩 Like it? | Star it! ⭐️ |
+| ---------------------------------------------: | -------------------- |
+| Add this line to your application's `Gemfile`: | `gem 'html2rss'` |
+| Then execute: | `bundle` |
+| In your code: | `require 'html2rss'` |
+## Building a feed config
+
+Here's a minimal working example:
+
```ruby
+require 'html2rss'
+
rss =
Html2rss.feed(
- channel: { title: 'StackOverflow: Hot Network Questions', url: 'https://stackoverflow.com/questions' },
+ channel: {
+ title: 'StackOverflow: Hot Network Questions',
+ url: 'https://stackoverflow.com/questions'
+ },
selectors: {
items: { selector: '#hot-network-questions > ul > li' },
title: { selector: 'a' },
link: { selector: 'a', extractor: 'href' }
}
)
-puts rss.to_s
+puts rss
```
-## Usage with a YAML config file
+A _feed config_ consists of a `channel` and a `selectors` Hash.
+The contents of both hashes are explained below.
-Create a YAML config file. Find an example at [`spec/config.test.yml`](https://github.com/gildesmarais/html2rss/blob/master/spec/config.test.yml).
+**Looks too complicated?** See [`html2rss-configs`](https://github.com/gildesmarais/html2rss-configs) for ready-made feed configs!
-`Html2rss.feed_from_yaml_config(File.join(['spec', 'config.test.yml']), 'nuxt-releases')`
-returns an `RSS:Rss` object.
+### The `channel`
-**Too complicated?** See [`html2rss-configs`](https://github.com/gildesmarais/html2rss-configs) for ready-made feed configs!
+| attribute | | type | remark |
+| ------------- | -------- | ------- | ----------------------- |
+| `title` | required | String | |
+| `url` | required | String | |
+| `ttl` | optional | Integer | time to live in minutes |
+| `description` | optional | String | |
+| `headers` | optional | Hash | See notes below. |
-## Assigning categories to an item
+### The `selectors`
+You must provide an `items` selector hash which contains the CSS selector.
+`items` needs to return a collection of HTML tags.
+The other selectors are scoped to the tags of the items' collection.
+
+To build a
+[valid RSS 2.0 item](http://www.rssboard.org/rss-profile#element-channel-item)
+each item has to have at least a `title` or a `description`.
+
+Your `selectors` can contain arbitrary selector names, but only these
+will make it into the RSS feed:
+
+| RSS 2.0 tag | name in html2rss | remark |
+| ------------- | ---------------- | --------------------------- |
+| `title` | `title` | |
+| `description` | `description` | Supports HTML. |
+| `link` | `link` | A URL. |
+| `author` | `author` | |
+| `category` | `categories` | See notes below. |
+| `enclosure` | `enclosure` | See notes below. |
+| `pubDate` | `update` | An instance of `Time`. |
+| `guid` | `guid` | Generated from the `title`. |
+| `comments` | `comments` | A URL. |
+| `source` | ~~source~~ | Not yet supported. |
+
+### The `selector` hash
+
+Your selector hash can have these attributes:
+
+| name | value |
+| -------------- | -------------------------------------------------------- |
+| `selector` | The CSS selector to select the tag with the information. |
+| `extractor` | Name of the extractor. See notes below. |
+| `post_process` | A hash or array of hashes. See notes below. |
+
+## Using extractors
+
+Extractors help with extracting the information from the selected HTML tag.
+
+- The default extractor is `text`, which returns the tag's inner text.
+- The `html` extractor returns the tag's outer HTML.
+- The `href` extractor returns a URL from the tag's `href` attribute and corrects relative ones to absolute ones.
+- The `attribute` extractor returns the value of that tag's attribute.
+- The `static` extractor returns the configured static value (it doesn't extract anything).
+- [See file list of extractors](https://github.com/gildesmarais/html2rss/tree/master/lib/html2rss/item_extractors).
+
+Extractors can require additional attributes on the selector hash.
+👉 [Read their docs for usage examples](https://www.rubydoc.info/gems/html2rss/Html2rss/ItemExtractors).
+
+<details>
+ <summary>See a Ruby example</summary>
+
+```ruby
+Html2rss.feed(
+ channel: {}, selectors: { link: { selector: 'a', extractor: 'href' } }
+)
+```
+
+</details>
+
+<details>
+ <summary>See a YAML feed config example</summary>
+
+```yml
+channel:
+ # ... omitted
+selectors:
+ # ... omitted
+ link:
+ selector: 'a'
+ extractor: 'href'
+```
+
+</details>
+
+## Using post processors
+
+Extracted information can be further manipulated with post processors.
+
+| name | |
+| ------------------ | ------------------------------------------------------------------------------------- |
+| `gsub` | Allows global substitution operations on Strings (Regexp or simple pattern). |
+| `html_to_markdown` | HTML to Markdown, using [reverse_markdown](https://github.com/xijo/reverse_markdown). |
+| `markdown_to_html` | converts Markdown to HTML, using [kramdown](https://github.com/gettalong/kramdown). |
+| `parse_time` | Parses a String containing a time in a time zone. |
+| `parse_uri` | Parses a String as URL. |
+| `sanitize_html` | Strips unsafe and uneeded HTML and adds security related attributes. |
+| `substring` | Cuts a part off of a String, starting at a position. |
+| `template` | Based on a template, it creates a new String filled with other selectors values. |
+
+⚠️ Always make use of the `sanitize_html` post processor for HTML content. _Never trust the internet!_ ⚠️
+
+- [See file list of post processors](https://github.com/gildesmarais/html2rss/tree/master/lib/html2rss/attribute_post_processors).
+
+👉 [Read their docs for usage examples.](https://www.rubydoc.info/gems/html2rss/Html2rss/AttributePostProcessors)
+
+<details>
+ <summary>See a Ruby example</summary>
+
+```ruby
+Html2rss.feed(
+ channel: {},
+ selectors: {
+ description: {
+ selector: '.content', post_process: { name: 'sanitize_html' }
+ }
+ }
+)
+```
+
+</details>
+
+<details>
+ <summary>See a YAML feed config example</summary>
+
+```yml
+channel:
+ # ... omitted
+selectors:
+ # ... omitted
+ description:
+ selector: '.content'
+ post_process:
+ - name: sanitize_html
+```
+
+</details>
+
+### Chaining post processors
+
+Pass an array to `post_process` to chain the post processors.
+
+<details>
+ <summary>YAML example: build the description from a template String (in Markdown) and convert that Markdown to HTML</summary>
+
+```yml
+channel:
+ # ... omitted
+selectors:
+ # ... omitted
+ price:
+ selector: '.price'
+ description:
+ selector: '.section'
+ post_process:
+ - name: template
+ string: |
+ # %{self}
+
+ Price: %{price}
+ - name: markdown_to_html
+```
+
+Note the use of `|` for a multi-line String in YAML.
+
+</details>
+
+## Adding `<category>` tags to an item
+
The `categories` selector takes an array of selector names. The value of those
-selectors will become a category on the item.
+selectors will become a `<category>` on the RSS item.
<details>
- <summary>See a YAML config example</summary>
+ <summary>See a Ruby example</summary>
+```ruby
+Html2rss.feed(
+ channel: {},
+ selectors: {
+ genre: {
+ # ... omitted
+ selector: '.genre'
+ },
+ branch: { selector: '.branch' },
+ categories: %i[genre branch]
+ }
+)
+```
+
+</details>
+
+<details>
+ <summary>See a YAML feed config example</summary>
+
```yml
channel:
-# ... omitted
+ # ... omitted
selectors:
- #... omitted
+ # ... omitted
genre:
- selector: '.genre'
+ selector: ".genre"
branch:
- selector: '.branch'
+ selector: ".branch"
categories:
- genre
- branch
```
</details>
-## Adding an enclosure to each item
+## Adding an `<enclosure>` tag to an item
An enclosure can be 'anything', e.g. a image, audio or video file.
-The config's `enclosure` selector needs to return a URL of the content to enclose. If the extracted URL is relative, it will be converted to an absolute one using the channel's url as a base.
+The `enclosure` selector needs to return a URL of the content to enclose. If the extracted URL is relative, it will be converted to an absolute one using the channel's URL as base.
-Since html2rss does no further inspection of the enclosure, the support of this tag comes with trade-offs:
+Since html2rss does no further inspection of the enclosure, its support comes with trade-offs:
1. The content-type is guessed from the file extension of the URL.
2. If the content-type guessing fails, it will default to `application/octet-stream`.
3. The content-length will always be undetermined and thus stated as `0` bytes.
Read the [RSS 2.0 spec](http://www.rssboard.org/rss-profile#element-channel-item-enclosure) for further information on enclosing content.
<details>
- <summary>See a YAML config example</summary>
+ <summary>See a Ruby example</summary>
+```ruby
+Html2rss.feed(
+ channel: {},
+ selectors: {
+ enclosure: { selector: 'img', extractor: 'attribute', attribute: 'src' }
+ }
+)
+```
+
+</details>
+
+<details>
+ <summary>See a YAML feed config example</summary>
+
```yml
channel:
-# ... omitted
+ # ... omitted
selectors:
- #... omitted
-enclosure:
- selector: 'img'
- extractor: 'attribute'
- attribute: 'src'
+ # ... omitted
+ enclosure:
+ selector: "img"
+ extractor: "attribute"
+ attribute: "src"
```
</details>
## Scraping JSON
-Since 0.5.0 it's possible to scrape and process JSON.
+Although this gem is called **html***2rss*, it's possible to scrape and process JSON.
Adding `json: true` to the channel config will convert the JSON response to XML.
<details>
+ <summary>See a Ruby example</summary>
+
+```ruby
+Html2rss.feed(
+ channel: {
+ url: 'https://example.com', title: 'Example with JSON', json: true
+ },
+ selectors: {} # ... omitted
+)
+```
+
+</details>
+
+<details>
<summary>See a YAML feed config example</summary>
```yaml
channel:
url: https://example.com
- title: 'Example with JSON'
+ title: "Example with JSON"
json: true
-# ...
+selectors:
+ # ... omitted
```
</details>
-Under the hood it uses ActiveSupport's [`Hash.to_xml`](https://apidock.com/rails/Hash/to_xml) core extension for the JSON to XML conversion.
+<details>
+ <summary>See example of a converted JSON object</summary>
-### Conversion of JSON objects
-
This JSON object:
```json
{
"data": [{ "title": "Headline", "url": "https://example.com" }]
}
```
-will be converted to:
+converts to:
```xml
<hash>
<data>
<datum>
@@ -140,19 +368,24 @@
</hash>
```
Your items selector would be `data > datum`, the item's `link` selector would be `url`.
-### Conversion of JSON arrays
+Find further information in [ActiveSupport's `Hash.to_xml` documentation](https://apidock.com/rails/Hash/to_xml).
+</details>
+
+<details>
+ <summary>See example of a converted JSON array</summary>
+
This JSON array:
```json
[{ "title": "Headline", "url": "https://example.com" }]
```
-will be converted to:
+converts to:
```xml
<objects>
<object>
<title>Headline</title>
@@ -161,43 +394,123 @@
</objects>
```
Your items selector would be `objects > object`, the item's `link` selector would be `url`.
+Find further information in [ActiveSupport's `Array.to_xml` documentation](https://apidock.com/rails/Array/to_xml).
+
+</details>
+
## Set any HTTP header in the request
You can add any HTTP headers to the request to the channel URL.
-You can use this to e.g. have Cookie or Authorization information being sent or to overwrite the User-Agent.
+Use this to e.g. have Cookie or Authorization information sent or to spoof the User-Agent.
+<details>
+ <summary>See a Ruby example</summary>
+
+ ```ruby
+ Html2rss.feed(
+ channel: {
+ url: 'https://example.com',
+ title: "Example with http headers",
+ headers: {
+ "User-Agent" => "html2rss-request",
+ "X-Something" => "Foobar",
+ "Authorization" => "Token deadbea7",
+ "Cookie" => "monster=MeWantCookie"
+ }
+ },
+ selectors: {}
+ )
+ ```
+
+</details>
+
+<details>
+ <summary>See a YAML feed config example</summary>
+
```yaml
channel:
url: https://example.com
title: "Example with http headers"
headers:
"User-Agent": "html2rss-request"
"X-Something": "Foobar"
"Authorization": "Token deadbea7"
"Cookie": "monster=MeWantCookie"
-# ...
+selectors:
+ # ...
```
-The headers provided by the channel will be merged into the global headers.
+</details>
-## Development
+The headers provided by the channel are merged into the global headers.
-After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
+## Usage with a YAML config file
-## Contributing
+This step is not required to work with this gem. If you're using
+[`html2rss-web`](https://github.com/gildesmarais/html2rss-web)
+and want to create your private feed configs, keep on reading!
-Bug reports and pull requests are welcome on GitHub at https://github.com/gildesmarais/html2rss.
+First, create your YAML file, e.g. called `config.yml`.
+This file will contain your global config and feed configs.
-## Releasing a new version
+Example:
+```yml
+headers:
+ 'User-Agent': "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1"
+feeds:
+ myfeed:
+ channel:
+ selectors:
+ myotherfeed:
+ channel:
+ selectors:
+```
+
+Your feed configs go below `feeds`. Everything else is part of the global config.
+
+Build your feeds like this:
+
+```ruby
+require 'html2rss'
+
+myfeed = Html2rss.feed_from_yaml_config('config.yml', 'myfeed')
+myotherfeed = Html2rss.feed_from_yaml_config('config.yml', 'myotherfeed')
+```
+
+Find a full example of a `config.yml` at [`spec/config.test.yml`](https://github.com/gildesmarais/html2rss/blob/master/spec/config.test.yml).
+
+## Gotchas and tips & tricks
+
+- Check that the channel URL does not redirect to a mobile page with a different markup structure.
+- Do not rely on your web browser's developer console. html2rss does not execute JavaScript.
+- Fiddling with [`curl`](https://github.com/curl/curl) and [`pup`](https://github.com/ericchiang/pup) to find the selectors seems efficient (`curl URL | pup`).
+- [CSS selectors are quite versatile, here's an overview.](https://www.w3.org/TR/selectors-4/#overview)
+
+## Development
+
+After checking out the repository, run `bin/setup` to install dependencies. Then, run `bundle exec rspec` to run the tests.
+You can also run `bin/console` for an interactive prompt that will allow you to experiment.
+
+<details>
+ <summary>Releasing a new version</summary>
+
1. `git pull`
2. increase version in `lib/html2rss/version.rb`
3. `bundle`
-4. commit the changes
-5. `git tag v....`
-6. [`standard-changelog -f`](https://github.com/conventional-changelog/conventional-changelog/tree/master/packages/standard-changelog)
-7. `git add CHANGELOG.md && git commit --amend`
-8. `git tag v.... -f`
-9. `git push && git push --tags`
+4. `git add Gemfile.lock lib/html2rss/version.rb`
+5. `VERSION=$(ruby -e 'require "./lib/html2rss/version.rb"; puts Html2rss::VERSION')`
+6. `git commit -m "chore: release $VERSION"`
+7. `git tag v$VERSION`
+8. [`standard-changelog -f`](https://github.com/conventional-changelog/conventional-changelog/tree/master/packages/standard-changelog)
+9. `git add CHANGELOG.md && git commit --amend`
+10. `git tag v$VERSION -f`
+11. `git push && git push --tags`
+
+</details>
+
+## Contributing
+
+Bug reports and pull requests are welcome on GitHub at https://github.com/gildesmarais/html2rss.