README.md in html2rss-0.8.0 vs README.md in html2rss-0.8.1
- old
+ new
@@ -13,11 +13,11 @@
With the _feed config_ containing the URL to scrape and
CSS selectors for information extraction (like title, URL, ...) your RSS builds.
[Extractors](#using-extractors) and chain-able [post processors](#using-post-processors)
make information extraction, processing and sanitizing a breeze.
-[Scraping JSON](#scraping-json) responses and
+[Scraping JSON](#scraping-and-handling-json-responses) responses and
[setting HTTP request headers](#set-any-http-header-in-the-request) is
supported, too.
## Installation
@@ -34,14 +34,11 @@
```ruby
require 'html2rss'
rss =
Html2rss.feed(
- channel: {
- title: 'StackOverflow: Hot Network Questions',
- url: 'https://stackoverflow.com/questions'
- },
+ channel: { url: 'https://stackoverflow.com/questions' },
selectors: {
items: { selector: '#hot-network-questions > ul > li' },
title: { selector: 'a' },
link: { selector: 'a', extractor: 'href' }
}
@@ -55,17 +52,19 @@
**Looks too complicated?** See [`html2rss-configs`](https://github.com/gildesmarais/html2rss-configs) for ready-made feed configs!
### The `channel`
-| attribute | | type | remark |
-| ------------- | -------- | ------- | ----------------------- |
-| `title` | required | String | |
-| `url` | required | String | |
-| `ttl` | optional | Integer | time to live in minutes |
-| `description` | optional | String | |
-| `headers` | optional | Hash | See notes below. |
+| attribute | | type | default | remark |
+| ------------- | -------- | ------- | -------------: | ------------------------------------------ |
+| `url` | required | String | | |
+| `title` | optional | String | auto-generated | |
+| `description` | optional | String | auto-generated | |
+| `ttl` | optional | Integer | `360` | TTL in _minutes_ |
+| `time_zone` | optional | String | `'UTC'` | TimeZone name |
+| `headers` | optional | Hash | `{}` | Set HTTP request headers. See notes below. |
+| `json` | optional | Boolean | `false` | Handle JSON response. See notes below. |
### The `selectors`
You must provide an `items` selector hash which contains the CSS selector.
`items` needs to return a collection of HTML tags.
@@ -76,22 +75,22 @@
each item has to have at least a `title` or a `description`.
Your `selectors` can contain arbitrary selector names, but only these
will make it into the RSS feed:
-| RSS 2.0 tag | name in html2rss | remark |
-| ------------- | ---------------- | --------------------------- |
-| `title` | `title` | |
-| `description` | `description` | Supports HTML. |
-| `link` | `link` | A URL. |
-| `author` | `author` | |
-| `category` | `categories` | See notes below. |
-| `enclosure` | `enclosure` | See notes below. |
-| `pubDate` | `update` | An instance of `Time`. |
-| `guid` | `guid` | Generated from the `title`. |
-| `comments` | `comments` | A URL. |
-| `source` | ~~source~~ | Not yet supported. |
+| RSS 2.0 tag | name in `html2rss` | remark |
+| ------------- | ------------------ | --------------------------- |
+| `title` | `title` | |
+| `description` | `description` | Supports HTML. |
+| `link` | `link` | A URL. |
+| `author` | `author` | |
+| `category` | `categories` | See notes below. |
+| `enclosure` | `enclosure` | See notes below. |
+| `pubDate` | `update` | An instance of `Time`. |
+| `guid` | `guid` | Generated from the `title`. |
+| `comments` | `comments` | A URL. |
+| `source` | ~~source~~ | Not yet supported. |
### The `selector` hash
Your selector hash can have these attributes:
@@ -223,11 +222,11 @@
</details>
## Adding `<category>` tags to an item
-The `categories` selector takes an array of selector names. The value of those
+The `categories` selector takes an array of selector names. Each value of those
selectors will become a `<category>` on the RSS item.
<details>
<summary>See a Ruby example</summary>
@@ -266,15 +265,15 @@
</details>
## Adding an `<enclosure>` tag to an item
-An enclosure can be 'anything', e.g. a image, audio or video file.
+An enclosure can be any file, e.g. a image, audio or video.
The `enclosure` selector needs to return a URL of the content to enclose. If the extracted URL is relative, it will be converted to an absolute one using the channel's URL as base.
-Since html2rss does no further inspection of the enclosure, its support comes with trade-offs:
+Since `html2rss` does no further inspection of the enclosure, its support comes with trade-offs:
1. The content-type is guessed from the file extension of the URL.
2. If the content-type guessing fails, it will default to `application/octet-stream`.
3. The content-length will always be undetermined and thus stated as `0` bytes.
@@ -308,11 +307,11 @@
attribute: "src"
```
</details>
-## Scraping JSON
+## Scraping and handling JSON responses
Although this gem is called **html***2rss*, it's possible to scrape and process JSON.
Adding `json: true` to the channel config will convert the JSON response to XML.
@@ -483,10 +482,10 @@
Find a full example of a `config.yml` at [`spec/config.test.yml`](https://github.com/gildesmarais/html2rss/blob/master/spec/config.test.yml).
## Gotchas and tips & tricks
- Check that the channel URL does not redirect to a mobile page with a different markup structure.
-- Do not rely on your web browser's developer console. html2rss does not execute JavaScript.
+- Do not rely on your web browser's developer console. `html2rss` does not execute JavaScript.
- Fiddling with [`curl`](https://github.com/curl/curl) and [`pup`](https://github.com/ericchiang/pup) to find the selectors seems efficient (`curl URL | pup`).
- [CSS selectors are quite versatile, here's an overview.](https://www.w3.org/TR/selectors-4/#overview)
## Development