# ScrapKit

ScrapKit automates web scraping and converts the results in plain objects by using configuration objects called _recipes_.

Each recipe can be loaded as an object or as JSON file, and have the following structure:

```json
{
  "url": "https://status.heroku.com/",
  "attributes": {
    "apps": ".subnav__inner .ember-view:nth-child(1) > .status-summary__description",
    "data": ".subnav__inner .ember-view:nth-child(2) > .status-summary__description",
    "tools": ".subnav__inner .ember-view:nth-child(3) > .status-summary__description"
  }
}
```

* `url`: It defines the web page to scrape.
* `attributes`: Is an object that maps each attribute name with its corresponding CSS selector.

`attributes` can have a more complex structure to handle collections. For example:

```json
{
  "url": "https://hpneo.dev/",
  "attributes": {
    "posts": {
      "selector": ".post-item",
      "children_attributes": {
        "title": "h2"
      }
    }
  }
}
```

In this case `attributes` has a `posts` key, which will store the results of a collection, defined by a CSS `selector` and an object of children attributes.

`children_attributes` is an object that maps each attribute name with its corresponding CSS selector (similar to how `attributes` works in its simpler version).

## Installation

Add this line to your application's Gemfile:

```ruby
gem 'scrap_kit'
```

And then execute:

    $ bundle

Or install it yourself as:

    $ gem install scrap_kit

## Usage

`ScrapKit::Recipe.load` can take an object with the recipe, or load a JSON file.

```ruby
recipe = ScrapKit::Recipe.load(
  url: "https://status.heroku.com/",
  attributes: {
    apps: ".subnav__inner .status-summary:nth-child(1) > .status-summary__description",
    data: ".subnav__inner .status-summary:nth-child(2) > .status-summary__description",
    tools: ".subnav__inner .status-summary:nth-child(3) > .status-summary__description",
  }
)

output = recipe.run
#=> {:apps=>"ok", :data=>"ok", :tools=>"ok"}
```

For more complex structures it's recommended to store the recipe in a JSON file:

```ruby
recipe = ScrapKit::Recipe.load("./spec/fixtures/file.json")

output = recipe.run
#=> {:posts=>[{:title=>"APIs de Internacionalización en JavaScript"}, {:title=>"Ejecutando comandos desde Ruby"}, {:title=>"Usando Higher-Order Components"}]}
```

### Working with selectors

Each attribute can be mapped to a selector, which can be any of the following types:

* A string, which represents a CSS selector.

```ruby
".subnav__inner .ember-view:nth-child(1) > .status-summary__description"
```

* A hash, which can have any of the following options:
  * `xpath: [String]`
  * `css: [String]`
  * `index: [Integer]`
  * `tag_name: [String]`
  * `text: [String]`

```ruby
{ text: "View Archive" }
```

* An array, which represents a path of selectors, where its last item must be a hash that matches a selector with an expected value.

```ruby
[".up-time-chart", { ".region-header .u-margin-Tm": "REGION" }]
```

Use any of them as it suits you best.

### Writing steps

Recipes can have a `steps` entry. This entry defines previous actions the scraper have to follow before extract the attributes. The following steps are supported:

* **`goto`**: It instructs the scraper to go to a link inside the current page. Its value can be a hash or array selector, or a URL:

```ruby
{
  goto: { text: "View Archive" }
}
```

* **`click`**: It instructs the scraper to click on an element inside the current page. Its value can be a hash or array selector:

```ruby
{
  click: { css: "[type=submit]" }
}
```

* **`fill_form`**: It instructs the scraper to fill a form or any form field inside the current page. Its value is a hash where the keys are either a input's name or a CSS selector, and the values are the values to be entered into those fields:

```ruby
{
  fill_form: {
    gem_name: "ScrapKit",
    author: "hpneo",
  }
}
```

## Development

After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).

## Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/hpneo/scrap_kit. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.

## License

The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).

## Code of Conduct

Everyone interacting in the ScrapKit project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/hpneo/scrap_kit/blob/master/CODE_OF_CONDUCT.md).