README.md in pupa-0.0.11 vs README.md in pupa-0.0.12

- old
+ new

@@ -5,10 +5,28 @@ [![Coverage Status](https://coveralls.io/repos/opennorth/pupa-ruby/badge.png?branch=master)](https://coveralls.io/r/opennorth/pupa-ruby) [![Code Climate](https://codeclimate.com/github/opennorth/pupa-ruby.png)](https://codeclimate.com/github/opennorth/pupa-ruby) Pupa.rb is a Ruby 2.0 fork of Sunlight Labs' [Pupa](https://github.com/opencivicdata/pupa). It implements an Extract, Transform and Load (ETL) process to scrape data from online sources, transform it, and write it to a database. +## What it tries to solve + +Pupa.rb's goal is to make scraping less painful by solving common problems: + +* If you are updating a database by scraping a website, you can either delete and recreate records, or you can merge the scraped records with the saved records. Pupa.rb offers a simple way to merge records, by using an object's stable properties for identification. +* If you are scraping a source that references other sources – for example, a committee that references its members – you may want to link the source to its references with foreign keys. Pupa.rb will use whatever identifying information you scrape – for example, the members' names – to fill in the foreign keys for you. +* Data sources may use different formats in different contexts. Pupa.rb makes it easy to [select scraping methods](https://github.com/opennorth/pupa-ruby#scraping-method-selection) according to criteria, like the year of publication for example. +* By splitting the scrape (extract) and import (load) steps, it's easier for you and volunteers to start a scraper without any interaction with a database. + +In short, Pupa.rb lets you spend more time on the tasks that are unique to your use case, and less time on common tasks like caching, merging and storing data. It also provides helpful features like: + +* Logging, to make debugging and monitoring a scraper easier +* [Automatic response parsing](https://github.com/opennorth/pupa-ruby#automatic-response-parsing) of JSON, XML and HTML +* Option parsing, to control your scraper from the command-line +* Object validation, using [JSON Schema](http://json-schema.org/) + +Pupa.rb is extensible, so that you can add your own models, parsers, helpers, actions, etc. It also offers several ways to [improve your scraper's performance](https://github.com/opennorth/pupa-ruby#performance). + ## Usage You can use Pupa.rb to author scrapers that create people, organizations, memberships and posts according to the [Popolo](http://popoloproject.com/) open government data specification. If you need to scrape other types of data, you can also use your own models with Pupa.rb. The [cat.rb](http://opennorth.github.io/pupa-ruby/docs/cat.html) example shows you how to: @@ -47,10 +65,22 @@ ### Automatic response parsing JSON parsing is enabled by default. To enable automatic parsing of HTML and XML, require the `nokogiri` and `multi_xml` gems. +### [OpenCivicData](http://opencivicdata.org/) compatibility + +Both Pupa.rb and Sunlight Labs' [Pupa](https://github.com/opencivicdata/pupa) implement models for people, organizations and memberships from the [Popolo](http://popoloproject.com/) open government data specification. Pupa.rb lets you use your own classes, but Pupa only supports a fixed set of classes. A consequence of Pupa.rb's flexibility is that the value of the `_type` property for `Person`, `Organization` and `Membership` objects differs between Pupa.rb and Pupa. Pupa.rb has namespaced types like `pupa/person` – to allow Ruby to load the `Person` class in the `Pupa` module – whereas Pupa has unnamespaced types like `person`. + +To save objects to MongoDB with unnamespaced types like Sunlight Labs' Pupa – in order to benefit from other tools in the [OpenCivicData](http://opencivicdata.org/) stack – add this line to the top of your script: + +```ruby +require 'pupa/refinements/opencivicdata' +``` + +It is not currently possible to run the `scrape` action with one of Pupa.rb and Pupa, and to then run the `import` action with the other. Both actions must be run by the same library. + ## Performance Pupa.rb offers several ways to significantly improve performance. In an example case, reducing disk I/O and skipping validation as described below reduced the time to scrape 10,000 documents from 100 cached HTTP responses from 100 seconds down to 5 seconds. Like fast tests, fast scrapers make development smoother. @@ -154,13 +184,9 @@ ### Skipping validation The `json-schema` gem is slow compared to, for example, [JSV](https://github.com/garycourt/JSV). Setting the `--no-validate` switch and running JSON Schema validations separately can further reduce a scraper's running time. The [pupa-validate](https://npmjs.org/package/pupa-validate) npm package can be used to validate JSON documents using the faster JSV. In an example case, using JSV instead of the `json-schema` gem reduced by half the time to validate 10,000 documents. - -### Parsing JSON - -If the rest of your scraper is fast, you may see an improvement by using the `oj` gem. Just `require 'oj'` and Pupa.rb will automatically pick it up, since it uses [MultiJson](https://github.com/intridea/multi_json). ### Profiling You can profile your code using [perftools.rb](https://github.com/tmm1/perftools.rb). First, install the gem: