# data_miner Download, pull out of a ZIP/TAR/GZ/BZ2 archive, parse, correct, and import XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models. Tested in MRI 1.8.7+, MRI 1.9.2+, and JRuby 1.6.7+. Thread safe. ## Real-world usage
We use `data_miner` for [data science at Brighter Planet](http://brighterplanet.com/research) and in production at * [Brighter Planet's reference data web service](http://data.brighterplanet.com) * [Brighter Planet's impact estimate web service](http://impact.brighterplanet.com) The killer combination for us is: 1. [`active_record_inline_schema`](https://github.com/seamusabshere/active_record_inline_schema) - define table structure 2. [`remote_table`](https://github.com/seamusabshere/remote_table) - download data and parse it 3. [`errata`](https://github.com/seamusabshere/errata) - apply corrections in a transparent way 4. [`data_miner`](https://github.com/seamusabshere/remote_table) (this library!) - import data idempotently ## Documentation Check out the [extensive documentation](http://rdoc.info/github/seamusabshere/data_miner). ## Quick start You definedata_miner
blocks in your ActiveRecord models. For example, in app/models/country.rb
:
class Country < ActiveRecord::Base
self.primary_key = 'iso_3166_code'
data_miner do
import("OpenGeoCode.org's Country Codes to Country Names list",
:url => 'http://opengeocode.org/download/countrynames.txt',
:format => :delimited,
:delimiter => '; ',
:headers => false,
:skip => 22) do
key :iso_3166_code, :field_number => 0
store :iso_3166_alpha_3_code, :field_number => 1
store :iso_3166_numeric_code, :field_number => 2
store :name, :field_number => 5
end
end
end
Now you can run:
>> Country.run_data_miner!
=> nil
## More advanced usage
The [`earth` library](https://github.com/brighterplanet/earth) has dozens of real-life examples showing how to download, pull out of a ZIP/TAR/BZ2 archive, parse, correct, and import CSVs, fixed-width files, ODS, XLS, XLSX, even HTML and XML:
Model | Highlights | Reference |
---|---|---|
Aircraft | parsing Microsoft Frontpage HTML (!) | data_miner.rb |
Airports | forcing column names and use of :select block (Proc ) |
data_miner.rb |
Automobile model variants | super advanced usage of "custom parser" and errata | data_miner.rb |
Country | parsing CSV and a few other tricks | data_miner.rb |
EGRID regions | parsing XLS | data_miner.rb |
Flight segment (stage) | super advanced usage of POSTing form data | data_miner.rb |
Zip codes | downloading a ZIP file and pulling an XLSX out of it | data_miner.rb |