# data_miner Download, pull out of a ZIP/TAR/GZ/BZ2 archive, parse, correct, and import XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models. Tested in MRI 1.8.7+, MRI 1.9.2+, and JRuby 1.6.7+. Thread safe. ## Real-world usage

Brighter Planet logo

We use `data_miner` for [data science at Brighter Planet](http://brighterplanet.com/research) and in production at * [Brighter Planet's reference data web service](http://data.brighterplanet.com) * [Brighter Planet's impact estimate web service](http://impact.brighterplanet.com) The killer combination for us is: 1. [`active_record_inline_schema`](https://github.com/seamusabshere/active_record_inline_schema) - define table structure 2. [`remote_table`](https://github.com/seamusabshere/remote_table) - download data and parse it 3. [`errata`](https://github.com/seamusabshere/errata) - apply corrections in a transparent way 4. [`data_miner`](https://github.com/seamusabshere/remote_table) (this library!) - import data idempotently ## Documentation Check out the [extensive documentation](http://rdoc.info/github/seamusabshere/data_miner). ## Quick start You define data_miner blocks in your ActiveRecord models. For example, in app/models/country.rb: class Country < ActiveRecord::Base self.primary_key = 'iso_3166_code' data_miner do import("OpenGeoCode.org's Country Codes to Country Names list", :url => 'http://opengeocode.org/download/countrynames.txt', :format => :delimited, :delimiter => '; ', :headers => false, :skip => 22) do key :iso_3166_code, :field_number => 0 store :iso_3166_alpha_3_code, :field_number => 1 store :iso_3166_numeric_code, :field_number => 2 store :name, :field_number => 5 end end end Now you can run: >> Country.run_data_miner! => nil ## More advanced usage The [`earth` library](https://github.com/brighterplanet/earth) has dozens of real-life examples showing how to download, pull out of a ZIP/TAR/BZ2 archive, parse, correct, and import CSVs, fixed-width files, ODS, XLS, XLSX, even HTML and XML:
Model Highlights Reference
Aircraft parsing Microsoft Frontpage HTML (!) data_miner.rb
Airports forcing column names and use of :select block (Proc) data_miner.rb
Automobile model variants super advanced usage of "custom parser" and errata data_miner.rb
Country parsing CSV and a few other tricks data_miner.rb
EGRID regions parsing XLS data_miner.rb
Flight segment (stage) super advanced usage of POSTing form data data_miner.rb
Zip codes downloading a ZIP file and pulling an XLSX out of it data_miner.rb
And many more - look for the `data_miner.rb` file that corresponds to each model. Note that you would normally put the `data_miner` declaration right inside the ActiveRecord model file... it's kept separate in `earth` so that loading it is optional. ## Authors * Seamus Abshere * Andy Rossmeissl * Derek Kastner * Ian Hough ## Copyright Copyright (c) 2012 Brighter Planet. See LICENSE for details.