README.rdoc in data_miner-0.3.13 vs README.rdoc in data_miner-0.4.0
- old
+ new
@@ -9,35 +9,41 @@
config.gem 'data_miner'
You need to define <tt>data_miner</tt> blocks in your ActiveRecord models. For example, in <tt>app/models/country.rb</tt>:
class Country < ActiveRecord::Base
- data_miner do |step|
- # import country names and country codes
- step.import :url => 'http://www.cs.princeton.edu/introcs/data/iso3166.csv' do |attr|
- attr.key :iso_3166, :field_name => 'country code'
- attr.store :iso_3166, :field_name => 'country code'
- attr.store :name, :field_name => 'country'
+ set_primary_key :iso_3166
+
+ data_miner do
+ import 'The official ISO country list', :url => 'http://www.iso.org/iso/list-en1-semic-3.txt', :skip => 2, :headers => false, :delimiter => ';' do
+ key 'iso_3166'
+ store 'iso_3166', :field_number => 1
+ store 'name', :field_number => 0
end
+
+ import 'A Princeton dataset with better capitalization for some countries', :url => 'http://www.cs.princeton.edu/introcs/data/iso3166.csv' do
+ key 'iso_3166'
+ store 'iso_3166', :field_name => 'country code'
+ store 'name', :field_name => 'country'
+ end
end
end
...and in <tt>app/models/airport.rb</tt>:
class Airport < ActiveRecord::Base
- belongs_to :country
+ set_primary_key :iata_code
- data_miner do |step|
- # import airport iata_code, name, etc.
- step.import(:url => 'http://openflights.svn.sourceforge.net/viewvc/openflights/openflights/data/airports.dat', :headers => false) do |attr|
- attr.key :iata_code, :field_number => 3
- attr.store :name, :field_number => 0
- attr.store :city, :field_number => 1
- attr.store :country, :field_number => 2, :foreign_key => :name # will use Country.find_by_name(X)
- attr.store :iata_code, :field_number => 3
- attr.store :latitude, :field_number => 5
- attr.store :longitude, :field_number => 6
+ data_miner do
+ import :url => 'http://openflights.svn.sourceforge.net/viewvc/openflights/openflights/data/airports.dat', :headers => false, :select => lambda { |row| row[4].present? } do
+ key 'iata_code'
+ store 'name', :field_number => 1
+ store 'city', :field_number => 2
+ store 'country_name', :field_number => 3
+ store 'iata_code', :field_number => 4
+ store 'latitude', :field_number => 6
+ store 'longitude', :field_number => 7
end
end
end
Put this in <tt>lib/tasks/data_miner_tasks.rake</tt>: (unfortunately I don't know a way to automatically include gem tasks, so you have to do this manually for now)
@@ -46,47 +52,39 @@
task :run => :environment do
DataMiner.run :resource_names => ENV['RESOURCES'].to_s.split(/\s*,\s*/).flatten.compact
end
end
-You need to specify what order to mine data. For example, in <tt>config/initializers/data_miner_config.rb</tt>:
-
- DataMiner.enqueue do |queue|
- queue << Country # class whose data should be mined 1st
- queue << Airport # class whose data should be mined 2nd
- # etc
- end
-
Once you have (1) set up the order of data mining and (2) defined <tt>data_miner</tt> blocks in your classes, you can:
- $ rake data_miner:run
+ $ rake data_miner:run RESOURCES=Airport,Country
==Complete example
~ $ rails testapp
~ $ cd testapp/
- ~/testapp $ ./script/generate model Airport iata_code:string name:string city:string country_id:integer latitude:float longitude:float
+ ~/testapp $ ./script/generate model Airport iata_code:string name:string city:string country_name:string latitude:float longitude:float
+ [...edit migration to make iata_code the primary key...]
~/testapp $ ./script/generate model Country iso_3166:string name:string
+ [...edit migration to make iso_3166 the primary key...]
~/testapp $ rake db:migrate
~/testapp $ touch lib/tasks/data_miner_tasks.rb
[...edit per quick start...]
- ~/testapp $ touch config/initializers/data_miner_config.rake
- [...edit per quick start...]
- ~/testapp $ rake data_miner:run
+ ~/testapp $ rake data_miner:run RESOURCES=Airport,Country
Now you should have
~/testapp $ ./script/console
Loading development environment (Rails 2.3.3)
>> Airport.first.iata_code
=> "GKA"
- >> Airport.first.country.name
+ >> Airport.first.country_name
=> "Papua New Guinea"
==Authors
* Seamus Abshere <seamus@abshere.net>
* Andy Rossmeissl <andy@rossmeissl.net>
==Copyright
-Copyright (c) 2009 Brighter Planet. See LICENSE for details.
+Copyright (c) 2010 Brighter Planet. See LICENSE for details.