README.markdown in remote_table-1.4.0 vs README.markdown in remote_table-2.0.0

- old
+ new

@@ -1,27 +1,40 @@ # remote_table -Open local or remote XLSX, XLS, ODS, CSV and fixed-width files. +Open Google Docs spreadsheets, local or remote XLSX, XLS, ODS, CSV (comma separated), TSV (tab separated), other delimited, fixed-width files. -## Production usage +Tested on MRI 1.8, MRI 1.9, and JRuby 1.6.7+. Thread-safe. -Used by [the Brighter Planet Reference Data web service](http://data.brighterplanet.com), the [`data_miner` gem](https://github.com/seamusabshere/data_miner), and the [`earth` gem](https://github.com/brighterplanet/earth). +## Real-world usage +<p><a href="http://brighterplanet.com"><img src="https://s3.amazonaws.com/static.brighterplanet.com/assets/logos/flush-left/inline/green/rasterized/brighter_planet-160-transparent.png" alt="Brighter Planet logo"/></a></p> + +We use `remote_table` for [data science at Brighter Planet](http://brighterplanet.com/research) and in production at + +* [Brighter Planet's impact estimate web service](http://impact.brighterplanet.com) +* [Brighter Planet's reference data web service](http://data.brighterplanet.com) + +It's also a big part of + +* the [`data_miner`](https://github.com/seamusabshere/data_miner) library +* the [`earth`](https://github.com/brighterplanet/earth) library + ## Example - $ irb - 1.9.3-p0 :001 > require 'remote_table' + >> require 'remote_table' + remote_table.rb:8:in `<top (required)>': iconv will be deprecated in the future, use String#encode instead. + [remote_table] Apologies - using iconv because Ruby 1.9.x's String#encode doesn't have transliteration tables (yet) => true - 1.9.3-p0 :002 > t = RemoteTable.new 'http://www.fueleconomy.gov/FEG/epadata/98guide6.zip', :filename => '98guide6.csv' - => #<RemoteTable:0x00000100851d98 @options={:filename=>"98guide6.csv"}, @url="http://www.fueleconomy.gov/FEG/epadata/98guide6.zip"> - 1.9.3-p0 :003 > t.rows.length + >> t = RemoteTable.new 'http://www.fueleconomy.gov/FEG/epadata/98guide6.zip', :filename => '98guide6.csv' + => #<RemoteTable:0x00000101b87390 @download_count_mutex=#<Mutex:0x00000101b87228>, @iconv_mutex=#<Mutex:0x00000101b87200>, @extend_bang_mutex=#<Mutex:0x00000101b871d8>, @errata_mutex=#<Mutex:0x00000101b871b0>, @cache=[], @download_count=0, @url="http://www.fueleconomy.gov/FEG/epadata/98guide6.zip", @format=nil, @headers=:first_row, @compression=:zip, @packing=nil, @streaming=false, @warn_on_multiple_downloads=true, @delimiter=",", @sheet=nil, @keep_blank_rows=false, @form_data=nil, @skip=0, @internal_encoding="UTF-8", @row_xpath=nil, @column_xpath=nil, @row_css=nil, @column_css=nil, @glob=nil, @filename="98guide6.csv", @transform_settings=nil, @cut=nil, @crop=nil, @schema=nil, @schema_name=nil, @pre_select=nil, @pre_reject=nil, @errata_settings=nil, @other_options={}, @transformer=#<RemoteTable::Transformer:0x00000101b8c2f0 @t=#<RemoteTable:0x00000101b87390 ...>, @legacy_transformer_mutex=#<Mutex:0x00000101b8c2a0>>, @local_copy=#<RemoteTable::LocalCopy:0x00000101b8bf58 @t=#<RemoteTable:0x00000101b87390 ...>, @encoded_io_mutex=#<Mutex:0x00000101b8be18>, @generate_mutex=#<Mutex:0x00000101b8bdc8>>> + >> t.rows.length => 806 - 1.9.3-p0 :004 > t.rows.first.length + >> t.rows.first.length => 26 - 1.9.3-p0 :005 > require 'pp' + >> require 'pp' => true - 1.9.3-p0 :006 > pp t[23] + >> pp t[23] {"Class"=>"TWO SEATERS", "Manufacturer"=>"PORSCHE", "carline name"=>"BOXSTER", "displ"=>"2.5", "cyl"=>"6", @@ -45,23 +58,35 @@ "hlv"=>"", "fcost"=>"956", "eng dscr"=>"", "trans dscr"=>""} -You get an <code>Array</code> of <code>Hash</code>es with **string keys**. If you set <code>:headers => false</code>, then you get an <code>Array</code> of <code>Array</code>s. +## Columns and rows +* If there are headers, you get an <code>Array</code> of <code>Hash</code>es with **string keys**. +* If you set <code>:headers => false</code>, then you get an <code>Array</code> of <code>Array</code>s. + +## Row keys + +Row keys are **strings**. Row keys are NOT symbolized. + + row['foobar'] # correct + row[:foobar] # incorrect + +You can call <code>symbolize_keys</code> yourself, but we don't do it automatically to avoid creating loads of garbage symbols. + ## Supported formats <table> <tr> <th>Format</th> <th>Notes</th> <th>Library</th> </tr> <tr> <td>Delimited (CSV, TSV, etc.)</td> - <td>All <code>RemoteTable::Format::Delimited::FASTERCSV_OPTIONS</code>, for example <code>:col_sep</code>, are passed directly to fastercsv.</td> + <td>All <code>RemoteTable::Delimited::PASSTHROUGH_CSV_SETTINGS</code>, for example <code>:col_sep</code>, are passed directly to fastercsv.</td> <td> <a href="http://fastercsv.rubyforge.org/">fastercsv</a> (1.8); <a href="http://www.ruby-doc.org/stdlib-1.9.3/libdoc/csv/rdoc/index.html">stdlib</code></a> (1.9) </td> </tr> @@ -145,79 +170,79 @@ RemoteTable.new("http://www.faa.gov/air_traffic/publications/atpubs/CNT/5-2-B.htm", :encoding => 'windows-1252', :row_xpath => '//table/tr[2]/td/table/tr', :column_xpath => 'td', :errata => { RemoteTable.new('https://spreadsheets.google.com/spreadsheet/pub?key=0AoQJbWqPrREqdGVBRnhkRGhSaVptSDJ5bXJGbkpUSWc&output=csv', :responder => Aircraft::Guru.new }, - :select => lambda { |record| manufacturer_whitelist? record['Manufacturer'] }) + :select => proc { |record| manufacturer_whitelist? record['Manufacturer'] }) # OpenFlights.org airports database RemoteTable.new('https://openflights.svn.sourceforge.net/svnroot/openflights/openflights/data/airports.dat', :headers => %w{ id name city country_name iata_code icao_code latitude longitude altitude timezone daylight_savings }, - :select => lambda { |record| record['iata_code'].present? }, + :select => proc { |record| record['iata_code'].present? }, :errata => { RemoteTable.new('https://spreadsheets.google.com/pub?key=0AoQJbWqPrREqdFc2UzhQYU5PWEQ0N21yWFZGNmc2a3c&gid=0&output=csv', :responder => Airport::Guru.new }) # see https://github.com/brighterplanet/earth/blob/master/lib/earth/air/aircraft/data_miner.rb # T100 flight segment data for #{month.strftime('%B %Y')} # for definition of `form_data` and `FlightSegment::Guru` see https://github.com/brighterplanet/earth/blob/master/lib/earth/air/flight_segment/data_miner.rb RemoteTable.new('http://www.transtats.bts.gov/DownLoad_Table.asp', :form_data => form_data, :compression => :zip, :glob => '/*.csv', :errata => { RemoteTable.new('https://spreadsheets.google.com/spreadsheet/pub?key=0AoQJbWqPrREqdGxpYU1qWFR3d0syTVMyQVVOaDd0V3c&output=csv', :responder => FlightSegment::Guru.new }, - :select => lambda { |record| record['DEPARTURES_PERFORMED'].to_i > 0 }) + :select => proc { |record| record['DEPARTURES_PERFORMED'].to_i > 0 }) # 1995 Fuel Economy Guide # for definition of `:fuel_economy_guide_b` and `AutomobileMakeModelYearVariant::ParserB` see https://github.com/brighterplanet/earth/blob/master/lib/earth/automobile/automobile_make_model_year_variant/data_miner.rb RemoteTable.new("http://www.fueleconomy.gov/FEG/epadata/95mfgui.zip", :filename => '95MFGUI.DAT', :format => :fixed_width, :cut => '13-', :schema_name => :fuel_economy_guide_b, - :select => lambda { |row| row['model'].present? and (row['suppress_code'].blank? or row['suppress_code'].to_f == 0) and row['state_code'] == 'F' }, + :select => proc { |row| row['model'].present? and (row['suppress_code'].blank? or row['suppress_code'].to_f == 0) and row['state_code'] == 'F' }, :transform => { :class => AutomobileMakeModelYearVariant::ParserB, :year => 1995 }, :errata => { :url => "https://docs.google.com/spreadsheet/pub?key=0AoQJbWqPrREqdDkxTElWRVlvUXB3Uy04SDhSYWkzakE&output=csv", :responder => AutomobileMakeModelYearVariant::Guru.new }) # 1998 Fuel Economy Guide # for definition of `AutomobileMakeModelYearVariant::ParserC` see https://github.com/brighterplanet/earth/blob/master/lib/earth/automobile/automobile_make_model_year_variant/data_miner.rb RemoteTable.new('http://www.fueleconomy.gov/FEG/epadata/98guide6.zip', :filename => '98guide6.csv', :transform => { :class => AutomobileMakeModelYearVariant::ParserC, :year => 1998 }, :errata => { :url => "https://docs.google.com/spreadsheet/pub?key=0AoQJbWqPrREqdDkxTElWRVlvUXB3Uy04SDhSYWkzakE&output=csv", :responder => AutomobileMakeModelYearVariant::Guru.new }, - :select => lambda { |row| row['model'].present? }) + :select => proc { |row| row['model'].present? }) # annual corporate average fuel economy data for domestic and imported vehicle fleets from the NHTSA RemoteTable.new('https://spreadsheets.google.com/pub?key=0AoQJbWqPrREqdEdXWXB6dkVLWkowLXhYSFVUT01sS2c&hl=en&gid=0&output=csv', :errata => { 'url' => 'http://static.brighterplanet.com/science/data/transport/automobiles/make_fleet_years/errata.csv' }, - :select => lambda { |row| row['volume'].to_i > 0 }) + :select => proc { |row| row['volume'].to_i > 0 }) # total vehicle miles travelled by gasoline passenger cars from the 2010 EPA GHG Inventory RemoteTable.new('http://www.epa.gov/climatechange/emissions/downloads10/2010-Inventory-Annex-Tables.zip', :filename => 'Annex Tables/Annex 3/Table A-87.csv', :skip => 1, - :select => lambda { |row| row['Year'].to_i.to_s == row['Year'] }) + :select => proc { |row| row['Year'].to_i.to_s == row['Year'] }) # total vehicle miles travelled from the 2010 EPA GHG Inventory RemoteTable.new('http://www.epa.gov/climatechange/emissions/downloads10/2010-Inventory-Annex-Tables.zip', :filename => 'Annex Tables/Annex 3/Table A-87.csv', :skip => 1, - :select => lambda { |row| row['Year'].to_i.to_s == row['Year'] }) + :select => proc { |row| row['Year'].to_i.to_s == row['Year'] }) # total travel distribution from the 2010 EPA GHG Inventory RemoteTable.new('http://www.epa.gov/climatechange/emissions/downloads10/2010-Inventory-Annex-Tables.zip', :filename => 'Annex Tables/Annex 3/Table A-93.csv', :skip => 1, - :select => lambda { |row| row['Vehicle Age'].to_i.to_s == row['Vehicle Age'] }) + :select => proc { |row| row['Vehicle Age'].to_i.to_s == row['Vehicle Age'] }) # building characteristics from the 2003 EIA Commercial Building Energy Consumption Survey RemoteTable.new('http://www.eia.gov/emeu/cbecs/cbecs2003/public_use_2003/data/FILE02.csv', :skip => 1, :headers => ["PUBID8","REGION8","CENDIV8","SQFT8","SQFTC8","YRCONC8","PBA8","ELUSED8","NGUSED8","FKUSED8","PRUSED8","STUSED8","HWUSED8","ONEACT8","ACT18","ACT28","ACT38","ACT1PCT8","ACT2PCT8","ACT3PCT8","PBAPLUS8","VACANT8","RWSEAT8","PBSEAT8","EDSEAT8","FDSEAT8","HCBED8","NRSBED8","LODGRM8","FACIL8","FEDFAC8","FACACT8","MANIND8","PLANT8","FACDST8","FACDHW8","FACDCW8","FACELC8","BLDPLT8","ADJWT8","STRATUM8","PAIR8"]) # 2003 CBECS C17 - Electricity Consumption and Intensity - New England Division # for definition of `CbecsEnergyIntensity::NAICS_CODE_SYNTHESIZER` see https://github.com/brighterplanet/earth/blob/master/lib/earth/industry/cbecs_energy_intensity/data_miner.rb RemoteTable.new("http://www.eia.gov/emeu/cbecs/cbecs2003/detailed_tables_2003/2003set10/2003excel/C17.xls", :headers => false, - :select => ::Proc.new { |row| CbecsEnergyIntensity::NAICS_CODE_SYNTHESIZER.call(row) }, + :select => proc { |row| CbecsEnergyIntensity::NAICS_CODE_SYNTHESIZER.call(row) }, :crop => (21..37)) # U.S. Census 2002 NAICS code list RemoteTable.new('http://www.census.gov/epcd/naics02/naicod02.txt', :skip => 4, @@ -236,17 +261,17 @@ # U.S. Census Geographic Terms and Definitions RemoteTable.new('http://www.census.gov/popest/about/geo/state_geocodes_v2009.txt', :skip => 6, :headers => %w{ Region Division FIPS Name }, - :select => ::Proc.new { |row| row['Division'].to_i > 0 and row['FIPS'].to_i == 0 }) + :select => proc { |row| row['Division'].to_i > 0 and row['FIPS'].to_i == 0 }) # state census divisions from the U.S. Census RemoteTable.new('http://www.census.gov/popest/about/geo/state_geocodes_v2009.txt', :skip => 8, :headers => ['Region', 'Division', 'State FIPS', 'Name'], - :select => ::Proc.new { |row| row['State FIPS'].to_i > 0 }) + :select => proc { |row| row['State FIPS'].to_i > 0 }) # OpenGeoCode.org's Country Codes to Country Names list RemoteTable.new('http://opengeocode.org/download/countrynames.txt', :format => :delimited, :delimiter => ';', @@ -265,23 +290,23 @@ # eGRID 2010 regions and loss factors RemoteTable.new('http://www.epa.gov/cleanenergy/documents/egridzips/eGRID2010V1_1_STIE_USGC.xls', :sheet => 'STIE07', :skip => 4, - :select => lambda { |row| row['eGRID2010 year 2007 file state sequence number'].to_i.between?(1, 51) }) + :select => proc { |row| row['eGRID2010 year 2007 file state sequence number'].to_i.between?(1, 51) }) # eGRID 2010 subregions and electricity emission factors RemoteTable.new('http://www.epa.gov/cleanenergy/documents/egridzips/eGRID2010_Version1-1_xls_only.zip', :filename => 'eGRID2010V1_1_year07_AGGREGATION.xls', :sheet => 'SRL07', :skip => 4, - :select => lambda { |row| row['SEQSRL07'].to_i.between?(1, 26) }) + :select => proc { |row| row['SEQSRL07'].to_i.between?(1, 26) }) # U.S. Census State ANSI Code file RemoteTable.new('http://www.census.gov/geo/www/ansi/state.txt', :delimiter => '|', - :select => lambda { |record| record['STATE'].to_i < 60 }) + :select => proc { |record| record['STATE'].to_i < 60 }) # Mapping Hacks zipcode database RemoteTable.new('http://mappinghacks.com/data/zipcode.zip', :filename => 'zipcode.csv') @@ -293,22 +318,22 @@ RemoteTable.new('http://www.freebase.com/type/exporttypeinstances/base/horses/horse_breed?page=0&filter_mode=type&filter_view=table&show%01p%3D%2Ftype%2Fobject%2Fname%01index=0&show%01p%3D%2Fcommon%2Ftopic%2Fimage%01index=1&show%01p%3D%2Fcommon%2Ftopic%2Farticle%01index=2&sort%01p%3D%2Ftype%2Fobject%2Ftype%01p%3Dlink%01p%3D%2Ftype%2Flink%2Ftimestamp%01index=false&=&exporttype=csv-8') # Brighter Planet's list of cat and dog breeds, genders, and weights RemoteTable.new('http://static.brighterplanet.com/science/data/consumables/pets/breed_genders.csv', :encoding => 'ISO-8859-1', - :select => lambda { |row| row['gender'].present? }) + :select => proc { |row| row['gender'].present? }) # residential electricity prices from the EIA RemoteTable.new('http://www.eia.doe.gov/cneaf/electricity/page/sales_revenue.xls', - :select => lambda { |row| row['Year'].to_s.first(4).to_i > 1989 }) + :select => proc { |row| row['Year'].to_s.first(4).to_i > 1989 }) # residential natural gas prices from the EIA # for definition of `NaturalGasParser` see https://github.com/brighterplanet/earth/blob/master/lib/earth/residence/residence_fuel_price/data_miner.rb RemoteTable.new('http://tonto.eia.doe.gov/dnav/ng/xls/ng_pri_sum_a_EPG0_FWA_DMcf_a.xls', :sheet => 'Data 1', :skip => 2, - :select => lambda { |row| row['year'].to_i > 1989 }, + :select => proc { |row| row['year'].to_i > 1989 }, :transform => { :class => NaturalGasParser }) # 2005 EIA Residential Energy Consumption Survey microdata RemoteTable.new('http://www.eia.doe.gov/emeu/recs/recspubuse05/datafiles/RECS05alldata.csv', :headers => :upcase) @@ -373,11 +398,11 @@ RemoteTable.new 'http://www.fueleconomy.gov/FEG/epadata/00data.zip', :filename => 'Gd6-dsc.txt', :format => :fixed_width, :crop => 21..26, # inclusive :cut => '2-', - :select => lambda { |row| /\A[A-Z]/.match row['code'] }, + :select => proc { |row| /\A[A-Z]/.match row['code'] }, :schema => [[ 'code', 2, { :type => :string } ], [ 'spacer', 2 ], [ 'name', 52, { :type => :string } ]] RemoteTable.new 'http://cloud.github.com/downloads/seamusabshere/remote_table/test2.fixed_width.txt', @@ -418,17 +443,14 @@ [ 'spacer', 1 ], [ 'header1', 10, { :type => :string } ]] ## Requirements -* MRI (not JRuby) -* Unix tools like curl, iconv, perl, cat, cut, tail, etc. accessible from `ENV['PATH']` +* Unix tools like curl, iconv, perl, cat, cut, tail, etc. accessible from your `$PATH` -As this library matures, that requirement should go away. - ## Wishlist -* JRuby and Win32 compat +* Win32 compat * The new "custom parser" syntax (aka transformer) hasn't been defined yet... only the old-style syntax is available ## Authors * Seamus Abshere <seamus@abshere.net>