data_miner

Download, unpack from a ZIP/TAR/GZ/BZ2 archive, parse, correct, convert units and import Google Spreadsheets, XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models. Uses RemoteTable gem internally.

MIT License

Downloads
504.9K
Stars
302
Committers
6

data_miner

Download, pull out of a ZIP/TAR/GZ/BZ2 archive, parse, correct, and import XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models.

Tested in MRI 1.8.7+, MRI 1.9.2+, and JRuby 1.6.7+. Thread safe.

Real-world usage

We use data_miner for data science at Brighter Planet and in production at

The killer combination for us is:

  1. active_record_inline_schema - define table structure
  2. remote_table - download data and parse it
  3. errata - apply corrections in a transparent way
  4. data_miner (this library!) - import data idempotently

Documentation

Check out the extensive documentation.

Quick start

You define data_miner blocks in your ActiveRecord models. For example, in app/models/country.rb:

class Country < ActiveRecord::Base
  self.primary_key = 'iso_3166_code'

  # the "col" class method is provided by a different library - active_record_inline_schema
  col :iso_3166_code                            # alpha-2 2-letter like GB
  col :iso_3166_numeric_code, :type => :integer # numeric like 826; aka UN M49 code
  col :iso_3166_alpha_3_code                    # 3-letter like GBR
  col :name

  data_miner do
    # auto_upgrade! is provided by active_record_inline_schema
    process :auto_upgrade!

    import("OpenGeoCode.org's Country Codes to Country Names list",
           :url => 'http://opengeocode.org/download/countrynames.txt',
           :format => :delimited,
           :delimiter => '; ',
           :headers => false,
           :skip => 22) do
      key   :iso_3166_code, :field_number => 0
      store :iso_3166_alpha_3_code, :field_number => 1
      store :iso_3166_numeric_code, :field_number => 2
      store :name, :field_number => 5
    end
  end
end

Now you can run:

>> Country.run_data_miner!
=> nil

More advanced usage

The earth library has dozens of real-life examples showing how to download, pull out of a ZIP/TAR/BZ2 archive, parse, correct, and import CSVs, fixed-width files, ODS, XLS, XLSX, even HTML and XML:

And many more - look for the data_miner.rb file that corresponds to each model. Note that you would normally put the data_miner declaration right inside the ActiveRecord model file... it's kept separate in earth so that loading it is optional.

Authors

Wishlist

  • Make the tests real unit tests
  • sql steps shouldn't shell out if binaries are missing

Copyright

Copyright (c) 2013 Seamus Abshere

Package Rankings
Top 4.23% on Rubygems.org
Top 6.72% on Proxy.golang.org