README.md in upsert-0.1.0 vs README.md in upsert-0.1.1

- old
+ new

@@ -1,62 +1,92 @@ # Upsert -Finally, all those SQL MERGE tricks codified. +Finally, all those SQL MERGE tricks codified so that you can do "upsert" on MySQL, PostgreSQL, and Sqlite. ## Usage +Let's say you have... + + class Pet < ActiveRecord::Base + # col :name + # col :breed + end + ### One at a time upsert = Upsert.new Pet.connection, Pet.table_name - upsert.row({:name => 'Jerry'}, :breed => 'beagle') - upsert.row({:name => 'Pierre'}, :breed => 'tabby') + selector = {:name => 'Jerry'} + document = {:breed => 'beagle'} + upsert.row selector, document -### Multiple upserts at once +### Multiple upserts bundled together for speed - Upsert.new(Pet.connection, Pet.table_name).multi do |upsert| + Upsert.stream(Pet.connection, Pet.table_name) do |upsert| + # [...] upsert.row({:name => 'Jerry'}, :breed => 'beagle') + # [...] upsert.row({:name => 'Pierre'}, :breed => 'tabby') + # [...] end -## Wishlist +Rows are buffered in memory until it's efficient to send them to the database. -1. Make `c=c+1` stuff possible with `Upsert.sql('c=c+1')` or something +## Real-world usage -## Speed +<p><a href="http://brighterplanet.com"><img src="https://s3.amazonaws.com/static.brighterplanet.com/assets/logos/flush-left/inline/green/rasterized/brighter_planet-160-transparent.png" alt="Brighter Planet logo"/></a></p> +We use `upsert` for [big data processing at Brighter Planet](http://brighterplanet.com/research) and in production at + +* [Brighter Planet's impact estimate web service](http://impact.brighterplanet.com) +* [Brighter Planet's reference data web service](http://data.brighterplanet.com) + +Originally written to speed up the [`data_miner`](https://github.com/seamusabshere/data_miner) data mining library. + +## Supported databases + ### MySQL -(from the tests) +Using the [mysql2](https://rubygems.org/gems/mysql2) driver. - Upsert was 47% faster than faking upserts with activerecord-import + Upsert.new Mysql2::Connection.new([...]), :pets + +#### Speed + +From the tests: + Upsert was 77% faster than find + new/set/save Upsert was 84% faster than create + rescue/find/update Upsert was 82% faster than find_or_create + update_attributes + Upsert was 47% faster than faking upserts with activerecord-import -### PostgreSQL +#### SQL MERGE trick - Upsert was 73% faster than find + new/set/save - Upsert was 84% faster than find_or_create + update_attributes - Upsert was 87% faster than create + rescue/find/update +"ON DUPLICATE KEY UPDATE" where we just set everything to the value of the insert. -## Supported database drivers + # http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html + INSERT INTO table (a,b,c) VALUES (1,2,3), (4,5,6) + ON DUPLICATE KEY UPDATE a=VALUES(a),b=VALUES(b),c=VALUES(c); -1. [mysql2](https://rubygems.org/gems/mysql2) (e.g. `Upsert.new(Mysql2::Connection.new([...]), :pets)`) -2. [sqlite3](https://rubygems.org/gems/sqlite3) -3. [pg](https://rubygems.org/gems/pg) -4. Any of these wrapped in an ActiveRecord connection adapter (e.g. `Upsert.new(Pet.connection, Pet.table_name)`) +Since this is an upsert helper library, not a general-use ON DUPLICATE KEY UPDATE wrapper, you **can't** do things like `c=c+1`. -## SQL merge tricks in use +### PostgreSQL -### MySQL +Using the [pg](https://rubygems.org/gems/pg) driver. - # http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html - INSERT INTO table (a,b,c) VALUES (1,2,3) - ON DUPLICATE KEY UPDATE c=c+1; + Upsert.new PG.connect([...]), :pets -### PostgreSQL +#### Speed +From the tests: + + Upsert was 73% faster than find + new/set/save + Upsert was 84% faster than find_or_create + update_attributes + Upsert was 87% faster than create + rescue/find/update + # (can't compare to activerecord-import because you can't fake it on pg) + +#### SQL MERGE trick + # http://www.postgresql.org/docs/current/interactive/plpgsql-control-structures.html#PLPGSQL-ERROR-TRAPPING CREATE TABLE db (a INT PRIMARY KEY, b TEXT); CREATE FUNCTION merge_db(key INT, data TEXT) RETURNS VOID AS $$ BEGIN @@ -80,26 +110,20 @@ $$ LANGUAGE plpgsql; SELECT merge_db(1, 'david'); SELECT merge_db(1, 'dennis'); -### Sqlite +The decision was made **not** to use the following because it's not straight from the manual: - # http://stackoverflow.com/questions/2717590/sqlite-upsert-on-duplicate-key-update - INSERT OR IGNORE INTO visits VALUES ($ip, 0); - UPDATE visits SET hits = hits + 1 WHERE ip LIKE $ip; - -### Unused alternatives - -#### PostgreSQL - # http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql UPDATE table SET field='C', field2='Z' WHERE id=3; INSERT INTO table (id, field, field2) SELECT 3, 'C', 'Z' WHERE NOT EXISTS (SELECT 1 FROM table WHERE id=3); +This was also rejected because there's something we can use in the manual: + # http://stackoverflow.com/questions/5269590/why-doesnt-this-rule-prevent-duplicate-key-violations BEGIN; CREATE TEMP TABLE stage_data(key_column, data_columns...) ON COMMIT DROP; \copy stage_data from data.csv with csv header -- prevent any other updates while we are merging input (omit this if you don't need it) @@ -108,6 +132,79 @@ INSERT INTO target_data(key_column, data_columns...) SELECT key_column, data_columns... FROM stage_data WHERE NOT EXISTS (SELECT 1 FROM target_data WHERE target_data.key_column = stage_data.key_column) - END; \ No newline at end of file + END; + +### Sqlite + +Using the [sqlite3](https://rubygems.org/gems/sqlite3) driver. + + Upsert.new SQLite3::Database.open([...]), :pets + +#### Speed + +FIXME tests are segfaulting. Pull request would be lovely. + +#### SQL MERGE trick + + # http://stackoverflow.com/questions/2717590/sqlite-upsert-on-duplicate-key-update + # bad example because we're not doing on-duplicate-key update + INSERT OR IGNORE INTO visits VALUES (127.0.0.1, 1); + UPDATE visits SET visits = 1 WHERE ip LIKE 127.0.0.1; + +### Rails / ActiveRecord + +(assuming that one of the other three supported drivers is being used under the covers) + + Upsert.new Pet.connection, Pet.table_name + +#### Speed + +Depends on the driver being used! + +#### SQL MERGE trick + +Depends on the driver being used! + +## Features + +### Tested to be fast and portable + +In addition to correctness, the library's tests check that it is + +1. Faster than comparable upsert techniques +2. Compatible with supported databases + +### Not dependent on ActiveRecord + +As below, all you need is a raw database connection like a `Mysql2::Connection`, `PG::Connection` or a `SQLite3::Database`. These are equivalent: + + # with activerecord + Upsert.new ActiveRecord::Base.connection, :pets + # with activerecord, prettier + Upsert.new Pet.connection, Pet.table_name + # without activerecord + Upsert.new Mysql2::Connection.new([...]), :pets + +### For a specific use case, faster and more portable than `activerecord-import` + +You could also use [activerecord-import](https://github.com/zdennis/activerecord-import) to upsert: + + Pet.import columns, all_values, :timestamps => false, :on_duplicate_key_update => columns + +This, however, only works on MySQL and requires ActiveRecord&mdash;and if all you are doing is upserts, `upsert` is tested to be 40% faster. And you don't have to put all of the rows to be upserted into a single huge array - you can stream them using `Upsert.stream`. + +### Loosely based on mongo-ruby-driver's upsert functionality + +The `selector` and `document` arguments are inspired by the upsert functionality of the [mongo-ruby-driver's update method](http://api.mongodb.org/ruby/1.6.4/Mongo/Collection.html#update-instance_method). + +## Wishlist + +1. `Pet.upsert`... duh +2. Don't need a separate buffer class... just extend an instance of Upsert with the appropriate database driver module. + +## Copyright + +Copyright 2012 Brighter Planet, Inc. +