README.rdoc in postgresql_cursor-0.3.1 vs README.rdoc in postgresql_cursor-0.4.0
- old
+ new
@@ -1,77 +1,76 @@
-= PostgreSQLCursor
+= PostgreSQLCursor for handling large Result Sets
-PostgreSQL Cursor is an extension to the ActiveRecord PostgreSQLAdapter for very large result sets.
-It provides a cursor open/fetch/close interface to access data without loading all rows into memory,
-and instead loads the result rows in "chunks" (default of 10_000 rows), buffers them, and returns the
-rows one at a time.
+PostgreSQLCursor extends ActiveRecord to allow for efficient processing of queries
+returning a large number of rows.
-For web pages, an application would not want to process a large amount of data, usually employing a
-Pagination scheme to present it to the users. Background processes sometimes need to generate a large
-amount of data, and the ActiveRecord approach to load all data into memory is not the best fit here.
+== Why use this?
-Previous solutions employ pagination to fetch each block, then re-running the query for the next "page".
-This gem avoids re-executing the query by using PostgreSQL's cursor feature.
+ActiveRecord is designed and optimized for web performance. In a web transaction, only a "page" of
+around 20 rows is returned to the user. When you do this
-The ActiveRecord methods for the cursor return instances of the model class by default, however you
-can pass in a block to override this strategy. For instance, returning Hashes instead of AR instances
-is faster. Julian's benchmarks showed returning instances was a factor of 4 slower than returning a hash.
+ Model.find(:all, :conditions=>["id>0"]
-ActiveRecord v3 has initial support in this release. It is deprecating using the conditions in the find()
-method in favor of Arel scopes and usage. I added a method to Arel for iterate over the results in a buffered
-way.
- Model.scope_methods.each_row { |hash| puts hash.inspect }
+The database returns all matching result set rows to ActiveRecord, which instantiates each row with
+the data returned. This function returns an array of all these rows to the caller.
-NOTE: This gem is intended to replace the 'postgresql-cursor' (with hyphen, not underscore) library.
+Asyncronous, Background, or Offline processing may require processing a large amount of data.
+When there is a very large number of rows, this requires a lot more memory to hold the data. Ruby
+does not return that memory after processing the array, and the causes your process to "bloat". If you
+don't have enough memory, it will cause an exception.
-==Installation
- [sudo] gem install postgresql_cursor
+== Enter find_each
-This does not require Rails to work, just ActiveRecord < 3.0.0 and the 'pg' gem. Rails 3 support is forthcoming.
+To solve this problem, ActiveRecord gives us two alternative methods that work in "chunks" of your data:
-You can then configure your Rails application by adding to the config/environment.rb file
- config.gem 'postgresql_cursor'
-or require the gem on your non-rails application
- require 'rubygems'
- require 'postgresql_cursor'
+ Model.where("id>0").find_each { |model| model.process! }
-==Usage
+ Model.where("id>0").find_in_batches do |batch|
+ batch.each { |model| model.process! }
+ end
-This library is intended to be used via ActiveRecord. It provides two methods on ActiveRecord::Base
-like the find() method, that will return cursor objects instead of an array of model instances.
+Optionally, you can specify a :batch_size option as the size of the "chunk", and defaults to 1000.
-Calling each() the returned cursor will yield a record to the block for each row. It handles a transaction
-block, open, buffering, and closing the cursor. In this way, it operates like an Array object. Note that
-it does not implement Enumerable, so no other methods are available.
+There are drawbacks with these methods:
-* *find_with_cursor*( <i>find options</i>, :cursor=>{<i>cursor options</i>}, &block) returns a cursor for the data matching the find options. It takes an optional block accepting a column=>value hash which returns the object desired via each or next, instead of an instance of the model.
+* You cannot specify the order, it will be ordered by the primary key (usually id)
+* The primary key must be numeric
+* The query is rerun for each chunk (1000 rows), starting at the next id sequence.
+* You cannot use overly complex queries as that will be rerun and incur more overhead.
-* *find_by_sql_with_cursor*(<i>select statement</i>, <i>cursor options</i>, &block) takes a custom SQL statement and returns each row as above.
+== PostgreSQLCursor FTW!
-==Examples
+PostgreSQLCursor was developed to take advantage of PostgreSQL's cursors. Cursors allow the program
+to declare a cursor to run a given query returning "chunks" of rows to the application program while
+retaining the position of the full result set in the database. This overcomes all the disadvantages
+of using find_each and find_in_batches.
- # Active Record v3 with Arel scopes. This is the new, preferred method of use
- Account.active.each_row { |hash| puts hash.inspect }
+Also, with PostgreSQL, you have on option to have raw hashes of the row returned instead of the
+instantiated models. An informal benchmark showed that returning instances is a factor of 4 times
+slower than returning hashes. If you are can work with the data in this form, you will find better
+performance.
- Account.find_with_cursor(:conditions=>["status = ?", 'good']).each do |row|
- puts row.to_json
- end
-
- Account.find_by_sql_with_cursor("select ...", :buffer_size=>1000).each do |row|
- row.process
- end
+With PostgreSQL, you can work with cursors as follows:
- Account.transaction do
- cursor = Account.find_with_cursor(...) { |record| record.symbolize_keys }
- while record = cursor.next do
- record.process # => {:column=>value, ...}
- cursor.close if cursor.count > 1000 # Halts loop after 1000 records
- end
- end
+ Model.where("id>0").each_row { |hash| Model.process(hash) }
- Account.find_with_cursor(...) { |record| record.symbolize_keys }.each do |row|
- row.process
- end
+ Model.where("id>0").each_instance { |model| model.process! }
+ Model.where("id>0").each_instance(buffer_size:100000) { |model| model.process! }
+
+ Model.each_row_by_sql("select * from models") { |hash| Model.process(hash) }
+
+ Model.each_instance_by_sql("select * from models") { |model| model.process }
+
+All these methods take an options hash to control things more:
+
+ buffer_size:n The number of rows to fetch from the database each time (default 1000)
+ while:value Continue looping as long as the block returns this value
+ until:value Continue looping until the block returns this value
+ connection:conn Use this connection instead of the current model connection
+ fraction:float A value to set for the cursor_tuple_fraction variable.
+ PostgreSQL uses 0.1 (optimize for 10% of result set)
+ This library uses 1.0 (Optimize for 100% of the result set)
+ Do not override this value unless you understand it.
==Authors
Allen Fair, allen.fair@gmail.com, http://github.com/afair
Thank you to: