README.rdoc in postgresql_cursor-0.3.1 vs README.rdoc in postgresql_cursor-0.4.0

- old
+ new

@@ -1,77 +1,76 @@ -= PostgreSQLCursor += PostgreSQLCursor for handling large Result Sets -PostgreSQL Cursor is an extension to the ActiveRecord PostgreSQLAdapter for very large result sets. -It provides a cursor open/fetch/close interface to access data without loading all rows into memory, -and instead loads the result rows in "chunks" (default of 10_000 rows), buffers them, and returns the -rows one at a time. +PostgreSQLCursor extends ActiveRecord to allow for efficient processing of queries +returning a large number of rows. -For web pages, an application would not want to process a large amount of data, usually employing a -Pagination scheme to present it to the users. Background processes sometimes need to generate a large -amount of data, and the ActiveRecord approach to load all data into memory is not the best fit here. +== Why use this? -Previous solutions employ pagination to fetch each block, then re-running the query for the next "page". -This gem avoids re-executing the query by using PostgreSQL's cursor feature. +ActiveRecord is designed and optimized for web performance. In a web transaction, only a "page" of +around 20 rows is returned to the user. When you do this -The ActiveRecord methods for the cursor return instances of the model class by default, however you -can pass in a block to override this strategy. For instance, returning Hashes instead of AR instances -is faster. Julian's benchmarks showed returning instances was a factor of 4 slower than returning a hash. + Model.find(:all, :conditions=>["id>0"] -ActiveRecord v3 has initial support in this release. It is deprecating using the conditions in the find() -method in favor of Arel scopes and usage. I added a method to Arel for iterate over the results in a buffered -way. - Model.scope_methods.each_row { |hash| puts hash.inspect } +The database returns all matching result set rows to ActiveRecord, which instantiates each row with +the data returned. This function returns an array of all these rows to the caller. -NOTE: This gem is intended to replace the 'postgresql-cursor' (with hyphen, not underscore) library. +Asyncronous, Background, or Offline processing may require processing a large amount of data. +When there is a very large number of rows, this requires a lot more memory to hold the data. Ruby +does not return that memory after processing the array, and the causes your process to "bloat". If you +don't have enough memory, it will cause an exception. -==Installation - [sudo] gem install postgresql_cursor +== Enter find_each -This does not require Rails to work, just ActiveRecord < 3.0.0 and the 'pg' gem. Rails 3 support is forthcoming. +To solve this problem, ActiveRecord gives us two alternative methods that work in "chunks" of your data: -You can then configure your Rails application by adding to the config/environment.rb file - config.gem 'postgresql_cursor' -or require the gem on your non-rails application - require 'rubygems' - require 'postgresql_cursor' + Model.where("id>0").find_each { |model| model.process! } -==Usage + Model.where("id>0").find_in_batches do |batch| + batch.each { |model| model.process! } + end -This library is intended to be used via ActiveRecord. It provides two methods on ActiveRecord::Base -like the find() method, that will return cursor objects instead of an array of model instances. +Optionally, you can specify a :batch_size option as the size of the "chunk", and defaults to 1000. -Calling each() the returned cursor will yield a record to the block for each row. It handles a transaction -block, open, buffering, and closing the cursor. In this way, it operates like an Array object. Note that -it does not implement Enumerable, so no other methods are available. +There are drawbacks with these methods: -* *find_with_cursor*( <i>find options</i>, :cursor=>{<i>cursor options</i>}, &block) returns a cursor for the data matching the find options. It takes an optional block accepting a column=>value hash which returns the object desired via each or next, instead of an instance of the model. +* You cannot specify the order, it will be ordered by the primary key (usually id) +* The primary key must be numeric +* The query is rerun for each chunk (1000 rows), starting at the next id sequence. +* You cannot use overly complex queries as that will be rerun and incur more overhead. -* *find_by_sql_with_cursor*(<i>select statement</i>, <i>cursor options</i>, &block) takes a custom SQL statement and returns each row as above. +== PostgreSQLCursor FTW! -==Examples +PostgreSQLCursor was developed to take advantage of PostgreSQL's cursors. Cursors allow the program +to declare a cursor to run a given query returning "chunks" of rows to the application program while +retaining the position of the full result set in the database. This overcomes all the disadvantages +of using find_each and find_in_batches. - # Active Record v3 with Arel scopes. This is the new, preferred method of use - Account.active.each_row { |hash| puts hash.inspect } +Also, with PostgreSQL, you have on option to have raw hashes of the row returned instead of the +instantiated models. An informal benchmark showed that returning instances is a factor of 4 times +slower than returning hashes. If you are can work with the data in this form, you will find better +performance. - Account.find_with_cursor(:conditions=>["status = ?", 'good']).each do |row| - puts row.to_json - end - - Account.find_by_sql_with_cursor("select ...", :buffer_size=>1000).each do |row| - row.process - end +With PostgreSQL, you can work with cursors as follows: - Account.transaction do - cursor = Account.find_with_cursor(...) { |record| record.symbolize_keys } - while record = cursor.next do - record.process # => {:column=>value, ...} - cursor.close if cursor.count > 1000 # Halts loop after 1000 records - end - end + Model.where("id>0").each_row { |hash| Model.process(hash) } - Account.find_with_cursor(...) { |record| record.symbolize_keys }.each do |row| - row.process - end + Model.where("id>0").each_instance { |model| model.process! } + Model.where("id>0").each_instance(buffer_size:100000) { |model| model.process! } + + Model.each_row_by_sql("select * from models") { |hash| Model.process(hash) } + + Model.each_instance_by_sql("select * from models") { |model| model.process } + +All these methods take an options hash to control things more: + + buffer_size:n The number of rows to fetch from the database each time (default 1000) + while:value Continue looping as long as the block returns this value + until:value Continue looping until the block returns this value + connection:conn Use this connection instead of the current model connection + fraction:float A value to set for the cursor_tuple_fraction variable. + PostgreSQL uses 0.1 (optimize for 10% of result set) + This library uses 1.0 (Optimize for 100% of the result set) + Do not override this value unless you understand it. ==Authors Allen Fair, allen.fair@gmail.com, http://github.com/afair Thank you to: