= External Indexing and array-like access to data stored on disk rather than in memory. == Description External provides a way to index and access array data directly from a file without loading it into memory. Indexes may be cached in memory or stored on disk with the data file, in essence giving you arbitrarily large arrays. Externals automatically chunk and buffer methods like each so that the memory footprint remains low even during enumeration. The main External classes are: * ExternalIndex -- for formatted binary data * ExternalArchive -- for string data * ExternalArray -- for objects (serialized as YAML) The array-like behavior of these classes is developed using modified versions of the RubySpec[http://rubyspec.org] specification for Array. The idea is to eventually duck-type all Array methods, including sort and collect, with acceptable performance. * Rubyforge[http://rubyforge.org/projects/external] * Lighthouse[http://bahuvrihi.lighthouseapp.com/projects/10590-external] * Github[http://github.com/bahuvrihi/external/tree/master] ==== Bugs/Known Issues * only a limited set of array methods are currently supported * currently only [] and []= are fully tested vs RubySpec * documentation is patchy Note also that YAML dump/load of some objects doesn't work or doesn't reproduce the object; such objects will not be properly stored in an ExternalArray. Problematic objects include: Proc and Class: block = lambda {} YAML.load(YAML.dump(block)) # !> TypeError: allocator undefined for Proc YAML.dump(Object) # !> TypeError: can't dump anonymous class Class Carriage returns ("\r"): YAML.load(YAML.dump("\r")) # => nil YAML.load(YAML.dump("\r\n")) # => "" YAML.load(YAML.dump("string with \r\n inside")) # => "string with \n inside" Chains of newlines ("\n"): YAML.load(YAML.dump("\n")) # => "" YAML.load(YAML.dump("\n\n")) # => "" DateTime is loaded as Time: YAML.load(YAML.dump(DateTime.now)).class # => Time == Usage === ExternalArray ExternalArray can be initialized from data using the [] operator and used like an array. a = ExternalArray['str', {'key' => 'value'}] a[0] # => 'str' a.last # => {'key' => 'value'} a << [1,2]; a.to_a # => ['str', {'key' => 'value'}, [1,2]] ExternalArray serializes and stores entries to an io while building an io_index that tracks the start and length of each entry. By default ExternalArray will serialize to a Tempfile and use an Array as the io_index: a.io.class # => Tempfile a.io.rewind; a.io.read # => "--- str\n--- \nkey: value\n--- \n- 1\n- 2\n" a.io_index.class # => Array a.io_index.to_a # => [[0, 8], [8, 16], [24, 13]] To save this data more permanently, provide a path to close; the tempfile is moved to the path and a binary index file will be created: a.close('example.yml') File.read('example.yml') # => "--- str\n--- \nkey: value\n--- \n- 1\n- 2\n" index = File.read('example.index') index.unpack('I*') # => [0, 8, 8, 16, 24, 13] ExternalArray provides open to create ExternalArrays from an existing file; the instance will use an index file if it exists and automatically reindex the data if it does not. Manual calls to reindex may be necessary when you initialize an ExternalArray with new instead of open: # use of an existing index file ExternalArray.open('example.yml') do |b| File.basename(b.io_index.io.path) # => 'example.index' b.to_a # => ['str', {'key' => 'value'}, [1,2]] end # automatic reindexing FileUtils.rm('example.index') ExternalArray.open('example.yml') do |b| b.to_a # => ['str', {'key' => 'value'}, [1,2]] end # manual reindexing file = File.open('example.yml') c = ExternalArray.new(file) c.to_a # => [] c.reindex c.to_a # => ['str', {'key' => 'value'}, [1,2]] === ExternalArchive ExternalArchive is exactly like ExternalArray except that it only stores strings (ExternalArray is actually a subclass of ExternalArchive which dumps/loads strings). arc = ExternalArchive["swift", "brown", "fox"] arc[2] # => "fox" arc.to_a # => ["swift", "brown", "fox"] arc.io.rewind; arc.io.read # => "swiftbrownfox" ExternalArchive is useful as a base for classes to access archival data. Here is a simple parser for FASTA[http://en.wikipedia.org/wiki/Fasta_format] data: # A sample FASTA entry # >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] # LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV # EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG # LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL # GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX # IENY class FastaEntry attr_reader :header, :body def initialize(str) @body = str.split(/\r?\n/) @header = body.shift end end class FastaArchive < ExternalArchive def str_to_entry(str); FastaEntry.new(str); end def entry_to_str(entry); ([entry.header] + entry.body).join("\n"); end def reindex reindex_by_sep('>', :entry_follows_sep => true) end end require 'open-uri' fasta = FastaArchive.new open('http://external.rubyforge.org/doc/tiny_fasta.txt') fasta.reindex fasta.length # => 5 fasta[0].body # => ["MEVNILAFIATTLFVLVPTAFLLIIYVKTVSQSD"] The non-redundant {NCBI protein database}[ftp://ftp.ncbi.nih.gov/blast/db/FASTA/] contains greater than 7 million FASTA entries in a 3.56 GB file; ExternalArchive is targeted at files that size, where lazy loading of data and a small memory footprint are critical. === ExternalIndex ExternalIndex provides array-like access to formatted binary data. The index of an uncached ExternalArray is an ExternalIndex configured for binary data like 'II'; two integers corresponding to the start position and length an entry. index = ExternalIndex[1, 2, 3, 4, 5, 6, {:format => 'II'}] index.format # => 'I*' index.frame # => 2 index[1] # => [3,4] index.to_a # => [[1,2], [3,4], [5,6]] ExternalIndex handles arbitrary packing formats, opening many possibilities: Tempfile.new('sample.txt') do |file| file << [1,2,3].pack("IQS") file << [4,5,6].pack("IQS") file << [7,8,9].pack("IQS") file.flush index = ExternalIndex.new(file, :format => "IQS") index[1] # => [4,5,6] index.to_a # => [[1,2,3], [4,5,6], [7,8,9]] end == Installation External is available from RubyForge[http://rubyforge.org/projects/external]. Use: % gem install external == Info Copyright (c) 2006-2008, Regents of the University of Colorado. Developer:: {Simon Chiang}[http://bahuvrihi.wordpress.com], {Biomolecular Structure Program}[http://biomol.uchsc.edu/], {Hansen Lab}[http://hsc-proteomics.uchsc.edu/hansenlab/] Support:: CU Denver School of Medicine Deans Academic Enrichment Fund Licence:: {MIT-Style}[link:files/MIT-LICENSE.html]