README in external-0.1.0 vs README in external-0.3.0

- old
+ new

@@ -2,167 +2,202 @@ Indexing and array-like access to data stored on disk rather than in memory. == Description -External provides an easy way to index files such that array-like calls can store and -retrieve entries directly from the file without loading it into memory. The indexes can -be cached for performance or stored on disk alongside the data file, in essence giving you -arbitrarily large arrays. +External provides a way to index and access array data directly from a file +without loading it into memory. Indexes may be cached in memory or stored +on disk with the data file, in essence giving you arbitrarily large arrays. +Externals automatically chunk and buffer methods like <tt>each</tt> so that +the memory footprint remains low even during enumeration. -The main classes of external provide array-like access to the following: -* ExtInd (External Index) -- formatted binary data -* ExtArr (External Array) -- externally stored ruby objects -* ExtArc (External Archive) -- externally stored string data +The main External classes are: -ExtArc is a subclass of ExtArr specialized for string archival files, formats like FASTA -where entries are strings delimited by '>': +* ExternalIndex -- for formatted binary data +* ExternalArchive -- for string data +* ExternalArray -- for objects (serialized as YAML) - >Q9BXQ0|Q9BXQ0_HUMAN Tissue transglutaminase (Fragment) - Homo sapiens (Human). - LEPFSGKALCSWSIC - >P02452|CO1A1_HUMAN Collagen alpha-1(I) chain - Homo sapiens (Human). - MFSFVDLRLLLLLAATALLTHGQEEGQVEGQDEDIPPITCVQNGLRYHDRDVWKPEPCRI - CVCDNGKVLCDDVICDETKNCPGAEVPEGECCPVCPDGSESPTDQETTGVEGPKGDTGPR - GPRGPAGPPGRDGIPGQPGLPGPPGPPGPPGPPGLGGNFAPQLSYGYDEKSTGGISVPGP - ... +The array-like behavior of these classes is developed using modified versions +of the RubySpec[http://rubyspec.org] specification for Array. The idea is to +eventually duck-type all Array methods, including sort and collect, with +acceptable performance. -The array-like behavior of these classes is developed against modified versions of the -Array tests themselves, and often uses the exact same tests. The idea is to eventually -duck-type all Array methods, including sort and collect, with acceptable performance. +* Rubyforge[http://rubyforge.org/projects/external] +* Lighthouse[http://bahuvrihi.lighthouseapp.com/projects/10590-external] +* Github[http://github.com/bahuvrihi/external/tree/master] -=== Bugs/Known Issues +==== Bugs/Known Issues * only a limited set of array methods are currently supported -* reindexing of ExtArr does not work for arrays containing yaml strings -* yaml serialization/deserialization of some strings do not reproduce identical input - and so will not be faithfully store in ExtArr. Carriage return string are notable: - "\r", "\r\n", "string_with_\r\n_internal", as are chains of newlines: "\n", "\n\n" -* documentation is poor at the moment +* currently only [] and []= are fully tested vs RubySpec +* documentation is patchy --- -== Performance -++ +Note also that YAML dump/load of some objects doesn't work or doesn't +reproduce the object; such objects will not be properly stored in an +ExternalArray. Problematic objects include: -== Info +Proc and Class: -Copyright (c) 2006-2007, Regents of the University of Colorado. -Developer:: {Simon Chiang}[http://bahuvrihi.wordpress.com], {Biomolecular Structure Program}[http://biomol.uchsc.edu/], {Hansen Lab}[http://hsc-proteomics.uchsc.edu/hansenlab/] -Support:: CU Denver School of Medicine Deans Academic Enrichment Fund -Licence:: MIT-Style + block = lambda {} + YAML.load(YAML.dump(block)) # !> TypeError: allocator undefined for Proc + YAML.dump(Object) # !> TypeError: can't dump anonymous class Class -== Installation +Carriage returns ("\r"): -External is available from RubyForge[http://rubyforge.org/projects/external]. Use: + YAML.load(YAML.dump("\r")) # => nil + YAML.load(YAML.dump("\r\n")) # => "" + YAML.load(YAML.dump("string with \r\n inside")) # => "string with \n inside" - % gem install external +Chains of newlines ("\n"): + YAML.load(YAML.dump("\n")) # => "" + YAML.load(YAML.dump("\n\n")) # => "" + +DateTime is loaded as Time: + + YAML.load(YAML.dump(DateTime.now)).class # => Time + == Usage -=== ExtArr +=== ExternalArray -ExtArr can be initialized from data using the [] operator and used as an array. +ExternalArray can be initialized from data using the [] operator and used like +an array. - ea = ExtArr[1, 2.2, "cat", {:key => 'value'}] - ea[2] # => "cat" - ea.last # => {:key => 'value'} - ea << [:a, :b] - ea.to_a # => [1, 2.2, "cat", {:key => 'value'}, [:a, :b]] + a = ExternalArray['str', {'key' => 'value'}] + a[0] # => 'str' + a.last # => {'key' => 'value'} + a << [1,2]; a.to_a # => ['str', {'key' => 'value'}, [1,2]] -Behind the scenes, ExtArr serializes and stores entries on a data source (io) and builds an -ExtInd that tracks where each entry begins and ends. +ExternalArray serializes and stores entries to an io while building an io_index +that tracks the start and length of each entry. By default ExternalArray +will serialize to a Tempfile and use an Array as the io_index: - ea.io.class # => Tempfile - ea.io.rewind - ea.io.read # => "--- 1\n--- 2.2\n--- cat\n--- \n:key: value\n--- \n- :a\n- :b\n" + a.io.class # => Tempfile + a.io.rewind; a.io.read # => "--- str\n--- \nkey: value\n--- \n- 1\n- 2\n" + a.io_index.class # => Array + a.io_index.to_a # => [[0, 8], [8, 16], [24, 13]] - ea.index.class # => ExtInd - ea.index.to_a # => [[0, 6], [6, 8], [14, 8], [22, 17], [39, 15]] +To save this data more permanently, provide a path to <tt>close</tt>; the tempfile +is moved to the path and a binary index file will be created: -By default External supports File, Tempfile, and StringIO data sources. If no data source is -given (as above), the external array is initialized to a Tempfile so that it will be cleaned -up on exit. + a.close('example.yml') + File.read('example.yml') # => "--- str\n--- \nkey: value\n--- \n- 1\n- 2\n" + + index = File.read('example.index') + index.unpack('I*') # => [0, 8, 8, 16, 24, 13] -ExtArr can be initialized from existing data sources. In this case, ExtArr tries to find and -load an existing index; if the index doesn't exist, then you have to reindex the data manually. +ExternalArray provides <tt>open</tt> to create ExternalArrays from an existing +file; the instance will use an index file if it exists and automatically +reindex the data if it does not. Manual calls to reindex may be necessary when +you initialize an ExternalArray with <tt>new</tt> instead of <tt>open</tt>: - File.open('path/to/file.txt', "w+") do |file| - file << "--- 1\n--- 2.2\n--- cat\n--- \n:key: value\n--- \n- :a\n- :b\n" - file.flush - - index_filepath = ExtArr.default_index_filepath(file.path) - File.exists?(index_filepath) # => false - - ea = ExtArr.new(file) - ea.to_a # => [] - ea.reindex - ea.to_a # => [1, 2.2, "cat", {:key => 'value'}, [:a, :b]] + # use of an existing index file + ExternalArray.open('example.yml') do |b| + File.basename(b.io_index.io.path) # => 'example.index' + b.to_a # => ['str', {'key' => 'value'}, [1,2]] end -ExtArr provides an open method for easy access to file data: - - ExtArr.open('path/to/file.txt') do |ea| - # ... + # automatic reindexing + FileUtils.rm('example.index') + ExternalArray.open('example.yml') do |b| + b.to_a # => ['str', {'key' => 'value'}, [1,2]] end + + # manual reindexing + file = File.open('example.yml') + c = ExternalArray.new(file) + + c.to_a # => [] + c.reindex + c.to_a # => ['str', {'key' => 'value'}, [1,2]] -=== ExtArc +=== ExternalArchive -ExtArc is a subclass of ExtArr designed for string archival files. Rather than serialize and -load ruby objects to and from the data file, ExtArc simply read and writes strings. In -addition, ExtArc provides additional reindexing methods designed to make reindexing easy. +ExternalArchive is exactly like ExternalArray except that it only stores +strings (ExternalArray is actually a subclass of ExternalArchive which +dumps/loads strings). - arc = ExtArc[">swift", ">brown", ">fox"] - arc[2] # => ">fox" - arc.to_a # => [">swift", ">brown", ">fox"] + arc = ExternalArchive["swift", "brown", "fox"] + arc[2] # => "fox" + arc.to_a # => ["swift", "brown", "fox"] + arc.io.rewind; arc.io.read # => "swiftbrownfox" - arc.io.class # => Tempfile - arc.io.rewind - arc.io.read # => ">swift>brown>fox" +ExternalArchive is useful as a base for classes to access archival data. +Here is a simple parser for FASTA[http://en.wikipedia.org/wiki/Fasta_format] +data: - File.open('path/to/file.txt', "w+") do |file| - file << ">swift>brown>fox" - file.flush - - # Reindex by a separation string - arc = ExtArc.new(file) - arc.to_a # => [] - arc.reindex_by_sep(:sep_string => ">", :entry_follows_sep => true) - arc.to_a # => [">swift", ">brown", ">fox"] - - # Reindex by scanning an entry - arc = ExtArc.new(file) - arc.to_a # => [] - arc.reindex_by_scan(/>\w*/) - arc.to_a # => [">swift", ">brown", ">fox"] + # A sample FASTA entry + # >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] + # LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV + # EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG + # LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL + # GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX + # IENY + + class FastaEntry + attr_reader :header, :body + + def initialize(str) + @body = str.split(/\r?\n/) + @header = body.shift + end end - -=== ExtInd + + class FastaArchive < ExternalArchive + def str_to_entry(str); FastaEntry.new(str); end + def entry_to_str(entry); ([entry.header] + entry.body).join("\n"); end + + def reindex + reindex_by_sep('>', :entry_follows_sep => true) + end + end + + require 'open-uri' + fasta = FastaArchive.new open('http://external.rubyforge.org/doc/tiny_fasta.txt') + fasta.reindex + + fasta.length # => 5 + fasta[0].body # => ["MEVNILAFIATTLFVLVPTAFLLIIYVKTVSQSD"] -ExtInd provides array-like access to formatted binary data. The index of ExtArr is an -ExtInd constructed to access data formatted as 'II'; two integers corresponding to the -start position and length of entries in the ExtArr data source. For simple, repetitive -formats like 'II', processing is optimized to use a general format and frame. +The non-redundant {NCBI protein database}[ftp://ftp.ncbi.nih.gov/blast/db/FASTA/] +contains greater than 7 million FASTA entries in a 3.56 GB file; ExternalArchive +is targeted at files that size, where lazy loading of data and a small memory +footprint are critical. - ea = ExtArr.new - ea.index.class # => ExtInd - index = ea.index +=== ExternalIndex - index.format # => 'I*' - index.frame # => 2 - index << [1,2] - index << [3,4] - index.to_a # => [[1,2],[3,4]] +ExternalIndex provides array-like access to formatted binary data. The index of an +uncached ExternalArray is an ExternalIndex configured for binary data like 'II'; two +integers corresponding to the start position and length an entry. -ExtInd handles arbitrary packing formats, opening many possibilites: + index = ExternalIndex[1, 2, 3, 4, 5, 6, {:format => 'II'}] + index.format # => 'I*' + index.frame # => 2 + index[1] # => [3,4] + index.to_a # => [[1,2], [3,4], [5,6]] - File.open('path/to/file', "w+") do |file| +ExternalIndex handles arbitrary packing formats, opening many possibilities: + + Tempfile.new('sample.txt') do |file| file << [1,2,3].pack("IQS") file << [4,5,6].pack("IQS") file << [7,8,9].pack("IQS") file.flush - index = ExtInd.new(file, :format => "IQS") - index[1] # => [4,5,6] - index.to_a # => [[1,2,3],[4,5,6],[7,8,9]] + index = ExternalIndex.new(file, :format => "IQS") + index[1] # => [4,5,6] + index.to_a # => [[1,2,3], [4,5,6], [7,8,9]] end -Note: at the moment formats must be specified longhand, ie 'III' cannot be written as 'I3', -and the native size directives for sSiIlL are not supported. \ No newline at end of file +== Installation + +External is available from RubyForge[http://rubyforge.org/projects/external]. Use: + + % gem install external + +== Info + +Copyright (c) 2006-2008, Regents of the University of Colorado. +Developer:: {Simon Chiang}[http://bahuvrihi.wordpress.com], {Biomolecular Structure Program}[http://biomol.uchsc.edu/], {Hansen Lab}[http://hsc-proteomics.uchsc.edu/hansenlab/] +Support:: CU Denver School of Medicine Deans Academic Enrichment Fund +Licence:: {MIT-Style}[link:files/MIT-LICENSE.html]