README in external-0.1.0 vs README in external-0.3.0
- old
+ new
@@ -2,167 +2,202 @@
Indexing and array-like access to data stored on disk rather than in memory.
== Description
-External provides an easy way to index files such that array-like calls can store and
-retrieve entries directly from the file without loading it into memory. The indexes can
-be cached for performance or stored on disk alongside the data file, in essence giving you
-arbitrarily large arrays.
+External provides a way to index and access array data directly from a file
+without loading it into memory. Indexes may be cached in memory or stored
+on disk with the data file, in essence giving you arbitrarily large arrays.
+Externals automatically chunk and buffer methods like <tt>each</tt> so that
+the memory footprint remains low even during enumeration.
-The main classes of external provide array-like access to the following:
-* ExtInd (External Index) -- formatted binary data
-* ExtArr (External Array) -- externally stored ruby objects
-* ExtArc (External Archive) -- externally stored string data
+The main External classes are:
-ExtArc is a subclass of ExtArr specialized for string archival files, formats like FASTA
-where entries are strings delimited by '>':
+* ExternalIndex -- for formatted binary data
+* ExternalArchive -- for string data
+* ExternalArray -- for objects (serialized as YAML)
- >Q9BXQ0|Q9BXQ0_HUMAN Tissue transglutaminase (Fragment) - Homo sapiens (Human).
- LEPFSGKALCSWSIC
- >P02452|CO1A1_HUMAN Collagen alpha-1(I) chain - Homo sapiens (Human).
- MFSFVDLRLLLLLAATALLTHGQEEGQVEGQDEDIPPITCVQNGLRYHDRDVWKPEPCRI
- CVCDNGKVLCDDVICDETKNCPGAEVPEGECCPVCPDGSESPTDQETTGVEGPKGDTGPR
- GPRGPAGPPGRDGIPGQPGLPGPPGPPGPPGPPGLGGNFAPQLSYGYDEKSTGGISVPGP
- ...
+The array-like behavior of these classes is developed using modified versions
+of the RubySpec[http://rubyspec.org] specification for Array. The idea is to
+eventually duck-type all Array methods, including sort and collect, with
+acceptable performance.
-The array-like behavior of these classes is developed against modified versions of the
-Array tests themselves, and often uses the exact same tests. The idea is to eventually
-duck-type all Array methods, including sort and collect, with acceptable performance.
+* Rubyforge[http://rubyforge.org/projects/external]
+* Lighthouse[http://bahuvrihi.lighthouseapp.com/projects/10590-external]
+* Github[http://github.com/bahuvrihi/external/tree/master]
-=== Bugs/Known Issues
+==== Bugs/Known Issues
* only a limited set of array methods are currently supported
-* reindexing of ExtArr does not work for arrays containing yaml strings
-* yaml serialization/deserialization of some strings do not reproduce identical input
- and so will not be faithfully store in ExtArr. Carriage return string are notable:
- "\r", "\r\n", "string_with_\r\n_internal", as are chains of newlines: "\n", "\n\n"
-* documentation is poor at the moment
+* currently only [] and []= are fully tested vs RubySpec
+* documentation is patchy
---
-== Performance
-++
+Note also that YAML dump/load of some objects doesn't work or doesn't
+reproduce the object; such objects will not be properly stored in an
+ExternalArray. Problematic objects include:
-== Info
+Proc and Class:
-Copyright (c) 2006-2007, Regents of the University of Colorado.
-Developer:: {Simon Chiang}[http://bahuvrihi.wordpress.com], {Biomolecular Structure Program}[http://biomol.uchsc.edu/], {Hansen Lab}[http://hsc-proteomics.uchsc.edu/hansenlab/]
-Support:: CU Denver School of Medicine Deans Academic Enrichment Fund
-Licence:: MIT-Style
+ block = lambda {}
+ YAML.load(YAML.dump(block)) # !> TypeError: allocator undefined for Proc
+ YAML.dump(Object) # !> TypeError: can't dump anonymous class Class
-== Installation
+Carriage returns ("\r"):
-External is available from RubyForge[http://rubyforge.org/projects/external]. Use:
+ YAML.load(YAML.dump("\r")) # => nil
+ YAML.load(YAML.dump("\r\n")) # => ""
+ YAML.load(YAML.dump("string with \r\n inside")) # => "string with \n inside"
- % gem install external
+Chains of newlines ("\n"):
+ YAML.load(YAML.dump("\n")) # => ""
+ YAML.load(YAML.dump("\n\n")) # => ""
+
+DateTime is loaded as Time:
+
+ YAML.load(YAML.dump(DateTime.now)).class # => Time
+
== Usage
-=== ExtArr
+=== ExternalArray
-ExtArr can be initialized from data using the [] operator and used as an array.
+ExternalArray can be initialized from data using the [] operator and used like
+an array.
- ea = ExtArr[1, 2.2, "cat", {:key => 'value'}]
- ea[2] # => "cat"
- ea.last # => {:key => 'value'}
- ea << [:a, :b]
- ea.to_a # => [1, 2.2, "cat", {:key => 'value'}, [:a, :b]]
+ a = ExternalArray['str', {'key' => 'value'}]
+ a[0] # => 'str'
+ a.last # => {'key' => 'value'}
+ a << [1,2]; a.to_a # => ['str', {'key' => 'value'}, [1,2]]
-Behind the scenes, ExtArr serializes and stores entries on a data source (io) and builds an
-ExtInd that tracks where each entry begins and ends.
+ExternalArray serializes and stores entries to an io while building an io_index
+that tracks the start and length of each entry. By default ExternalArray
+will serialize to a Tempfile and use an Array as the io_index:
- ea.io.class # => Tempfile
- ea.io.rewind
- ea.io.read # => "--- 1\n--- 2.2\n--- cat\n--- \n:key: value\n--- \n- :a\n- :b\n"
+ a.io.class # => Tempfile
+ a.io.rewind; a.io.read # => "--- str\n--- \nkey: value\n--- \n- 1\n- 2\n"
+ a.io_index.class # => Array
+ a.io_index.to_a # => [[0, 8], [8, 16], [24, 13]]
- ea.index.class # => ExtInd
- ea.index.to_a # => [[0, 6], [6, 8], [14, 8], [22, 17], [39, 15]]
+To save this data more permanently, provide a path to <tt>close</tt>; the tempfile
+is moved to the path and a binary index file will be created:
-By default External supports File, Tempfile, and StringIO data sources. If no data source is
-given (as above), the external array is initialized to a Tempfile so that it will be cleaned
-up on exit.
+ a.close('example.yml')
+ File.read('example.yml') # => "--- str\n--- \nkey: value\n--- \n- 1\n- 2\n"
+
+ index = File.read('example.index')
+ index.unpack('I*') # => [0, 8, 8, 16, 24, 13]
-ExtArr can be initialized from existing data sources. In this case, ExtArr tries to find and
-load an existing index; if the index doesn't exist, then you have to reindex the data manually.
+ExternalArray provides <tt>open</tt> to create ExternalArrays from an existing
+file; the instance will use an index file if it exists and automatically
+reindex the data if it does not. Manual calls to reindex may be necessary when
+you initialize an ExternalArray with <tt>new</tt> instead of <tt>open</tt>:
- File.open('path/to/file.txt', "w+") do |file|
- file << "--- 1\n--- 2.2\n--- cat\n--- \n:key: value\n--- \n- :a\n- :b\n"
- file.flush
-
- index_filepath = ExtArr.default_index_filepath(file.path)
- File.exists?(index_filepath) # => false
-
- ea = ExtArr.new(file)
- ea.to_a # => []
- ea.reindex
- ea.to_a # => [1, 2.2, "cat", {:key => 'value'}, [:a, :b]]
+ # use of an existing index file
+ ExternalArray.open('example.yml') do |b|
+ File.basename(b.io_index.io.path) # => 'example.index'
+ b.to_a # => ['str', {'key' => 'value'}, [1,2]]
end
-ExtArr provides an open method for easy access to file data:
-
- ExtArr.open('path/to/file.txt') do |ea|
- # ...
+ # automatic reindexing
+ FileUtils.rm('example.index')
+ ExternalArray.open('example.yml') do |b|
+ b.to_a # => ['str', {'key' => 'value'}, [1,2]]
end
+
+ # manual reindexing
+ file = File.open('example.yml')
+ c = ExternalArray.new(file)
+
+ c.to_a # => []
+ c.reindex
+ c.to_a # => ['str', {'key' => 'value'}, [1,2]]
-=== ExtArc
+=== ExternalArchive
-ExtArc is a subclass of ExtArr designed for string archival files. Rather than serialize and
-load ruby objects to and from the data file, ExtArc simply read and writes strings. In
-addition, ExtArc provides additional reindexing methods designed to make reindexing easy.
+ExternalArchive is exactly like ExternalArray except that it only stores
+strings (ExternalArray is actually a subclass of ExternalArchive which
+dumps/loads strings).
- arc = ExtArc[">swift", ">brown", ">fox"]
- arc[2] # => ">fox"
- arc.to_a # => [">swift", ">brown", ">fox"]
+ arc = ExternalArchive["swift", "brown", "fox"]
+ arc[2] # => "fox"
+ arc.to_a # => ["swift", "brown", "fox"]
+ arc.io.rewind; arc.io.read # => "swiftbrownfox"
- arc.io.class # => Tempfile
- arc.io.rewind
- arc.io.read # => ">swift>brown>fox"
+ExternalArchive is useful as a base for classes to access archival data.
+Here is a simple parser for FASTA[http://en.wikipedia.org/wiki/Fasta_format]
+data:
- File.open('path/to/file.txt', "w+") do |file|
- file << ">swift>brown>fox"
- file.flush
-
- # Reindex by a separation string
- arc = ExtArc.new(file)
- arc.to_a # => []
- arc.reindex_by_sep(:sep_string => ">", :entry_follows_sep => true)
- arc.to_a # => [">swift", ">brown", ">fox"]
-
- # Reindex by scanning an entry
- arc = ExtArc.new(file)
- arc.to_a # => []
- arc.reindex_by_scan(/>\w*/)
- arc.to_a # => [">swift", ">brown", ">fox"]
+ # A sample FASTA entry
+ # >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
+ # LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
+ # EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
+ # LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
+ # GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
+ # IENY
+
+ class FastaEntry
+ attr_reader :header, :body
+
+ def initialize(str)
+ @body = str.split(/\r?\n/)
+ @header = body.shift
+ end
end
-
-=== ExtInd
+
+ class FastaArchive < ExternalArchive
+ def str_to_entry(str); FastaEntry.new(str); end
+ def entry_to_str(entry); ([entry.header] + entry.body).join("\n"); end
+
+ def reindex
+ reindex_by_sep('>', :entry_follows_sep => true)
+ end
+ end
+
+ require 'open-uri'
+ fasta = FastaArchive.new open('http://external.rubyforge.org/doc/tiny_fasta.txt')
+ fasta.reindex
+
+ fasta.length # => 5
+ fasta[0].body # => ["MEVNILAFIATTLFVLVPTAFLLIIYVKTVSQSD"]
-ExtInd provides array-like access to formatted binary data. The index of ExtArr is an
-ExtInd constructed to access data formatted as 'II'; two integers corresponding to the
-start position and length of entries in the ExtArr data source. For simple, repetitive
-formats like 'II', processing is optimized to use a general format and frame.
+The non-redundant {NCBI protein database}[ftp://ftp.ncbi.nih.gov/blast/db/FASTA/]
+contains greater than 7 million FASTA entries in a 3.56 GB file; ExternalArchive
+is targeted at files that size, where lazy loading of data and a small memory
+footprint are critical.
- ea = ExtArr.new
- ea.index.class # => ExtInd
- index = ea.index
+=== ExternalIndex
- index.format # => 'I*'
- index.frame # => 2
- index << [1,2]
- index << [3,4]
- index.to_a # => [[1,2],[3,4]]
+ExternalIndex provides array-like access to formatted binary data. The index of an
+uncached ExternalArray is an ExternalIndex configured for binary data like 'II'; two
+integers corresponding to the start position and length an entry.
-ExtInd handles arbitrary packing formats, opening many possibilites:
+ index = ExternalIndex[1, 2, 3, 4, 5, 6, {:format => 'II'}]
+ index.format # => 'I*'
+ index.frame # => 2
+ index[1] # => [3,4]
+ index.to_a # => [[1,2], [3,4], [5,6]]
- File.open('path/to/file', "w+") do |file|
+ExternalIndex handles arbitrary packing formats, opening many possibilities:
+
+ Tempfile.new('sample.txt') do |file|
file << [1,2,3].pack("IQS")
file << [4,5,6].pack("IQS")
file << [7,8,9].pack("IQS")
file.flush
- index = ExtInd.new(file, :format => "IQS")
- index[1] # => [4,5,6]
- index.to_a # => [[1,2,3],[4,5,6],[7,8,9]]
+ index = ExternalIndex.new(file, :format => "IQS")
+ index[1] # => [4,5,6]
+ index.to_a # => [[1,2,3], [4,5,6], [7,8,9]]
end
-Note: at the moment formats must be specified longhand, ie 'III' cannot be written as 'I3',
-and the native size directives for sSiIlL are not supported.
\ No newline at end of file
+== Installation
+
+External is available from RubyForge[http://rubyforge.org/projects/external]. Use:
+
+ % gem install external
+
+== Info
+
+Copyright (c) 2006-2008, Regents of the University of Colorado.
+Developer:: {Simon Chiang}[http://bahuvrihi.wordpress.com], {Biomolecular Structure Program}[http://biomol.uchsc.edu/], {Hansen Lab}[http://hsc-proteomics.uchsc.edu/hansenlab/]
+Support:: CU Denver School of Medicine Deans Academic Enrichment Fund
+Licence:: {MIT-Style}[link:files/MIT-LICENSE.html]