README.rdoc in bio-blastxmlparser-0.6.0 vs README.rdoc in bio-blastxmlparser-0.6.1
- old
+ new
@@ -1,64 +1,71 @@
= bio-blastxmlparser
blastxmlparser is a fast big-data BLAST XML file parser. Rather than
loading everything in memory, XML is parsed by BLAST query
-(Iteration). Not only has this the advantage of low memory use, it may
-also be faster when IO continues in parallel (disks read ahead).
+(Iteration). Not only has this the advantage of low memory use, it
+also shows results early, and it may be faster when IO continues in
+parallel (disk read-ahead).
Next to the API, blastxmlparser comes as a command line utility, which
can be used to filter results and requires no understanding of Ruby.
== Performance
-XML parsing is expensive. blastxmlparser uses the Nokogiri C, or Java, XML
-parser, based on libxml2. Basically a DOM parser is used for subsections of a
-document, tests show this is faster than a SAX parser with Ruby callbacks. To
+XML parsing is expensive. blastxmlparser uses the fast Nokogiri C, or Java, XML
+parsers, based on libxml2. Basically, a DOM parser is used for subsections of a
+document. Tests show this is faster than a SAX parser with Ruby callbacks. To
see why libxml2 based Nokogiri is fast, see
http://www.rubyinside.com/ruby-xml-performance-benchmarks-1641.html and
http://www.xml.com/lpt/a/1703.
The parser is also designed with other optimizations, such as lazy evaluation,
-only creating objects when required, and (future) parallelization. When parsing
+only creating objects when required, and (in a future version) parallelization. When parsing
a full BLAST result usually only a few fields are used. By using XPath queries
only the relevant fields are queried.
Timings for parsing test/data/nt_example_blastn.m7 (file size 3.4Mb)
-Nokogiri DOM (default)
+ bio-blastxmlparser + Nokogiri DOM (default)
-real 0m1.259s
-user 0m1.052s
-sys 0m0.144s
+ real 0m1.259s
+ user 0m1.052s
+ sys 0m0.144s
-Nokogiri split DOM
+ bio-blastxmlparser + Nokogiri split DOM
-real 0m1.713s
-user 0m1.444s
-sys 0m0.160s
+ real 0m1.713s
+ user 0m1.444s
+ sys 0m0.160s
-BioRuby ReXML DOM parser
+ BioRuby ReXML DOM parser
-real 1m14.548s
-user 1m13.065s
-sys 0m0.472s
+ real 1m14.548s
+ user 1m13.065s
+ sys 0m0.472s
== Install
+Quick install:
+
gem install bio-blastxmlparser
+Important: the parser is written for Ruby >= 1.9. You can check with
+
+ gem env
+
Nokogiri XML parser is required. To install it,
the libxml2 libraries and headers need to be installed first, for
example on Debian:
apt-get install libxslt-dev libxml2-dev
gem install bio-blastxmlparser
for more installation on other platforms see
http://nokogiri.org/tutorials/installing_nokogiri.html.
-== API
+== API (Ruby library)
To loop through a BLAST result:
>> require 'bio-blastxmlparser'
>> fn = 'test/data/nt_example_blastn.m7'
@@ -70,16 +77,17 @@
>> print hit.hit_id, "\t", hsp.evalue, "\n" if hsp.evalue < 0.001
>> end
>> end
>> end
-The next example parses XML using less memory
+The next example parses XML using less memory by using a Ruby
+Iterator
- >> blast = XmlSplitterIterator.new(fn).to_enum
+ >> blast = Bio::Blast::XmlSplitterIterator.new(fn).to_enum
>> iter = blast.next
>> iter.iter_num
- >> 1
+ => 1
>> iter.query_id
=> "lcl|1_0"
Get the first hit
@@ -130,18 +138,23 @@
>> hsp.hseq
=> "AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG"
>> hsp.midline
=> "|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||"
-It is possible to use the XML element names, over methods. E.g.
+Unlike BioRuby, this module uses the actual element names in the XML
+definition, to avoid confusion (if anyone wants a translation,
+feel free to contribute an adaptor).
+It is also possible to use the XML element names as Strings, rather
+than methods. E.g.
+
>> hsp.field("Hsp_bit-score")
=> "145.205"
>> hsp["Hsp_bit-score"]
=> "145.205"
-Note that these are always String values.
+Note that, when using the element names, the results are always String values.
Fetch the next result (Iteration)
>> iter2 = blast.next
>> iter2.iter_num
@@ -151,15 +164,18 @@
etc. etc.
For more examples see the files in ./spec
-== Usage
+== Command line usage
+
+== Usage
blastxmlparser [options] file(s)
-p, --parser name Use full|split parser (default full)
+ --output-fasta Output FASTA
-n, --named fields Set named fields
-e, --exec filter Execute filter
--logger filename Log to file (default stderr)
--trace options Set log level (default INFO, see bio-logger)
@@ -180,25 +196,50 @@
Print fields where bit_score > 145
blastxmlparser -e 'hsp.bit_score>145' test/data/nt_example_blastn.m7
-It is also possible to use the XML element names directly
+prints a tab delimited
+ 1 1 lcl|1_0 lcl|I_74685 1 5.82208e-34
+ 2 1 lcl|1_0 lcl|I_1 1 5.82208e-34
+ 3 2 lcl|2_0 lcl|I_2 1 6.05436e-59
+ 4 3 lcl|3_0 lcl|I_3 1 2.03876e-56
+
+The second and third column show the BLAST iteration, and the others
+relate to the hits.
+
+As this is evaluated Ruby, it is also possible to use the XML element
+names directly
+
blastxmlparser -e 'hsp["Hsp_bit-score"].to_i>145' test/data/nt_example_blastn.m7
-Print named fields where E-value < 0.001 and hit length > 100
+And it is possible to print (non default) named fields where E-value < 0.001
+and hit length > 100. E.g.
blastxmlparser -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
1 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
2 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
3 2.76378e-11 AATATGGTAGCTACAGAAACGGTAGTACACTCTTC
4 1.13373e-13 CTAAACACAGGAGCATATAGGTTGGCAGGCAGGCAAAAT
5 2.76378e-11 GAAGAGTGTACTACCGTTTCTGTAGCTACCATATT
etc. etc.
-To use the low-mem version use
+prints the evalue and qseq columns. To output FASTA use --output-fasta
+
+ blastxmlparser --output-fasta -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
+
+which prints matching sequences, where the first field is the accession, followed
+by query iteration id, and hit_id. E.g.
+
+ >I_74685 1|lcl|1_0 lcl|I_74685 [57809 - 57666] (REVERSE SENSE)
+ AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
+ >I_1 1|lcl|1_0 lcl|I_1 [477 - 884]
+ AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
+ etc. etc.
+
+To use the low-mem (iterated slower) version of the parser use
blastxmlparser --parser split -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
== URL