README.rdoc in bio-blastxmlparser-1.1.0 vs README.rdoc in bio-blastxmlparser-1.1.1
- old
+ new
@@ -26,14 +26,15 @@
document. Tests show this is faster than a SAX parser with Ruby callbacks. To
see why libxml2 based Nokogiri is fast, see
http://www.rubyinside.com/ruby-xml-performance-benchmarks-1641.html and
http://www.xml.com/lpt/a/1703.
-The parser is also designed with other optimizations, such as lazy evaluation,
-only creating objects when required, and (in a future version) parallelization. When parsing
-a full BLAST result usually only a few fields are used. By using XPath queries
-only the relevant fields are queried.
+The parser is also designed with other optimizations, such as lazy
+evaluation, i.e. only creating objects when required, and (in a future
+version) parallelization. When parsing a full BLAST result usually
+only a few fields are used. By using XPath queries only the relevant
+fields are queried.
Timings for parsing test/data/nt_example_blastn.m7 (file size 3.4Mb)
bio-blastxmlparser + Nokogiri DOM (default)
@@ -45,11 +46,11 @@
real 0m1.713s
user 0m1.444s
sys 0m0.160s
- BioRuby ReXML DOM parser
+ BioRuby ReXML DOM parser (old style)
real 1m14.548s
user 1m13.065s
sys 0m0.472s
@@ -70,17 +71,92 @@
gem install bio-blastxmlparser
for more installation on other platforms see
http://nokogiri.org/tutorials/installing_nokogiri.html.
+== Command line usage
+
+=== Usage
+ blastxmlparser [options] file(s)
+
+ -p, --parser name Use full|split parser (default full)
+ --output-fasta Output FASTA
+ -n, --named fields Set named fields
+ -e, --exec filter Execute filter
+
+ --logger filename Log to file (default stderr)
+ --trace options Set log level (default INFO, see bio-logger)
+ -q, --quiet Run quietly
+ -v, --verbose Run verbosely
+ --debug Show debug messages
+ -h, --help Show help and examples
+
+ bioblastxmlparser filename(s)
+
+ Use --help switch for more information
+
+=== Examples
+
+Print result fields of iterations containing 'lcl', using a regex
+
+ blastxmlparser -e 'iter.query_id=~/lcl/' test/data/nt_example_blastn.m7
+
+Print fields where bit_score > 145
+
+ blastxmlparser -e 'hsp.bit_score>145' test/data/nt_example_blastn.m7
+
+prints a tab delimited
+
+ 1 1 lcl|1_0 lcl|I_74685 1 5.82208e-34
+ 2 1 lcl|1_0 lcl|I_1 1 5.82208e-34
+ 3 2 lcl|2_0 lcl|I_2 1 6.05436e-59
+ 4 3 lcl|3_0 lcl|I_3 1 2.03876e-56
+
+The second and third column show the BLAST iteration, and the others
+relate to the hits.
+
+As this is evaluated Ruby, it is also possible to use the XML element
+names directly
+
+ blastxmlparser -e 'hsp["Hsp_bit-score"].to_i>145' test/data/nt_example_blastn.m7
+
+And it is possible to print (non default) named fields where E-value < 0.001
+and hit length > 100. E.g.
+
+ blastxmlparser -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
+
+ 1 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
+ 2 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
+ 3 2.76378e-11 AATATGGTAGCTACAGAAACGGTAGTACACTCTTC
+ 4 1.13373e-13 CTAAACACAGGAGCATATAGGTTGGCAGGCAGGCAAAAT
+ 5 2.76378e-11 GAAGAGTGTACTACCGTTTCTGTAGCTACCATATT
+ etc. etc.
+
+prints the evalue and qseq columns. To output FASTA use --output-fasta
+
+ blastxmlparser --output-fasta -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
+
+which prints matching sequences, where the first field is the accession, followed
+by query iteration id, and hit_id. E.g.
+
+ >I_74685 1|lcl|1_0 lcl|I_74685 [57809 - 57666] (REVERSE SENSE)
+ AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
+ >I_1 1|lcl|1_0 lcl|I_1 [477 - 884]
+ AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
+ etc. etc.
+
+To use the low-mem (iterated slower) version of the parser use
+
+ blastxmlparser --parser split -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
+
== API (Ruby library)
To loop through a BLAST result:
>> require 'bio-blastxmlparser'
>> fn = 'test/data/nt_example_blastn.m7'
- >> n = Bio::Blast::XmlIterator.new(fn).to_enum
+ >> n = Bio::BlastXMLParser::XmlIterator.new(fn).to_enum
>> n.each do | iter |
>> puts "Hits for " + iter.query_id
>> iter.each do | hit |
>> hit.each do | hsp |
>> print hit.hit_id, "\t", hsp.evalue, "\n" if hsp.evalue < 0.001
@@ -89,11 +165,11 @@
>> end
The next example parses XML using less memory by using a Ruby
Iterator
- >> blast = Bio::Blast::XmlSplitterIterator.new(fn).to_enum
+ >> blast = Bio::BlastXMLParser::XmlSplitterIterator.new(fn).to_enum
>> iter = blast.next
>> iter.iter_num
=> 1
>> iter.query_id
=> "lcl|1_0"
@@ -173,89 +249,13 @@
etc. etc.
For more examples see the files in ./spec
-== Command line usage
-
-
-== Usage
- blastxmlparser [options] file(s)
-
- -p, --parser name Use full|split parser (default full)
- --output-fasta Output FASTA
- -n, --named fields Set named fields
- -e, --exec filter Execute filter
-
- --logger filename Log to file (default stderr)
- --trace options Set log level (default INFO, see bio-logger)
- -q, --quiet Run quietly
- -v, --verbose Run verbosely
- --debug Show debug messages
- -h, --help Show help and examples
-
- bioblastxmlparser filename(s)
-
- Use --help switch for more information
-
-== Examples
-
-Print result fields of iterations containing 'lcl', using a regex
-
- blastxmlparser -e 'iter.query_id=~/lcl/' test/data/nt_example_blastn.m7
-
-Print fields where bit_score > 145
-
- blastxmlparser -e 'hsp.bit_score>145' test/data/nt_example_blastn.m7
-
-prints a tab delimited
-
- 1 1 lcl|1_0 lcl|I_74685 1 5.82208e-34
- 2 1 lcl|1_0 lcl|I_1 1 5.82208e-34
- 3 2 lcl|2_0 lcl|I_2 1 6.05436e-59
- 4 3 lcl|3_0 lcl|I_3 1 2.03876e-56
-
-The second and third column show the BLAST iteration, and the others
-relate to the hits.
-
-As this is evaluated Ruby, it is also possible to use the XML element
-names directly
-
- blastxmlparser -e 'hsp["Hsp_bit-score"].to_i>145' test/data/nt_example_blastn.m7
-
-And it is possible to print (non default) named fields where E-value < 0.001
-and hit length > 100. E.g.
-
- blastxmlparser -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
-
- 1 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
- 2 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
- 3 2.76378e-11 AATATGGTAGCTACAGAAACGGTAGTACACTCTTC
- 4 1.13373e-13 CTAAACACAGGAGCATATAGGTTGGCAGGCAGGCAAAAT
- 5 2.76378e-11 GAAGAGTGTACTACCGTTTCTGTAGCTACCATATT
- etc. etc.
-
-prints the evalue and qseq columns. To output FASTA use --output-fasta
-
- blastxmlparser --output-fasta -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
-
-which prints matching sequences, where the first field is the accession, followed
-by query iteration id, and hit_id. E.g.
-
- >I_74685 1|lcl|1_0 lcl|I_74685 [57809 - 57666] (REVERSE SENSE)
- AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
- >I_1 1|lcl|1_0 lcl|I_1 [477 - 884]
- AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG
- etc. etc.
-
-To use the low-mem (iterated slower) version of the parser use
-
- blastxmlparser --parser split -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7
-
== URL
The project lives at http://github.com/pjotrp/blastxmlparser. If you use this software, please cite http://dx.doi.org/10.1093/bioinformatics/btq475
== Copyright
-Copyright (c) 2011 Pjotr Prins under the MIT licence. See LICENSE.txt and http://www.opensource.org/licenses/mit-license.html for further details.
+Copyright (c) 2011,2012 Pjotr Prins under the MIT licence. See LICENSE.txt and http://www.opensource.org/licenses/mit-license.html for further details.