README.rdoc in bio-blastxmlparser-1.1.0 vs README.rdoc in bio-blastxmlparser-1.1.1

- old
+ new

@@ -26,14 +26,15 @@ document. Tests show this is faster than a SAX parser with Ruby callbacks. To see why libxml2 based Nokogiri is fast, see http://www.rubyinside.com/ruby-xml-performance-benchmarks-1641.html and http://www.xml.com/lpt/a/1703. -The parser is also designed with other optimizations, such as lazy evaluation, -only creating objects when required, and (in a future version) parallelization. When parsing -a full BLAST result usually only a few fields are used. By using XPath queries -only the relevant fields are queried. +The parser is also designed with other optimizations, such as lazy +evaluation, i.e. only creating objects when required, and (in a future +version) parallelization. When parsing a full BLAST result usually +only a few fields are used. By using XPath queries only the relevant +fields are queried. Timings for parsing test/data/nt_example_blastn.m7 (file size 3.4Mb) bio-blastxmlparser + Nokogiri DOM (default) @@ -45,11 +46,11 @@ real 0m1.713s user 0m1.444s sys 0m0.160s - BioRuby ReXML DOM parser + BioRuby ReXML DOM parser (old style) real 1m14.548s user 1m13.065s sys 0m0.472s @@ -70,17 +71,92 @@ gem install bio-blastxmlparser for more installation on other platforms see http://nokogiri.org/tutorials/installing_nokogiri.html. +== Command line usage + +=== Usage + blastxmlparser [options] file(s) + + -p, --parser name Use full|split parser (default full) + --output-fasta Output FASTA + -n, --named fields Set named fields + -e, --exec filter Execute filter + + --logger filename Log to file (default stderr) + --trace options Set log level (default INFO, see bio-logger) + -q, --quiet Run quietly + -v, --verbose Run verbosely + --debug Show debug messages + -h, --help Show help and examples + + bioblastxmlparser filename(s) + + Use --help switch for more information + +=== Examples + +Print result fields of iterations containing 'lcl', using a regex + + blastxmlparser -e 'iter.query_id=~/lcl/' test/data/nt_example_blastn.m7 + +Print fields where bit_score > 145 + + blastxmlparser -e 'hsp.bit_score>145' test/data/nt_example_blastn.m7 + +prints a tab delimited + + 1 1 lcl|1_0 lcl|I_74685 1 5.82208e-34 + 2 1 lcl|1_0 lcl|I_1 1 5.82208e-34 + 3 2 lcl|2_0 lcl|I_2 1 6.05436e-59 + 4 3 lcl|3_0 lcl|I_3 1 2.03876e-56 + +The second and third column show the BLAST iteration, and the others +relate to the hits. + +As this is evaluated Ruby, it is also possible to use the XML element +names directly + + blastxmlparser -e 'hsp["Hsp_bit-score"].to_i>145' test/data/nt_example_blastn.m7 + +And it is possible to print (non default) named fields where E-value < 0.001 +and hit length > 100. E.g. + + blastxmlparser -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7 + + 1 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT... + 2 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT... + 3 2.76378e-11 AATATGGTAGCTACAGAAACGGTAGTACACTCTTC + 4 1.13373e-13 CTAAACACAGGAGCATATAGGTTGGCAGGCAGGCAAAAT + 5 2.76378e-11 GAAGAGTGTACTACCGTTTCTGTAGCTACCATATT + etc. etc. + +prints the evalue and qseq columns. To output FASTA use --output-fasta + + blastxmlparser --output-fasta -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7 + +which prints matching sequences, where the first field is the accession, followed +by query iteration id, and hit_id. E.g. + + >I_74685 1|lcl|1_0 lcl|I_74685 [57809 - 57666] (REVERSE SENSE) + AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG + >I_1 1|lcl|1_0 lcl|I_1 [477 - 884] + AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG + etc. etc. + +To use the low-mem (iterated slower) version of the parser use + + blastxmlparser --parser split -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7 + == API (Ruby library) To loop through a BLAST result: >> require 'bio-blastxmlparser' >> fn = 'test/data/nt_example_blastn.m7' - >> n = Bio::Blast::XmlIterator.new(fn).to_enum + >> n = Bio::BlastXMLParser::XmlIterator.new(fn).to_enum >> n.each do | iter | >> puts "Hits for " + iter.query_id >> iter.each do | hit | >> hit.each do | hsp | >> print hit.hit_id, "\t", hsp.evalue, "\n" if hsp.evalue < 0.001 @@ -89,11 +165,11 @@ >> end The next example parses XML using less memory by using a Ruby Iterator - >> blast = Bio::Blast::XmlSplitterIterator.new(fn).to_enum + >> blast = Bio::BlastXMLParser::XmlSplitterIterator.new(fn).to_enum >> iter = blast.next >> iter.iter_num => 1 >> iter.query_id => "lcl|1_0" @@ -173,89 +249,13 @@ etc. etc. For more examples see the files in ./spec -== Command line usage - - -== Usage - blastxmlparser [options] file(s) - - -p, --parser name Use full|split parser (default full) - --output-fasta Output FASTA - -n, --named fields Set named fields - -e, --exec filter Execute filter - - --logger filename Log to file (default stderr) - --trace options Set log level (default INFO, see bio-logger) - -q, --quiet Run quietly - -v, --verbose Run verbosely - --debug Show debug messages - -h, --help Show help and examples - - bioblastxmlparser filename(s) - - Use --help switch for more information - -== Examples - -Print result fields of iterations containing 'lcl', using a regex - - blastxmlparser -e 'iter.query_id=~/lcl/' test/data/nt_example_blastn.m7 - -Print fields where bit_score > 145 - - blastxmlparser -e 'hsp.bit_score>145' test/data/nt_example_blastn.m7 - -prints a tab delimited - - 1 1 lcl|1_0 lcl|I_74685 1 5.82208e-34 - 2 1 lcl|1_0 lcl|I_1 1 5.82208e-34 - 3 2 lcl|2_0 lcl|I_2 1 6.05436e-59 - 4 3 lcl|3_0 lcl|I_3 1 2.03876e-56 - -The second and third column show the BLAST iteration, and the others -relate to the hits. - -As this is evaluated Ruby, it is also possible to use the XML element -names directly - - blastxmlparser -e 'hsp["Hsp_bit-score"].to_i>145' test/data/nt_example_blastn.m7 - -And it is possible to print (non default) named fields where E-value < 0.001 -and hit length > 100. E.g. - - blastxmlparser -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7 - - 1 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT... - 2 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT... - 3 2.76378e-11 AATATGGTAGCTACAGAAACGGTAGTACACTCTTC - 4 1.13373e-13 CTAAACACAGGAGCATATAGGTTGGCAGGCAGGCAAAAT - 5 2.76378e-11 GAAGAGTGTACTACCGTTTCTGTAGCTACCATATT - etc. etc. - -prints the evalue and qseq columns. To output FASTA use --output-fasta - - blastxmlparser --output-fasta -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7 - -which prints matching sequences, where the first field is the accession, followed -by query iteration id, and hit_id. E.g. - - >I_74685 1|lcl|1_0 lcl|I_74685 [57809 - 57666] (REVERSE SENSE) - AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG - >I_1 1|lcl|1_0 lcl|I_1 [477 - 884] - AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG - etc. etc. - -To use the low-mem (iterated slower) version of the parser use - - blastxmlparser --parser split -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7 - == URL The project lives at http://github.com/pjotrp/blastxmlparser. If you use this software, please cite http://dx.doi.org/10.1093/bioinformatics/btq475 == Copyright -Copyright (c) 2011 Pjotr Prins under the MIT licence. See LICENSE.txt and http://www.opensource.org/licenses/mit-license.html for further details. +Copyright (c) 2011,2012 Pjotr Prins under the MIT licence. See LICENSE.txt and http://www.opensource.org/licenses/mit-license.html for further details.