README.rdoc in bio-blastxmlparser-0.6.0 vs README.rdoc in bio-blastxmlparser-0.6.1

- old
+ new

@@ -1,64 +1,71 @@ = bio-blastxmlparser blastxmlparser is a fast big-data BLAST XML file parser. Rather than loading everything in memory, XML is parsed by BLAST query -(Iteration). Not only has this the advantage of low memory use, it may -also be faster when IO continues in parallel (disks read ahead). +(Iteration). Not only has this the advantage of low memory use, it +also shows results early, and it may be faster when IO continues in +parallel (disk read-ahead). Next to the API, blastxmlparser comes as a command line utility, which can be used to filter results and requires no understanding of Ruby. == Performance -XML parsing is expensive. blastxmlparser uses the Nokogiri C, or Java, XML -parser, based on libxml2. Basically a DOM parser is used for subsections of a -document, tests show this is faster than a SAX parser with Ruby callbacks. To +XML parsing is expensive. blastxmlparser uses the fast Nokogiri C, or Java, XML +parsers, based on libxml2. Basically, a DOM parser is used for subsections of a +document. Tests show this is faster than a SAX parser with Ruby callbacks. To see why libxml2 based Nokogiri is fast, see http://www.rubyinside.com/ruby-xml-performance-benchmarks-1641.html and http://www.xml.com/lpt/a/1703. The parser is also designed with other optimizations, such as lazy evaluation, -only creating objects when required, and (future) parallelization. When parsing +only creating objects when required, and (in a future version) parallelization. When parsing a full BLAST result usually only a few fields are used. By using XPath queries only the relevant fields are queried. Timings for parsing test/data/nt_example_blastn.m7 (file size 3.4Mb) -Nokogiri DOM (default) + bio-blastxmlparser + Nokogiri DOM (default) -real 0m1.259s -user 0m1.052s -sys 0m0.144s + real 0m1.259s + user 0m1.052s + sys 0m0.144s -Nokogiri split DOM + bio-blastxmlparser + Nokogiri split DOM -real 0m1.713s -user 0m1.444s -sys 0m0.160s + real 0m1.713s + user 0m1.444s + sys 0m0.160s -BioRuby ReXML DOM parser + BioRuby ReXML DOM parser -real 1m14.548s -user 1m13.065s -sys 0m0.472s + real 1m14.548s + user 1m13.065s + sys 0m0.472s == Install +Quick install: + gem install bio-blastxmlparser +Important: the parser is written for Ruby >= 1.9. You can check with + + gem env + Nokogiri XML parser is required. To install it, the libxml2 libraries and headers need to be installed first, for example on Debian: apt-get install libxslt-dev libxml2-dev gem install bio-blastxmlparser for more installation on other platforms see http://nokogiri.org/tutorials/installing_nokogiri.html. -== API +== API (Ruby library) To loop through a BLAST result: >> require 'bio-blastxmlparser' >> fn = 'test/data/nt_example_blastn.m7' @@ -70,16 +77,17 @@ >> print hit.hit_id, "\t", hsp.evalue, "\n" if hsp.evalue < 0.001 >> end >> end >> end -The next example parses XML using less memory +The next example parses XML using less memory by using a Ruby +Iterator - >> blast = XmlSplitterIterator.new(fn).to_enum + >> blast = Bio::Blast::XmlSplitterIterator.new(fn).to_enum >> iter = blast.next >> iter.iter_num - >> 1 + => 1 >> iter.query_id => "lcl|1_0" Get the first hit @@ -130,18 +138,23 @@ >> hsp.hseq => "AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG" >> hsp.midline => "|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||" -It is possible to use the XML element names, over methods. E.g. +Unlike BioRuby, this module uses the actual element names in the XML +definition, to avoid confusion (if anyone wants a translation, +feel free to contribute an adaptor). +It is also possible to use the XML element names as Strings, rather +than methods. E.g. + >> hsp.field("Hsp_bit-score") => "145.205" >> hsp["Hsp_bit-score"] => "145.205" -Note that these are always String values. +Note that, when using the element names, the results are always String values. Fetch the next result (Iteration) >> iter2 = blast.next >> iter2.iter_num @@ -151,15 +164,18 @@ etc. etc. For more examples see the files in ./spec -== Usage +== Command line usage + +== Usage blastxmlparser [options] file(s) -p, --parser name Use full|split parser (default full) + --output-fasta Output FASTA -n, --named fields Set named fields -e, --exec filter Execute filter --logger filename Log to file (default stderr) --trace options Set log level (default INFO, see bio-logger) @@ -180,25 +196,50 @@ Print fields where bit_score > 145 blastxmlparser -e 'hsp.bit_score>145' test/data/nt_example_blastn.m7 -It is also possible to use the XML element names directly +prints a tab delimited + 1 1 lcl|1_0 lcl|I_74685 1 5.82208e-34 + 2 1 lcl|1_0 lcl|I_1 1 5.82208e-34 + 3 2 lcl|2_0 lcl|I_2 1 6.05436e-59 + 4 3 lcl|3_0 lcl|I_3 1 2.03876e-56 + +The second and third column show the BLAST iteration, and the others +relate to the hits. + +As this is evaluated Ruby, it is also possible to use the XML element +names directly + blastxmlparser -e 'hsp["Hsp_bit-score"].to_i>145' test/data/nt_example_blastn.m7 -Print named fields where E-value < 0.001 and hit length > 100 +And it is possible to print (non default) named fields where E-value < 0.001 +and hit length > 100. E.g. blastxmlparser -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7 1 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT... 2 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT... 3 2.76378e-11 AATATGGTAGCTACAGAAACGGTAGTACACTCTTC 4 1.13373e-13 CTAAACACAGGAGCATATAGGTTGGCAGGCAGGCAAAAT 5 2.76378e-11 GAAGAGTGTACTACCGTTTCTGTAGCTACCATATT etc. etc. -To use the low-mem version use +prints the evalue and qseq columns. To output FASTA use --output-fasta + + blastxmlparser --output-fasta -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7 + +which prints matching sequences, where the first field is the accession, followed +by query iteration id, and hit_id. E.g. + + >I_74685 1|lcl|1_0 lcl|I_74685 [57809 - 57666] (REVERSE SENSE) + AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG + >I_1 1|lcl|1_0 lcl|I_1 [477 - 884] + AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG + etc. etc. + +To use the low-mem (iterated slower) version of the parser use blastxmlparser --parser split -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7 == URL