= bio-blastxmlparser blastxmlparser is a fast big-data BLAST XML file parser. Rather than loading everything in memory, XML is parsed by BLAST query (Iteration). Not only has this the advantage of low memory use, it may also be faster when IO continues in parallel (disks read ahead). Next to the API, blastxmlparser comes as a command line utility, which can be used to filter results and requires no understanding of Ruby. == Performance XML parsing is expensive. blastxmlparser uses the Nokogiri C, or Java, XML parser, based on libxml2. Basically a DOM parser is used for subsections of a document, tests show this is faster than a SAX parser with Ruby callbacks. To see why libxml2 based Nokogiri is fast, see http://www.rubyinside.com/ruby-xml-performance-benchmarks-1641.html and http://www.xml.com/lpt/a/1703. The parser is also designed with other optimizations, such as lazy evaluation, only creating objects when required, and (future) parallelization. When parsing a full BLAST result usually only a few fields are used. By using XPath queries only the relevant fields are queried. Timings for parsing test/data/nt_example_blastn.m7 (file size 3.4Mb) Nokogiri DOM (default) real 0m1.259s user 0m1.052s sys 0m0.144s Nokogiri split DOM real 0m1.713s user 0m1.444s sys 0m0.160s BioRuby ReXML DOM parser real 1m14.548s user 1m13.065s sys 0m0.472s == Install gem install bio-blastxmlparser Nokogiri XML parser is required. To install it, the libxml2 libraries and headers need to be installed first, for example on Debian: apt-get install libxslt-dev libxml2-dev gem install bio-blastxmlparser for more installation on other platforms see http://nokogiri.org/tutorials/installing_nokogiri.html. == API To loop through a BLAST result: >> require 'bio-blastxmlparser' >> fn = 'test/data/nt_example_blastn.m7' >> n = Bio::Blast::XmlIterator.new(fn).to_enum >> n.each do | iter | >> puts "Hits for " + iter.query_id >> iter.each do | hit | >> hit.each do | hsp | >> print hit.hit_id, "\t", hsp.evalue, "\n" if hsp.evalue < 0.001 >> end >> end >> end The next example parses XML using less memory >> blast = XmlSplitterIterator.new(fn).to_enum >> iter = blast.next >> iter.iter_num >> 1 >> iter.query_id => "lcl|1_0" Get the first hit >> hit = iter.hits.first >> hit.hit_num => 1 >> hit.hit_id => "lcl|I_74685" >> hit.hit_def => "[57809 - 57666] (REVERSE SENSE) " >> hit.accession => "I_74685" >> hit.len => 144 Get the parent info >> hit.parent.query_id => "lcl|1_0" Get the first Hsp >> hsp = hit.hsps.first >> hsp.hsp_num => 1 >> hsp.bit_score => 145.205 >> hsp.score => 73 >> hsp.evalue => 5.82208e-34 >> hsp.query_from => 28 >> hsp.query_to => 100 >> hsp.query_frame => 1 >> hsp.hit_frame => 1 >> hsp.identity => 73 >> hsp.positive => 73 >> hsp.align_len => 73 >> hsp.qseq => "AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG" >> hsp.hseq => "AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG" >> hsp.midline => "|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||" It is possible to use the XML element names, over methods. E.g. >> hsp.field("Hsp_bit-score") => "145.205" >> hsp["Hsp_bit-score"] => "145.205" Note that these are always String values. Fetch the next result (Iteration) >> iter2 = blast.next >> iter2.iter_num >> 2 >> iter2.query_id => "lcl|2_0" etc. etc. For more examples see the files in ./spec == Usage blastxmlparser [options] file(s) -p, --parser name Use full|split parser (default full) -n, --named fields Set named fields -e, --exec filter Execute filter --logger filename Log to file (default stderr) --trace options Set log level (default INFO, see bio-logger) -q, --quiet Run quietly -v, --verbose Run verbosely --debug Show debug messages -h, --help Show help and examples bioblastxmlparser filename(s) Use --help switch for more information == Examples Print result fields of iterations containing 'lcl', using a regex blastxmlparser -e 'iter.query_id=~/lcl/' test/data/nt_example_blastn.m7 Print fields where bit_score > 145 blastxmlparser -e 'hsp.bit_score>145' test/data/nt_example_blastn.m7 It is also possible to use the XML element names directly blastxmlparser -e 'hsp["Hsp_bit-score"].to_i>145' test/data/nt_example_blastn.m7 Print named fields where E-value < 0.001 and hit length > 100 blastxmlparser -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7 1 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT... 2 5.82208e-34 AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT... 3 2.76378e-11 AATATGGTAGCTACAGAAACGGTAGTACACTCTTC 4 1.13373e-13 CTAAACACAGGAGCATATAGGTTGGCAGGCAGGCAAAAT 5 2.76378e-11 GAAGAGTGTACTACCGTTTCTGTAGCTACCATATT etc. etc. To use the low-mem version use blastxmlparser --parser split -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7 == URL The project lives at http://github.com/pjotrp/blastxmlparser. If you use this software, please cite http://dx.doi.org/10.1093/bioinformatics/btq475 == Copyright Copyright (c) 2011 Pjotr Prins under the MIT licence. See LICENSE.txt and http://www.opensource.org/licenses/mit-license.html for further details.