= bio-blastxmlparser

blastxmlparser is a fast big-data BLAST XML file parser. Rather than
loading everything in memory, XML is parsed by BLAST query
(Iteration). Not only has this the advantage of low memory use, it may
also be faster when IO continues in parallel (disks read ahead).

Next to the API, blastxmlparser comes as a command line utility, which
can be used to filter results and requires no understanding of Ruby.

== Performance

XML parsing is expensive. blastxmlparser uses the Nokogiri C, or Java, XML
parser, based on libxml2. Basically a DOM parser is used for subsections of a
document, tests show this is faster than a SAX parser with Ruby callbacks.  To
see why libxml2 based Nokogiri is fast, see
http://www.rubyinside.com/ruby-xml-performance-benchmarks-1641.html and
http://www.xml.com/lpt/a/1703. 

The parser is also designed with other optimizations, such as lazy evaluation,
only creating objects when required, and (future) parallelization. When parsing
a full BLAST result usually only a few fields are used. By using XPath queries
only the relevant fields are queried.

Timings for parsing test/data/nt_example_blastn.m7 (file size 3.4Mb) 

Nokogiri DOM (default)

real    0m1.259s
user    0m1.052s
sys     0m0.144s

Nokogiri split DOM

real    0m1.713s
user    0m1.444s
sys     0m0.160s

BioRuby ReXML DOM parser

real    1m14.548s
user    1m13.065s
sys     0m0.472s

== Install

  gem install bio-blastxmlparser

Nokogiri XML parser is required. To install it,
the libxml2 libraries and headers need to be installed first, for
example on Debian:

  apt-get install libxslt-dev libxml2-dev
  gem install bio-blastxmlparser

for more installation on other platforms see
http://nokogiri.org/tutorials/installing_nokogiri.html. 

== API

To loop through a BLAST result:

    >> require 'bio-blastxmlparser'
    >> fn = 'test/data/nt_example_blastn.m7'
    >>   n = Bio::Blast::XmlIterator.new(fn).to_enum
    >>   n.each do | iter |
    >>     puts "Hits for " + iter.query_id
    >>     iter.each do | hit |
    >>       hit.each do | hsp |
    >>         print hit.hit_id, "\t", hsp.evalue, "\n" if hsp.evalue < 0.001
    >>       end
    >>     end
    >>   end

The next example parses XML using less memory

    >> blast = XmlSplitterIterator.new(fn).to_enum
    >> iter = blast.next
    >> iter.iter_num
    >> 1
    >> iter.query_id
    => "lcl|1_0"

Get the first hit

    >> hit = iter.hits.first
    >> hit.hit_num
    => 1
    >> hit.hit_id
    => "lcl|I_74685"
    >> hit.hit_def
    => "[57809 - 57666] (REVERSE SENSE) "
    >> hit.accession
    => "I_74685"
    >> hit.len
    => 144

Get the parent info

    >> hit.parent.query_id
    => "lcl|1_0"
 
Get the first Hsp

    >> hsp = hit.hsps.first
    >> hsp.hsp_num
    => 1
    >> hsp.bit_score
    => 145.205
    >> hsp.score
    => 73
    >> hsp.evalue
    => 5.82208e-34
    >> hsp.query_from
    => 28
    >> hsp.query_to
    => 100
    >> hsp.query_frame
    => 1
    >> hsp.hit_frame
    => 1
    >> hsp.identity
    => 73
    >> hsp.positive
    => 73
    >> hsp.align_len
    => 73
    >> hsp.qseq
    => "AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG"
    >> hsp.hseq
    => "AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCTGCCAACCTATATGCTCCTGTGTTTAG"
    >> hsp.midline
    => "|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||"

It is possible to use the XML element names, over methods. E.g.

    >> hsp.field("Hsp_bit-score")
    => "145.205"
    >> hsp["Hsp_bit-score"]
    => "145.205"

Note that these are always String values.

Fetch the next result (Iteration)

    >> iter2 = blast.next
    >> iter2.iter_num
    >> 2 
    >> iter2.query_id
    => "lcl|2_0"

etc. etc.

For more examples see the files in ./spec

== Usage

  blastxmlparser [options] file(s)

    -p, --parser name                Use full|split parser (default full)
    -n, --named fields               Set named fields
    -e, --exec filter                Execute filter

        --logger filename            Log to file (default stderr)
        --trace options              Set log level (default INFO, see bio-logger)
    -q, --quiet                      Run quietly
    -v, --verbose                    Run verbosely
        --debug                      Show debug messages
    -h, --help                       Show help and examples

  bioblastxmlparser filename(s)

    Use --help switch for more information

== Examples

Print result fields of iterations containing 'lcl', using a regex

  blastxmlparser -e 'iter.query_id=~/lcl/' test/data/nt_example_blastn.m7

Print fields where bit_score > 145

  blastxmlparser -e 'hsp.bit_score>145' test/data/nt_example_blastn.m7

It is also possible to use the XML element names directly

  blastxmlparser -e 'hsp["Hsp_bit-score"].to_i>145' test/data/nt_example_blastn.m7

Print named fields where E-value < 0.001 and hit length > 100

  blastxmlparser -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7

  1       5.82208e-34     AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
  2       5.82208e-34     AGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT...
  3       2.76378e-11     AATATGGTAGCTACAGAAACGGTAGTACACTCTTC     
  4       1.13373e-13     CTAAACACAGGAGCATATAGGTTGGCAGGCAGGCAAAAT 
  5       2.76378e-11     GAAGAGTGTACTACCGTTTCTGTAGCTACCATATT     
  etc. etc.

To use the low-mem version use

  blastxmlparser --parser split -n 'hsp.evalue,hsp.qseq' -e 'hsp.evalue<0.01 and hit.len>100' test/data/nt_example_blastn.m7

== URL

The project lives at http://github.com/pjotrp/blastxmlparser. If you use this software, please cite http://dx.doi.org/10.1093/bioinformatics/btq475

== Copyright

Copyright (c) 2011 Pjotr Prins under the MIT licence.  See LICENSE.txt and http://www.opensource.org/licenses/mit-license.html for further details.