README.md in bio-vcf-0.8.2 vs README.md in bio-vcf-0.9.0

- old
+ new

@@ -1,51 +1,71 @@ # bio-vcf [![Build Status](https://secure.travis-ci.org/pjotrp/bioruby-vcf.png)](http://travis-ci.org/pjotrp/bioruby-vcf) -A new generation VCF parser. Bio-vcf is not only fast for genome-wide -(WGS) data, it also comes with a really nice filtering, evaluation and -rewrite language and it can output any type of textual data, including -VCF header and contents in RDF and JSON. +## Updates +* The outputter now writes (properly) in parallel with the parser +* bio-vcf turns any VCF into JSON with header information, and + allows you to pipe that JSON directly into any JSON supporting + language, including Python and Javascript! + +## Bio-vcf + +Bio-vcf is a new generation VCF parser, filter and converter. Bio-vcf is not only +very fast for genome-wide (WGS) data, it also comes with a really nice +filtering, evaluation and rewrite language and it can output any type +of textual data, including VCF header and contents in RDF and JSON. + So, why would you use bio-vcf over other parsers? Because 1. Bio-vcf is fast and scales on multi-core computers 2. Bio-vcf has an expressive filtering and evaluation language 3. Bio-vcf has great multi-sample support 4. Bio-vcf has multiple global filters and sample filters 5. Bio-vcf can access any VCF format -6. Bio-vcf can do calculations on fields -7. Bio-vcf allows for genotype processing -8. Bio-vcf has support for set analysis -9. Bio-vcf has sane error handling -10. Bio-vcf can convert *any* VCF to *any* output, including tabular data, HTML, LaTeX, RDF, JSON and JSON-LD and even other VCFs by using (erb) templates +6. Bio-vcf can parse and query the VCF header (META) +7. Bio-vcf can do calculations on fields +8. Bio-vcf allows for genotype processing +9. Bio-vcf has support for set analysis +10. Bio-vcf has sane error handling +11. Bio-vcf can convert *any* VCF to *any* output, including tabular data, BED, HTML, LaTeX, RDF, JSON and JSON-LD and even other VCFs by using (erb) templates Bio-vcf has better performance than other tools because of lazy parsing, multi-threading, and useful combinations of -(fancy) command line filtering. For example on an 2 core machine -bio-vcf is typically 50% faster than JVM based SnpSift. Adding +(fancy) command line filtering (who says Ruby is slow?). Adding cores, bio-vcf just does better. The more complicated the filters, -the larger the gain. +the larger the gain. First the base line test to show IO performance ```sh - time ./bin/bio-vcf -iv --num-threads 8 --filter 'r.info.cp>0.3' < ESP6500SI_V2_SSA137.vcf > test1.vcf - real 0m21.095s - user 1m41.101s - sys 0m7.852s + time cat ESP6500SI-V2-SSA137.GRCh38-liftover.*.vcf|wc + 1987143 15897724 1003214613 + real 0m7.823s + user 0m7.002s + sys 0m2.972s ``` -while parsing with SnpSift takes +Next run the 1Gb data with bio-vcf effectively using 5 cores on AMD Opteron(tm) Processor 6174 using Linux ```sh - time cat ESP6500SI_V2_SSA137.vcf |java -jar snpEff/SnpSift.jar filter "( CP>0.3 )" > test.vcf - real 1m4.913s - user 0m58.071s - sys 0m7.982s + time cat ESP6500SI-V2-SSA137.GRCh38-liftover.*.vcf|./bin/bio-vcf -iv --num-threads 8 --filter 'r.info.cp.to_f>0.3' > /dev/null + real 0m32.491s + user 2m34.767s + sys 0m12.733s ``` -Bio-vcf is perfect for parsing large data files. Parsing a 650 Mb GATK +The same with SnpSift v4.0 takes + +```sh +time cat ESP6500SI-V2-SSA137.GRCh38-liftover.*.vcf|java -jar snpEff/SnpSift.jar filter "( CP>0.3 )" > /dev/null +real 12m36.121s +user 12m53.273s +sys 0m9.913s +``` + +This means that on this machine bio-vcf is 24x faster than SnpSift even for a simple filter. +In fact, bio-vcf is perfect for complex filters and parsing large data files on powerful machines. Parsing a 650 Mb GATK Illumina Hiseq VCF file and evaluating the results into a BED format on a 16 core machine takes ```sh time bio-vcf --num-threads 16 --filter 'r.chrom.to_i>0 and r.chrom.to_i<21 and r.qual>50' --sfilter '!s.empty? and s.dp>20' --eval '[r.chrom,r.pos,r.pos+1]' < test.large2.vcf > test.out.3 @@ -70,17 +90,23 @@ Ruby), an embedded Ragel parser for INFO and FORMAT header definitions, as well as primitives for set analysis. Few assumptions are made about the actual contents of the VCF file (field names are resolved on the fly), so bio-vcf should work with all VCF files. -To fetch all entries where all samples have depth larger than 20 use -a sample filter +To fetch all entries where all samples have depth larger than 20 and +filter set to PASS use a sample filter ```ruby - bio-vcf --sfilter 'sample.dp>20' < file.vcf + bio-vcf --sfilter 'sample.dp>20 and rec.filter=="PASS"' < file.vcf ``` +or with a regex + +```ruby + bio-vcf --sfilter 'sample.dp>20 and rec.filter !~ /LowQD/' < file.vcf +``` + To only filter on some samples number 0 and 3: ```ruby bio-vcf --sfilter-samples 0,3 --sfilter 's.dp>20' < file.vcf ``` @@ -263,10 +289,16 @@ ```ruby bio-vcf -q --eval-once 'header.samples.join(",")' < file.vcf NORMAL,TUMOR ``` +Get information from the header (META) + +```ruby + bio-vcf -q --skip-header --eval-once 'header.meta["GATKCommandLine"]' < gatk_exome.vcf +``` + The 'fields' array contains unprocessed data (strings). Print first five raw fields ```ruby bio-vcf --eval 'fields[0..4]' < file.vcf @@ -302,12 +334,15 @@ ```ruby bio-vcf --filter 'rec.info.dp>100 and rec.info.readposranksum<=0.815' < file.vcf ``` -With subfields defined by rec.format +(alternatively you can use the indexed rec.info['DP'] and list INFO fields with +rec.info.fields). +Subfields defined by rec.format: + ```ruby bio-vcf --filter 'rec.tumor.ss != 2' < file.vcf ``` Output @@ -693,11 +728,11 @@ "seq:chr": "<%= rec.chrom %>" , "seq:pos": <%= rec.pos %> , "seq:ref": "<%= rec.ref %>" , "seq:alt": "<%= rec.alt[0] %>" , "seq:maf": <%= rec.info.maf[0] %> , - "dp": <%= rec.info.dp %> , + "dp": <%= rec.info.dp %> }; ``` To get JSON, run with something like (combining with a filter) @@ -713,11 +748,11 @@ "seq:chr": "13" , "seq:pos": 35745475 , "seq:ref": "C" , "seq:alt": "T" , "seq:maf": 0.0151 , - "dp": 86 , + "dp": 86 }; ``` Likewise for RDF output: @@ -765,28 +800,27 @@ can be ```Javascript =HEADER <% require 'json' %> -[ - { "HEADER": { +{ "HEADER": { "options": <%= options.to_h.to_json %>, "files": <%= ARGV %>, "version": "<%= BIOVCF_VERSION %>" }, - + "BODY":[ =BODY - -{ - "seq:chr": "<%= rec.chrom %>" , - "seq:pos": <%= rec.pos %> , - "seq:ref": "<%= rec.ref %>" , - "seq:alt": "<%= rec.alt[0] %>" , - "dp": <%= rec.info.dp %> , -}, + { + "seq:chr": "<%= rec.chrom %>" , + "seq:pos": <%= rec.pos %> , + "seq:ref": "<%= rec.ref %>" , + "seq:alt": "<%= rec.alt[0] %>" , + "dp": <%= rec.info.dp %> + }, =FOOTER -] + ] +} ``` with ```sh @@ -794,30 +828,31 @@ ``` may generate something like ```Javascript -[ - { "HEADER": { +{ "HEADER": { "options": {"show_help":false,"source":"https://github.com/CuppenResearch/bioruby-vcf","version":"0.8.1-pre3 (Pjotr Prins)","date":"2014-11-26 12:51:36 +0000","thread_lines":40000,"template":"template/vcf2json.erb","skip_header":true}, "files": [], "version": "0.8.1-pre3" }, -{ - "seq:chr": "1" , - "seq:pos": 883516 , - "seq:ref": "G" , - "seq:alt": "A" , - "dp": , -}, -{ - "seq:chr": "1" , - "seq:pos": 891344 , - "seq:ref": "G" , - "seq:alt": "A" , - "dp": , -}, -] + "BODY":[ + { + "seq:chr": "1" , + "seq:pos": 883516 , + "seq:ref": "G" , + "seq:alt": "A" , + "dp": + }, + { + "seq:chr": "1" , + "seq:pos": 891344 , + "seq:ref": "G" , + "seq:alt": "A" , + "dp": , + }, + ] +} ``` Note that the template is not smart enough to remove the final comma from the last BODY element. To make it valid JSON that needs to be removed. A future version may add a parameter to the BODY element or a