README.md in bio-vcf-0.8.2 vs README.md in bio-vcf-0.9.0

- old
+ new

@@ -1,51 +1,71 @@
 # bio-vcf
 
 [![Build Status](https://secure.travis-ci.org/pjotrp/bioruby-vcf.png)](http://travis-ci.org/pjotrp/bioruby-vcf) 
 
-A new generation VCF parser. Bio-vcf is not only fast for genome-wide
-(WGS) data, it also comes with a really nice filtering, evaluation and
-rewrite language and it can output any type of textual data, including
-VCF header and contents in RDF and JSON.
+## Updates
 
+* The outputter now writes (properly) in parallel with the parser
+* bio-vcf turns any VCF into JSON with header information, and
+  allows you to pipe that JSON directly into any JSON supporting
+  language, including Python and Javascript!
+
+## Bio-vcf
+
+Bio-vcf is a new generation VCF parser, filter and converter. Bio-vcf is not only
+very fast for genome-wide (WGS) data, it also comes with a really nice
+filtering, evaluation and rewrite language and it can output any type
+of textual data, including VCF header and contents in RDF and JSON.
+
 So, why would you use bio-vcf over other parsers? Because
 
 1. Bio-vcf is fast and scales on multi-core computers
 2. Bio-vcf has an expressive filtering and evaluation language
 3. Bio-vcf has great multi-sample support
 4. Bio-vcf has multiple global filters and sample filters
 5. Bio-vcf can access any VCF format
-6. Bio-vcf can do calculations on fields
-7. Bio-vcf allows for genotype processing
-8. Bio-vcf has support for set analysis
-9. Bio-vcf has sane error handling
-10. Bio-vcf can convert *any* VCF to *any* output, including tabular data, HTML, LaTeX, RDF, JSON and JSON-LD and even other VCFs by using (erb) templates
+6. Bio-vcf can parse and query the VCF header (META)
+7. Bio-vcf can do calculations on fields
+8. Bio-vcf allows for genotype processing
+9. Bio-vcf has support for set analysis
+10. Bio-vcf has sane error handling
+11. Bio-vcf can convert *any* VCF to *any* output, including tabular data, BED, HTML, LaTeX, RDF, JSON and JSON-LD and even other VCFs by using (erb) templates
 
 Bio-vcf has better performance than other tools
 because of lazy parsing, multi-threading, and useful combinations of
-(fancy) command line filtering. For example on an 2 core machine
-bio-vcf is typically 50% faster than JVM based SnpSift. Adding
+(fancy) command line filtering (who says Ruby is slow?). Adding
 cores, bio-vcf just does better. The more complicated the filters,
-the larger the gain.
+the larger the gain. First the base line test to show IO performance
 
 ```sh
-  time ./bin/bio-vcf -iv --num-threads 8 --filter 'r.info.cp>0.3' < ESP6500SI_V2_SSA137.vcf > test1.vcf
-  real    0m21.095s
-  user    1m41.101s
-  sys     0m7.852s
+  time cat ESP6500SI-V2-SSA137.GRCh38-liftover.*.vcf|wc
+  1987143 15897724 1003214613
+  real    0m7.823s
+  user    0m7.002s
+  sys     0m2.972s
 ```
 
-while parsing with SnpSift takes
+Next run the 1Gb data with bio-vcf effectively using 5 cores on AMD Opteron(tm) Processor 6174 using Linux
 
 ```sh
-  time cat ESP6500SI_V2_SSA137.vcf |java -jar snpEff/SnpSift.jar filter "( CP>0.3 )" > test.vcf
-  real    1m4.913s
-  user    0m58.071s
-  sys     0m7.982s
+  time cat ESP6500SI-V2-SSA137.GRCh38-liftover.*.vcf|./bin/bio-vcf -iv --num-threads 8 --filter 'r.info.cp.to_f>0.3' > /dev/null
+  real    0m32.491s
+  user    2m34.767s
+  sys     0m12.733s
 ```
 
-Bio-vcf is perfect for parsing large data files. Parsing a 650 Mb GATK
+The same with SnpSift v4.0 takes
+
+```sh
+time cat ESP6500SI-V2-SSA137.GRCh38-liftover.*.vcf|java -jar snpEff/SnpSift.jar filter "( CP>0.3 )" > /dev/null
+real    12m36.121s
+user    12m53.273s
+sys     0m9.913s
+```
+
+This means that on this machine bio-vcf is 24x faster than SnpSift even for a simple filter.
+In fact, bio-vcf is perfect for complex filters and parsing large data files on powerful machines. Parsing a 650 Mb GATK
 Illumina Hiseq VCF file and evaluating the results into a BED format on
 a 16 core machine takes
 
 ```sh
   time bio-vcf --num-threads 16 --filter 'r.chrom.to_i>0 and r.chrom.to_i<21 and r.qual>50' --sfilter '!s.empty? and s.dp>20' --eval '[r.chrom,r.pos,r.pos+1]' < test.large2.vcf > test.out.3
@@ -70,17 +90,23 @@
 Ruby), an embedded Ragel parser for INFO and FORMAT header definitions, as well as primitives for set analysis. Few
 assumptions are made about the actual contents of the VCF file (field
 names are resolved on the fly), so bio-vcf should work with
 all VCF files.
 
-To fetch all entries where all samples have depth larger than 20 use
-a sample filter
+To fetch all entries where all samples have depth larger than 20 and
+filter set to PASS use a sample filter
 
 ```ruby
-  bio-vcf --sfilter 'sample.dp>20' < file.vcf
+  bio-vcf --sfilter 'sample.dp>20 and rec.filter=="PASS"' < file.vcf
 ```
 
+or with a regex
+
+```ruby
+  bio-vcf --sfilter 'sample.dp>20 and rec.filter !~ /LowQD/' < file.vcf
+```
+
 To only filter on some samples number 0 and 3:
 
 ```ruby
   bio-vcf --sfilter-samples 0,3 --sfilter 's.dp>20' < file.vcf
 ```
@@ -263,10 +289,16 @@
 ```ruby
   bio-vcf -q --eval-once 'header.samples.join(",")' < file.vcf
   NORMAL,TUMOR
 ```
 
+Get information from the header (META)
+
+```ruby
+  bio-vcf -q --skip-header --eval-once 'header.meta["GATKCommandLine"]' < gatk_exome.vcf
+```
+
 The 'fields' array contains unprocessed data (strings).  Print first
 five raw fields
 
 ```ruby
   bio-vcf --eval 'fields[0..4]' < file.vcf 
@@ -302,12 +334,15 @@
 
 ```ruby
   bio-vcf --filter 'rec.info.dp>100 and rec.info.readposranksum<=0.815' < file.vcf 
 ```
 
-With subfields defined by rec.format
+(alternatively you can use the indexed rec.info['DP'] and list INFO fields with
+rec.info.fields).
 
+Subfields defined by rec.format:
+
 ```ruby
   bio-vcf --filter 'rec.tumor.ss != 2' < file.vcf 
 ```
 
 Output
@@ -693,11 +728,11 @@
   "seq:chr": "<%= rec.chrom %>" ,
   "seq:pos": <%= rec.pos %> ,
   "seq:ref": "<%= rec.ref %>" ,
   "seq:alt": "<%= rec.alt[0] %>" ,
   "seq:maf": <%= rec.info.maf[0] %> ,
-  "dp":      <%= rec.info.dp %> ,
+  "dp":      <%= rec.info.dp %>
 };
 ```
 
 To get JSON, run with something like (combining 
 with a filter)
@@ -713,11 +748,11 @@
   "seq:chr": "13" ,
   "seq:pos": 35745475 ,
   "seq:ref": "C" ,
   "seq:alt": "T" ,
   "seq:maf": 0.0151 ,
-  "dp":      86 ,
+  "dp":      86
 };
 ```
 
 Likewise for RDF output:
 
@@ -765,28 +800,27 @@
 can be
 
 ```Javascript
 =HEADER
 <% require 'json' %>
-[
-  { "HEADER": {
+{ "HEADER": {
     "options":  <%= options.to_h.to_json %>,
     "files":    <%= ARGV %>,
     "version":  "<%= BIOVCF_VERSION %>"
   },
-
+  "BODY":[
 =BODY
-
-{
-  "seq:chr": "<%= rec.chrom %>" ,
-  "seq:pos": <%= rec.pos %> ,
-  "seq:ref": "<%= rec.ref %>" ,
-  "seq:alt": "<%= rec.alt[0] %>" ,
-  "dp":      <%= rec.info.dp %> ,
-},
+    {
+      "seq:chr": "<%= rec.chrom %>" ,
+      "seq:pos": <%= rec.pos %> ,
+      "seq:ref": "<%= rec.ref %>" ,
+      "seq:alt": "<%= rec.alt[0] %>" ,
+      "dp":      <%= rec.info.dp %>
+    },
 =FOOTER
-]
+  ]
+}
 ```
 
 with
 
 ```sh
@@ -794,30 +828,31 @@
 ```
 
 may generate something like
 
 ```Javascript
-[
-  { "HEADER": {
+{ "HEADER": {
     "options":  {"show_help":false,"source":"https://github.com/CuppenResearch/bioruby-vcf","version":"0.8.1-pre3 (Pjotr Prins)","date":"2014-11-26 12:51:36 +0000","thread_lines":40000,"template":"template/vcf2json.erb","skip_header":true},
     "files":    [],
     "version":  "0.8.1-pre3"
   },
-{
-  "seq:chr": "1" ,
-  "seq:pos": 883516 ,
-  "seq:ref": "G" ,
-  "seq:alt": "A" ,
-  "dp":       ,
-},
-{
-  "seq:chr": "1" ,
-  "seq:pos": 891344 ,
-  "seq:ref": "G" ,
-  "seq:alt": "A" ,
-  "dp":       ,
-},
-]
+  "BODY":[
+    {
+      "seq:chr": "1" ,
+      "seq:pos": 883516 ,
+      "seq:ref": "G" ,
+      "seq:alt": "A" ,
+      "dp":
+    },
+    {
+      "seq:chr": "1" ,
+      "seq:pos": 891344 ,
+      "seq:ref": "G" ,
+      "seq:alt": "A" ,
+      "dp": ,
+    },
+  ]
+}
 ```
 
 Note that the template is not smart enough to remove the final comma
 from the last BODY element. To make it valid JSON that needs to be
 removed. A future version may add a parameter to the BODY element or a