README.md in bio-vcf-0.7.0 vs README.md in bio-vcf-0.7.3

- old
+ new

@@ -1,26 +1,39 @@
 # bio-vcf
 
 [![Build Status](https://secure.travis-ci.org/pjotrp/bioruby-vcf.png)](http://travis-ci.org/pjotrp/bioruby-vcf) 
 
-Yet another VCF parser. Bio-vcf is not only fast for genome-wide data,
-it also comes with a really nice filtering, evaluation and rewrite
-language. Bio-vcf has better performance than other tools
+A new generation VCF parser. Bio-vcf is not only fast for genome-wide
+(WGS) data, it also comes with a really nice filtering, evaluation and
+rewrite language. Why would you use bio-vcf over other parsers?
+
+1. Bio-vcf is fast and scales on multi-core computers
+2. Bio-vcf has an expressive filtering and evaluation language
+3. Bio-vcf has great multi-sample support
+4. Bio-vcf has multiple global filters and sample filters
+5. Bio-vcf can access any VCF format
+6. Bio-vcf can do calculations on fields
+7. Bio-vcf allows for genotype processing
+8. Bio-vcf has support for set analysis
+9. Bio-vcf has sane error handling
+10. Bio-vcf can output tabular data, HTML, LaTeX, RDF and (soon) JSON
+
+Bio-vcf has better performance than other tools
 because of lazy parsing, multi-threading, and useful combinations of
-(fancy) command line filtering. For example on an 2 core machine 
-bio-vcf is 50% faster than SnpSift. On an 8 core machine bio-vcf is
-3x faster than SnpSift. Parsing a 1 Gb ESP VCF with 8 cores with
-bio-vcf takes
+(fancy) command line filtering. For example on an 2 core machine
+bio-vcf is typically 50% faster than JVM based SnpSift. On an 8 core machine
+bio-vcf is at least 3x faster than SnpSift. Parsing a 1 Gb ESP
+VCF with 8 cores with bio-vcf takes
 
 ```sh
   time ./bin/bio-vcf -iv --num-threads 8 --filter 'r.info.cp>0.3' < ESP6500SI_V2_SSA137.vcf > test1.vcf
   real    0m21.095s
   user    1m41.101s
   sys     0m7.852s
 ```
 
-and parsing with SnpSift takes
+while parsing with SnpSift takes
 
 ```sh
   time cat ESP6500SI_V2_SSA137.vcf |java -jar snpEff/SnpSift.jar filter "( CP>0.3 )" > test.vcf
   real    1m4.913s
   user    0m58.071s
@@ -30,26 +43,26 @@
 Bio-vcf is perfect for parsing large data files. Parsing a 650 Mb GATK
 Illumina Hiseq VCF file and evaluating the results into a BED format on
 a 16 core machine takes
 
 ```sh
-  time bio-vcf --num-threads 36 --filter 'r.chrom.to_i>0 and r.chrom.to_i<21 and r.qual>50' --sfilter '!s.empty? and s.dp>20' --eval '[r.chrom,r.pos,r.pos+1]' < test.large2.vcf > test.out.3
+  time bio-vcf --num-threads 16 --filter 'r.chrom.to_i>0 and r.chrom.to_i<21 and r.qual>50' --sfilter '!s.empty? and s.dp>20' --eval '[r.chrom,r.pos,r.pos+1]' < test.large2.vcf > test.out.3
   real    0m47.612s
   user    8m18.234s
   sys     0m5.039s
 ```
 
-which shows some pretty decent core utilisation (10x). 
+which shows pretty decent core utilisation (10x). We are running 
+gzip compressed VCF files of 30+ Gb with similar performance gains.
 
 Use zcat to
-pipe gzipped (vcf.gz) files into bio-vcf, e.g.
+pipe such gzipped (vcf.gz) files into bio-vcf, e.g.
 
 ```sh
   zcat huge_file.vcf.gz| bio-vcf --num-threads 36 --filter 'r.chrom.to_i>0 and r.chrom.to_i<21 and r.qual>50'
     --sfilter '!s.empty? and s.dp>20' 
     --eval '[r.chrom,r.pos,r.pos+1]' > test.bed
-
 ```
 
 bio-vcf comes with a sensible parser definition language (it is 100%
 Ruby), as well as primitives for set analysis. Few
 assumptions are made about the actual contents of the VCF file (field
@@ -182,10 +195,17 @@
 Most filter and eval commands can be used at the same time. Special set
 commands exit for filtering and eval. When a set is defined, based on
 the sample name, you can apply filters on the samples inside the set,
 outside the set and over all samples. E.g.
 
+So, why would you use bio-vcf instead of rolling out your own
+Perl/Python/other ad-hoc script? I think the reason should be that
+there is less chance of mistakes because of Bio-vcf's clear filtering
+language and sensible built-in validation. The second reason would be
+speed. Bio-vcf's multi-threading capability gives it great and hard to
+replicate performance.
+
 Also note you can use
 [bio-table](https://github.com/pjotrp/bioruby-table) to
 filter/transform data further and convert to other formats, such as
 RDF.
 
@@ -200,11 +220,11 @@
 the source code. It is not hard to add features. Otherwise, send a short
 example of a VCF statement you need to work on.
 
 ## Installation
 
-Note that you need Ruby 1.9.3 or later. The 2.x Ruby series also give
+Note that you need Ruby 2.x or later. The 2.x Ruby series also give
 a performance improvement. Bio-vcf will show the Ruby version when
 typing the command 'bio-vcf -h'.
 
 To intall bio-vcf with gem:
 
@@ -369,20 +389,21 @@
 ```
 
 Note that only valid method names in lower case get picked up this
 way. Also by convention normal is sample 1 and tumor is sample 2.
 
-Even shorter r is an alias for rec (nyi) 
+Even shorter r is an alias for rec
 
 ```sh
   bio-vcf --eval "r.original.gt" < file.vcf
   bio-vcf --eval "r.s3t2.dp" < file.vcf
 ```
 
 ## Special functions
 
-Note: special functions are not yet implemented!
+Note: special functions are not yet implemented! Look below
+for genotype processing which has indexing in 'gti'.
 
 Sometime you want to use a special function in a filter. For 
 example percentage variant reads can be defined as [a,c,g,t] 
 with frequencies against sample read depth (dp) as 
 [0,0.03,0.47,0.50]. Filtering would with a special function, 
@@ -438,27 +459,32 @@
 bio-vcf allows for set analysis. With the complement filter, for
 example, samples are selected that evaluate to true, all others should
 evaluate to false. For this we create three filters, one for all 
 samples that are included (the --ifilter or -if), for all samples that
 are excluded (the --efilter or -ef) and for any sample (the --sfilter
-or -sf). So i=include, e=exclude and s=any sample. 
+or -sf). So i=include (OR filter), e=exclude and s=any sample (AND
+filter). 
 
 The equivalent of the union filter is by using the --sfilter, so
 
 ```sh
   bio-vcf --sfilter 's.dp>20' 
 ```
 
-Filters DP on all samples. To filter on a subset you can add a
+Filters DP on all samples and is true if all samples match the
+criterium (AND). To filter on a subset you can add a
 selector
 
 ```sh
   bio-vcf --sfilter-samples 0,1,4 --sfilter 's.dp>20' 
 ```
 
-For set analysis there are the additional ifilter (include) and efilter (exclude). To filter
-on samples 0,1,4 and output the gq values
+For set analysis there are the additional ifilter (include) and
+efilter (exclude).  Where sfilter represents an ALL match, the ifilter
+represents an ANY match, i.e., it is true if one of the samples
+matches the criterium (OR). To filter on samples 0,1,4 and output the gq
+values
 
 ```sh
   bio-vcf -i --ifilter-samples 0,1,4 --ifilter 's.gq<10 or s.gq==99' --seval s.gq
     1       14907   99      99      99      99      99      99      99
     1       14930   99      99      99      99      99      99      99
@@ -492,12 +518,14 @@
 ```sh
   bio-vcf -i --ifilter-samples 0,1,4 --ifilter 's.gt==rec.s1t1.gt and s.gq>10' --seval s.gq --efilter 's.gq==99' 
 ```
 
 Etc. etc. Any combination of sfilter, ifilter and efilter is possible.
+Currently the efilter is an ALL filter (AND), i.e. all excluded
+samples need to match the criterium.
 
-The following are not yet implemented:
+The following regular expression matches are not yet implemented:
 
 In the near future it is also possible to select samples on a regex (here
 select all samples where the name starts with s3)
 
 ```sh
@@ -558,35 +586,43 @@
     1       10694                   G       G
     1       12783   G       G       G       G       G       G       G
     1       15274   G       G       G       G       G       G       G
 ```
 
+where gts represents the indexed genotype on [ref] + [alt].
+
 These values can also be used in filters and output allele depth, for
 example
 
 ```ruby
   bio-vcf -vi --ifilter 'rec.original.gt!="0/1"' --efilter 'rec.original.gt=="0/0"' --seval 'rec.original.ad[s.gti[1]]'
     1       10257   151     151     151     151     151     8       151
     1       13302   26      10      10      10      10      10      10
     1       13757   47      47      4       47      47      4       47
 ```
 
-The following does not yet work (using the gti in a sample directly)
+You can use the genotype index gti to fetch values from, for example,
+allele depth:
 
 ```ruby
 bio-vcf -vi --ifilter 'rec.original.gt!="0/1"' --efilter 'rec.original.gti[0]==0' --seval 'rec.original.ad[s.gti[1]]'
+
+1       10257   151     151     151     151     151     8       151
+1       13302   26      10      10      10      10      10      10
+1       13757   47      47      4       47      47      4       47
 ```
 
+
 ## Modify VCF files
 
 Add or modify the sample file name in the INFO fields:
 
 ```sh
   bio-vcf --rewrite 'rec.info["sample"]="mytest"' < mytest.vcf
 ```
 
-To remove/select 3 samples and create a new file:
+To remove/select 3 samples:
 
 ```sh
   bio-vcf --samples 0,1,3 < mytest.vcf
 ```
 
@@ -612,15 +648,54 @@
 ```
 
 Similarly for GoNL
 
 ```ruby
-bio-vcf --id gonl --rdf --tags '{"db:evs" => true, "seq:freq" => rec.info.af }' < GoNL.vcf
+bio-vcf --id gonl --rdf --tags '{"db:gonl" => true, "seq:freq" => rec.info.af }' < GoNL.vcf
 ```
 
+or without AF
+
+
+```ruby
+bio-vcf --id gonl --rdf --tags '{"db:gonl" => true, "seq:freq" => (rec.info.ac.to_f/rec.info.an).round(2) }' < gonl_germline_overlap_r4.vcf
+```
+
+
+
 Also check out [bio-table](https://github.com/pjotrp/bioruby-table) to convert tabular data to RDF.
 
+## Statistics
+
+Simple statistics are available for REF>ALT changes:
+
+```sh
+./bin/bio-vcf -v --statistics < test/data/input/dbsnp.vcf
+```
+
+    ## ==== Statistics ==================================
+      G>A             59      45%
+      C>T             30      23%
+      A>G              5       4%
+      C>G              5       4% 
+      C>A              5       4%
+      G>T              4       3%
+      T>C              4       3%
+      G>C              4       3%
+      T>A              3       2%
+      A>C              3       2%
+      A>T              2       2%
+      GTCCGACCGCTCC>G  1       1%
+      CGACCGCTCC>C     1       1%
+      T>TGGAGC         1       1%
+      C>CGTCTTCA       1       1%
+      TG>T             1       1%
+      AC>A             1       1%
+
+      Total          130
+    ## ==================================================
+
 ## Other examples
 
 For more examples see the feature [section](https://github.com/pjotrp/bioruby-vcf/tree/master/features).
 
 ## API
@@ -651,9 +726,12 @@
     #
   end
 ```
 
 ## Trouble shooting
+
+Note that Ruby 2.x is required for Bio-vcf. JRuby works, but only
+in single threaded mode (for now).
 
 The multi-threading creates temporary files using the system TMPDIR.
 This behaviour can be overridden by setting the environment variable.
 Also, for genome-wide sequencing it may be useful to increase
 --thread-lines to a value larger than 1_000_000.