README.md in bio-vcf-0.9.0 vs README.md in bio-vcf-0.9.2

- old
+ new

@@ -2,10 +2,13 @@
 
 [![Build Status](https://secure.travis-ci.org/pjotrp/bioruby-vcf.png)](http://travis-ci.org/pjotrp/bioruby-vcf) 
 
 ## Updates
 
+* Getting ready for a 1.0 release
+* 0.9.1 removed a rare threading bug and cleanup on error
+* Added support for soft filters (request by Brad Chapman)
 * The outputter now writes (properly) in parallel with the parser
 * bio-vcf turns any VCF into JSON with header information, and
   allows you to pipe that JSON directly into any JSON supporting
   language, including Python and Javascript!
 
@@ -21,32 +24,33 @@
 1. Bio-vcf is fast and scales on multi-core computers
 2. Bio-vcf has an expressive filtering and evaluation language
 3. Bio-vcf has great multi-sample support
 4. Bio-vcf has multiple global filters and sample filters
 5. Bio-vcf can access any VCF format
-6. Bio-vcf can parse and query the VCF header (META)
+6. Bio-vcf can parse and query the VCF header (META data)
 7. Bio-vcf can do calculations on fields
 8. Bio-vcf allows for genotype processing
 9. Bio-vcf has support for set analysis
 10. Bio-vcf has sane error handling
 11. Bio-vcf can convert *any* VCF to *any* output, including tabular data, BED, HTML, LaTeX, RDF, JSON and JSON-LD and even other VCFs by using (erb) templates
+12. Bio-vcf has soft filters
 
-Bio-vcf has better performance than other tools
-because of lazy parsing, multi-threading, and useful combinations of
-(fancy) command line filtering (who says Ruby is slow?). Adding
-cores, bio-vcf just does better. The more complicated the filters,
-the larger the gain. First the base line test to show IO performance
+Bio-vcf has better performance than other tools because of lazy
+parsing, multi-threading, and useful combinations of (fancy) command
+line filtering (who says Ruby is slow?). Adding cores, bio-vcf just
+does better. The more complicated the filters, the larger the
+gain. First a base line test to show IO performance
 
 ```sh
   time cat ESP6500SI-V2-SSA137.GRCh38-liftover.*.vcf|wc
   1987143 15897724 1003214613
   real    0m7.823s
   user    0m7.002s
   sys     0m2.972s
 ```
 
-Next run the 1Gb data with bio-vcf effectively using 5 cores on AMD Opteron(tm) Processor 6174 using Linux
+Next run this 1Gb data with bio-vcf effectively using 5 cores on AMD Opteron(tm) Processor 6174 using Linux
 
 ```sh
   time cat ESP6500SI-V2-SSA137.GRCh38-liftover.*.vcf|./bin/bio-vcf -iv --num-threads 8 --filter 'r.info.cp.to_f>0.3' > /dev/null
   real    0m32.491s
   user    2m34.767s
@@ -60,14 +64,15 @@
 real    12m36.121s
 user    12m53.273s
 sys     0m9.913s
 ```
 
-This means that on this machine bio-vcf is 24x faster than SnpSift even for a simple filter.
-In fact, bio-vcf is perfect for complex filters and parsing large data files on powerful machines. Parsing a 650 Mb GATK
-Illumina Hiseq VCF file and evaluating the results into a BED format on
-a 16 core machine takes
+This means that on this machine bio-vcf is 24x faster than SnpSift
+even for a simple filter.  In fact, bio-vcf is perfect for complex
+filters and parsing large data files on powerful machines. Parsing a
+650 Mb GATK Illumina Hiseq VCF file and evaluating the results into a
+BED format on a 16 core machine takes
 
 ```sh
   time bio-vcf --num-threads 16 --filter 'r.chrom.to_i>0 and r.chrom.to_i<21 and r.qual>50' --sfilter '!s.empty? and s.dp>20' --eval '[r.chrom,r.pos,r.pos+1]' < test.large2.vcf > test.out.3
   real    0m47.612s
   user    8m18.234s
@@ -75,24 +80,28 @@
 ```
 
 which shows decent core utilisation (10x). Running 
 gzip compressed VCF files of 30+ Gb has similar performance gains.
 
-Use zcat to
-pipe such gzipped (vcf.gz) files into bio-vcf, e.g.
+To view some complex filters on an 80Gb SNP file check out a
+[GTEx exercise](https://github.com/pjotrp/bioruby-vcf/blob/master/doc/GTEx_reduce.md).
 
+Use zcat (or even better pigz which is multi-core itself) to pipe such
+gzipped (vcf.gz) files into bio-vcf, e.g.
+
 ```sh
   zcat huge_file.vcf.gz| bio-vcf --num-threads 36 --filter 'r.chrom.to_i>0 and r.chrom.to_i<21 and r.qual>50'
     --sfilter '!s.empty? and s.dp>20' 
     --eval '[r.chrom,r.pos,r.pos+1]' > test.bed
 ```
 
-bio-vcf comes with a sensible parser definition language (interestingly it is 100%
-Ruby), an embedded Ragel parser for INFO and FORMAT header definitions, as well as primitives for set analysis. Few
+bio-vcf comes with a sensible parser definition language
+(interestingly it is 100% Ruby), an embedded Ragel parser for INFO and
+FORMAT header definitions, as well as primitives for set analysis. Few
 assumptions are made about the actual contents of the VCF file (field
-names are resolved on the fly), so bio-vcf should work with
-all VCF files.
+names are resolved on the fly), so bio-vcf should work with all VCF
+files.
 
 To fetch all entries where all samples have depth larger than 20 and
 filter set to PASS use a sample filter
 
 ```ruby
@@ -166,11 +175,12 @@
 1       10291   16      26      30      32      36      27      31
 1       10297   18      23      26      30      20      27      27
 1       10303   25      31      28      32      17      23      22
 ```
 
-To calculate alt frequencies from s.ad which is sample (alt dp)/(ref dp + alt dp)
+To calculate percentage non-reference (PNR) alt frequencies from s.ad
+which is sample (alt dp)/(ref dp + alt dp)
 
 ```ruby
 bio-vcf -i --seval 's.ad[1].to_f/(s.ad[0]+s.ad[1])'
 1       10257   0.050314465408805034    0.0912863070539419      0.08835341365461848     0.088709677419354840.09782608695652174      0.12735849056603774     0.06944444444444445
 1       10291   0.09937888198757763     0.10655737704918032     0.12295081967213115     0.1306122448979592 0.22784810126582278      0.17088607594936708     0.1657754010695187
@@ -260,16 +270,10 @@
 ```sh
 gem install bio-vcf
 bio-vcf -h
 ```
 
-For multi-core also install the parallel gem
-
-```sh
-gem install parallel
-```
-
 ## Command line interface (CLI)
 
 Get the version of the VCF file
 
 ```ruby
@@ -395,10 +399,27 @@
 
 ```sh
   bio-vcf --filter "rec.missing_samples?" < file.vcf
 ```
 
+To set a soft filter, i.e. the filter column is updated
+
+```sh
+bio-vcf --add-filter LowQD --filter 'r.tumor.dp<5' < test/data/input/somaticsniper.vcf |bio-vcf --eval '[r.chr,r.pos,r.tumor.dp,r.filter]' --filter 'r.filter.index("LowQD")'
+```
+
+may render something like
+
+```
+1       46527674        4       LowQD
+1       108417572       4       LowQD
+1       155449089       4       LowQD
+1       169847826       4       LowQD
+1       203098164       3       LowQD
+2       39213209        4       LowQD
+```
+
 Likewise you can check for record validity
 
 ```sh
   bio-vcf --filter "not rec.valid?" < file.vcf
 ```
@@ -623,19 +644,41 @@
 ```
 
 and 'gts' as a nucleotide string array
 
 ```ruby
-  bio-vcf --seval 's.gts[0]' 
+  bio-vcf --seval 's.gts' 
     1       10665                   C       C               C       C
     1       10694                   G       G
     1       12783   G       G       G       G       G       G       G
     1       15274   G       G       G       G       G       G       G
 ```
 
 where gts represents the indexed genotype on [ref] + [alt].
 
+To convert combined genotypes into numbers, i.e., 0/0 -> 0, 0/1 -> 1,
+1/1 -> 2, is useful for indexed fields giving information on, for
+example signficance, use
+
+```ruby
+    bio-vcf --seval '!s.empty? and s.gtindex'
+    11      58949455        0       1
+    11      65481082        0       1
+    11      94180424        0       1
+    11      121036021       0       1
+```
+
+Now you can index other fields, e.g. GL
+
+```ruby
+    ./bin/bio-vcf --seval '[(!s.empty? ? s.gl[s.gtindex]:-1)]' 
+    1       900057  1.0     1.0     0.994   1.0     1.0     -1      0.999   1.0     0.997   -1  0.994    0.989   -1      0.991   -1      0.972   0.992   1.0
+    ```
+
+shows a number of SNPs have been scored with high significance and a
+number are missing, here marked as -1.
+
 These values can also be used in filters and output allele depth, for
 example
 
 ```ruby
   bio-vcf -vi --ifilter 'rec.original.gt!="0/1"' --efilter 'rec.original.gt=="0/0"' --seval 'rec.original.ad[s.gti[1]]'
@@ -653,11 +696,39 @@
 1       10257   151     151     151     151     151     8       151
 1       13302   26      10      10      10      10      10      10
 1       13757   47      47      4       47      47      4       47
 ```
 
+## Sample counting
 
+Note, the use of lambda allows for sophisticated queries. You may need
+some expert advice here.
+
+To count valid genotype field in samples you can do something like
+
+```ruby
+bio-vcf --eval 'r.samples.count {|s| s.gt!="./."}'
+```
+
+A similar complex count would be
+
+```ruby
+    bio-vcf --eval '[r.chr,r.pos,r.samples.count { |s| (!s.empty? && s.gl[s.gtindex]==1.0) }]'
+```
+
+which tests for perfect SNPs scored (for example).
+
+## Reorder filter with lambda
+
+Sometime it pay to reorder the filter using a lambda. This is one
+example where the greedy sample counts are done only for those
+samples that match the other criteria:
+
+```ruby
+./bin/bio-vcf --num-threads=1 --filter '(r.info.miss<0.05 and r.info.exp_freq_a1>0.05 and r.info.exp_freq_a1<0.95 and r.info.impinfo>0.7 and r.info.hw<1.0) ? lambda { found=r.samples.count { |s| (!s.empty? && s.gl[s.gtindex]==1.0) }.to_f; total=r.samples.count{|s| s.gt!="./."} ; found/total>0.7 and total-found<30 }.call : false)'
+```
+
 ## Modify VCF files
 
 Add or modify the sample file name in the INFO fields:
 
 ```sh
@@ -902,11 +973,14 @@
       Total          130
     ## ==================================================
 
 ## Other examples
 
-For more examples see the feature [section](https://github.com/pjotrp/bioruby-vcf/tree/master/features).
+For more exercises and examples see
+[doc](https://github.com/pjotrp/bioruby-vcf/tree/master/doc) directory
+and the the feature
+[section](https://github.com/pjotrp/bioruby-vcf/tree/master/features).
 
 ## API
 
 BioVcf can also be used as an API. The following code is basically
 what the command line interface uses (see ./bin/bio-vcf)
@@ -935,16 +1009,65 @@
   end
 ```
 
 ## Trouble shooting
 
+### MRI supports threading
+
 Note that Ruby 2.x is required for Bio-vcf. JRuby works, but only
 in single threaded mode (for now).
 
+### Set TMPDIR when running out of space
+
 The multi-threading creates temporary files using the system TMPDIR.
 This behaviour can be overridden by setting the environment variable.
-Also, for genome-wide sequencing it may be useful to increase
---thread-lines to a value larger than 1_000_000.
+
+### Reorder filter on time out
+
+Make sure to minimize expensive calculations by moving them
+backward. An 'and' statement is evaluated from left to right. With
+
+```ruby
+fast_check and slow_check
+```
+
+slow_check only gets executed if fast_check is true.
+
+For more complex filters use lambda inside a conditional
+
+```ruby
+    ( fast_check ? lambda { slow_check }.call : false )
+```
+    
+where slow_check is the slow section of your query. As is shown
+earlier in this document. Don't forget the .call!
+
+### Reduce thread lines on timeout
+
+Depending on your input data and the speed filters it may be useful to
+tweak the number of thread lines and/or to increase the timeout.
+
+On really fast file systems for genome-wide sequencing try increasing
+--thread-lines to a value larger than 100_000. On the other hand if
+the computations are intensive (per line) reduce the number of
+thread-lines (try 10_000 and 1_000).  If processes get killed that is
+the one to try.
+
+For larger files set the timeout to 600, or so. --timeout 600.
+
+Different values may show different core use on a machine.
+
+### Debugging
+
+To debug output use '-v --num-threads=1' for generating useful
+output. Also do not use the -i switch (ignore errors) when there
+are problems.
+
+### Tmpdir contains (old) bio-vcf directories
+
+Multi-threaded bio-vcf writes into a temporary directory during
+processing. When a process gets interrupted for some reason the
+temporary directory may remain.
 
 ## Project home page
 
 Information on the source tree, documentation, examples, issues and
 how to contribute, see