README.md in bio-vcf-0.9.0 vs README.md in bio-vcf-0.9.2
- old
+ new
@@ -2,10 +2,13 @@
[![Build Status](https://secure.travis-ci.org/pjotrp/bioruby-vcf.png)](http://travis-ci.org/pjotrp/bioruby-vcf)
## Updates
+* Getting ready for a 1.0 release
+* 0.9.1 removed a rare threading bug and cleanup on error
+* Added support for soft filters (request by Brad Chapman)
* The outputter now writes (properly) in parallel with the parser
* bio-vcf turns any VCF into JSON with header information, and
allows you to pipe that JSON directly into any JSON supporting
language, including Python and Javascript!
@@ -21,32 +24,33 @@
1. Bio-vcf is fast and scales on multi-core computers
2. Bio-vcf has an expressive filtering and evaluation language
3. Bio-vcf has great multi-sample support
4. Bio-vcf has multiple global filters and sample filters
5. Bio-vcf can access any VCF format
-6. Bio-vcf can parse and query the VCF header (META)
+6. Bio-vcf can parse and query the VCF header (META data)
7. Bio-vcf can do calculations on fields
8. Bio-vcf allows for genotype processing
9. Bio-vcf has support for set analysis
10. Bio-vcf has sane error handling
11. Bio-vcf can convert *any* VCF to *any* output, including tabular data, BED, HTML, LaTeX, RDF, JSON and JSON-LD and even other VCFs by using (erb) templates
+12. Bio-vcf has soft filters
-Bio-vcf has better performance than other tools
-because of lazy parsing, multi-threading, and useful combinations of
-(fancy) command line filtering (who says Ruby is slow?). Adding
-cores, bio-vcf just does better. The more complicated the filters,
-the larger the gain. First the base line test to show IO performance
+Bio-vcf has better performance than other tools because of lazy
+parsing, multi-threading, and useful combinations of (fancy) command
+line filtering (who says Ruby is slow?). Adding cores, bio-vcf just
+does better. The more complicated the filters, the larger the
+gain. First a base line test to show IO performance
```sh
time cat ESP6500SI-V2-SSA137.GRCh38-liftover.*.vcf|wc
1987143 15897724 1003214613
real 0m7.823s
user 0m7.002s
sys 0m2.972s
```
-Next run the 1Gb data with bio-vcf effectively using 5 cores on AMD Opteron(tm) Processor 6174 using Linux
+Next run this 1Gb data with bio-vcf effectively using 5 cores on AMD Opteron(tm) Processor 6174 using Linux
```sh
time cat ESP6500SI-V2-SSA137.GRCh38-liftover.*.vcf|./bin/bio-vcf -iv --num-threads 8 --filter 'r.info.cp.to_f>0.3' > /dev/null
real 0m32.491s
user 2m34.767s
@@ -60,14 +64,15 @@
real 12m36.121s
user 12m53.273s
sys 0m9.913s
```
-This means that on this machine bio-vcf is 24x faster than SnpSift even for a simple filter.
-In fact, bio-vcf is perfect for complex filters and parsing large data files on powerful machines. Parsing a 650 Mb GATK
-Illumina Hiseq VCF file and evaluating the results into a BED format on
-a 16 core machine takes
+This means that on this machine bio-vcf is 24x faster than SnpSift
+even for a simple filter. In fact, bio-vcf is perfect for complex
+filters and parsing large data files on powerful machines. Parsing a
+650 Mb GATK Illumina Hiseq VCF file and evaluating the results into a
+BED format on a 16 core machine takes
```sh
time bio-vcf --num-threads 16 --filter 'r.chrom.to_i>0 and r.chrom.to_i<21 and r.qual>50' --sfilter '!s.empty? and s.dp>20' --eval '[r.chrom,r.pos,r.pos+1]' < test.large2.vcf > test.out.3
real 0m47.612s
user 8m18.234s
@@ -75,24 +80,28 @@
```
which shows decent core utilisation (10x). Running
gzip compressed VCF files of 30+ Gb has similar performance gains.
-Use zcat to
-pipe such gzipped (vcf.gz) files into bio-vcf, e.g.
+To view some complex filters on an 80Gb SNP file check out a
+[GTEx exercise](https://github.com/pjotrp/bioruby-vcf/blob/master/doc/GTEx_reduce.md).
+Use zcat (or even better pigz which is multi-core itself) to pipe such
+gzipped (vcf.gz) files into bio-vcf, e.g.
+
```sh
zcat huge_file.vcf.gz| bio-vcf --num-threads 36 --filter 'r.chrom.to_i>0 and r.chrom.to_i<21 and r.qual>50'
--sfilter '!s.empty? and s.dp>20'
--eval '[r.chrom,r.pos,r.pos+1]' > test.bed
```
-bio-vcf comes with a sensible parser definition language (interestingly it is 100%
-Ruby), an embedded Ragel parser for INFO and FORMAT header definitions, as well as primitives for set analysis. Few
+bio-vcf comes with a sensible parser definition language
+(interestingly it is 100% Ruby), an embedded Ragel parser for INFO and
+FORMAT header definitions, as well as primitives for set analysis. Few
assumptions are made about the actual contents of the VCF file (field
-names are resolved on the fly), so bio-vcf should work with
-all VCF files.
+names are resolved on the fly), so bio-vcf should work with all VCF
+files.
To fetch all entries where all samples have depth larger than 20 and
filter set to PASS use a sample filter
```ruby
@@ -166,11 +175,12 @@
1 10291 16 26 30 32 36 27 31
1 10297 18 23 26 30 20 27 27
1 10303 25 31 28 32 17 23 22
```
-To calculate alt frequencies from s.ad which is sample (alt dp)/(ref dp + alt dp)
+To calculate percentage non-reference (PNR) alt frequencies from s.ad
+which is sample (alt dp)/(ref dp + alt dp)
```ruby
bio-vcf -i --seval 's.ad[1].to_f/(s.ad[0]+s.ad[1])'
1 10257 0.050314465408805034 0.0912863070539419 0.08835341365461848 0.088709677419354840.09782608695652174 0.12735849056603774 0.06944444444444445
1 10291 0.09937888198757763 0.10655737704918032 0.12295081967213115 0.1306122448979592 0.22784810126582278 0.17088607594936708 0.1657754010695187
@@ -260,16 +270,10 @@
```sh
gem install bio-vcf
bio-vcf -h
```
-For multi-core also install the parallel gem
-
-```sh
-gem install parallel
-```
-
## Command line interface (CLI)
Get the version of the VCF file
```ruby
@@ -395,10 +399,27 @@
```sh
bio-vcf --filter "rec.missing_samples?" < file.vcf
```
+To set a soft filter, i.e. the filter column is updated
+
+```sh
+bio-vcf --add-filter LowQD --filter 'r.tumor.dp<5' < test/data/input/somaticsniper.vcf |bio-vcf --eval '[r.chr,r.pos,r.tumor.dp,r.filter]' --filter 'r.filter.index("LowQD")'
+```
+
+may render something like
+
+```
+1 46527674 4 LowQD
+1 108417572 4 LowQD
+1 155449089 4 LowQD
+1 169847826 4 LowQD
+1 203098164 3 LowQD
+2 39213209 4 LowQD
+```
+
Likewise you can check for record validity
```sh
bio-vcf --filter "not rec.valid?" < file.vcf
```
@@ -623,19 +644,41 @@
```
and 'gts' as a nucleotide string array
```ruby
- bio-vcf --seval 's.gts[0]'
+ bio-vcf --seval 's.gts'
1 10665 C C C C
1 10694 G G
1 12783 G G G G G G G
1 15274 G G G G G G G
```
where gts represents the indexed genotype on [ref] + [alt].
+To convert combined genotypes into numbers, i.e., 0/0 -> 0, 0/1 -> 1,
+1/1 -> 2, is useful for indexed fields giving information on, for
+example signficance, use
+
+```ruby
+ bio-vcf --seval '!s.empty? and s.gtindex'
+ 11 58949455 0 1
+ 11 65481082 0 1
+ 11 94180424 0 1
+ 11 121036021 0 1
+```
+
+Now you can index other fields, e.g. GL
+
+```ruby
+ ./bin/bio-vcf --seval '[(!s.empty? ? s.gl[s.gtindex]:-1)]'
+ 1 900057 1.0 1.0 0.994 1.0 1.0 -1 0.999 1.0 0.997 -1 0.994 0.989 -1 0.991 -1 0.972 0.992 1.0
+ ```
+
+shows a number of SNPs have been scored with high significance and a
+number are missing, here marked as -1.
+
These values can also be used in filters and output allele depth, for
example
```ruby
bio-vcf -vi --ifilter 'rec.original.gt!="0/1"' --efilter 'rec.original.gt=="0/0"' --seval 'rec.original.ad[s.gti[1]]'
@@ -653,11 +696,39 @@
1 10257 151 151 151 151 151 8 151
1 13302 26 10 10 10 10 10 10
1 13757 47 47 4 47 47 4 47
```
+## Sample counting
+Note, the use of lambda allows for sophisticated queries. You may need
+some expert advice here.
+
+To count valid genotype field in samples you can do something like
+
+```ruby
+bio-vcf --eval 'r.samples.count {|s| s.gt!="./."}'
+```
+
+A similar complex count would be
+
+```ruby
+ bio-vcf --eval '[r.chr,r.pos,r.samples.count { |s| (!s.empty? && s.gl[s.gtindex]==1.0) }]'
+```
+
+which tests for perfect SNPs scored (for example).
+
+## Reorder filter with lambda
+
+Sometime it pay to reorder the filter using a lambda. This is one
+example where the greedy sample counts are done only for those
+samples that match the other criteria:
+
+```ruby
+./bin/bio-vcf --num-threads=1 --filter '(r.info.miss<0.05 and r.info.exp_freq_a1>0.05 and r.info.exp_freq_a1<0.95 and r.info.impinfo>0.7 and r.info.hw<1.0) ? lambda { found=r.samples.count { |s| (!s.empty? && s.gl[s.gtindex]==1.0) }.to_f; total=r.samples.count{|s| s.gt!="./."} ; found/total>0.7 and total-found<30 }.call : false)'
+```
+
## Modify VCF files
Add or modify the sample file name in the INFO fields:
```sh
@@ -902,11 +973,14 @@
Total 130
## ==================================================
## Other examples
-For more examples see the feature [section](https://github.com/pjotrp/bioruby-vcf/tree/master/features).
+For more exercises and examples see
+[doc](https://github.com/pjotrp/bioruby-vcf/tree/master/doc) directory
+and the the feature
+[section](https://github.com/pjotrp/bioruby-vcf/tree/master/features).
## API
BioVcf can also be used as an API. The following code is basically
what the command line interface uses (see ./bin/bio-vcf)
@@ -935,16 +1009,65 @@
end
```
## Trouble shooting
+### MRI supports threading
+
Note that Ruby 2.x is required for Bio-vcf. JRuby works, but only
in single threaded mode (for now).
+### Set TMPDIR when running out of space
+
The multi-threading creates temporary files using the system TMPDIR.
This behaviour can be overridden by setting the environment variable.
-Also, for genome-wide sequencing it may be useful to increase
---thread-lines to a value larger than 1_000_000.
+
+### Reorder filter on time out
+
+Make sure to minimize expensive calculations by moving them
+backward. An 'and' statement is evaluated from left to right. With
+
+```ruby
+fast_check and slow_check
+```
+
+slow_check only gets executed if fast_check is true.
+
+For more complex filters use lambda inside a conditional
+
+```ruby
+ ( fast_check ? lambda { slow_check }.call : false )
+```
+
+where slow_check is the slow section of your query. As is shown
+earlier in this document. Don't forget the .call!
+
+### Reduce thread lines on timeout
+
+Depending on your input data and the speed filters it may be useful to
+tweak the number of thread lines and/or to increase the timeout.
+
+On really fast file systems for genome-wide sequencing try increasing
+--thread-lines to a value larger than 100_000. On the other hand if
+the computations are intensive (per line) reduce the number of
+thread-lines (try 10_000 and 1_000). If processes get killed that is
+the one to try.
+
+For larger files set the timeout to 600, or so. --timeout 600.
+
+Different values may show different core use on a machine.
+
+### Debugging
+
+To debug output use '-v --num-threads=1' for generating useful
+output. Also do not use the -i switch (ignore errors) when there
+are problems.
+
+### Tmpdir contains (old) bio-vcf directories
+
+Multi-threaded bio-vcf writes into a temporary directory during
+processing. When a process gets interrupted for some reason the
+temporary directory may remain.
## Project home page
Information on the source tree, documentation, examples, issues and
how to contribute, see