README.md in bio-vcf-0.9.0 vs README.md in bio-vcf-0.9.2

- old
+ new

@@ -2,10 +2,13 @@ [![Build Status](https://secure.travis-ci.org/pjotrp/bioruby-vcf.png)](http://travis-ci.org/pjotrp/bioruby-vcf) ## Updates +* Getting ready for a 1.0 release +* 0.9.1 removed a rare threading bug and cleanup on error +* Added support for soft filters (request by Brad Chapman) * The outputter now writes (properly) in parallel with the parser * bio-vcf turns any VCF into JSON with header information, and allows you to pipe that JSON directly into any JSON supporting language, including Python and Javascript! @@ -21,32 +24,33 @@ 1. Bio-vcf is fast and scales on multi-core computers 2. Bio-vcf has an expressive filtering and evaluation language 3. Bio-vcf has great multi-sample support 4. Bio-vcf has multiple global filters and sample filters 5. Bio-vcf can access any VCF format -6. Bio-vcf can parse and query the VCF header (META) +6. Bio-vcf can parse and query the VCF header (META data) 7. Bio-vcf can do calculations on fields 8. Bio-vcf allows for genotype processing 9. Bio-vcf has support for set analysis 10. Bio-vcf has sane error handling 11. Bio-vcf can convert *any* VCF to *any* output, including tabular data, BED, HTML, LaTeX, RDF, JSON and JSON-LD and even other VCFs by using (erb) templates +12. Bio-vcf has soft filters -Bio-vcf has better performance than other tools -because of lazy parsing, multi-threading, and useful combinations of -(fancy) command line filtering (who says Ruby is slow?). Adding -cores, bio-vcf just does better. The more complicated the filters, -the larger the gain. First the base line test to show IO performance +Bio-vcf has better performance than other tools because of lazy +parsing, multi-threading, and useful combinations of (fancy) command +line filtering (who says Ruby is slow?). Adding cores, bio-vcf just +does better. The more complicated the filters, the larger the +gain. First a base line test to show IO performance ```sh time cat ESP6500SI-V2-SSA137.GRCh38-liftover.*.vcf|wc 1987143 15897724 1003214613 real 0m7.823s user 0m7.002s sys 0m2.972s ``` -Next run the 1Gb data with bio-vcf effectively using 5 cores on AMD Opteron(tm) Processor 6174 using Linux +Next run this 1Gb data with bio-vcf effectively using 5 cores on AMD Opteron(tm) Processor 6174 using Linux ```sh time cat ESP6500SI-V2-SSA137.GRCh38-liftover.*.vcf|./bin/bio-vcf -iv --num-threads 8 --filter 'r.info.cp.to_f>0.3' > /dev/null real 0m32.491s user 2m34.767s @@ -60,14 +64,15 @@ real 12m36.121s user 12m53.273s sys 0m9.913s ``` -This means that on this machine bio-vcf is 24x faster than SnpSift even for a simple filter. -In fact, bio-vcf is perfect for complex filters and parsing large data files on powerful machines. Parsing a 650 Mb GATK -Illumina Hiseq VCF file and evaluating the results into a BED format on -a 16 core machine takes +This means that on this machine bio-vcf is 24x faster than SnpSift +even for a simple filter. In fact, bio-vcf is perfect for complex +filters and parsing large data files on powerful machines. Parsing a +650 Mb GATK Illumina Hiseq VCF file and evaluating the results into a +BED format on a 16 core machine takes ```sh time bio-vcf --num-threads 16 --filter 'r.chrom.to_i>0 and r.chrom.to_i<21 and r.qual>50' --sfilter '!s.empty? and s.dp>20' --eval '[r.chrom,r.pos,r.pos+1]' < test.large2.vcf > test.out.3 real 0m47.612s user 8m18.234s @@ -75,24 +80,28 @@ ``` which shows decent core utilisation (10x). Running gzip compressed VCF files of 30+ Gb has similar performance gains. -Use zcat to -pipe such gzipped (vcf.gz) files into bio-vcf, e.g. +To view some complex filters on an 80Gb SNP file check out a +[GTEx exercise](https://github.com/pjotrp/bioruby-vcf/blob/master/doc/GTEx_reduce.md). +Use zcat (or even better pigz which is multi-core itself) to pipe such +gzipped (vcf.gz) files into bio-vcf, e.g. + ```sh zcat huge_file.vcf.gz| bio-vcf --num-threads 36 --filter 'r.chrom.to_i>0 and r.chrom.to_i<21 and r.qual>50' --sfilter '!s.empty? and s.dp>20' --eval '[r.chrom,r.pos,r.pos+1]' > test.bed ``` -bio-vcf comes with a sensible parser definition language (interestingly it is 100% -Ruby), an embedded Ragel parser for INFO and FORMAT header definitions, as well as primitives for set analysis. Few +bio-vcf comes with a sensible parser definition language +(interestingly it is 100% Ruby), an embedded Ragel parser for INFO and +FORMAT header definitions, as well as primitives for set analysis. Few assumptions are made about the actual contents of the VCF file (field -names are resolved on the fly), so bio-vcf should work with -all VCF files. +names are resolved on the fly), so bio-vcf should work with all VCF +files. To fetch all entries where all samples have depth larger than 20 and filter set to PASS use a sample filter ```ruby @@ -166,11 +175,12 @@ 1 10291 16 26 30 32 36 27 31 1 10297 18 23 26 30 20 27 27 1 10303 25 31 28 32 17 23 22 ``` -To calculate alt frequencies from s.ad which is sample (alt dp)/(ref dp + alt dp) +To calculate percentage non-reference (PNR) alt frequencies from s.ad +which is sample (alt dp)/(ref dp + alt dp) ```ruby bio-vcf -i --seval 's.ad[1].to_f/(s.ad[0]+s.ad[1])' 1 10257 0.050314465408805034 0.0912863070539419 0.08835341365461848 0.088709677419354840.09782608695652174 0.12735849056603774 0.06944444444444445 1 10291 0.09937888198757763 0.10655737704918032 0.12295081967213115 0.1306122448979592 0.22784810126582278 0.17088607594936708 0.1657754010695187 @@ -260,16 +270,10 @@ ```sh gem install bio-vcf bio-vcf -h ``` -For multi-core also install the parallel gem - -```sh -gem install parallel -``` - ## Command line interface (CLI) Get the version of the VCF file ```ruby @@ -395,10 +399,27 @@ ```sh bio-vcf --filter "rec.missing_samples?" < file.vcf ``` +To set a soft filter, i.e. the filter column is updated + +```sh +bio-vcf --add-filter LowQD --filter 'r.tumor.dp<5' < test/data/input/somaticsniper.vcf |bio-vcf --eval '[r.chr,r.pos,r.tumor.dp,r.filter]' --filter 'r.filter.index("LowQD")' +``` + +may render something like + +``` +1 46527674 4 LowQD +1 108417572 4 LowQD +1 155449089 4 LowQD +1 169847826 4 LowQD +1 203098164 3 LowQD +2 39213209 4 LowQD +``` + Likewise you can check for record validity ```sh bio-vcf --filter "not rec.valid?" < file.vcf ``` @@ -623,19 +644,41 @@ ``` and 'gts' as a nucleotide string array ```ruby - bio-vcf --seval 's.gts[0]' + bio-vcf --seval 's.gts' 1 10665 C C C C 1 10694 G G 1 12783 G G G G G G G 1 15274 G G G G G G G ``` where gts represents the indexed genotype on [ref] + [alt]. +To convert combined genotypes into numbers, i.e., 0/0 -> 0, 0/1 -> 1, +1/1 -> 2, is useful for indexed fields giving information on, for +example signficance, use + +```ruby + bio-vcf --seval '!s.empty? and s.gtindex' + 11 58949455 0 1 + 11 65481082 0 1 + 11 94180424 0 1 + 11 121036021 0 1 +``` + +Now you can index other fields, e.g. GL + +```ruby + ./bin/bio-vcf --seval '[(!s.empty? ? s.gl[s.gtindex]:-1)]' + 1 900057 1.0 1.0 0.994 1.0 1.0 -1 0.999 1.0 0.997 -1 0.994 0.989 -1 0.991 -1 0.972 0.992 1.0 + ``` + +shows a number of SNPs have been scored with high significance and a +number are missing, here marked as -1. + These values can also be used in filters and output allele depth, for example ```ruby bio-vcf -vi --ifilter 'rec.original.gt!="0/1"' --efilter 'rec.original.gt=="0/0"' --seval 'rec.original.ad[s.gti[1]]' @@ -653,11 +696,39 @@ 1 10257 151 151 151 151 151 8 151 1 13302 26 10 10 10 10 10 10 1 13757 47 47 4 47 47 4 47 ``` +## Sample counting +Note, the use of lambda allows for sophisticated queries. You may need +some expert advice here. + +To count valid genotype field in samples you can do something like + +```ruby +bio-vcf --eval 'r.samples.count {|s| s.gt!="./."}' +``` + +A similar complex count would be + +```ruby + bio-vcf --eval '[r.chr,r.pos,r.samples.count { |s| (!s.empty? && s.gl[s.gtindex]==1.0) }]' +``` + +which tests for perfect SNPs scored (for example). + +## Reorder filter with lambda + +Sometime it pay to reorder the filter using a lambda. This is one +example where the greedy sample counts are done only for those +samples that match the other criteria: + +```ruby +./bin/bio-vcf --num-threads=1 --filter '(r.info.miss<0.05 and r.info.exp_freq_a1>0.05 and r.info.exp_freq_a1<0.95 and r.info.impinfo>0.7 and r.info.hw<1.0) ? lambda { found=r.samples.count { |s| (!s.empty? && s.gl[s.gtindex]==1.0) }.to_f; total=r.samples.count{|s| s.gt!="./."} ; found/total>0.7 and total-found<30 }.call : false)' +``` + ## Modify VCF files Add or modify the sample file name in the INFO fields: ```sh @@ -902,11 +973,14 @@ Total 130 ## ================================================== ## Other examples -For more examples see the feature [section](https://github.com/pjotrp/bioruby-vcf/tree/master/features). +For more exercises and examples see +[doc](https://github.com/pjotrp/bioruby-vcf/tree/master/doc) directory +and the the feature +[section](https://github.com/pjotrp/bioruby-vcf/tree/master/features). ## API BioVcf can also be used as an API. The following code is basically what the command line interface uses (see ./bin/bio-vcf) @@ -935,16 +1009,65 @@ end ``` ## Trouble shooting +### MRI supports threading + Note that Ruby 2.x is required for Bio-vcf. JRuby works, but only in single threaded mode (for now). +### Set TMPDIR when running out of space + The multi-threading creates temporary files using the system TMPDIR. This behaviour can be overridden by setting the environment variable. -Also, for genome-wide sequencing it may be useful to increase ---thread-lines to a value larger than 1_000_000. + +### Reorder filter on time out + +Make sure to minimize expensive calculations by moving them +backward. An 'and' statement is evaluated from left to right. With + +```ruby +fast_check and slow_check +``` + +slow_check only gets executed if fast_check is true. + +For more complex filters use lambda inside a conditional + +```ruby + ( fast_check ? lambda { slow_check }.call : false ) +``` + +where slow_check is the slow section of your query. As is shown +earlier in this document. Don't forget the .call! + +### Reduce thread lines on timeout + +Depending on your input data and the speed filters it may be useful to +tweak the number of thread lines and/or to increase the timeout. + +On really fast file systems for genome-wide sequencing try increasing +--thread-lines to a value larger than 100_000. On the other hand if +the computations are intensive (per line) reduce the number of +thread-lines (try 10_000 and 1_000). If processes get killed that is +the one to try. + +For larger files set the timeout to 600, or so. --timeout 600. + +Different values may show different core use on a machine. + +### Debugging + +To debug output use '-v --num-threads=1' for generating useful +output. Also do not use the -i switch (ignore errors) when there +are problems. + +### Tmpdir contains (old) bio-vcf directories + +Multi-threaded bio-vcf writes into a temporary directory during +processing. When a process gets interrupted for some reason the +temporary directory may remain. ## Project home page Information on the source tree, documentation, examples, issues and how to contribute, see