README.md in bio-vcf-0.7.0 vs README.md in bio-vcf-0.7.3

- old
+ new

@@ -1,26 +1,39 @@ # bio-vcf [![Build Status](https://secure.travis-ci.org/pjotrp/bioruby-vcf.png)](http://travis-ci.org/pjotrp/bioruby-vcf) -Yet another VCF parser. Bio-vcf is not only fast for genome-wide data, -it also comes with a really nice filtering, evaluation and rewrite -language. Bio-vcf has better performance than other tools +A new generation VCF parser. Bio-vcf is not only fast for genome-wide +(WGS) data, it also comes with a really nice filtering, evaluation and +rewrite language. Why would you use bio-vcf over other parsers? + +1. Bio-vcf is fast and scales on multi-core computers +2. Bio-vcf has an expressive filtering and evaluation language +3. Bio-vcf has great multi-sample support +4. Bio-vcf has multiple global filters and sample filters +5. Bio-vcf can access any VCF format +6. Bio-vcf can do calculations on fields +7. Bio-vcf allows for genotype processing +8. Bio-vcf has support for set analysis +9. Bio-vcf has sane error handling +10. Bio-vcf can output tabular data, HTML, LaTeX, RDF and (soon) JSON + +Bio-vcf has better performance than other tools because of lazy parsing, multi-threading, and useful combinations of -(fancy) command line filtering. For example on an 2 core machine -bio-vcf is 50% faster than SnpSift. On an 8 core machine bio-vcf is -3x faster than SnpSift. Parsing a 1 Gb ESP VCF with 8 cores with -bio-vcf takes +(fancy) command line filtering. For example on an 2 core machine +bio-vcf is typically 50% faster than JVM based SnpSift. On an 8 core machine +bio-vcf is at least 3x faster than SnpSift. Parsing a 1 Gb ESP +VCF with 8 cores with bio-vcf takes ```sh time ./bin/bio-vcf -iv --num-threads 8 --filter 'r.info.cp>0.3' < ESP6500SI_V2_SSA137.vcf > test1.vcf real 0m21.095s user 1m41.101s sys 0m7.852s ``` -and parsing with SnpSift takes +while parsing with SnpSift takes ```sh time cat ESP6500SI_V2_SSA137.vcf |java -jar snpEff/SnpSift.jar filter "( CP>0.3 )" > test.vcf real 1m4.913s user 0m58.071s @@ -30,26 +43,26 @@ Bio-vcf is perfect for parsing large data files. Parsing a 650 Mb GATK Illumina Hiseq VCF file and evaluating the results into a BED format on a 16 core machine takes ```sh - time bio-vcf --num-threads 36 --filter 'r.chrom.to_i>0 and r.chrom.to_i<21 and r.qual>50' --sfilter '!s.empty? and s.dp>20' --eval '[r.chrom,r.pos,r.pos+1]' < test.large2.vcf > test.out.3 + time bio-vcf --num-threads 16 --filter 'r.chrom.to_i>0 and r.chrom.to_i<21 and r.qual>50' --sfilter '!s.empty? and s.dp>20' --eval '[r.chrom,r.pos,r.pos+1]' < test.large2.vcf > test.out.3 real 0m47.612s user 8m18.234s sys 0m5.039s ``` -which shows some pretty decent core utilisation (10x). +which shows pretty decent core utilisation (10x). We are running +gzip compressed VCF files of 30+ Gb with similar performance gains. Use zcat to -pipe gzipped (vcf.gz) files into bio-vcf, e.g. +pipe such gzipped (vcf.gz) files into bio-vcf, e.g. ```sh zcat huge_file.vcf.gz| bio-vcf --num-threads 36 --filter 'r.chrom.to_i>0 and r.chrom.to_i<21 and r.qual>50' --sfilter '!s.empty? and s.dp>20' --eval '[r.chrom,r.pos,r.pos+1]' > test.bed - ``` bio-vcf comes with a sensible parser definition language (it is 100% Ruby), as well as primitives for set analysis. Few assumptions are made about the actual contents of the VCF file (field @@ -182,10 +195,17 @@ Most filter and eval commands can be used at the same time. Special set commands exit for filtering and eval. When a set is defined, based on the sample name, you can apply filters on the samples inside the set, outside the set and over all samples. E.g. +So, why would you use bio-vcf instead of rolling out your own +Perl/Python/other ad-hoc script? I think the reason should be that +there is less chance of mistakes because of Bio-vcf's clear filtering +language and sensible built-in validation. The second reason would be +speed. Bio-vcf's multi-threading capability gives it great and hard to +replicate performance. + Also note you can use [bio-table](https://github.com/pjotrp/bioruby-table) to filter/transform data further and convert to other formats, such as RDF. @@ -200,11 +220,11 @@ the source code. It is not hard to add features. Otherwise, send a short example of a VCF statement you need to work on. ## Installation -Note that you need Ruby 1.9.3 or later. The 2.x Ruby series also give +Note that you need Ruby 2.x or later. The 2.x Ruby series also give a performance improvement. Bio-vcf will show the Ruby version when typing the command 'bio-vcf -h'. To intall bio-vcf with gem: @@ -369,20 +389,21 @@ ``` Note that only valid method names in lower case get picked up this way. Also by convention normal is sample 1 and tumor is sample 2. -Even shorter r is an alias for rec (nyi) +Even shorter r is an alias for rec ```sh bio-vcf --eval "r.original.gt" < file.vcf bio-vcf --eval "r.s3t2.dp" < file.vcf ``` ## Special functions -Note: special functions are not yet implemented! +Note: special functions are not yet implemented! Look below +for genotype processing which has indexing in 'gti'. Sometime you want to use a special function in a filter. For example percentage variant reads can be defined as [a,c,g,t] with frequencies against sample read depth (dp) as [0,0.03,0.47,0.50]. Filtering would with a special function, @@ -438,27 +459,32 @@ bio-vcf allows for set analysis. With the complement filter, for example, samples are selected that evaluate to true, all others should evaluate to false. For this we create three filters, one for all samples that are included (the --ifilter or -if), for all samples that are excluded (the --efilter or -ef) and for any sample (the --sfilter -or -sf). So i=include, e=exclude and s=any sample. +or -sf). So i=include (OR filter), e=exclude and s=any sample (AND +filter). The equivalent of the union filter is by using the --sfilter, so ```sh bio-vcf --sfilter 's.dp>20' ``` -Filters DP on all samples. To filter on a subset you can add a +Filters DP on all samples and is true if all samples match the +criterium (AND). To filter on a subset you can add a selector ```sh bio-vcf --sfilter-samples 0,1,4 --sfilter 's.dp>20' ``` -For set analysis there are the additional ifilter (include) and efilter (exclude). To filter -on samples 0,1,4 and output the gq values +For set analysis there are the additional ifilter (include) and +efilter (exclude). Where sfilter represents an ALL match, the ifilter +represents an ANY match, i.e., it is true if one of the samples +matches the criterium (OR). To filter on samples 0,1,4 and output the gq +values ```sh bio-vcf -i --ifilter-samples 0,1,4 --ifilter 's.gq<10 or s.gq==99' --seval s.gq 1 14907 99 99 99 99 99 99 99 1 14930 99 99 99 99 99 99 99 @@ -492,12 +518,14 @@ ```sh bio-vcf -i --ifilter-samples 0,1,4 --ifilter 's.gt==rec.s1t1.gt and s.gq>10' --seval s.gq --efilter 's.gq==99' ``` Etc. etc. Any combination of sfilter, ifilter and efilter is possible. +Currently the efilter is an ALL filter (AND), i.e. all excluded +samples need to match the criterium. -The following are not yet implemented: +The following regular expression matches are not yet implemented: In the near future it is also possible to select samples on a regex (here select all samples where the name starts with s3) ```sh @@ -558,35 +586,43 @@ 1 10694 G G 1 12783 G G G G G G G 1 15274 G G G G G G G ``` +where gts represents the indexed genotype on [ref] + [alt]. + These values can also be used in filters and output allele depth, for example ```ruby bio-vcf -vi --ifilter 'rec.original.gt!="0/1"' --efilter 'rec.original.gt=="0/0"' --seval 'rec.original.ad[s.gti[1]]' 1 10257 151 151 151 151 151 8 151 1 13302 26 10 10 10 10 10 10 1 13757 47 47 4 47 47 4 47 ``` -The following does not yet work (using the gti in a sample directly) +You can use the genotype index gti to fetch values from, for example, +allele depth: ```ruby bio-vcf -vi --ifilter 'rec.original.gt!="0/1"' --efilter 'rec.original.gti[0]==0' --seval 'rec.original.ad[s.gti[1]]' + +1 10257 151 151 151 151 151 8 151 +1 13302 26 10 10 10 10 10 10 +1 13757 47 47 4 47 47 4 47 ``` + ## Modify VCF files Add or modify the sample file name in the INFO fields: ```sh bio-vcf --rewrite 'rec.info["sample"]="mytest"' < mytest.vcf ``` -To remove/select 3 samples and create a new file: +To remove/select 3 samples: ```sh bio-vcf --samples 0,1,3 < mytest.vcf ``` @@ -612,15 +648,54 @@ ``` Similarly for GoNL ```ruby -bio-vcf --id gonl --rdf --tags '{"db:evs" => true, "seq:freq" => rec.info.af }' < GoNL.vcf +bio-vcf --id gonl --rdf --tags '{"db:gonl" => true, "seq:freq" => rec.info.af }' < GoNL.vcf ``` +or without AF + + +```ruby +bio-vcf --id gonl --rdf --tags '{"db:gonl" => true, "seq:freq" => (rec.info.ac.to_f/rec.info.an).round(2) }' < gonl_germline_overlap_r4.vcf +``` + + + Also check out [bio-table](https://github.com/pjotrp/bioruby-table) to convert tabular data to RDF. +## Statistics + +Simple statistics are available for REF>ALT changes: + +```sh +./bin/bio-vcf -v --statistics < test/data/input/dbsnp.vcf +``` + + ## ==== Statistics ================================== + G>A 59 45% + C>T 30 23% + A>G 5 4% + C>G 5 4% + C>A 5 4% + G>T 4 3% + T>C 4 3% + G>C 4 3% + T>A 3 2% + A>C 3 2% + A>T 2 2% + GTCCGACCGCTCC>G 1 1% + CGACCGCTCC>C 1 1% + T>TGGAGC 1 1% + C>CGTCTTCA 1 1% + TG>T 1 1% + AC>A 1 1% + + Total 130 + ## ================================================== + ## Other examples For more examples see the feature [section](https://github.com/pjotrp/bioruby-vcf/tree/master/features). ## API @@ -651,9 +726,12 @@ # end ``` ## Trouble shooting + +Note that Ruby 2.x is required for Bio-vcf. JRuby works, but only +in single threaded mode (for now). The multi-threading creates temporary files using the system TMPDIR. This behaviour can be overridden by setting the environment variable. Also, for genome-wide sequencing it may be useful to increase --thread-lines to a value larger than 1_000_000.