README.md in parse_fasta-1.5.2 vs README.md in parse_fasta-1.6.0

- old
+ new

@@ -1,8 +1,8 @@ # parse_fasta # -[![Gem Version](https://badge.fury.io/rb/parse_fasta.svg)](http://badge.fury.io/rb/parse_fasta) +[![Gem Version](https://badge.fury.io/rb/parse_fasta.svg)](http://badge.fury.io/rb/parse_fasta) [![Build Status](https://travis-ci.org/mooreryan/parse_fasta.svg?branch=master)](https://travis-ci.org/mooreryan/parse_fasta) [![Coverage Status](https://coveralls.io/repos/mooreryan/parse_fasta/badge.svg)](https://coveralls.io/r/mooreryan/parse_fasta) So you want to parse a fasta file... ## Installation ## @@ -18,60 +18,73 @@ $ gem install parse_fasta ## Overview ## -I wanted a simple, fast way to parse fasta and fastq files so I -wouldn't have to keep writing annoying boilerplate parsing code -everytime I go to do something with a fasta or fastq file. I will -probably add more, but likely only tasks that I find myself doing over -and over. +Provides nice, programmatic access to fasta and fastq files, as well +as providing Sequence and Quality helper classes. It's more +lightweight than BioRuby. And more fun! ;) ## Documentation ## Checkout -[parse_fasta docs](http://rubydoc.info/gems/parse_fasta/1.5.0/frames) -to see the full documentation. +[parse_fasta docs](http://rubydoc.info/gems/parse_fasta/1.6.0/frames) +for the full api documentation. ## Usage ## Some examples... A little script to print header and length of each record. require 'parse_fasta' - FastaFile.open(ARGV.first, 'r').each_record do |header, sequence| + FastaFile.open(ARGV[0]).each_record do |header, sequence| puts [header, sequence.length].join("\t") end And here, a script to calculate GC content: - FastaFile.open(ARGV.first, 'r').each_record do |header, sequence| + FastaFile.open(ARGV[0]).each_record do |header, sequence| puts [header, sequence.gc].join("\t") end Now we can parse fastq files as well! - FastqFile.open(ARGV.first, 'r').each_record do |head, seq, desc, qual| - puts [header, seq, desc, qual.qual_scores.join(',')].join("\t") + FastqFile.open(ARGV[0]).each_record do |head, seq, desc, qual| + puts [header, qual.qual_scores.join(',')].join("\t") end +What if you don't care if the input is a fastA or a fastQ? No problem! + + SeqFile.open(ARGV[0]).each_record do |head, seq| + puts [header, seq].join "\t" + end + ## Versions ## -### 1.5.0 ### +### 1.6 ### +Added `SeqFile` class, which accepts either fastA or fastQ files. It +uses FastaFile and FastqFile internally. You can use this class if you +want your scripts to accept either fastA or fastQ files. + +If you need the description and quality string, you should use +FastqFile instead. + +### 1.5 ### + Now accepts gzipped files. Huzzah! -### 1.4.0 ### +### 1.4 ### Added methods: Sequence.base_counts Sequence.base_frequencies -### 1.3.0 ### +### 1.3 ### Add additional functionality to `each_record` method. #### Info #### @@ -106,26 +119,26 @@ Then info will contain the following arrays ['fruits', ['pineapple', 'pear', 'peach']], ['veggies', ['peppers', 'parsnip', 'peas']] -### 1.2.0 ### +### 1.2 ### Added `mean_qual` method to the `Quality` class. ### 1.1.2 ### Dropped Ruby requirement to 1.9.3 (Note, if you want to build the docs with yard and you're using Ruby 1.9.3, you may have to install the redcarpet gem.) -### 1.1.0 ### +### 1.1 ### Added: Fastq and Quality classes -### 1.0.0 ### +### 1.0 ### Added: Fasta and Sequence classes Removed: File monkey patch @@ -133,30 +146,30 @@ Last version with File monkey patch. ## Benchmark ## -Take these with a grain of salt since `BioRuby` is a big module -module with lots of features and error checking, whereas `parse_fasta` -is meant to be lightweight and easy to use for my own research. +Perhaps this isn't exactly fair since `BioRuby` is a big module with +lots of features and error checking, whereas `parse_fasta` is meant to +be lightweight and easy to use for my own research. Oh well ;) ### FastaFile#each_record ### -Just for fun, I wanted to compare the execution time to that of -BioRuby. I calculated sequence length for each fasta record with both -the `each_record` method from this gem and using the `FastaFormat` -class from BioRuby. You can see the test script in `benchmark.rb`. +You're probably wondering...How does it compare to BioRuby in some +super accurate benchmarking tests? Lucky for you, I calculated +sequence length for each fasta record with both the `each_record` +method from this gem and using the `FastaFormat` class from +BioRuby. You can see the test script in `benchmark.rb`. The test file contained 2,009,897 illumina reads and the file size was 1.1 gigabytes. Here are the results from Ruby's `Benchmark` class: user system total real parse_fasta 64.530000 1.740000 66.270000 ( 67.081502) bioruby 116.250000 2.260000 118.510000 (120.223710) -I just wanted a nice, clean way to parse fasta files, but being nearly -twice as fasta as BioRuby doesn't hurt either! +Hot dog! It's faster :) ### FastqFile#each_record ### The same sequence length test as above, but this time with a fastq file containing 4,000,000 illumina reads. @@ -165,18 +178,15 @@ this_fastq 62.610000 1.660000 64.270000 ( 64.389408) bioruby_fastq 165.500000 2.100000 167.600000 (167.969636) ### Sequence#gc ### -I played around with a few different implementations for the `#gc` -method and found this one to be the fastest. - -The test is done on random strings mating `/[AaCcTtGgUu]/`. `this_gc` +The test is done on random strings matcing `/[AaCcTtGgUu]/`. `this_gc` is `Sequence.new(str).gc`, and `bioruby_gc` is `Bio::Sequence::NA.new(str).gc_content`. -To see how the methods scale, the test 1 string was 2,000,000 bases, +To see how the methods scales, the test 1 string was 2,000,000 bases, test 2 was 4,000,000 and test 3 was 8,000,000 bases. user system total real this_gc 1 0.030000 0.000000 0.030000 ( 0.029145) bioruby_gc 1 2.030000 0.010000 2.040000 ( 2.157512) @@ -187,9 +197,23 @@ this_gc 3 0.120000 0.000000 0.120000 ( 0.185434) bioruby_gc 3 8.060000 0.020000 8.080000 ( 8.659071) Nice! +Troll: "But Ryan, when will you find the GC of an 8,000,000 base +sequence?" + +Me: "Step off, troll!" + +## Test suite & docs ## + +For a good time, you could clone this repo and run the test suite with +rspec! Or if you just don't trust that it works like it should. The +specs probably need a little clean up...so fork it and clean it up ;) + +Same with the docs. Clone the repo and build them yourself with `yard` +if you are in need of some excitement. + ## Notes ## -Currently in doesn't check whether your file is actually a fasta file -or anything, so watch out. +Only the `SeqFile` class actually checks to make sure that you passed +in a "proper" fastA or fastQ file, so watch out.