README.md in parse_fasta-1.0.1 vs README.md in parse_fasta-1.1.0

- old
+ new

@@ -16,52 +16,64 @@ $ gem install parse_fasta ## Overview ## -I wanted a simple, fast way to parse fasta files so I wouldn't have to -keep writing annoying boilerplate fasta parsing code everytime I go to -do something with one. I will probably add more, but likely only tasks -that I find myself doing over and over. +I wanted a simple, fast way to parse fasta and fastq files so I +wouldn't have to keep writing annoying boilerplate parsing code +everytime I go to do something with a fasta or fastq file. I will +probably add more, but likely only tasks that I find myself doing over +and over. -## Usage ## +## Documentation ## -### Version 1.0.0 (current) ### +Checkout [parse_fasta docs](http://rubydoc.info/gems/parse_fasta/1.1.0/frames) to see +the full documentation. -The monkey patch of the `File` class is no more! Here is the new print -length example: +## Usage ## +A little script to print header and length of each record. + require 'parse_fasta' FastaFile.open(ARGV.first, 'r').each_record do |header, sequence| puts [header, sequence.length].join("\t") end And here, a script to calculate GC content: - require 'parse_fasta' - FastaFile.open(ARGV.first, 'r').each_record do |header, sequence| puts [header, sequence.gc].join("\t") end -### Version 0.0.5 (old) ### +Now we can parse fastq files as well! -An example that lists the length for each sequence. (Won't work in -version 1.0.0) + FastqFile.open(ARGV.first, 'r').each_record do |head, seq, desc, qual| + puts [header, seq, desc, qual.qual_scores.join(',')].join("\t") + end - require 'parse_fasta' +## Versions ## - File.open(ARGV.first, 'r').each_record do |header, sequence| - puts [header, sequence.length].join("\t") - end +### 1.1.0 ### +Added: Fastq and Quality classes + +### 1.0.0 ### + +Added: Fasta and Sequence classes + +Removed: File monkey patch + +### 0.0.5 ### + +Last version with File monkey patch. + ## Benchmark ## -Take these with a grain of salt since `BioRuby` is a heavy weight +Take these with a grain of salt since `BioRuby` is a big module module with lots of features and error checking, whereas `parse_fasta` -is meant to be lightweight and easy to use for my own coding. +is meant to be lightweight and easy to use for my own research. ### FastaFile#each_record ### Just for fun, I wanted to compare the execution time to that of BioRuby. I calculated sequence length for each fasta record with both @@ -76,15 +88,24 @@ bioruby 116.250000 2.260000 118.510000 (120.223710) I just wanted a nice, clean way to parse fasta files, but being nearly twice as fasta as BioRuby doesn't hurt either! +### FastqFile#each_record ### + +The same sequence length test as above, but this time with a fastq +file containing 4,000,000 illumina reads. + + user system total real + this_fastq 62.610000 1.660000 64.270000 ( 64.389408) + bioruby_fastq 165.500000 2.100000 167.600000 (167.969636) + ### Sequence#gc ### I played around with a few different implementations for the `#gc` method and found this one to be the fastest. -The test is done one random strings mating `/[AaCcTtGgUu]/`. `this_gc` +The test is done on random strings mating `/[AaCcTtGgUu]/`. `this_gc` is `Sequence.new(str).gc`, and `bioruby_gc` is `Bio::Sequence::NA.new(str).gc_content`. To see how the methods scale, the test 1 string was 2,000,000 bases, test 2 was 4,000,000 and test 3 was 8,000,000 bases.