README.md in parse_fasta-0.0.5 vs README.md in parse_fasta-1.0.0
- old
+ new
@@ -16,29 +16,55 @@
$ gem install parse_fasta
## Overview ##
-Provides the method `#each_record` for the `File` class.
+I wanted a simple, fast way to parse fasta files so I wouldn't have to
+keep writing annoying boilerplate fasta parsing code everytime I go to
+do something with one. I will probably add more, but likely only tasks
+that I find myself doing over and over.
- each_record { |header, sequence| block }
+## Usage ##
-The whole file is not loaded into memory, so have no fear of giant
-fasta files!
+### Version 1.0.0 (current) ###
-## Usage ##
+The monkey patch of the `File` class is no more! Here is the new print
+length example:
-An example that lists the length for each sequence.
+ require 'parse_fasta'
+ FastaFile.open(ARGV.first, 'r').each_record do |header, sequence|
+ puts [header, sequence.length].join("\t")
+ end
+
+And here, a script to calculate GC content:
+
+ require 'parse_fasta'
+
+ FastaFile.open(ARGV.first, 'r').each_record do |header, sequence|
+ puts [header, sequence.gc].join("\t")
+ end
+
+### Version 0.0.5 (old) ###
+
+An example that lists the length for each sequence. (Won't work in
+version 1.0.0)
+
require 'parse_fasta'
File.open(ARGV.first, 'r').each_record do |header, sequence|
puts [header, sequence.length].join("\t")
end
## Benchmark ##
+Take these with a grain of salt since `BioRuby` is a heavy weight
+module with lots of features and error checking, whereas `parse_fasta`
+is meant to be lightweight and easy to use for my own coding.
+
+### FastaFile#each_record ###
+
Just for fun, I wanted to compare the execution time to that of
BioRuby. I calculated sequence length for each fasta record with both
the `each_record` method from this gem and using the `FastaFormat`
class from BioRuby. You can see the test script in `benchmark.rb`.
@@ -49,9 +75,33 @@
parse_fasta 64.530000 1.740000 66.270000 ( 67.081502)
bioruby 116.250000 2.260000 118.510000 (120.223710)
I just wanted a nice, clean way to parse fasta files, but being nearly
twice as fasta as BioRuby doesn't hurt either!
+
+### Sequence#gc ###
+
+I played around with a few different implementations for the `#gc`
+method and found this one to be the fastest.
+
+The test is done one random strings mating `/[AaCcTtGgUu]/`. `this_gc`
+is `Sequence.new(str).gc`, and `bioruby_gc` is
+`Bio::Sequence::NA.new(str).gc_content`.
+
+To see how the methods scale, the test 1 string was 2,000,000 bases,
+test 2 was 4,000,000 and test 3 was 8,000,000 bases.
+
+ user system total real
+ this_gc 1 0.030000 0.000000 0.030000 ( 0.029145)
+ bioruby_gc 1 2.030000 0.010000 2.040000 ( 2.157512)
+
+ this_gc 2 0.060000 0.000000 0.060000 ( 0.059408)
+ bioruby_gc 2 4.060000 0.020000 4.080000 ( 4.334159)
+
+ this_gc 3 0.120000 0.000000 0.120000 ( 0.185434)
+ bioruby_gc 3 8.060000 0.020000 8.080000 ( 8.659071)
+
+Nice!
## Notes ##
Currently in doesn't check whether your file is actually a fasta file
or anything, so watch out.