# parse_fasta # So you want to parse a fasta file... ## Installation ## Add this line to your application's Gemfile: gem 'parse_fasta' And then execute: $ bundle Or install it yourself as: $ gem install parse_fasta ## Overview ## I wanted a simple, fast way to parse fasta files so I wouldn't have to keep writing annoying boilerplate fasta parsing code everytime I go to do something with one. I will probably add more, but likely only tasks that I find myself doing over and over. ## Usage ## ### Version 1.0.0 (current) ### The monkey patch of the `File` class is no more! Here is the new print length example: require 'parse_fasta' FastaFile.open(ARGV.first, 'r').each_record do |header, sequence| puts [header, sequence.length].join("\t") end And here, a script to calculate GC content: require 'parse_fasta' FastaFile.open(ARGV.first, 'r').each_record do |header, sequence| puts [header, sequence.gc].join("\t") end ### Version 0.0.5 (old) ### An example that lists the length for each sequence. (Won't work in version 1.0.0) require 'parse_fasta' File.open(ARGV.first, 'r').each_record do |header, sequence| puts [header, sequence.length].join("\t") end ## Benchmark ## Take these with a grain of salt since `BioRuby` is a heavy weight module with lots of features and error checking, whereas `parse_fasta` is meant to be lightweight and easy to use for my own coding. ### FastaFile#each_record ### Just for fun, I wanted to compare the execution time to that of BioRuby. I calculated sequence length for each fasta record with both the `each_record` method from this gem and using the `FastaFormat` class from BioRuby. You can see the test script in `benchmark.rb`. The test file contained 2,009,897 illumina reads and the file size was 1.1 gigabytes. Here are the results from Ruby's `Benchmark` class: user system total real parse_fasta 64.530000 1.740000 66.270000 ( 67.081502) bioruby 116.250000 2.260000 118.510000 (120.223710) I just wanted a nice, clean way to parse fasta files, but being nearly twice as fasta as BioRuby doesn't hurt either! ### Sequence#gc ### I played around with a few different implementations for the `#gc` method and found this one to be the fastest. The test is done one random strings mating `/[AaCcTtGgUu]/`. `this_gc` is `Sequence.new(str).gc`, and `bioruby_gc` is `Bio::Sequence::NA.new(str).gc_content`. To see how the methods scale, the test 1 string was 2,000,000 bases, test 2 was 4,000,000 and test 3 was 8,000,000 bases. user system total real this_gc 1 0.030000 0.000000 0.030000 ( 0.029145) bioruby_gc 1 2.030000 0.010000 2.040000 ( 2.157512) this_gc 2 0.060000 0.000000 0.060000 ( 0.059408) bioruby_gc 2 4.060000 0.020000 4.080000 ( 4.334159) this_gc 3 0.120000 0.000000 0.120000 ( 0.185434) bioruby_gc 3 8.060000 0.020000 8.080000 ( 8.659071) Nice! ## Notes ## Currently in doesn't check whether your file is actually a fasta file or anything, so watch out.