# parse_fasta # So you want to parse a fasta file... ## Installation ## Add this line to your application's Gemfile: gem 'parse_fasta' And then execute: $ bundle Or install it yourself as: $ gem install parse_fasta ## Overview ## I wanted a simple, fast way to parse fasta and fastq files so I wouldn't have to keep writing annoying boilerplate parsing code everytime I go to do something with a fasta or fastq file. I will probably add more, but likely only tasks that I find myself doing over and over. ## Documentation ## Checkout [parse_fasta docs](http://rubydoc.info/gems/parse_fasta/1.1.0/frames) to see the full documentation. ## Usage ## Some examples... A little script to print header and length of each record. require 'parse_fasta' FastaFile.open(ARGV.first, 'r').each_record do |header, sequence| puts [header, sequence.length].join("\t") end And here, a script to calculate GC content: FastaFile.open(ARGV.first, 'r').each_record do |header, sequence| puts [header, sequence.gc].join("\t") end Now we can parse fastq files as well! FastqFile.open(ARGV.first, 'r').each_record do |head, seq, desc, qual| puts [header, seq, desc, qual.qual_scores.join(',')].join("\t") end ## Versions ## ### 1.2.0 ### Added `mean_qual` method to the `Quality` class. ### 1.1.2 ### Dropped Ruby requirement to 1.9.3 (Note, if you want to build the docs with yard and you're using Ruby 1.9.3, you may have to install the redcarpet gem.) ### 1.1.0 ### Added: Fastq and Quality classes ### 1.0.0 ### Added: Fasta and Sequence classes Removed: File monkey patch ### 0.0.5 ### Last version with File monkey patch. ## Benchmark ## Take these with a grain of salt since `BioRuby` is a big module module with lots of features and error checking, whereas `parse_fasta` is meant to be lightweight and easy to use for my own research. ### FastaFile#each_record ### Just for fun, I wanted to compare the execution time to that of BioRuby. I calculated sequence length for each fasta record with both the `each_record` method from this gem and using the `FastaFormat` class from BioRuby. You can see the test script in `benchmark.rb`. The test file contained 2,009,897 illumina reads and the file size was 1.1 gigabytes. Here are the results from Ruby's `Benchmark` class: user system total real parse_fasta 64.530000 1.740000 66.270000 ( 67.081502) bioruby 116.250000 2.260000 118.510000 (120.223710) I just wanted a nice, clean way to parse fasta files, but being nearly twice as fasta as BioRuby doesn't hurt either! ### FastqFile#each_record ### The same sequence length test as above, but this time with a fastq file containing 4,000,000 illumina reads. user system total real this_fastq 62.610000 1.660000 64.270000 ( 64.389408) bioruby_fastq 165.500000 2.100000 167.600000 (167.969636) ### Sequence#gc ### I played around with a few different implementations for the `#gc` method and found this one to be the fastest. The test is done on random strings mating `/[AaCcTtGgUu]/`. `this_gc` is `Sequence.new(str).gc`, and `bioruby_gc` is `Bio::Sequence::NA.new(str).gc_content`. To see how the methods scale, the test 1 string was 2,000,000 bases, test 2 was 4,000,000 and test 3 was 8,000,000 bases. user system total real this_gc 1 0.030000 0.000000 0.030000 ( 0.029145) bioruby_gc 1 2.030000 0.010000 2.040000 ( 2.157512) this_gc 2 0.060000 0.000000 0.060000 ( 0.059408) bioruby_gc 2 4.060000 0.020000 4.080000 ( 4.334159) this_gc 3 0.120000 0.000000 0.120000 ( 0.185434) bioruby_gc 3 8.060000 0.020000 8.080000 ( 8.659071) Nice! ## Notes ## Currently in doesn't check whether your file is actually a fasta file or anything, so watch out.