# parse_fasta # [![Gem Version](https://badge.fury.io/rb/parse_fasta.svg)](http://badge.fury.io/rb/parse_fasta) [![Build Status](https://travis-ci.org/mooreryan/parse_fasta.svg?branch=master)](https://travis-ci.org/mooreryan/parse_fasta) [![Coverage Status](https://coveralls.io/repos/mooreryan/parse_fasta/badge.svg)](https://coveralls.io/r/mooreryan/parse_fasta) So you want to parse a fasta file... ## Installation ## Add this line to your application's Gemfile: gem 'parse_fasta' And then execute: $ bundle Or install it yourself as: $ gem install parse_fasta ## Overview ## Provides nice, programmatic access to fasta and fastq files, as well as providing Sequence and Quality helper classes. It's more lightweight than BioRuby. And more fun! ;) ## Documentation ## Checkout [parse_fasta docs](http://rubydoc.info/gems/parse_fasta/1.6.0/frames) for the full api documentation. ## Usage ## Some examples... A little script to print header and length of each record. require 'parse_fasta' FastaFile.open(ARGV[0]).each_record do |header, sequence| puts [header, sequence.length].join("\t") end And here, a script to calculate GC content: FastaFile.open(ARGV[0]).each_record do |header, sequence| puts [header, sequence.gc].join("\t") end Now we can parse fastq files as well! FastqFile.open(ARGV[0]).each_record do |head, seq, desc, qual| puts [header, qual.qual_scores.join(',')].join("\t") end What if you don't care if the input is a fastA or a fastQ? No problem! SeqFile.open(ARGV[0]).each_record do |head, seq| puts [header, seq].join "\t" end ## Versions ## ### 1.6 ### Added `SeqFile` class, which accepts either fastA or fastQ files. It uses FastaFile and FastqFile internally. You can use this class if you want your scripts to accept either fastA or fastQ files. If you need the description and quality string, you should use FastqFile instead. ### 1.5 ### Now accepts gzipped files. Huzzah! ### 1.4 ### Added methods: Sequence.base_counts Sequence.base_frequencies ### 1.3 ### Add additional functionality to `each_record` method. #### Info #### I often like to use the fasta format for other things like so >fruits pineapple pear peach >veggies peppers parsnip peas rather than having this in a two column file like this fruit,pineapple fruit,pear fruit,peach veggie,peppers veggie,parsnip veggie,peas So I added functionality to `each_record` to keep each line a record separate in an array. Here's an example using the above file. info = [] FastaFile.open(f, 'r').each_record(1) do |header, lines| info << [header, lines] end Then info will contain the following arrays ['fruits', ['pineapple', 'pear', 'peach']], ['veggies', ['peppers', 'parsnip', 'peas']] ### 1.2 ### Added `mean_qual` method to the `Quality` class. ### 1.1.2 ### Dropped Ruby requirement to 1.9.3 (Note, if you want to build the docs with yard and you're using Ruby 1.9.3, you may have to install the redcarpet gem.) ### 1.1 ### Added: Fastq and Quality classes ### 1.0 ### Added: Fasta and Sequence classes Removed: File monkey patch ### 0.0.5 ### Last version with File monkey patch. ## Benchmark ## Perhaps this isn't exactly fair since `BioRuby` is a big module with lots of features and error checking, whereas `parse_fasta` is meant to be lightweight and easy to use for my own research. Oh well ;) ### FastaFile#each_record ### You're probably wondering...How does it compare to BioRuby in some super accurate benchmarking tests? Lucky for you, I calculated sequence length for each fasta record with both the `each_record` method from this gem and using the `FastaFormat` class from BioRuby. You can see the test script in `benchmark.rb`. The test file contained 2,009,897 illumina reads and the file size was 1.1 gigabytes. Here are the results from Ruby's `Benchmark` class: user system total real parse_fasta 64.530000 1.740000 66.270000 ( 67.081502) bioruby 116.250000 2.260000 118.510000 (120.223710) Hot dog! It's faster :) ### FastqFile#each_record ### The same sequence length test as above, but this time with a fastq file containing 4,000,000 illumina reads. user system total real this_fastq 62.610000 1.660000 64.270000 ( 64.389408) bioruby_fastq 165.500000 2.100000 167.600000 (167.969636) ### Sequence#gc ### The test is done on random strings matcing `/[AaCcTtGgUu]/`. `this_gc` is `Sequence.new(str).gc`, and `bioruby_gc` is `Bio::Sequence::NA.new(str).gc_content`. To see how the methods scales, the test 1 string was 2,000,000 bases, test 2 was 4,000,000 and test 3 was 8,000,000 bases. user system total real this_gc 1 0.030000 0.000000 0.030000 ( 0.029145) bioruby_gc 1 2.030000 0.010000 2.040000 ( 2.157512) this_gc 2 0.060000 0.000000 0.060000 ( 0.059408) bioruby_gc 2 4.060000 0.020000 4.080000 ( 4.334159) this_gc 3 0.120000 0.000000 0.120000 ( 0.185434) bioruby_gc 3 8.060000 0.020000 8.080000 ( 8.659071) Nice! Troll: "But Ryan, when will you find the GC of an 8,000,000 base sequence?" Me: "Step off, troll!" ## Test suite & docs ## For a good time, you could clone this repo and run the test suite with rspec! Or if you just don't trust that it works like it should. The specs probably need a little clean up...so fork it and clean it up ;) Same with the docs. Clone the repo and build them yourself with `yard` if you are in need of some excitement. ## Notes ## Only the `SeqFile` class actually checks to make sure that you passed in a "proper" fastA or fastQ file, so watch out.