# parse_fasta #

[![Gem Version](https://badge.fury.io/rb/parse_fasta.svg)](http://badge.fury.io/rb/parse_fasta) [![Build Status](https://travis-ci.org/mooreryan/parse_fasta.svg?branch=master)](https://travis-ci.org/mooreryan/parse_fasta) [![Coverage Status](https://coveralls.io/repos/mooreryan/parse_fasta/badge.svg)](https://coveralls.io/r/mooreryan/parse_fasta)

So you want to parse a fasta file...

## Installation ##

Add this line to your application's Gemfile:

    gem 'parse_fasta'

And then execute:

    $ bundle

Or install it yourself as:

    $ gem install parse_fasta

## Overview ##

Provides nice, programmatic access to fasta and fastq files, as well
as providing Sequence and Quality helper classes. It's more
lightweight than BioRuby. And more fun! ;)

## Documentation ##

Checkout
[parse_fasta docs](http://rubydoc.info/gems/parse_fasta)
for the full api documentation.

## Usage ##

Some examples...

A little script to print header and length of each record.

	require 'parse_fasta'

	FastaFile.open(ARGV[0]).each_record do |header, sequence|
	  puts [header, sequence.length].join("\t")
	end

And here, a script to calculate GC content:

	FastaFile.open(ARGV[0]).each_record do |header, sequence|
	  puts [header, sequence.gc].join("\t")
	end

Now we can parse fastq files as well!

	FastqFile.open(ARGV[0]).each_record do |head, seq, desc, qual|
	  puts [header, qual.qual_scores.join(',')].join("\t")
	end

What if you don't care if the input is a fastA or a fastQ? No problem!

	SeqFile.open(ARGV[0]).each_record do |head, seq|
	  puts [header, seq].join "\t"
	end

Read fasta file into a hash.

    seqs = FastaFile.open(ARGV[0]).to_hash

## Versions ##

### 1.9.2 ###

Speed up fastA `each_record` and `each_record_fast`.

### 1.9.1 ###

Speed up fastQ `each_record` and `each_record_fast`. Courtesy of
[Matthew Ralston](https://github.com/MatthewRalston).

### 1.9.0 ###

Added "fast" versions of `each_record` methods
(`each_record_fast`). Basically, they return sequences and quality
strings as Ruby `Sring` objects instead of aa `Sequence` or `Quality`
objects. Also, if the sequence or quality string has spaces, they will
be retained. If this is a problem, use the original `each_record`
methods.

### 1.8.2 ###

Speed up `FastqFile#each_record`.

### 1.8.1 ###

An error will be raised if a fasta file has a `>` in the
sequence. Sometimes files are not terminated with a newline
character. If this is the case, then catting two fasta files will
smush the first header of the second file right in with the last
sequence of the first file. This is bad, raise an error! ;)

Example

    >seq1
    ACTG>seq2
    ACTG
    >seq3
    ACTG

This will raise `ParseFasta::SequenceFormatError`.

Also, headers with lots of `>` within are fine now.

### 1.8 ###

Add `Sequence#rev_comp`. It can handle IUPAC characters. Since
`parse_fasta` doesn't check whether the seq is AA or NA, if called on
an amino acid string, things will get weird as it will complement the
IUPAC characters in the AA string and leave others.

### 1.7.2 ###

Strip spaces (not all whitespace) from `Sequence` and `Quality` strings.

Some alignment fastas have spaces for easier reading. Strip these
out. For consistency, also strips spaces from `Quality` strings. If
there are spaces that don't match in the quality and sequence in a
fastQ file, then things will get messed up in the FastQ file. FastQ
shouldn't have spaces though.

### 1.7 ###

Add `SeqFile#to_hash`, `FastaFile#to_hash` and `FastqFile#to_hash`.

### 1.6.2 ###

`FastaFile::open` now raises a `ParseFasta::DataFormatError` when passed files
that don't begin with a `>`.

### 1.6.1 ###

Better internal handling of empty sequences -- instead of raising
errors, pass empty sequences.

### 1.6 ###

Added `SeqFile` class, which accepts either fastA or fastQ files. It
uses FastaFile and FastqFile internally. You can use this class if you
want your scripts to accept either fastA or fastQ files.

If you need the description and quality string, you should use
FastqFile instead.

### 1.5 ###

Now accepts gzipped files. Huzzah!

### 1.4 ###

Added methods:

    Sequence.base_counts
	Sequence.base_frequencies

### 1.3 ###

Add additional functionality to `each_record` method.

#### Info ####

I often like to use the fasta format for other things like so

	>fruits
	pineapple
	pear
	peach
	>veggies
	peppers
	parsnip
	peas

rather than having this in a two column file like this

	fruit,pineapple
	fruit,pear
	fruit,peach
	veggie,peppers
	veggie,parsnip
	veggie,peas

So I added functionality to `each_record` to keep each line a record
separate in an array. Here's an example using the above file.

    info = []
	FastaFile.open(f, 'r').each_record(1) do |header, lines|
	  info << [header, lines]
	end

Then info will contain the following arrays

	['fruits', ['pineapple', 'pear', 'peach']],
	['veggies', ['peppers', 'parsnip', 'peas']]

### 1.2 ###

Added `mean_qual` method to the `Quality` class.

### 1.1.2 ###

Dropped Ruby requirement to 1.9.3

(Note, if you want to build the docs with yard and you're using
Ruby 1.9.3, you may have to install the redcarpet gem.)

### 1.1 ###

Added: Fastq and Quality classes

### 1.0 ###

Added: Fasta and Sequence classes

Removed: File monkey patch

### 0.0.5 ###

Last version with File monkey patch.

## Benchmark ##

Some quick and dirty benchmarks against `BioRuby`.

### FastaFile#each_record ###

You can see the test script in `benchmark.rb`.

                           user     system      total        real
    parse_fasta        1.920000   0.160000   2.080000 (  2.145932)
    parse_fasta fast   1.210000   0.160000   1.370000 (  1.377770)
    bioruby            4.330000   0.290000   4.620000 (  4.655567)

Hot dog! It's faster :)

## Notes ##

Only the `SeqFile` class actually checks to make sure that you passed
in a "proper" fastA or fastQ file, so watch out.