= bio-gff3 GFF3 parser, aimed at parsing big data GFF3 to return sequences of any type, including assembled mRNA, protein and CDS sequences. Features: # Take GFF3 (genome browser) information of any type, and assemble sequences, e.g. mRNA and CDS # Options for low memory use and caching of records # Support for external FASTA input files # Use of multi-cores (NYI) Currently the output is a FASTA file. You can use this plugin in two ways. First as a standalone program, next as a plugin library to BioRuby. == Install and run gff3-fetch After installing ruby 1.9, or later, you can use rubygems gem install bio-gff3 Then, fetch mRNA and CDS information from GFF3 files and output to FASTA: gff3-fetch mrna test/data/gff/test.gff3 gff3-fetch cds test/data/gff/test.gff3 == Development To use the library require 'bio-gff3' For coding examples see ./bin/gff3-fetch and the ./spec/*rb You can run RSpecs with something like rspec -I ../bioruby/lib/ spec/*.rb (supposing you are referring a bioruby source repository) This implementation depends on BioRuby's basic GFF3 parser, with the possible advantage that the plugin can assemble sequences, is faster and does not consume all memory. The Gff3 specs are based on the output of the Wormbase genome browser. == See also gff3-fetch --help For a write-up see http://thebird.nl/bioruby/BioRuby_GFF3.html ------------------------------------------------------------------------------- == Copyright Copyright (C) 2010,2011 Pjotr Prins Fetch and assemble GFF3 types (e.g. ORF, mRNA, CDS) + print in FASTA format. gff3-fetch [--low-mem] [--validate] type [filename.fa] filename.gff3 Where (NYI == Not Yet Implemented): --translate : output as amino acid sequence --validate : validate GFF3 file by translating --fix : check 3-frame translation and fix, if possible --fix-wormbase : fix 3-frame translation on ORFs named 'gene1' --no-assemble : output each record as a sequence -- NYI --add-phase : output records using phase (useful w. no-assemble CDS to AA) --NYI type is any valid type in the GFF3 definition. For example: mRNA : assemble mRNA CDS : assemble CDS exon : list all exons gene|ORF : list gene ORFs other : use any type from GFF3 definition, e.g. 'Terminate' -- NYI and the following performance options: --cache full : load all in RAM (fast) --cache none : do not load anything in memory (slow) --low-mem : use LRU cache (limit RAM use, fast) -- NYI --max-cpus num : use num threads -- NYI --emboss : use EMBOSS translation (fast) -- NYI Multiple GFF3 files can be used. With external FASTA files, always the last one before the GFF3 filename is matched. Note that above switches are only partially implemented at this stage. Full feature support is projected Feb. 2011. Examples: Assemble mRNA and CDS information from test.gff3 (which includes sequence information) gff3-fetch mRNA test/data/gff/test.gff3 gff3-fetch CDS test/data/gff/test.gff3 Find CDS records from external FASTA file, adding phase and translate to protein sequence gff3-fetch --no-assemble --add-phase --translate CDS test/data/gff/MhA1_Contig1133.fa test/data/gff/MhA1_Contig1133.gff3 Find mRNA from external FASTA file, without loading everything in RAM gff3-fetch --cache none mRNA test/data/gff/test-ext-fasta.fa test/data/gff/test-ext-fasta.gff3 gff3-fetch --cache none mRNA test/data/gff/test-ext-fasta.fa test/data/gff/test-ext-fasta.gff3 Validate GFF3 file using EMBOSS translation and validation gff3-fetch --cache none --validate --emboss mRNA test/data/gff/test-ext-fasta.fa test/data/gff/test-ext-fasta.gff3 Find GENEID predicted terminal exons gff3-fetch terminal chromosome1.fa geneid.gff3 == Performance time gff3-fetch cds m_hapla.WS217.dna.fa m_hapla.WS217.gff3 > test.fa Cache real user sys ---------------------------------------------------- full 12m41s 12m28s 0m09s (0.8.0 Jan. 2011) none 504m39s 477m49s 26m50s (0.8.0 Jan. 2011) ---------------------------------------------------- where 52M m_hapla.WS217.dna.fa 456M m_hapla.WS217.gff3 ruby 1.9.2p136 (2010-12-25 revision 30365) [x86_64-linux] on an 8 CPU, 2.6 GHz (6MB cache), 16 GB RAM machine. == Cite If you use this software, please cite http://dx.doi.org/10.1093/bioinformatics/btq475 == Copyright Copyright (C) 2010,2011 Pjotr Prins