maf_extract(1) -- extract blocks from MAF files =============================================== ## SYNOPSIS `maf_extract` -m MAF [-i INDEX] --interval SEQ:START:END [OPTIONS] `maf_extract` -m MAF [-i INDEX] --bed BED [OPTIONS] `maf_extract` -d MAFDIR --interval SEQ:START:END [OPTIONS] `maf_extract` -d MAFDIR --bed BED [OPTIONS] ## DESCRIPTION **maf_extract** extracts alignment blocks from one or more indexed MAF files, according to either a genomic interval specified with `--interval` or multiple intervals given in a BED file specified with `--bed`. It can either match blocks intersecting the specified intervals with `--mode intersect`, the default, or extract slices of them which cover only the specified intervals, with `--mode slice`. Blocks and the sequences they contain can be filtered with a variety of options including `--only-species`, `--with-all-species`, `--min-sequences`, `--min-text-size`, and `--max-text-size`. With the `--join-blocks` option, adjacent parsed blocks can be joined if sequence filtering has removed a species causing them to be separated. The `--remove-gaps` option will remove columns containing only gaps (`-`). Blocks can be output in MAF format, with `--format maf` (the default), or FASTA format, with `--format fasta`. Output can be directed to a file with `--output`. This tool exposes almost all the random-access functionality of the Bio::MAF::Access class. The exception is MAF tiling, which is provided by maf_tile(1). ## FILES A single MAF file can be processed by specifying it with `--maf`. Its accompanying index, created by maf_index(1), is specified with `--index`. If `--maf` is given but no index is specified, the entire file will be parsed to build a temporary in-memory index. This facilitates processing small, transient MAF files. However, on a large file this will incur a great deal of overhead; files expected to be used more than once should be indexed with maf_index(1). Alternatively, a directory of indexed MAF files can be specified with `--maf-dir`; in this case, they will all be used to satisfy queries. ## OPTIONS MAF source options: * `-m`, `--maf MAF`: A single MAF file to process. * `-i`, `--index INDEX`: An index for the file specified with `--maf`, as created by maf_index(1). * `-d`, `--maf-dir DIR`: A directory of indexed MAF files. Extraction options: * `--mode (intersect | slice)`: The extraction mode to use. With `--mode intersect`, any alignment block intersecting the genomic intervals specified will be matched in its entirety. With `--mode slice`, intersecting blocks will be matched in the same way, but columns extending outside the specified interval will be removed. * `--bed BED`: The specified file will be parsed as a BED file, and each interval it contains will be matched in turn. * `--interval SEQ:START:END`: A single zero-based half-open genomic interval will be matched, with sequence identifier , (inclusive) start position , and (exclusive) end position . Output options: * `-f`, `--format (maf | fasta)`: Output will be written in the specified format, either MAF or FASTA. * `-o`, `--output OUT`: Output will be written to the file . Filtering options: * `--only-species (SP1,SP2,SP3 | @FILE)`: Alignment blocks will be filtered to contain only the specified species. These can be given as a comma-separated list or as a file, prefixed with `@`, from which a list of species will be read. * `--with-all-species (SP1,SP2,SP3 | @FILE)`: Only alignment blocks containing all the specified species will be matched. These can be given as a comma-separated list or as a file, prefixed with `@`, from which a list of species will be read. * `--min-sequences N`: Only alignment blocks containing at least sequences will be matched. * `--min-text-size N`: Only alignment blocks with a text size (including gaps) of at least will be matched. * `--max-text-size N`: Only alignment blocks with a text size (including gaps) of at most will be matched. Block processing options: * `--join-blocks`: If sequence filtering with `--only-species` removes a species which caused two adjacent blocks to be separate, this option will join them together into a single alignment block. The filtered blocks must contain the same sequences in contiguous positions and on the same strand. * `--remove-gaps`: If sequence filtering with `--only-species` leaves a block containing columns consisting only of gap characters (`-`), these will be removed. * `--parse-extended`: Parse `i` lines, giving information on the context of sequence lines, and `q` lines, giving quality scores. * `--parse-empty`: Parse `e` lines, indicating cases where a species does not align with the current block but does align with blocks before and after it. Logging options: * `-q`, `--quiet`: Run quietly, with warnings suppressed. * `-v`, `--verbose`: Run verbosely, with additional informational messages. * `--debug`: Log debugging information. ## EXAMPLES TODO ## ENVIRONMENT `maf_index` is a Ruby program and relies on ordinary Ruby environment variables. ## BUGS No provision exists for writing output to multiple files. FASTA description lines are always in the format `>source:start-end`. ## COPYRIGHT `maf_index` is copyright (C) 2012 Clayton Wheeler. ## SEE ALSO ruby(1), maf_index(1), maf_tile(1)