scaffolder-format(7) -- scaffolder authoring format using YAML ============================================================== ## SYNOPSIS The scaffolder software allows the construction of genome scaffolds using text files written in the YAML format. An example scaffold file with three entries is as follows: --- - sequence: source: 'sequence1' - unresolved: length: 20 - sequence: source: 'sequence2' start: 3 stop: 100 reverse: true inserts: - source: 'insert1' open: 5 close: 20 ## DESCRIPTION Scaffold files are written using the YAML format. The scaffold file along with the fasta file for the corresponding nucleotide sequences is used by the scaffolder software to generate the scaffold super sequence. The scaffold file should begin with three dashes `---`. This identifies the start of a YAML formatted document. The scaffold file is composed of a list of entries where the sequence for each entries is joined together to create the scaffold sequence. Each scaffold entry should should have a '-' dash line to indicate the start of the entry. The first line of the entry should define the type of entry. The entry types allowed are outlined in the section below. ## SCAFFOLD ENTRY TYPES ### SEQUENCE Sequence entries are the main component of the scaffold and insert the nucleotide sequences from the fasta file into the scaffold. The `sequence` tag is used to specify a sequence entry in the scaffold. The following tags can be used to modify how the sequence is inserted into the scaffold: * `source:` : **Required.** Specify the source nucleotide sequence to insert into the scaffold. The value should match the first space delimited word from the fasta header of the required sequence. * `reverse:` : **Optional.** Specify whether sequence should be reverse completed in the scaffold. Default value is false. * `start:` : **Optional.** Specify if the sequence should be trimmed from the start to the specified coordinate position. Default value is 1 where no sequence is removed. * `stop:` : **Optional.** Specify if the sequence should be trimmed from the end of the sequence. Sequence onward of specified trim position is removed. Default value is the length of the fasta sequence where no sequence is removed. * `inserts:` : **Optional.** Specify if additional insert sequences should be added to replace regions in this sequence. Should be specified as a list of YAML formatted [INSERT][] entries. Default value is an empty list. ### UNRESOLVED Unresolved regions add sequences of N nucleotide characters to the scaffold. They define gaps in the scaffold and join separate sequence sequence regions together. Unresolved regions are specified using the `unresolved` tag. A single attribute tag is used to specify the length of an unresolved region. * `length:` : **Required.** Specify the length of the unresolved region to add to the scaffold. ### INSERTS Insert sequences are used to replace regions in encoding [SEQUENCE][] entries. Each insert should correspond to a sequence in the fasta file. Inserts share the same attributes tags as [SEQUENCE][] entries but define two additional `open` and `close` attribute tags. These specify where the insert should be added to the encoding sequence. Multiple inserts can be added as a list to the encoding sequence entry using the `inserts` attribute tag. The following attribute tags can be to modify how an insert is added to the scaffold. * `source:` : **Required.** Specify the insert sequence to use. The value should match the first space delimited word from the fasta header of the required insert sequence. * `reverse:` : **Optional.** Specify whether the insert should be reverse complemented. Default value is false. * `start:` : **Optional.** Specify if the insert should be trimmed from the start to the specified trim position. Default value is 1 where no sequence is removed. * `stop:` : **Optional.** Specify if the insert should be trimmed from the end of the sequence. Sequence onward of trim position is removed. Default value is the length of the fasta sequence where no sequence is removed. * `open:` : **Optional if close coordinate is specified.** Specify the start coordinate where the insert is added to the encoding sequence. Default value is the close coordinate position minus the length of the insert sequence. * `close:` : **Optional if open coordinate is specified.** Specify the end coordinate position for adding the insert to the encoding sequence. Default value is the open coordinate position plus the length of the insert sequence. ## COPYRIGHT **Scaffolder** is Copyright (C) 2010 Michael Barton