Class: Blast
- Inherits:
-
Object
- Object
- Blast
- Defined in:
- lib/genevalidator/blast.rb
Instance Attribute Summary (collapse)
-
- (Object) fasta_file
readonly
Returns the value of attribute fasta_file.
-
- (Object) idx
readonly
current number of the querry processed.
-
- (Object) outfmt
readonly
Returns the value of attribute outfmt.
-
- (Object) query_offset_lst
readonly
array of indexes for the start offsets of each query in the fasta file.
-
- (Object) start_idx
readonly
Returns the value of attribute start_idx.
-
- (Object) type
readonly
Returns the value of attribute type.
Instance Method Summary (collapse)
-
- (Object) blast
Calls blast according to the type of the sequence.
-
- (Object) call_blast_from_file(command, filename, gapopen, gapextend)
Calls blast from file with specific parameters Param: command: blast command in String format (e.g 'blastx' or 'blastp') filename: name of the FAST file query: String containing the the query in fasta format gapopen: gapopen blast parameter gapextend: gapextend blast parameter Output: String with the blast xml output.
-
- (Object) call_blast_from_stdin(command, query, gapopen, gapextend)
Calls blast from standard input with specific parameters Params: command: blast command in String format (e.g 'blastx' or 'blastp') query: String containing the the query in fasta format gapopen: gapopen blast parameter gapextend: gapextend blast parameter Output: String with the blast xml output.
-
- (Object) composition(sequence_string)
Method copied from sequenceserver/sequencehelpers.rb Params: sequence_string: String of which we mfind the composition Output: a Hash.
-
- (Object) get_sequence_by_accession_no(accno, db)
Gets gene by accession number from a givem database Params: accno: accession number as String db: database as String Output: String with the nucleotide sequence corresponding to the accno.
-
- (Object) guess_sequence_type(sequence_string)
Method copied from sequenceserver/sequencehelpers.rb Strips all non-letter characters.
-
- (Blast) initialize(fasta_file, type, xml_file = nil, outfmt = nil, start_idx = 1)
constructor
Initilizes the object Params: fasta_file: query sequence fasta file with query sequences type: query sequence type; can be :nucleotide or :protein xml_file: name of the precalculated blast xml output (used in 'skip blast' case) outfmt: output format start_idx: number of the sequence from the file to start with.
-
- (Object) parse_next_query(iterator)
Parses the next query from the blast xml output query Params: iterator: blast xml iterator for hits Outputs: output1: an array of Sequence ojbects for hits output2: Sequence object for the predicted sequence.
-
- (Object) parse_xml_output(output)
Parses the xml blast output Param: output: String with the blast output in xml format.
-
- (Object) type_of_sequences(fasta_format_string)
Method copied from sequenceserver/sequencehelpers.rb Splits input at putative fasta definition lines (like ">adsfadsf"), guesses sequence type for each sequence.
Constructor Details
- (Blast) initialize(fasta_file, type, xml_file = nil, outfmt = nil, start_idx = 1)
Initilizes the object Params: fasta_file: query sequence fasta file with query sequences type: query sequence type; can be :nucleotide or :protein xml_file: name of the precalculated blast xml output (used in 'skip blast' case) outfmt: output format start_idx: number of the sequence from the file to start with
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
# File 'lib/genevalidator/blast.rb', line 35 def initialize(fasta_file, type, xml_file = nil, outfmt = nil, start_idx = 1) begin puts "\nDepending on your input and your computational resources, this may take a while. Please wait...\n\n" if type == "protein" @type = :protein else @type = :nucleotide end @fasta_file = fasta_file @xml_file = xml_file @idx = 0 @start_idx = start_idx @outfmt = outfmt raise FileNotFoundException.new unless File.exists?(@fasta_file) fasta_content = IO.binread(@fasta_file); # type validation: the type of the sequence in the FASTA correspond to the one declared by the user if @type != type_of_sequences(fasta_content) raise SequenceTypeError.new end # create a list of index of the queries in the FASTA @query_offset_lst = fasta_content.enum_for(:scan, /(>[^>]+)/).map{ Regexp.last_match.begin(0)} @query_offset_lst.push(fasta_content.length) fasta_content = nil # free memory for variable fasta_content # redirect the cosole messages of R R.echo "enable = nil, stderr = nil, warn = nil" rescue SequenceTypeError => error $stderr.print "Sequence Type error at #{error.backtrace[0].scan(/\/([^\/]+:\d+):.*/)[0][0]}. Possible cause: input file is not FASTA or the --type parameter is incorrect.\n" exit rescue FileNotFoundException => error $stderr.print "File not found error at #{error.backtrace[0].scan(/\/([^\/]+:\d+):.*/)[0][0]}. Possible cause: input file does not exist.\n" exit end end |
Instance Attribute Details
- (Object) fasta_file (readonly)
Returns the value of attribute fasta_file
19 20 21 |
# File 'lib/genevalidator/blast.rb', line 19 def fasta_file @fasta_file end |
- (Object) idx (readonly)
current number of the querry processed
21 22 23 |
# File 'lib/genevalidator/blast.rb', line 21 def idx @idx end |
- (Object) outfmt (readonly)
Returns the value of attribute outfmt
23 24 25 |
# File 'lib/genevalidator/blast.rb', line 23 def outfmt @outfmt end |
- (Object) query_offset_lst (readonly)
array of indexes for the start offsets of each query in the fasta file
25 26 27 |
# File 'lib/genevalidator/blast.rb', line 25 def query_offset_lst @query_offset_lst end |
- (Object) start_idx (readonly)
Returns the value of attribute start_idx
22 23 24 |
# File 'lib/genevalidator/blast.rb', line 22 def start_idx @start_idx end |
- (Object) type (readonly)
Returns the value of attribute type
18 19 20 |
# File 'lib/genevalidator/blast.rb', line 18 def type @type end |
Instance Method Details
- (Object) blast
Calls blast according to the type of the sequence
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
# File 'lib/genevalidator/blast.rb', line 79 def blast begin if @xml_file == nil #file seek for each query @query_offset_lst[0..@query_offset_lst.length-2].each_with_index do |pos, i| if (i+1) >= @start_idx query = IO.binread(@fasta_file, @query_offset_lst[i+1] - @query_offset_lst[i], @query_offset_lst[i]); #call blast with the default parameters if type == :protein output = call_blast_from_stdin("blastp", query, 11, 1) else output = call_blast_from_stdin("blastx", query, 11, 1) end #save output in a file xml_file = "#{@fasta_file}_#{i+1}.xml" File.open(xml_file , "w") do |f| f.write(output) end #parse output parse_xml_output(output) else @idx = @idx + 1 end end else file = File.open(@xml_file, "rb").read parse_xml_output(file) end rescue SystemCallError => error $stderr.print "Load error at #{error.backtrace[0].scan(/\/([^\/]+:\d+):.*/)[0][0]}. Possible cause: input file is not valid\n" exit end end |
- (Object) call_blast_from_file(command, filename, gapopen, gapextend)
Calls blast from file with specific parameters Param: command: blast command in String format (e.g 'blastx' or 'blastp') filename: name of the FAST file query: String containing the the query in fasta format gapopen: gapopen blast parameter gapextend: gapextend blast parameter Output: String with the blast xml output
164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 |
# File 'lib/genevalidator/blast.rb', line 164 def call_blast_from_file(command, filename, gapopen, gapextend) begin raise TypeError unless command.is_a? String and filename.is_a? String evalue = "1e-5" #output = 5 (XML Blast output) cmd = "#{command} -query #{filename} -db nr -remote -evalue #{evalue} -outfmt 5 -gapopen #{gapopen} -gapextend #{gapextend} " puts "Executing \"#{cmd}\"..." puts "This may take a while..." output = %x[#{cmd} if xml_file == nil file = File.open(xml_file, "rb").read b.parse_xml_output(file) end 2>/dev/null] if output == "" raise ClasspathError.new end return output rescue TypeError => error $stderr.print "Type error at #{error.backtrace[0].scan(/\/([^\/]+:\d+):.*/)[0][0]}. Possible cause: one of the arguments of 'call_blast_from_file' method has not the proper type\n" exit rescue ClasspathError =>error $stderr.print "BLAST error at #{error.backtrace[0].scan(/\/([^\/]+:\d+):.*/)[0][0]}. Did you add BLAST path to CLASSPATH?\n" exit end end |
- (Object) call_blast_from_stdin(command, query, gapopen, gapextend)
Calls blast from standard input with specific parameters Params: command: blast command in String format (e.g 'blastx' or 'blastp') query: String containing the the query in fasta format gapopen: gapopen blast parameter gapextend: gapextend blast parameter Output: String with the blast xml output
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
# File 'lib/genevalidator/blast.rb', line 127 def call_blast_from_stdin(command, query, gapopen, gapextend) begin raise TypeError unless command.is_a? String and query.is_a? String evalue = "1e-5" #output format = 5 (XML Blast output) blast_cmd = "#{command} -db nr -remote -evalue #{evalue} -outfmt 5 -gapopen #{gapopen} -gapextend #{gapextend}" cmd = "echo \"#{query}\" | #{blast_cmd}" #puts "Executing \"#{blast_cmd}\"... This may take a while..." output = %x[#{cmd} 2>/dev/null] if output == "" raise ClasspathError.new end return output rescue TypeError => error $stderr.print "Type error at #{error.backtrace[0].scan(/\/([^\/]+:\d+):.*/)[0][0]}. Possible cause: one of the arguments of 'call_blast_from_file' method has not the proper type\n" exit rescue ClasspathError => error $stderr.print "BLAST error at #{error.backtrace[0].scan(/\/([^\/]+:\d+):.*/)[0][0]}. Possible cause: Did you add BLAST path to CLASSPATH?\n" exit end end |
- (Object) composition(sequence_string)
Method copied from sequenceserver/sequencehelpers.rb Params: sequence_string: String of which we mfind the composition Output: a Hash
386 387 388 389 390 391 392 |
# File 'lib/genevalidator/blast.rb', line 386 def composition(sequence_string) count = Hash.new(0) sequence_string.scan(/./) do |x| count[x] += 1 end count end |
- (Object) get_sequence_by_accession_no(accno, db)
Gets gene by accession number from a givem database Params: accno: accession number as String db: database as String Output: String with the nucleotide sequence corresponding to the accno
358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 |
# File 'lib/genevalidator/blast.rb', line 358 def get_sequence_by_accession_no(accno,db) uri = "http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=#{db}&retmax=1&usehistory=y&term=#{accno}/" #puts uri result = Net::HTTP.get(URI.parse(uri)) result2 = result queryKey = result2.scan(/<\bQueryKey\b>([\w\W\d]+)<\/\bQueryKey\b>/)[0][0] webEnv = result.scan(/<\bWebEnv\b>([\w\W\d]+)<\/\bWebEnv\b>/)[0][0] uri = "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?rettype=fasta&retmode=text&retstart=0&retmax=1&db=#{db}&query_key=#{queryKey}&WebEnv=#{webEnv}" result = Net::HTTP.get(URI.parse(uri)) #parse FASTA output rec=result nl = rec.index("\n") header = rec[0..nl-1] seq = rec[nl+1..-1] seq.gsub!(/\n/,'') end |
- (Object) guess_sequence_type(sequence_string)
Method copied from sequenceserver/sequencehelpers.rb Strips all non-letter characters. guestimates sequence based on that. If less than 10 useable characters... returns nil If more than 90% ACGTU returns :nucleotide. else returns :protein Params: sequence_string: String to validate Output: nil, :nucleotide or :protein
403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 |
# File 'lib/genevalidator/blast.rb', line 403 def guess_sequence_type(sequence_string) cleaned_sequence = sequence_string.gsub(/[^A-Z]/i, '') # removing non-letter characters cleaned_sequence.gsub!(/[NX]/i, '') # removing ambiguous characters return nil if cleaned_sequence.length < 10 # conservative composition = composition(cleaned_sequence) composition_NAs = composition.select { |character, count|character.match(/[ACGTU]/i) } # only putative NAs putative_NA_counts = composition_NAs.collect { |key_value_array| key_value_array[1] } # only count, not char putative_NA_sum = putative_NA_counts.inject { |sum, n| sum + n } # count of all putative NA putative_NA_sum = 0 if putative_NA_sum.nil? if putative_NA_sum > (0.9 * cleaned_sequence.length) return :nucleotide else return :protein end end |
- (Object) parse_next_query(iterator)
Parses the next query from the blast xml output query Params: iterator: blast xml iterator for hits Outputs: output1: an array of Sequence ojbects for hits output2: Sequence object for the predicted sequence
257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 |
# File 'lib/genevalidator/blast.rb', line 257 def parse_next_query(iterator) begin raise TypeError unless iterator.is_a? Enumerator hits = Array.new predicted_seq = Sequence.new iter = iterator.next #puts "#################################################" #puts "Parsing query #{iter.field('Iteration_iter-num')}" # get info about the query predicted_seq.xml_length = iter.field("Iteration_query-len").to_i if @type == :nucleotide predicted_seq.xml_length /= 3 end predicted_seq.definition = iter.field("Iteration_query-def") # parse blast the xml output and get the hits iter.each do | hit | seq = Sequence.new seq.xml_length = hit.len.to_i seq.seq_type = @type seq.database = iter.field("BlastOutput_db") seq.id = hit.hit_id seq.definition = hit.hit_def seq.accession_no = hit.accession species_regex = hit.hit_def.scan(/\[([^\]\[]+)\]$/) if species_regex[0] == nil seq.species = "Unknown" else seq.species = species_regex[0][0] end #get gene by accession number if @type == :protein seq.raw_sequence = ""#get_sequence_by_accession_no(seq.accession_no, "protein") else seq.raw_sequence = ""#get_sequence_by_accession_no(seq.accession_no, "nucleotide") end # get all high-scoring segment pairs (hsp) hsps = [] hit.hsps.each do |hsp| current_hsp = Hsp.new current_hsp.bit_score = hsp.bit_score.to_i current_hsp.hsp_score = hsp.score.to_i current_hsp.hsp_evalue = hsp.evalue.to_i current_hsp.hit_from = hsp.hit_from.to_i current_hsp.hit_to = hsp.hit_to.to_i current_hsp.match_query_from = hsp.query_from.to_i current_hsp.match_query_to = hsp.query_to.to_i if @type == :nucleotide current_hsp.match_query_from /= 3 current_hsp.match_query_to /= 3 end current_hsp.query_reading_frame = hsp.query_frame.to_i current_hsp.hit_alignment = hsp.hseq.to_s current_hsp.query_alignment = hsp.qseq.to_s current_hsp.middles = hsp.midline.to_s current_hsp.identity = hsp.identity.to_i current_hsp.positive = hsp.positive.to_i current_hsp.gaps = hsp.gaps.to_i current_hsp.align_len = hsp.align_len.to_i hsps.push(current_hsp) #regex = current_hsp.hseq.gsub(/[+ -]/, '+' => '.', ' ' => '.', '-' => '') #seq.alignment_start_offset = seq.raw_sequence.index(/#{regex}/) end seq.hsp_list = hsps hits.push(seq) #puts "getting sequence #{seq.accession_no}..." end return [hits, predicted_seq] rescue TypeError => error $stderr.print "Type error at #{error.backtrace[0].scan(/\/([^\/]+:\d+):.*/)[0][0]}. Possible cause: you didn't call parse method first!\n" exit rescue StopIteration nil end end |
- (Object) parse_xml_output(output)
Parses the xml blast output Param: output: String with the blast output in xml format
198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 |
# File 'lib/genevalidator/blast.rb', line 198 def parse_xml_output(output) iterator = Bio::BlastXMLParser::NokogiriBlastXml.new(output).to_enum begin @idx = @idx + 1 if @idx < @start_idx iter = iterator.next else sequences = parse_next_query(iterator) #returns [hits, predicted_seq] if sequences == nil @idx = @idx -1 break end hits = sequences[0] prediction = sequences[1] # get the @idx-th sequence from the fasta file i = @idx-1 ### add exception query = IO.binread(@fasta_file, @query_offset_lst[i+1] - @query_offset_lst[i], @query_offset_lst[i]) prediction.raw_sequence = query.scan(/[^\n]*\n([ATGCatgc\n]*)/)[0][0].gsub("\n","") #file seek for each query # do validations v = Validation.new(hits, prediction, @type, @fasta_file, @idx, @start_idx) if @outfmt == :html query_output = v.validate_all(true) query_output.generate_html else query_output = v.validate_all end query_output.print_output_console #if @outfmt == :yaml query_output.print_output_file_yaml #end end rescue NoMethodError => error $stderr.print "NoMethod error at #{error.backtrace[0].scan(/\/([^\/]+:\d+):.*/)[0][0]}. Possible cause: input file is not in blast xml format.\n" exit rescue StopIteration return end while 1 end |
- (Object) type_of_sequences(fasta_format_string)
Method copied from sequenceserver/sequencehelpers.rb Splits input at putative fasta definition lines (like ">adsfadsf"), guesses sequence type for each sequence. If not enough sequence to determine, returns nil. If 2 kinds of sequence mixed together, raises ArgumentError Otherwise, returns :nucleotide or :protein Params: sequence_string: String to validate Output: nil, :nucleotide or :protein
432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 |
# File 'lib/genevalidator/blast.rb', line 432 def type_of_sequences(fasta_format_string) # the first sequence does not need to have a fasta definition line sequences = fasta_format_string.split(/^>.*$/).delete_if { |seq| seq.empty? } # get all sequence types sequence_types = sequences.collect { |seq| guess_sequence_type(seq) }.uniq.compact return nil if sequence_types.empty? if sequence_types.length == 1 return sequence_types.first # there is only one (but yes its an array) else raise ArgumentError, "Insufficient info to determine sequence type. Cleaned queries are: #{ sequences.to_s }" end end |