Class: Blast

Inherits:
Object
  • Object
show all
Defined in:
lib/genevalidator/blast.rb

Instance Attribute Summary (collapse)

Instance Method Summary (collapse)

Constructor Details

- (Blast) initialize(fasta_file, type, xml_file = nil, outfmt = nil, start_idx = 1)

Initilizes the object Params: fasta_file: query sequence fasta file with query sequences type: query sequence type; can be :nucleotide or :protein xml_file: name of the precalculated blast xml output (used in 'skip blast' case) outfmt: output format start_idx: number of the sequence from the file to start with



35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
# File 'lib/genevalidator/blast.rb', line 35

def initialize(fasta_file, type, xml_file = nil, outfmt = nil, start_idx = 1)
  begin

    puts "\nDepending on your input and your computational resources, this may take a while. Please wait...\n\n"

    if type == "protein"
      @type = :protein
    else 
      @type = :nucleotide
    end

    @fasta_file = fasta_file
    @xml_file = xml_file
    @idx = 0
    @start_idx = start_idx
    @outfmt = outfmt

    raise FileNotFoundException.new unless File.exists?(@fasta_file)
    fasta_content = IO.binread(@fasta_file);

    # type validation: the type of the sequence in the FASTA correspond to the one declared by the user
    if @type != type_of_sequences(fasta_content)
      raise SequenceTypeError.new
    end

    # create a list of index of the queries in the FASTA
    @query_offset_lst = fasta_content.enum_for(:scan, /(>[^>]+)/).map{ Regexp.last_match.begin(0)}
    @query_offset_lst.push(fasta_content.length)
    fasta_content = nil # free memory for variable fasta_content

    # redirect the cosole messages of R
    R.echo "enable = nil, stderr = nil, warn = nil"

  rescue SequenceTypeError => error
    $stderr.print "Sequence Type error at #{error.backtrace[0].scan(/\/([^\/]+:\d+):.*/)[0][0]}. Possible cause: input file is not FASTA or the --type parameter is incorrect.\n"      
    exit
  rescue FileNotFoundException => error
    $stderr.print "File not found error at #{error.backtrace[0].scan(/\/([^\/]+:\d+):.*/)[0][0]}. Possible cause: input file does not exist.\n"
    exit 
  end
end

Instance Attribute Details

- (Object) fasta_file (readonly)

Returns the value of attribute fasta_file



19
20
21
# File 'lib/genevalidator/blast.rb', line 19

def fasta_file
  @fasta_file
end

- (Object) idx (readonly)

current number of the querry processed



21
22
23
# File 'lib/genevalidator/blast.rb', line 21

def idx
  @idx
end

- (Object) outfmt (readonly)

Returns the value of attribute outfmt



23
24
25
# File 'lib/genevalidator/blast.rb', line 23

def outfmt
  @outfmt
end

- (Object) query_offset_lst (readonly)

array of indexes for the start offsets of each query in the fasta file



25
26
27
# File 'lib/genevalidator/blast.rb', line 25

def query_offset_lst
  @query_offset_lst
end

- (Object) start_idx (readonly)

Returns the value of attribute start_idx



22
23
24
# File 'lib/genevalidator/blast.rb', line 22

def start_idx
  @start_idx
end

- (Object) type (readonly)

Returns the value of attribute type



18
19
20
# File 'lib/genevalidator/blast.rb', line 18

def type
  @type
end

Instance Method Details

- (Object) blast

Calls blast according to the type of the sequence



79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
# File 'lib/genevalidator/blast.rb', line 79

def blast
  begin

    if @xml_file == nil
 
      #file seek for each query
      @query_offset_lst[0..@query_offset_lst.length-2].each_with_index do |pos, i|
    
        if (i+1) >= @start_idx
          query = IO.binread(@fasta_file, @query_offset_lst[i+1] - @query_offset_lst[i], @query_offset_lst[i]);

          #call blast with the default parameters
          if type == :protein
            output = call_blast_from_stdin("blastp", query, 11, 1)
          else
            output = call_blast_from_stdin("blastx", query, 11, 1)
          end

          #save output in a file
          xml_file = "#{@fasta_file}_#{i+1}.xml"
          File.open(xml_file , "w") do |f| f.write(output) end

          #parse output
          parse_xml_output(output)   
        else
          @idx = @idx + 1
        end
      end
    else
      file = File.open(@xml_file, "rb").read
      parse_xml_output(file)      
    end

  rescue SystemCallError => error
    $stderr.print "Load error at #{error.backtrace[0].scan(/\/([^\/]+:\d+):.*/)[0][0]}. Possible cause: input file is not valid\n"      
    exit
  end
end

- (Object) call_blast_from_file(command, filename, gapopen, gapextend)

Calls blast from file with specific parameters Param: command: blast command in String format (e.g 'blastx' or 'blastp') filename: name of the FAST file query: String containing the the query in fasta format gapopen: gapopen blast parameter gapextend: gapextend blast parameter Output: String with the blast xml output



164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
# File 'lib/genevalidator/blast.rb', line 164

def call_blast_from_file(command, filename, gapopen, gapextend)
  begin  
    raise TypeError unless command.is_a? String and filename.is_a? String

    evalue = "1e-5"

    #output = 5 (XML Blast output)
    cmd = "#{command} -query #{filename} -db nr -remote -evalue #{evalue} -outfmt 5 -gapopen #{gapopen} -gapextend #{gapextend} "
    puts "Executing \"#{cmd}\"..."
    puts "This may take a while..."
    output = %x[#{cmd}          if xml_file == nil
          file = File.open(xml_file, "rb").read
          b.parse_xml_output(file)
        end 2>/dev/null]

    if output == ""
      raise ClasspathError.new      
    end

    return output

  rescue TypeError => error
    $stderr.print "Type error at #{error.backtrace[0].scan(/\/([^\/]+:\d+):.*/)[0][0]}. Possible cause: one of the arguments of 'call_blast_from_file' method has not the proper type\n"      
    exit
  rescue ClasspathError =>error
    $stderr.print "BLAST error at #{error.backtrace[0].scan(/\/([^\/]+:\d+):.*/)[0][0]}. Did you add BLAST path to CLASSPATH?\n"      
    exit
  end
end

- (Object) call_blast_from_stdin(command, query, gapopen, gapextend)

Calls blast from standard input with specific parameters Params: command: blast command in String format (e.g 'blastx' or 'blastp') query: String containing the the query in fasta format gapopen: gapopen blast parameter gapextend: gapextend blast parameter Output: String with the blast xml output



127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
# File 'lib/genevalidator/blast.rb', line 127

def call_blast_from_stdin(command, query, gapopen, gapextend)
  begin
    raise TypeError unless command.is_a? String and query.is_a? String

    evalue = "1e-5"

    #output format = 5 (XML Blast output)
    blast_cmd = "#{command} -db nr -remote -evalue #{evalue} -outfmt 5 -gapopen #{gapopen} -gapextend #{gapextend}"
    cmd = "echo \"#{query}\" | #{blast_cmd}"
    #puts "Executing \"#{blast_cmd}\"... This may take a while..."
    output = %x[#{cmd} 2>/dev/null]

    if output == ""
      raise ClasspathError.new
    end

    return output

  rescue TypeError => error
    $stderr.print "Type error at #{error.backtrace[0].scan(/\/([^\/]+:\d+):.*/)[0][0]}. Possible cause: one of the arguments of 'call_blast_from_file' method has not the proper type\n"
    exit
  rescue ClasspathError => error
    $stderr.print "BLAST error at #{error.backtrace[0].scan(/\/([^\/]+:\d+):.*/)[0][0]}. Possible cause: Did you add BLAST path to CLASSPATH?\n" 
    exit 
  end
end

- (Object) composition(sequence_string)

Method copied from sequenceserver/sequencehelpers.rb Params: sequence_string: String of which we mfind the composition Output: a Hash



386
387
388
389
390
391
392
# File 'lib/genevalidator/blast.rb', line 386

def composition(sequence_string)
  count = Hash.new(0)
  sequence_string.scan(/./) do |x|
    count[x] += 1
  end
  count
end

- (Object) get_sequence_by_accession_no(accno, db)

Gets gene by accession number from a givem database Params: accno: accession number as String db: database as String Output: String with the nucleotide sequence corresponding to the accno



358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
# File 'lib/genevalidator/blast.rb', line 358

def get_sequence_by_accession_no(accno,db)

  uri = "http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=#{db}&retmax=1&usehistory=y&term=#{accno}/"
  #puts uri
  result = Net::HTTP.get(URI.parse(uri))

  result2 = result
  queryKey = result2.scan(/<\bQueryKey\b>([\w\W\d]+)<\/\bQueryKey\b>/)[0][0]
  webEnv = result.scan(/<\bWebEnv\b>([\w\W\d]+)<\/\bWebEnv\b>/)[0][0]

  uri = "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?rettype=fasta&retmode=text&retstart=0&retmax=1&db=#{db}&query_key=#{queryKey}&WebEnv=#{webEnv}"

  result = Net::HTTP.get(URI.parse(uri))

  #parse FASTA output
  rec=result
  nl = rec.index("\n")
  header = rec[0..nl-1]
  seq = rec[nl+1..-1]
  seq.gsub!(/\n/,'')
end

- (Object) guess_sequence_type(sequence_string)

Method copied from sequenceserver/sequencehelpers.rb Strips all non-letter characters. guestimates sequence based on that. If less than 10 useable characters... returns nil If more than 90% ACGTU returns :nucleotide. else returns :protein Params: sequence_string: String to validate Output: nil, :nucleotide or :protein



403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
# File 'lib/genevalidator/blast.rb', line 403

def guess_sequence_type(sequence_string)
  cleaned_sequence = sequence_string.gsub(/[^A-Z]/i, '') # removing non-letter characters
  cleaned_sequence.gsub!(/[NX]/i, '') # removing ambiguous characters

  return nil if cleaned_sequence.length < 10 # conservative

  composition = composition(cleaned_sequence)
  composition_NAs = composition.select { |character, count|character.match(/[ACGTU]/i) } # only putative NAs
  putative_NA_counts = composition_NAs.collect { |key_value_array| key_value_array[1] } # only count, not char
  putative_NA_sum = putative_NA_counts.inject { |sum, n| sum + n } # count of all putative NA
  putative_NA_sum = 0 if putative_NA_sum.nil?

  if putative_NA_sum > (0.9 * cleaned_sequence.length)
    return :nucleotide
  else
    return :protein
  end
end

- (Object) parse_next_query(iterator)

Parses the next query from the blast xml output query Params: iterator: blast xml iterator for hits Outputs: output1: an array of Sequence ojbects for hits output2: Sequence object for the predicted sequence



257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
# File 'lib/genevalidator/blast.rb', line 257

def parse_next_query(iterator)
  begin
    raise TypeError unless iterator.is_a? Enumerator

    hits = Array.new
    predicted_seq = Sequence.new
    iter = iterator.next

    #puts "#################################################"
    #puts "Parsing query #{iter.field('Iteration_iter-num')}"

    # get info about the query
    predicted_seq.xml_length = iter.field("Iteration_query-len").to_i
    if @type == :nucleotide
      predicted_seq.xml_length /= 3
    end
    predicted_seq.definition = iter.field("Iteration_query-def")

    # parse blast the xml output and get the hits
    iter.each do | hit | 
      
      seq = Sequence.new

      seq.xml_length = hit.len.to_i        
      seq.seq_type = @type
      seq.database = iter.field("BlastOutput_db")
      seq.id = hit.hit_id
      seq.definition = hit.hit_def
      seq.accession_no = hit.accession

      species_regex = hit.hit_def.scan(/\[([^\]\[]+)\]$/)
      if species_regex[0] == nil
        seq.species = "Unknown"
      else
        seq.species = species_regex[0][0]
      end

      #get gene by accession number
      if @type == :protein
        seq.raw_sequence = ""#get_sequence_by_accession_no(seq.accession_no, "protein")
      else
        seq.raw_sequence = ""#get_sequence_by_accession_no(seq.accession_no, "nucleotide")
      end

      # get all high-scoring segment pairs (hsp)
      hsps = []
      hit.hsps.each do |hsp|
        current_hsp = Hsp.new
        current_hsp.bit_score = hsp.bit_score.to_i
        current_hsp.hsp_score = hsp.score.to_i
        current_hsp.hsp_evalue = hsp.evalue.to_i
        
        current_hsp.hit_from = hsp.hit_from.to_i
        current_hsp.hit_to = hsp.hit_to.to_i
        current_hsp.match_query_from = hsp.query_from.to_i
        current_hsp.match_query_to = hsp.query_to.to_i

        if @type == :nucleotide
          current_hsp.match_query_from /= 3 
          current_hsp.match_query_to /= 3             
        end


        current_hsp.query_reading_frame = hsp.query_frame.to_i

        current_hsp.hit_alignment = hsp.hseq.to_s
        current_hsp.query_alignment = hsp.qseq.to_s
        current_hsp.middles = hsp.midline.to_s

        current_hsp.identity = hsp.identity.to_i
        current_hsp.positive = hsp.positive.to_i
        current_hsp.gaps = hsp.gaps.to_i
        current_hsp.align_len = hsp.align_len.to_i

        hsps.push(current_hsp)
        #regex = current_hsp.hseq.gsub(/[+ -]/, '+' => '.', ' ' => '.', '-' => '')
        #seq.alignment_start_offset = seq.raw_sequence.index(/#{regex}/)
      end

      seq.hsp_list = hsps
      hits.push(seq)
      #puts "getting sequence #{seq.accession_no}..."
    end     
  
    return [hits, predicted_seq]

  rescue TypeError => error
    $stderr.print "Type error at #{error.backtrace[0].scan(/\/([^\/]+:\d+):.*/)[0][0]}. Possible cause: you didn't call parse method first!\n"       
    exit
  rescue StopIteration
    nil
  end
end

- (Object) parse_xml_output(output)

Parses the xml blast output Param: output: String with the blast output in xml format



198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
# File 'lib/genevalidator/blast.rb', line 198

def parse_xml_output(output)

  iterator = Bio::BlastXMLParser::NokogiriBlastXml.new(output).to_enum

  begin
    @idx = @idx + 1      
    if @idx < @start_idx  
      iter = iterator.next 
    else     
      sequences = parse_next_query(iterator) #returns [hits, predicted_seq]
      if sequences == nil          
        @idx = @idx -1
        break
      end

      hits = sequences[0]
      prediction = sequences[1]
      # get the @idx-th sequence  from the fasta file
      i = @idx-1
     
      ### add exception
      query = IO.binread(@fasta_file, @query_offset_lst[i+1] - @query_offset_lst[i], @query_offset_lst[i])
      prediction.raw_sequence = query.scan(/[^\n]*\n([ATGCatgc\n]*)/)[0][0].gsub("\n","")      
      #file seek for each query
      
      # do validations

      v = Validation.new(hits, prediction, @type, @fasta_file, @idx, @start_idx)

      if @outfmt == :html
        query_output = v.validate_all(true)
        query_output.generate_html
      else
        query_output = v.validate_all
      end
    
      query_output.print_output_console

      #if @outfmt == :yaml
        query_output.print_output_file_yaml
      #end
    end

    rescue NoMethodError => error
      $stderr.print "NoMethod error at #{error.backtrace[0].scan(/\/([^\/]+:\d+):.*/)[0][0]}. Possible cause: input file is not in blast xml format.\n"        
      exit
    rescue StopIteration
      return
  end while 1

end

- (Object) type_of_sequences(fasta_format_string)

Method copied from sequenceserver/sequencehelpers.rb Splits input at putative fasta definition lines (like ">adsfadsf"), guesses sequence type for each sequence. If not enough sequence to determine, returns nil. If 2 kinds of sequence mixed together, raises ArgumentError Otherwise, returns :nucleotide or :protein Params: sequence_string: String to validate Output: nil, :nucleotide or :protein



432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
# File 'lib/genevalidator/blast.rb', line 432

def type_of_sequences(fasta_format_string)
  # the first sequence does not need to have a fasta definition line
  sequences = fasta_format_string.split(/^>.*$/).delete_if { |seq| seq.empty? }

  # get all sequence types
  sequence_types = sequences.collect { |seq| guess_sequence_type(seq) }.uniq.compact

  return nil if sequence_types.empty?

  if sequence_types.length == 1
    return sequence_types.first # there is only one (but yes its an array)
  else
    raise ArgumentError, "Insufficient info to determine sequence type. Cleaned queries are: #{ sequences.to_s }"
  end
end