Class: CodeZauker::FileScanner
- Inherits:
-
Object
- Object
- CodeZauker::FileScanner
- Defined in:
- lib/code_zauker.rb
Overview
Scan a file and push it inside redis... then it can provide handy method to find file scontaining the trigram...
Instance Method Summary (collapse)
- - (Object) disconnect
-
- (FileScanner) initialize(redisConnection = nil)
constructor
A new instance of FileScanner.
- - (Object) load(filename, noReload = false)
-
- (Object) remove(filePaths = nil)
Remove the files from the index, updating trigrams.
-
- (Object) removeAll
Remove all the keys.
-
- (Object) search(term)
search
Find a list of file candidates to a search string The search string is padded into trigrams.
Constructor Details
- (FileScanner) initialize(redisConnection = nil)
A new instance of FileScanner
16 17 18 19 20 21 22 |
# File 'lib/code_zauker.rb', line 16 def initialize(redisConnection=nil) if redisConnection==nil @redis=Redis.new else @redis=redisConnection end end |
Instance Method Details
- (Object) disconnect
23 24 25 |
# File 'lib/code_zauker.rb', line 23 def disconnect() @redis.quit end |
- (Object) load(filename, noReload = false)
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
# File 'lib/code_zauker.rb', line 52 def load(filename, noReload=false) # Define my redis id... # Already exists?... fid=@redis.get "fscan:id:#{filename}" if fid==nil @redis.setnx "fscan:nextId",0 fid=@redis.incr "fscan:nextId" # BUG: Consider storing it at the END of the processing @redis.set "fscan:id:#{filename}", fid @redis.set "fscan:id2filename:#{fid}",filename else if noReload puts "Already found #{filename} as id:#{fid} and NOT RELOADED" return nil end end # fid is the set key!... trigramScanned=0 # TEST_LICENSE.txt: 3290 Total Scanned: 24628 # The ratio is below 13% of total trigrams are unique for very big files # So we avoid a huge roundtrip to redis, and store the trigram on a memory-based set # before sending it to redis. This avoid # a lot of spourios work s=Set.new File.open(filename,"r") do |f| lines=f.readlines() adaptiveSize= 6000 lines.each do |l| # Split each line into 3-char chunks, and store in a redis set i=0 for istart in 0...(l.length-GRAM_SIZE) trigram = l[istart, GRAM_SIZE] # Avoid storing the 3space guy enterely if trigram==SPACE_GUY next end # push the trigram to redis (highly optimized) s.add(trigram) if s.length > adaptiveSize pushTrigramsSet(s,fid,filename) s=Set.new() end trigramScanned += 1 #puts "#{istart} Trigram fscan:#{trigram}/ FileId: #{fid}" end end end if s.length > 0 pushTrigramsSet(s,fid,filename) s=nil #puts "Final push of #{s.length}" end trigramsOnFile=@redis.scard "fscan:trigramsOnFile:#{fid}" @redis.sadd "fscan:processedFiles", "#{filename}" trigramRatio=( (trigramsOnFile*1.0) / trigramScanned )* 100.0 if trigramRatio < 10 or trigramRatio >75 puts "#{filename}\n\tRatio:#{trigramRatio.round}% Unique Trigrams:#{trigramsOnFile} Total Scanned: #{trigramScanned} " end return nil end |
- (Object) remove(filePaths = nil)
Remove the files from the index, updating trigrams
163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 |
# File 'lib/code_zauker.rb', line 163 def remove(filePaths=nil) if filePaths==nil fileList=[] storedFiles=@redis.keys "fscan:id:*" storedFiles.each do |fileKey| filename=fileKey.split("fscan:id:")[1] fileList.push(filename) end else fileList=filePaths end puts "Files to remove from index...#{fileList.length}" fileList.each do |filename| fid=@redis.get "fscan:id:#{filename}" trigramsToExpurge=@redis.smembers "fscan:trigramsOnFile:#{fid}" if trigramsToExpurge.length==0 puts "?Nothing to do on #{filename}" end puts "#{filename} id=#{fid} Trigrams: #{trigramsToExpurge.length}" trigramsToExpurge.each do | ts | @redis.srem "trigram:#{ts}", fid begin @redis.srem "trigram:ci:#{ts.downcase}",fid rescue ArgumentError # Ignore "ArgumentError: invalid byte sequence in UTF-8" # and proceed... end end @redis.del "fscan:id:#{filename}", "fscan:trigramsOnFile:#{fid}", "fscan:id2filename:#{fid}" @redis.srem "fscan:processedFiles", filename end return nil end |
- (Object) removeAll
Remove all the keys
158 159 160 |
# File 'lib/code_zauker.rb', line 158 def removeAll() self.remove(nil) end |
- (Object) search(term)
search
Find a list of file candidates to a search string The search string is padded into trigrams
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
# File 'lib/code_zauker.rb', line 119 def search(term) if term.length < GRAM_SIZE raise "FATAL: #{term} is shorter then the minimum size of #{GRAM_SIZE} character" end #puts " ** Searching: #{term}" # split the term in a padded trigram trigramInAnd=[] # Search=> Sea AND ear AND arc AND rch for j in 0...term.length currentTrigram=term[j,GRAM_SIZE] if currentTrigram.length <GRAM_SIZE # We are at the end... break end trigramInAnd.push("trigram:#{currentTrigram}") end #puts "Trigam conversion /#{term}/ into #{trigramInAnd}" if trigramInAnd.length==0 return [] end fileIds= @redis.sinter(*trigramInAnd) filenames=[] # fscan:id2filename:#{fid}.... fileIds.each do | id | filenames.push(@redis.get("fscan:id2filename:#{id}")) end #puts " ** Files found:#{filenames} from ids #{fileIds}" return filenames end |