# Milemarker -- track (and probably log) progress in batch jobs Never again write code of the form `log.info "Finished 1_000 in #{secs} seconds at a rate of #{total.to_f / secs}"` . ## Usage ```ruby require 'milemarker' require 'logger' input_file = "records.ndj" # Create a new milemarker. Default batch_size is 1_000 milemarker = Milemarker.new(name: "Load #{input_file}", batch_size: 1_000_000) logger = Logger.new(STDERR) milemarker.logger = logger File.open(input_file).each do |line| do_whatever_needs_doing(line) milemarker.increment_and_log_batch_line end milemarker.log_final_line # if logging is set up # Identical to the above, but do the logging "by hand" File.open(input_file).each do |line| do_whatever_needs_doing(line) milemarker.increment_and_on_batch { logger.info milemarker.batch_line } end logger.info milemarker.final_line # Sample output # ... # I, [2021-11-02T01:51:06.959137 #11710] INFO -- : load records.ndj 8_000_000. This batch 2_000_000 in 26.2s (76_469 r/s). Overall 72_705 r/s. # I, [2021-11-02T01:51:36.992831 #11710] INFO -- : load records.ndj 10_000_000. This batch 2_000_000 in 30.0s (66_591 r/s). Overall 71_394 r/s. # ... # I, [2021-11-02T02:01:56.702196 #11710] INFO -- : load records.ndj FINISHED. 27_138_118 total records in 00h 12m 39s. Overall 35_718 r/s. ``` ## Basic usage Most programs will probably use `milemarker` is via `#increment_and_log_batch_line` (or its counterpart `#increment_and_on_batch {|milemarker| ... }` ). As the name suggests, this will: * increment the batch counter * If the batch counter >= the batch size: * run the provided block (or write the logline) * reset count/time/etc for the next batch Some examples: ```ruby # Logging, as above milemarker = Milemarker.new(batch_size: 1000, name: 'Load myfile') milemarker.increment_and_on_batch { logger.info milemarker.batch_line } # Alert when things seem to to take too long milemarker.increment_and_on_batch do |milemarker| secs = milemarker.last_batch_seconds if secs > way_too_long logger.error "Whoa: #{secs} is too long for a batch of #{milemarker.batch_size}" end end # #on_batch and #increment_and_on_batch can be used to do real (i.e., # non-logging) work after every `batch` calls, too queue = [] my_stuff.each do |doc| queue << do_something_to(doc) milemarker.increment_and_on_batch do |milemarker| write_to_datastore(queue) queue = [] logger.info milemarker.batch_line end end ``` `#incr` and `#on_batch(&blk)` are also available separately if you need to be more explicit and less atomic. All the components that make up a batch_line (e.g., the records/second as a nice string) are available to roll your own batch line. See the API documentation for details. ### Incorporating a logger into milemarker For standard logging cases, you can also pass in a logger, or let milemarker create one for its own use based on an IO-like object you provide ```ruby logger = Logger.new(STDERR) milemarker = Milemarker.new(name: 'my_process', batch_size: 10_000, logger: logger) # same thing milemarker = Milemarker.new(name: 'my_process', batch_size: 10_000) milemarker.logger = logger # same thing again milemarker = Milemarker.new(name: 'my_process', batch_size: 10_000) milemarker.create_logger!(STDERR) File.open(input_file).each do |line| do_whatever_needs_doing(line) milemarker.increment_and_log_batch_line end milemarker.log_final_line # All the logging methods take an optional :level argument milemarker.log_final_line(level: :debug) ``` ### Structured logging with Milemarker::Structured `Milemarker::Structured` will return hashes for `#batch_line` and `#final_line` (aliased to `#batch_data` and `#final_data`, respectively) and pass those hashes along to whatever logger you provide. `#create_logger!` for this subclass will create a logger that provides json lines instead of text, too. Presumably, if you pass in your own logger you'll use something like [semantic_logger](https://github.com/reidmorrison/semantic_logger) or [ougai](https://github.com/tilfin/ougai). ```ruby milemarker = Milemarker::Structured.new(name: 'my_process', batch_size: 10_000) milemarker.create_logger!(STDERR) File.open(input_file).each do |line| do_whatever_needs_doing(line) milemarker.increment_and_log_batch_line end # Usually one line; broken up for readability # {"name":"my_process","batch_count":10_000,"batch_seconds":97.502088, # "batch_rate":1.035875252230496,"total_count":100,"total_seconds":97.502094, # "total_rate":1.0358751884856956,"level":"INFO","time":"2021-11-06 17:32:21 -0400"} ``` ## Threadsafety A call to `milemaker.threadsafify!` will wrap `increment_and_on_batch` (and `increment_and_log_batch_line`) to be a threadsafe atomic operation at the cost of some performance. ``` milemarker.threadsafify! ``` ## Turning off logging If the logger is set to `nil`, no logging will occur. ```ruby # Turn off logging milemarker.logger = nil ``` You could also just configure your logger to ignore stuff ```ruby milemarker.logger.level = :error ``` ## Accuracy Note that `milemarker` isn't designed for real benchmarking. The assumption is that whatever work your code is actually doing will drown out any inefficiencies in the `milemarker` code, and milemarker numbers can be used to suss out where weird things are happening. ## Installation Add this line to your application's Gemfile: ```ruby gem 'milemarker' ``` And then execute: $ bundle install Or install it yourself as: $ gem install milemarker ## Contributing Bug reports and pull requests are welcome on GitHub at https://github.com/billdueber/milemarker. ## License The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).