## stackprof a sampling call-stack profiler for ruby 2.1+ inspired heavily by [gperftools](https://code.google.com/p/gperftools/), and written as a replacement for [perftools.rb](https://github.com/tmm1/perftools.rb) ### getting started in ruby: ``` ruby profile = StackProf.run(mode: :cpu) do ... end File.open('tmp/stackprof-cpu-myapp.dump', 'wb'){ |f| f.write Marshal.dump(profile) } ``` via rack: ``` ruby use StackProf::Middleware, enabled: true, mode: :cpu, interval: 1000, save_every: 5 ``` reporting: ``` $ stackprof tmp/stackprof-cpu-*.dump --text --limit 1 ================================== Mode: cpu(1000) Samples: 60395 (1.09% miss rate) GC: 2851 (4.72%) ================================== TOTAL (pct) SAMPLES (pct) FRAME 1660 (2.7%) 1595 (2.6%) String#blank? $ stackprof tmp/stackprof-cpu-*.dump --method 'String#blank?' String#blank? (gems/activesupport-2.3.14.github30/lib/active_support/core_ext/object/blank.rb:80) samples: 1595 self (2.6%) / 1660 total (2.7%) callers: 373 ( 41.0%) ApplicationHelper#current_user 192 ( 21.1%) ApplicationHelper#current_repository callers: 803 ( 48.4%) Object#present? code: | 80 | def blank? 1225 (2.0%) / 1225 (2.0%) | 81 | self !~ /[^[:space:]]/ | 82 | end $ stackprof tmp/stackprof-cpu-*.dump --method 'Object#present?' Object#present? (gems/activesupport-2.3.14.github30/lib/active_support/core_ext/object/blank.rb:20) samples: 59 self (0.1%) / 910 total (1.5%) callees (851 total): 803 ( 94.4%) String#blank? 32 ( 3.8%) Object#blank? 16 ( 1.9%) NilClass#blank? code: | 20 | def present? 910 (1.5%) / 59 (0.1%) | 21 | !blank? | 22 | end ``` ### sampling four sampling modes are supported: - :wall (using `ITIMER_REAL` and `SIGALRM`) [default mode] - :cpu (using `ITIMER_PROF` and `SIGPROF`) - :object (using `RUBY_INTERNAL_EVENT_NEWOBJ`) - :custom (user-defined via `StackProf.sample`) samplers have a tuneable interval which can be used to reduce overhead or increase granularity: - wall time: sample every _interval_ microseconds of wallclock time (default: 1000) - cpu time: sample every _interval_ microseconds of cpu activity (default: 1000 = 1 millisecond) - object allocation: sample every _interval_ allocations (default: 1) samples are taken using a combination of three new C-APIs in ruby 2.1: - signal handlers enqueue a sampling job using `rb_postponed_job_register_one`. this ensures callstack samples can be taken safely, in case the VM is garbage collecting or in some other inconsistent state during the interruption. - stack frames are collected via `rb_profile_frames`, which provides low-overhead C-API access to the VM's call stack. no object allocations occur in this path, allowing stackprof to collect callstacks in allocation mode. - in allocation mode, samples are taken via `rb_tracepoint_new(RUBY_INTERNAL_EVENT_NEWOBJ)`, which provides a notification every time the VM allocates a new object. ### aggregation each sample consists of N stack frames, where a frame looks something like `MyClass#method` or `block in MySingleton.method`. for each of these frames in the sample, the profiler collects a few pieces of metadata: - samples: number of samples where this was the topmost frame - total_samples: samples where this frame was in the stack - lines: samples per line number in this frame - edges: samples per callee frame (methods invoked by this frame) the aggregation algorithm is roughly equivalent to the following pseudo code: ``` ruby trap('PROF') do top, *rest = caller top.samples += 1 top.lines[top.lineno] += 1 top.total_samples += 1 prev = top rest.each do |frame| frame.edges[prev] += 1 frame.total_samples += 1 prev = frame end end ``` this technique builds up an incremental callgraph from the samples. on any given frame, the sum of the outbound edge weights is equal to total samples collected on that frame (`frame.total_samples == frame.edges.values.sum`). ### reporting multiple reporting modes are supported: - text - dotgraph - source annotation #### `StackProf::Report.new(data).print_text` ``` TOTAL (pct) SAMPLES (pct) FRAME 91 (48.4%) 91 (48.4%) A#pow 58 (30.9%) 58 (30.9%) A.newobj 34 (18.1%) 34 (18.1%) block in A#math 188 (100.0%) 3 (1.6%) block (2 levels) in
185 (98.4%) 1 (0.5%) A#initialize 35 (18.6%) 1 (0.5%) A#math 188 (100.0%) 0 (0.0%)
188 (100.0%) 0 (0.0%) block in
188 (100.0%) 0 (0.0%)
``` #### `StackProf::Report.new(data).print_graphviz` ![](http://cl.ly/image/2t3l2q0l0B0A/content) ``` digraph profile { 70346498324780 [size=23.5531914893617] [fontsize=23.5531914893617] [shape=box] [label="A#pow\n91 (48.4%)\r"]; 70346498324680 [size=18.638297872340424] [fontsize=18.638297872340424] [shape=box] [label="A.newobj\n58 (30.9%)\r"]; 70346498324480 [size=15.063829787234042] [fontsize=15.063829787234042] [shape=box] [label="block in A#math\n34 (18.1%)\r"]; 70346498324220 [size=10.446808510638299] [fontsize=10.446808510638299] [shape=box] [label="block (2 levels) in
\n3 (1.6%)\rof 188 (100.0%)\r"]; 70346498324220 -> 70346498324900 [label="185"]; 70346498324900 [size=10.148936170212766] [fontsize=10.148936170212766] [shape=box] [label="A#initialize\n1 (0.5%)\rof 185 (98.4%)\r"]; 70346498324900 -> 70346498324780 [label="91"]; 70346498324900 -> 70346498324680 [label="58"]; 70346498324900 -> 70346498324580 [label="35"]; 70346498324580 [size=10.148936170212766] [fontsize=10.148936170212766] [shape=box] [label="A#math\n1 (0.5%)\rof 35 (18.6%)\r"]; 70346498324580 -> 70346498324480 [label="34"]; 70346497983360 [size=10.0] [fontsize=10.0] [shape=box] [label="
\n0 (0.0%)\rof 188 (100.0%)\r"]; 70346497983360 -> 70346498325080 [label="188"]; 70346498324300 [size=10.0] [fontsize=10.0] [shape=box] [label="block in
\n0 (0.0%)\rof 188 (100.0%)\r"]; 70346498324300 -> 70346498324220 [label="188"]; 70346498325080 [size=10.0] [fontsize=10.0] [shape=box] [label="
\n0 (0.0%)\rof 188 (100.0%)\r"]; 70346498325080 -> 70346498324300 [label="188"]; } ``` #### `StackProf::Report.new(data).print_source(/pow|newobj|math/)` ``` A#pow (/Users/tmm1/code/stackprof/sample.rb:11) | 11 | def pow 91 (48.4% / 100.0%) | 12 | 2 ** 100 | 13 | end A.newobj (/Users/tmm1/code/stackprof/sample.rb:15) | 15 | def self.newobj 33 (17.6% / 56.9%) | 16 | Object.new 25 (13.3% / 43.1%) | 17 | Object.new | 18 | end A#math (/Users/tmm1/code/stackprof/sample.rb:20) | 20 | def math 1 (0.5% / 100.0%) | 21 | 2.times do | 22 | 2 + 3 * 4 ^ 5 / 6 block in A#math (/Users/tmm1/code/stackprof/sample.rb:21) | 21 | 2.times do 34 (18.1% / 100.0%) | 22 | 2 + 3 * 4 ^ 5 / 6 | 23 | end ``` ### usage the profiler is compiled as a C-extension and exposes a simple api: `StackProf.run(mode: [:cpu|:wall|:object])`. the `run` method takes a block of code and returns a profile as a simple hash. ``` ruby # sample after every 1ms of cpu activity profile = StackProf.run(mode: :cpu, interval: 1000) do MyCode.execute end ``` this profile data structure is part of the public API, and is intended to be saved (as json/marshal for example) for later processing. the reports above can be generated by passing this structure into `StackProf::Report.new`. the format itself is very simple. it contains a header and a list of frames. each frame has a unique id and identifying information such as its name, file and line. the frame also contains sampling data, including per-line samples, and a list of relationships to other frames represented as weighted edges. ``` ruby {:version=>1.0, :mode=>:cpu, :inteval=>1000, :samples=>188, :missed_samples=>0, :frames=> {70346498324780=> {:name=>"A#pow", :file=>"/Users/tmm1/code/stackprof/sample.rb", :line=>11, :total_samples=>91, :samples=>91, :lines=>{12=>91}}, 70346498324900=> {:name=>"A#initialize", :file=>"/Users/tmm1/code/stackprof/sample.rb", :line=>5, :total_samples=>185, :samples=>1, :edges=>{70346498324780=>91, 70346498324680=>58, 70346498324580=>35}, :lines=>{8=>1}}, ``` above, `A#pow` was involved in 91 samples, and in all cases it was at the top of the stack on line 12. `A#initialize` was in 185 samples, but it was at the top of the stack in only 1 sample. the rest of the samples are divided up between its callee edges. all 91 calls to `A#pow` came from `A#initialize`, as seen by the edge numbered `70346498324780`. ### advanced usage the profiler can be started and stopped manually. results are accumulated until retrieval, across multiple start/stop invocations. ``` ruby StackProf.running? StackProf.start StackProf.stop StackProf.results ``` ### todo * file/iseq blacklist * restore signal handlers on stop