README.md in d_heap-0.3.0 vs README.md in d_heap-0.4.0
- old
+ new
@@ -1,47 +1,130 @@
# DHeap
-A fast _d_-ary heap implementation for ruby, useful in priority queues and graph
-algorithms.
+A fast [_d_-ary heap][d-ary heap] [priority queue] implementation for ruby,
+implemented as a C extension.
-The _d_-ary heap data structure is a generalization of the binary heap, in which
-the nodes have _d_ children instead of 2. This allows for "decrease priority"
-operations to be performed more quickly with the tradeoff of slower delete
-minimum. Additionally, _d_-ary heaps can have better memory cache behavior than
-binary heaps, allowing them to run more quickly in practice despite slower
-worst-case time complexity. In the worst case, a _d_-ary heap requires only
-`O(log n / log d)` to push, with the tradeoff that pop is `O(d log n / log d)`.
+With a regular queue, you expect "FIFO" behavior: first in, first out. With a
+stack you expect "LIFO": last in first out. With a priority queue, you push
+elements along with a score and the lowest scored element is the first to be
+popped. Priority queues are often used in algorithms for e.g. [scheduling] of
+timers or bandwidth management, [Huffman coding], and various graph search
+algorithms such as [Dijkstra's algorithm], [A* search], or [Prim's algorithm].
-Although you should probably just stick with the default _d_ value of `4`, it
-may be worthwhile to benchmark your specific scenario.
+The _d_-ary heap data structure is a generalization of the [binary heap], in
+which the nodes have _d_ children instead of 2. This allows for "decrease
+priority" operations to be performed more quickly with the tradeoff of slower
+delete minimum. Additionally, _d_-ary heaps can have better memory cache
+behavior than binary heaps, allowing them to run more quickly in practice
+despite slower worst-case time complexity. In the worst case, a _d_-ary heap
+requires only `O(log n / log d)` to push, with the tradeoff that pop is `O(d log
+n / log d)`.
+Although you should probably just use the default _d_ value of `4` (see the
+analysis below), it's always advisable to benchmark your specific use-case.
+
+[d-ary heap]: https://en.wikipedia.org/wiki/D-ary_heap
+[priority queue]: https://en.wikipedia.org/wiki/Priority_queue
+[binary heap]: https://en.wikipedia.org/wiki/Binary_heap
+[scheduling]: https://en.wikipedia.org/wiki/Scheduling_(computing)
+[Huffman coding]: https://en.wikipedia.org/wiki/Huffman_coding#Compression
+[Dijkstra's algorithm]: https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm#Using_a_priority_queue
+[A* search]: https://en.wikipedia.org/wiki/A*_search_algorithm#Description
+[Prim's algorithm]: https://en.wikipedia.org/wiki/Prim%27s_algorithm
+
## Usage
-The simplest way to use it is simply with `#push` and `#pop`. Push takes a
-score and a value, and pop returns the value with the current minimum score.
+The basic API is:
+* `heap << object` adds a value as its own score.
+* `heap.push(score, value)` adds a value with an extrinsic score.
+* `heap.pop` removes and returns the value with the minimum score.
+* `heap.pop_lte(score)` pops if the minimum score is `<=` the provided score.
+* `heap.peek` to view the minimum value without popping it.
+The score must be `Integer` or `Float` or convertable to a `Float` via
+`Float(score)` (i.e. it should implement `#to_f`). Constraining scores to
+numeric values gives a 40+% speedup under some benchmarks!
+
+_n.b._ `Integer` _scores must have an absolute value that fits into_ `unsigned
+long long`. _This is architecture dependant but on an IA-64 system this is 64
+bits, which gives a range of -18,446,744,073,709,551,615 to
++18446744073709551615._
+
+_Comparing arbitary objects via_ `a <=> b` _was the original design and may
+be added back in a future version,_ if (and only if) _it can be done without
+impacting the speed of numeric comparisons._
+
```ruby
require "d_heap"
-heap = DHeap.new # defaults to a 4-ary heap
+Task = Struct.new(:id) # for demonstration
+heap = DHeap.new # defaults to a 4-heap
+
# storing [score, value] tuples
heap.push Time.now + 5*60, Task.new(1)
heap.push Time.now + 30, Task.new(2)
heap.push Time.now + 60, Task.new(3)
heap.push Time.now + 5, Task.new(4)
# peeking and popping (using last to get the task and ignore the time)
-heap.pop.last # => Task[4]
-heap.pop.last # => Task[2]
-heap.peak.last # => Task[3], but don't pop it
-heap.pop.last # => Task[3]
-heap.pop.last # => Task[1]
+heap.pop # => Task[4]
+heap.pop # => Task[2]
+heap.peek # => Task[3], but don't pop it from the heap
+heap.pop # => Task[3]
+heap.pop # => Task[1]
+heap.empty? # => true
+heap.pop # => nil
```
-Read the `rdoc` for more detailed documentation and examples.
+If your values behave as their own score, by being convertible via
+`Float(value)`, then you can use `#<<` for implicit scoring. The score should
+not change for as long as the value remains in the heap, since it will not be
+re-evaluated after being pushed.
+```ruby
+heap.clear
+
+# The score can be derived from the value by using to_f.
+# "a <=> b" is *much* slower than comparing numbers, so it isn't used.
+class Event
+ include Comparable
+ attr_reader :time, :payload
+ alias_method :to_time, :time
+
+ def initialize(time, payload)
+ @time = time.to_time
+ @payload = payload
+ freeze
+ end
+
+ def to_f
+ time.to_f
+ end
+
+ def <=>(other)
+ to_f <=> other.to_f
+ end
+end
+
+heap << comparable_max # sorts last, using <=>
+heap << comparable_min # sorts first, using <=>
+heap << comparable_mid # sorts in the middle, using <=>
+heap.pop # => comparable_min
+heap.pop # => comparable_mid
+heap.pop # => comparable_max
+heap.empty? # => true
+heap.pop # => nil
+```
+
+You can also pass a value into `#pop(max)` which will only pop if the minimum
+score is less than or equal to `max`.
+
+Read the [API documentation] for more detailed documentation and examples.
+
+[API documentation]: https://rubydoc.info/gems/d_heap/DHeap
+
## Installation
Add this line to your application's Gemfile:
```ruby
@@ -56,114 +139,231 @@
$ gem install d_heap
## Motivation
-Sometimes you just need a priority queue, right? With a regular queue, you
-expect "FIFO" behavior: first in, first out. With a priority queue, you push
-with a score (or your elements are comparable), and you want to be able to
-efficiently pop off the minimum (or maximum) element.
+One naive approach to a priority queue is to maintain an array in sorted order.
+This can be very simply implemented using `Array#bseach_index` + `Array#insert`.
+This can be very fast—`Array#pop` is `O(1)`—but the worst-case for insert is
+`O(n)` because it may need to `memcpy` a significant portion of the array.
-One obvious approach is to simply maintain an array in sorted order. And
-ruby's Array class makes it simple to maintain a sorted array by combining
-`#bsearch_index` with `#insert`. With certain insert/remove workloads that can
-perform very well, but in the worst-case an insert or delete can result in O(n),
-since `#insert` may need to `memcpy` or `memmove` a significant portion of the
-array.
+The standard way to implement a priority queue is with a binary heap. Although
+this increases the time for `pop`, it converts the amortized time per push + pop
+from `O(n)` to `O(d log n / log d)`.
-But the standard way to efficiently and simply solve this problem is using a
-binary heap. Although it increases the time for `pop`, it converts the
-amortized time per push + pop from `O(n)` to `O(d log n / log d)`.
+However, I was surprised to find that—at least for some benchmarks—my pure ruby
+heap implementation was much slower than inserting into and popping from a fully
+sorted array. The reason for this surprising result: Although it is `O(n)`,
+`memcpy` has a _very_ small constant factor, and calling `<=>` from ruby code
+has relatively _much_ larger constant factors. If your queue contains only a
+few thousand items, the overhead of those extra calls to `<=>` is _far_ more
+than occasionally calling `memcpy`. In the worst case, a _d_-heap will require
+`d + 1` times more comparisons for each push + pop than a `bsearch` + `insert`
+sorted array.
-I was surprised to find that, at least under certain benchmarks, my pure ruby
-heap implementation was usually slower than inserting into a fully sorted
-array. While this is a testament to ruby's fine-tuned Array implementationw, a
-heap implementated in C should easily peform faster than `Array#insert`.
+Moving the sift-up and sift-down code into C helps some. But much more helpful
+is optimizing the comparison of numeric scores, so `a <=> b` never needs to be
+called. I'm hopeful that MJIT will eventually obsolete this C-extension. JRuby
+or TruffleRuby may already run the pure ruby version at high speed. This can be
+hotspot code, and the basic ruby implementation should perform well if not for
+the high overhead of `<=>`.
-The biggest issue is that it just takes far too much time to call `<=>` from
-ruby code: A sorted array only requires `log n / log 2` comparisons to insert
-and no comparisons to pop. However a _d_-ary heap requires `log n / log d` to
-insert plus an additional `d log n / log d` to pop. If your queue contains only
-a few hundred items at once, the overhead of those extra calls to `<=>` is far
-more than occasionally calling `memcpy`.
+## Analysis
-It's likely that MJIT will eventually make the C-extension completely
-unnecessary. This is definitely hotspot code, and the basic ruby implementation
-would work fine, if not for that `<=>` overhead. Until then... this gem gets
-the job done.
+### Time complexity
-## TODOs...
+There are two fundamental heap operations: sift-up (used by push) and sift-down
+(used by pop).
-_TODO:_ In addition to a basic _d_-ary heap class (`DHeap`), this library
-~~includes~~ _will include_ extensions to `Array`, allowing an Array to be
-directly handled as a priority queue. These extension methods are meant to be
-used similarly to how `#bsearch` and `#bsearch_index` might be used.
+* Both sift operations can perform as many as `log n / log d` swaps, as the
+ element may sift from the bottom of the tree to the top, or vice versa.
+* Sift-up performs a single comparison per swap: `O(1)`.
+ So pushing a new element is `O(log n / log d)`.
+* Swap down performs as many as d comparions per swap: `O(d)`.
+ So popping the min element is `O(d log n / log d)`.
-_TODO:_ Also ~~included is~~ _will include_ `DHeap::Set`, which augments the
-basic heap with an internal `Hash`, which maps a set of values to scores.
-loosely inspired by go's timers. e.g: It lazily sifts its heap after deletion
-and adjustments, to achieve faster average runtime for *add* and *cancel*
-operations.
+Assuming every inserted element is eventually deleted from the root, d=4
+requires the fewest comparisons for combined insert and delete:
-_TODO:_ Also ~~included is~~ _will include_ `DHeap::Timers`, which contains some
-features that are loosely inspired by go's timers. e.g: It lazily sifts its
-heap after deletion and adjustments, to achieve faster average runtime for *add*
-and *cancel* operations.
+* (1 + 2) lg 2 = 4.328085
+* (1 + 3) lg 3 = 3.640957
+* (1 + 4) lg 4 = 3.606738
+* (1 + 5) lg 5 = 3.728010
+* (1 + 6) lg 6 = 3.906774
+* etc...
-Additionally, I was inspired by reading go's "timer.go" implementation to
-experiment with a 4-ary heap instead of the traditional binary heap. In the
-case of timers, new timers are usually scheduled to run after most of the
-existing timers. And timers are usually canceled before they have a chance to
-run. While a binary heap holds 50% of its elements in its last layer, 75% of a
-4-ary heap will have no children. That diminishes the extra comparison overhead
-during sift-down.
+Leaf nodes require no comparisons to shift down, and higher values for d have
+higher percentage of leaf nodes:
+* d=2 has ~50% leaves,
+* d=3 has ~67% leaves,
+* d=4 has ~75% leaves,
+* and so on...
+
+See https://en.wikipedia.org/wiki/D-ary_heap#Analysis for deeper analysis.
+
+### Space complexity
+
+Space usage is linear, regardless of d. However higher d values may
+provide better cache locality. Because the heap is a complete binary tree, the
+elements can be stored in an array, without the need for tree or list pointers.
+
+Ruby can compare Numeric values _much_ faster than other ruby objects, even if
+those objects simply delegate comparison to internal Numeric values. And it is
+often useful to use external scores for otherwise uncomparable values. So
+`DHeap` uses twice as many entries (one for score and one for value)
+as an array which only stores values.
+
## Benchmarks
-_TODO: put benchmarks here._
+_See `bin/benchmarks` and `docs/benchmarks.txt`, as well as `bin/profile` and
+`docs/profile.txt` for more details or updated results. These benchmarks were
+measured with v0.4.0 and ruby 2.7.2 without MJIT enabled._
-## Analysis
+These benchmarks use very simple implementations for a pure-ruby heap and an
+array that is kept sorted using `Array#bsearch_index` and `Array#insert`. For
+comparison, an alternate implementation `Array#min` and `Array#delete_at` is
+also shown.
-### Time complexity
+Three different scenarios are measured:
+ * push N values but never pop (clearing between each set of pushes).
+ * push N values and then pop N values.
+ Although this could be used for heap sort, we're unlikely to choose heap sort
+ over Ruby's quick sort implementation. I'm using this scenario to represent
+ the amortized cost of creating a heap and (eventually) draining it.
+ * For a heap of size N, repeatedly push and pop while keeping a stable size.
+ This is a _very simple_ approximation for how most scheduler/timer heaps
+ would be used. Usually when a timer fires it will be quickly replaced by a
+ new timer, and the overall count of timers will remain roughly stable.
-Both sift operations can perform (log[d] n = log n / log d) swaps.
-Swap up performs only a single comparison per swap: O(1).
-Swap down performs as many as d comparions per swap: O(d).
+In these benchmarks, `DHeap` runs faster than all other implementations for
+every scenario and every value of N, although the difference is much more
+noticable at higher values of N. The pure ruby heap implementation is
+competitive for `push` alone at every value of N, but is significantly slower
+than bsearch + insert for push + pop until N is _very_ large (somewhere between
+10k and 100k)!
-Inserting an item is O(log n / log d).
-Deleting the root is O(d log n / log d).
+For very small N values the benchmark implementations, `DHeap` runs faster than
+the other implementations for each scenario, although the difference is still
+relatively small. The pure ruby binary heap is 2x or more slower than bsearch +
+insert for common common push/pop scenario.
-Assuming every inserted item is eventually deleted from the root, d=4 requires
-the fewest comparisons for combined insert and delete:
- * (1 + 2) lg 2 = 4.328085
- * (1 + 3) lg 3 = 3.640957
- * (1 + 4) lg 4 = 3.606738
- * (1 + 5) lg 5 = 3.728010
- * (1 + 6) lg 6 = 3.906774
- * etc...
+ == push N (N=5) ==========================================================
+ push N (c_dheap): 1701338.1 i/s
+ push N (rb_heap): 971614.1 i/s - 1.75x slower
+ push N (bsearch): 946363.7 i/s - 1.80x slower
-Leaf nodes require no comparisons to shift down, and higher values for d have
-higher percentage of leaf nodes:
- * d=2 has ~50% leaves,
- * d=3 has ~67% leaves,
- * d=4 has ~75% leaves,
- * and so on...
+ == push N then pop N (N=5) ===============================================
+ push N + pop N (c_dheap): 1087944.8 i/s
+ push N + pop N (findmin): 841708.1 i/s - 1.29x slower
+ push N + pop N (bsearch): 773252.7 i/s - 1.41x slower
+ push N + pop N (rb_heap): 471852.9 i/s - 2.31x slower
-See https://en.wikipedia.org/wiki/D-ary_heap#Analysis for deeper analysis.
+ == Push/pop with pre-filled queue of size=N (N=5) ========================
+ push + pop (c_dheap): 5525418.8 i/s
+ push + pop (findmin): 5003904.8 i/s - 1.10x slower
+ push + pop (bsearch): 4320581.8 i/s - 1.28x slower
+ push + pop (rb_heap): 2207042.0 i/s - 2.50x slower
-### Space complexity
+By N=21, `DHeap` has pulled significantly ahead of bsearch + insert for all
+scenarios, but the pure ruby heap is still slower than every other
+implementation—even resorting the array after every `#push`—in any scenario that
+uses `#pop`.
-Because the heap is a complete binary tree, space usage is linear, regardless
-of d. However higher d values may provide better cache locality.
+ == push N (N=21) =========================================================
+ push N (c_dheap): 408307.0 i/s
+ push N (rb_heap): 212275.2 i/s - 1.92x slower
+ push N (bsearch): 169583.2 i/s - 2.41x slower
-We can run comparisons much much faster for Numeric or String objects than for
-ruby objects which delegate comparison to internal Numeric or String objects.
-And it is often advantageous to use extrinsic scores for uncomparable items.
-For this, our internal array uses twice as many entries (one for score and one
-for value) as it would if it only supported intrinsic comparison or used an
-un-memoized "sort_by" proc.
+ == push N then pop N (N=21) ==============================================
+ push N + pop N (c_dheap): 199435.5 i/s
+ push N + pop N (findmin): 162024.5 i/s - 1.23x slower
+ push N + pop N (bsearch): 146284.3 i/s - 1.36x slower
+ push N + pop N (rb_heap): 72289.0 i/s - 2.76x slower
+ == Push/pop with pre-filled queue of size=N (N=21) =======================
+ push + pop (c_dheap): 4836860.0 i/s
+ push + pop (findmin): 4467453.9 i/s - 1.08x slower
+ push + pop (bsearch): 3345458.4 i/s - 1.45x slower
+ push + pop (rb_heap): 1560476.3 i/s - 3.10x slower
+
+At higher values of N, `DHeap`'s logarithmic growth leads to little slowdown
+of `DHeap#push`, while insert's linear growth causes it to run slower and
+slower. But because `#pop` is O(1) for a sorted array and O(d log n / log d)
+for a _d_-heap, scenarios involving `#pop` remain relatively close even as N
+increases to 5k:
+
+ == Push/pop with pre-filled queue of size=N (N=5461) ==============
+ push + pop (c_dheap): 2718225.1 i/s
+ push + pop (bsearch): 1793546.4 i/s - 1.52x slower
+ push + pop (rb_heap): 707139.9 i/s - 3.84x slower
+ push + pop (findmin): 122316.0 i/s - 22.22x slower
+
+Somewhat surprisingly, bsearch + insert still runs faster than a pure ruby heap
+for the repeated push/pop scenario, all the way up to N as high as 87k:
+
+ == push N (N=87381) ======================================================
+ push N (c_dheap): 92.8 i/s
+ push N (rb_heap): 43.5 i/s - 2.13x slower
+ push N (bsearch): 2.9 i/s - 31.70x slower
+
+ == push N then pop N (N=87381) ===========================================
+ push N + pop N (c_dheap): 22.6 i/s
+ push N + pop N (rb_heap): 5.5 i/s - 4.08x slower
+ push N + pop N (bsearch): 2.9 i/s - 7.90x slower
+
+ == Push/pop with pre-filled queue of size=N (N=87381) ====================
+ push + pop (c_dheap): 1815277.3 i/s
+ push + pop (bsearch): 762343.2 i/s - 2.38x slower
+ push + pop (rb_heap): 535913.6 i/s - 3.39x slower
+ push + pop (findmin): 2262.8 i/s - 802.24x slower
+
+## Profiling
+
+_n.b. `Array#fetch` is reading the input data, external to heap operations.
+These benchmarks use integers for all scores, which enables significantly faster
+comparisons. If `a <=> b` were used instead, then the difference between push
+and pop would be much larger. And ruby's `Tracepoint` impacts these different
+implementations differently. So we can't use these profiler results for
+comparisons between implementations. A sampling profiler would be needed for
+more accurate relative measurements._
+
+It's informative to look at the `ruby-prof` results for a simple binary search +
+insert implementation, repeatedly pushing and popping to a large heap. In
+particular, even with 1000 members, the linear `Array#insert` is _still_ faster
+than the logarithmic `Array#bsearch_index`. At this scale, ruby comparisons are
+still (relatively) slow and `memcpy` is (relatively) quite fast!
+
+ %self total self wait child calls name location
+ 34.79 2.222 2.222 0.000 0.000 1000000 Array#insert
+ 32.59 2.081 2.081 0.000 0.000 1000000 Array#bsearch_index
+ 12.84 6.386 0.820 0.000 5.566 1 DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77
+ 10.38 4.966 0.663 0.000 4.303 1000000 DHeap::Benchmarks::BinarySearchAndInsert#<< d_heap/benchmarks/implementations.rb:61
+ 5.38 0.468 0.343 0.000 0.125 1000000 DHeap::Benchmarks::BinarySearchAndInsert#pop d_heap/benchmarks/implementations.rb:70
+ 2.06 0.132 0.132 0.000 0.000 1000000 Array#fetch
+ 1.95 0.125 0.125 0.000 0.000 1000000 Array#pop
+
+Contrast this with a simplistic pure-ruby implementation of a binary heap:
+
+ %self total self wait child calls name location
+ 48.52 8.487 8.118 0.000 0.369 1000000 DHeap::Benchmarks::NaiveBinaryHeap#pop d_heap/benchmarks/implementations.rb:96
+ 42.94 7.310 7.184 0.000 0.126 1000000 DHeap::Benchmarks::NaiveBinaryHeap#<< d_heap/benchmarks/implementations.rb:80
+ 4.80 16.732 0.803 0.000 15.929 1 DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77
+
+You can see that it spends almost more time in pop than it does in push. That
+is expected behavior for a heap: although both are O(log n), pop is
+significantly more complex, and has _d_ comparisons per layer.
+
+And `DHeap` shows a similar comparison between push and pop, although it spends
+half of its time in the benchmark code (which is written in ruby):
+
+ %self total self wait child calls name location
+ 43.09 1.685 0.726 0.000 0.959 1 DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77
+ 26.05 0.439 0.439 0.000 0.000 1000000 DHeap#<<
+ 23.57 0.397 0.397 0.000 0.000 1000000 DHeap#pop
+ 7.29 0.123 0.123 0.000 0.000 1000000 Array#fetch
+
### Timers
Additionally, when used to sort timers, we can reasonably assume that:
* New timers usually sort after most existing timers.
* Most timers will be canceled before executing.
@@ -176,25 +376,57 @@
rescheduled before we garbage collect, adjusting its position will usually be
faster than a delete and re-insert.
## Alternative data structures
+As always, you should run benchmarks with your expected scenarios to determine
+which is right.
+
Depending on what you're doing, maintaining a sorted `Array` using
-`#bsearch_index` and `#insert` might be faster! Although it is technically
-O(n) for insertions, the implementations for `memcpy` or `memmove` can be *very*
-fast on modern architectures. Also, it can be faster O(n) on average, if
-insertions are usually near the end of the array. You should run benchmarks
-with your expected scenarios to determine which is right.
+`#bsearch_index` and `#insert` might be just fine! As discussed above, although
+it is `O(n)` for insertions, `memcpy` is so fast on modern hardware that this
+may not matter. Also, if you can arrange for insertions to occur near the end
+of the array, that could significantly reduce the `memcpy` overhead even more.
+More complex heap varients, e.g. [Fibonacci heap], can allow heaps to be merged
+as well as lower amortized time.
+
+[Fibonacci heap]: https://en.wikipedia.org/wiki/Fibonacci_heap
+
If it is important to be able to quickly enumerate the set or find the ranking
-of values in it, then you probably want to use a self-balancing binary search
-tree (e.g. a red-black tree) or a skip-list.
+of values in it, then you may want to use a self-balancing binary search tree
+(e.g. a [red-black tree]) or a [skip-list].
-A Hashed Timing Wheel or Heirarchical Timing Wheels (or some variant in that
-family of data structures) can be constructed to have effectively O(1) running
-time in most cases. However, the implementation for that data structure is more
-complex than a heap. If a 4-ary heap is good enough for go's timers, it should
-be suitable for many use cases.
+[red-black tree]: https://en.wikipedia.org/wiki/Red%E2%80%93black_tree
+[skip-list]: https://en.wikipedia.org/wiki/Skip_list
+
+[Hashed and Heirarchical Timing Wheels][timing wheels] (or some variant in that
+family of data structures) can be constructed to have effectively `O(1)` running
+time in most cases. Although the implementation for that data structure is more
+complex than a heap, it may be necessary for enormous values of N.
+
+[timing wheels]: http://www.cs.columbia.edu/~nahum/w6998/papers/ton97-timing-wheels.pdf
+
+## TODOs...
+
+_TODO:_ Also ~~included is~~ _will include_ `DHeap::Set`, which augments the
+basic heap with an internal `Hash`, which maps a set of values to scores.
+loosely inspired by go's timers. e.g: It lazily sifts its heap after deletion
+and adjustments, to achieve faster average runtime for *add* and *cancel*
+operations.
+
+_TODO:_ Also ~~included is~~ _will include_ `DHeap::Lazy`, which contains some
+features that are loosely inspired by go's timers. e.g: It lazily sifts its
+heap after deletion and adjustments, to achieve faster average runtime for *add*
+and *cancel* operations.
+
+Additionally, I was inspired by reading go's "timer.go" implementation to
+experiment with a 4-ary heap instead of the traditional binary heap. In the
+case of timers, new timers are usually scheduled to run after most of the
+existing timers. And timers are usually canceled before they have a chance to
+run. While a binary heap holds 50% of its elements in its last layer, 75% of a
+4-ary heap will have no children. That diminishes the extra comparison overhead
+during sift-down.
## Development
After checking out the repo, run `bin/setup` to install dependencies. Then, run
`rake spec` to run the tests. You can also run `bin/console` for an interactive