README.md in d_heap-0.3.0 vs README.md in d_heap-0.4.0

- old
+ new

@@ -1,47 +1,130 @@
 # DHeap
 
-A fast _d_-ary heap implementation for ruby, useful in priority queues and graph
-algorithms.
+A fast [_d_-ary heap][d-ary heap] [priority queue] implementation for ruby,
+implemented as a C extension.
 
-The _d_-ary heap data structure is a generalization of the binary heap, in which
-the nodes have _d_ children instead of 2.  This allows for "decrease priority"
-operations to be performed more quickly with the tradeoff of slower delete
-minimum.  Additionally, _d_-ary heaps can have better memory cache behavior than
-binary heaps, allowing them to run more quickly in practice despite slower
-worst-case time complexity. In the worst case, a _d_-ary heap requires only
-`O(log n / log d)` to push, with the tradeoff that pop is `O(d log n / log d)`.
+With a regular queue, you expect "FIFO" behavior: first in, first out.  With a
+stack you expect "LIFO": last in first out.  With a priority queue, you push
+elements along with a score and the lowest scored element is the first to be
+popped.  Priority queues are often used in algorithms for e.g. [scheduling] of
+timers or bandwidth management, [Huffman coding], and various graph search
+algorithms such as [Dijkstra's algorithm], [A* search], or [Prim's algorithm].
 
-Although you should probably just stick with the default _d_ value  of `4`, it
-may be worthwhile to benchmark your specific scenario.
+The _d_-ary heap data structure is a generalization of the [binary heap], in
+which the nodes have _d_ children instead of 2.  This allows for "decrease
+priority" operations to be performed more quickly with the tradeoff of slower
+delete minimum.  Additionally, _d_-ary heaps can have better memory cache
+behavior than binary heaps, allowing them to run more quickly in practice
+despite slower worst-case time complexity. In the worst case, a _d_-ary heap
+requires only `O(log n / log d)` to push, with the tradeoff that pop is `O(d log
+n / log d)`.
 
+Although you should probably just use the default _d_ value  of `4` (see the
+analysis below), it's always advisable to benchmark your specific use-case.
+
+[d-ary heap]: https://en.wikipedia.org/wiki/D-ary_heap
+[priority queue]: https://en.wikipedia.org/wiki/Priority_queue
+[binary heap]: https://en.wikipedia.org/wiki/Binary_heap
+[scheduling]: https://en.wikipedia.org/wiki/Scheduling_(computing)
+[Huffman coding]: https://en.wikipedia.org/wiki/Huffman_coding#Compression
+[Dijkstra's algorithm]: https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm#Using_a_priority_queue
+[A* search]: https://en.wikipedia.org/wiki/A*_search_algorithm#Description
+[Prim's algorithm]: https://en.wikipedia.org/wiki/Prim%27s_algorithm
+
 ## Usage
 
-The simplest way to use it is simply with `#push` and `#pop`.  Push takes a
-score and a value, and pop returns the value with the current minimum score.
+The basic API is:
+* `heap << object` adds a value as its own score.
+* `heap.push(score, value)` adds a value with an extrinsic score.
+* `heap.pop` removes and returns the value with the minimum score.
+* `heap.pop_lte(score)` pops if the minimum score is `<=` the provided score.
+* `heap.peek` to view the minimum value without popping it.
 
+The score must be `Integer` or `Float` or convertable to a `Float` via
+`Float(score)` (i.e. it should implement `#to_f`).  Constraining scores to
+numeric values gives a 40+% speedup under some benchmarks!
+
+_n.b._ `Integer` _scores must have an absolute value that fits into_ `unsigned
+long long`. _This is architecture dependant but on an IA-64 system this is 64
+bits, which gives a range of -18,446,744,073,709,551,615 to
++18446744073709551615._
+
+_Comparing arbitary objects via_ `a <=> b` _was the original design and may
+be added back in a future version,_ if (and only if) _it can be done without
+impacting the speed of numeric comparisons._
+
 ```ruby
 require "d_heap"
 
-heap = DHeap.new # defaults to a 4-ary heap
+Task = Struct.new(:id) # for demonstration
 
+heap = DHeap.new # defaults to a 4-heap
+
 # storing [score, value] tuples
 heap.push Time.now + 5*60, Task.new(1)
 heap.push Time.now +   30, Task.new(2)
 heap.push Time.now +   60, Task.new(3)
 heap.push Time.now +    5, Task.new(4)
 
 # peeking and popping (using last to get the task and ignore the time)
-heap.pop.last  # => Task[4]
-heap.pop.last  # => Task[2]
-heap.peak.last # => Task[3], but don't pop it
-heap.pop.last  # => Task[3]
-heap.pop.last  # => Task[1]
+heap.pop    # => Task[4]
+heap.pop    # => Task[2]
+heap.peek   # => Task[3], but don't pop it from the heap
+heap.pop    # => Task[3]
+heap.pop    # => Task[1]
+heap.empty? # => true
+heap.pop    # => nil
 ```
 
-Read the `rdoc` for more detailed documentation and examples.
+If your values behave as their own score, by being convertible via
+`Float(value)`, then you can use `#<<` for implicit scoring.  The score should
+not change for as long as the value remains in the heap, since it will not be
+re-evaluated after being pushed.
 
+```ruby
+heap.clear
+
+# The score can be derived from the value by using to_f.
+# "a <=> b" is *much* slower than comparing numbers, so it isn't used.
+class Event
+  include Comparable
+  attr_reader :time, :payload
+  alias_method :to_time, :time
+
+  def initialize(time, payload)
+    @time = time.to_time
+    @payload = payload
+    freeze
+  end
+
+  def to_f
+    time.to_f
+  end
+
+  def <=>(other)
+    to_f <=> other.to_f
+  end
+end
+
+heap << comparable_max # sorts last, using <=>
+heap << comparable_min # sorts first, using <=>
+heap << comparable_mid # sorts in the middle, using <=>
+heap.pop    # => comparable_min
+heap.pop    # => comparable_mid
+heap.pop    # => comparable_max
+heap.empty? # => true
+heap.pop    # => nil
+```
+
+You can also pass a value into `#pop(max)` which will only pop if the minimum
+score is less than or equal to `max`.
+
+Read the [API documentation] for more detailed documentation and examples.
+
+[API documentation]: https://rubydoc.info/gems/d_heap/DHeap
+
 ## Installation
 
 Add this line to your application's Gemfile:
 
 ```ruby
@@ -56,114 +139,231 @@
 
     $ gem install d_heap
 
 ## Motivation
 
-Sometimes you just need a priority queue, right?  With a regular queue, you
-expect "FIFO" behavior: first in, first out.  With a priority queue, you push
-with a score (or your elements are comparable), and you want to be able to
-efficiently pop off the minimum (or maximum) element.
+One naive approach to a priority queue is to maintain an array in sorted order.
+This can be very simply implemented using `Array#bseach_index` + `Array#insert`.
+This can be very fast—`Array#pop` is `O(1)`—but the worst-case for insert is
+`O(n)` because it may need to `memcpy` a significant portion of the array.
 
-One obvious approach is to simply maintain an array in sorted order.  And
-ruby's Array class makes it simple to maintain a sorted array by combining
-`#bsearch_index` with `#insert`.  With certain insert/remove workloads that can
-perform very well, but in the worst-case an insert or delete can result in O(n),
-since `#insert` may need to `memcpy` or `memmove` a significant portion of the
-array.
+The standard way to implement a priority queue is with a binary heap.  Although
+this increases the time for `pop`, it converts the amortized time per push + pop
+from `O(n)` to `O(d log n / log d)`.
 
-But the standard way to efficiently and simply solve this problem is using a
-binary heap.  Although it increases the time for `pop`, it converts the
-amortized time per push + pop from `O(n)` to `O(d log n / log d)`.
+However, I was surprised to find that—at least for some benchmarks—my pure ruby
+heap implementation was much slower than inserting into and popping from a fully
+sorted array.  The reason for this surprising result: Although it is `O(n)`,
+`memcpy` has a _very_ small constant factor, and calling `<=>` from ruby code
+has relatively _much_ larger constant factors.  If your queue contains only a
+few thousand items, the overhead of those extra calls to `<=>` is _far_ more
+than occasionally calling `memcpy`.  In the worst case, a _d_-heap will require
+`d + 1` times more comparisons for each push + pop than a `bsearch` + `insert`
+sorted array.
 
-I was surprised to find that, at least under certain benchmarks, my pure ruby
-heap implementation was usually slower than inserting into a fully sorted
-array.  While this is a testament to ruby's fine-tuned Array implementationw, a
-heap implementated in C should easily peform faster than `Array#insert`.
+Moving the sift-up and sift-down code into C helps some.  But much more helpful
+is optimizing the comparison of numeric scores, so `a <=> b` never needs to be
+called.  I'm hopeful that MJIT will eventually obsolete this C-extension.  JRuby
+or TruffleRuby may already run the pure ruby version at high speed.  This can be
+hotspot code, and the basic ruby implementation should perform well if not for
+the high overhead of `<=>`.
 
-The biggest issue is that it just takes far too much time to call `<=>` from
-ruby code: A sorted array only requires `log n / log 2` comparisons to insert
-and no comparisons to pop.  However a _d_-ary heap requires `log n / log d` to
-insert plus an additional `d log n / log d` to pop.  If your queue contains only
-a few hundred items at once, the overhead of those extra calls to `<=>` is far
-more than occasionally calling `memcpy`.
+## Analysis
 
-It's likely that MJIT will eventually make the C-extension completely
-unnecessary.  This is definitely hotspot code, and the basic ruby implementation
-would work fine, if not for that `<=>` overhead.  Until then... this gem gets
-the job done.
+### Time complexity
 
-## TODOs...
+There are two fundamental heap operations: sift-up (used by push) and sift-down
+(used by pop).
 
-_TODO:_ In addition to a basic _d_-ary heap class (`DHeap`), this library
-~~includes~~ _will include_ extensions to `Array`, allowing an Array to be
-directly handled as a priority queue.  These extension methods are meant to be
-used similarly to how `#bsearch` and `#bsearch_index` might be used.
+* Both sift operations can perform as many as `log n / log d` swaps, as the
+  element may sift from the bottom of the tree to the top, or vice versa.
+* Sift-up performs a single comparison per swap: `O(1)`.
+  So pushing a new element is `O(log n / log d)`.
+* Swap down performs as many as d comparions per swap: `O(d)`.
+  So popping the min element is `O(d log n / log d)`.
 
-_TODO:_ Also ~~included is~~ _will include_ `DHeap::Set`, which augments the
-basic heap with an internal `Hash`, which maps a set of values to scores.
-loosely inspired by go's timers.  e.g: It lazily sifts its heap after deletion
-and adjustments, to achieve faster average runtime for *add* and *cancel*
-operations.
+Assuming every inserted element is eventually deleted from the root, d=4
+requires the fewest comparisons for combined insert and delete:
 
-_TODO:_ Also ~~included is~~ _will include_ `DHeap::Timers`, which contains some
-features that are loosely inspired by go's timers.  e.g: It lazily sifts its
-heap after deletion and adjustments, to achieve faster average runtime for *add*
-and *cancel* operations.
+* (1 + 2) lg 2 = 4.328085
+* (1 + 3) lg 3 = 3.640957
+* (1 + 4) lg 4 = 3.606738
+* (1 + 5) lg 5 = 3.728010
+* (1 + 6) lg 6 = 3.906774
+* etc...
 
-Additionally, I was inspired by reading go's "timer.go" implementation to
-experiment with a 4-ary heap instead of the traditional binary heap.  In the
-case of timers, new timers are usually scheduled to run after most of the
-existing timers.  And timers are usually canceled before they have a chance to
-run. While a binary heap holds 50% of its elements in its last layer, 75% of a
-4-ary heap will have no children.  That diminishes the extra comparison overhead
-during sift-down.
+Leaf nodes require no comparisons to shift down, and higher values for d have
+higher percentage of leaf nodes:
 
+* d=2 has ~50% leaves,
+* d=3 has ~67% leaves,
+* d=4 has ~75% leaves,
+* and so on...
+
+See https://en.wikipedia.org/wiki/D-ary_heap#Analysis for deeper analysis.
+
+### Space complexity
+
+Space usage is linear, regardless of d.  However higher d values may
+provide better cache locality.  Because the heap is a complete binary tree, the
+elements can be stored in an array, without the need for tree or list pointers.
+
+Ruby can compare Numeric values _much_ faster than other ruby objects, even if
+those objects simply delegate comparison to internal Numeric values.  And it is
+often useful to use external scores for otherwise uncomparable values.  So
+`DHeap` uses twice as many entries (one for score and one for value)
+as an array which only stores values.
+
 ## Benchmarks
 
-_TODO: put benchmarks here._
+_See `bin/benchmarks` and `docs/benchmarks.txt`, as well as `bin/profile` and
+`docs/profile.txt` for more details or updated results. These benchmarks were
+measured with v0.4.0 and ruby 2.7.2 without MJIT enabled._
 
-## Analysis
+These benchmarks use very simple implementations for a pure-ruby heap and an
+array that is kept sorted using `Array#bsearch_index` and `Array#insert`.  For
+comparison, an alternate implementation `Array#min` and `Array#delete_at` is
+also shown.
 
-### Time complexity
+Three different scenarios are measured:
+ * push N values but never pop (clearing between each set of pushes).
+ * push N values and then pop N values.
+   Although this could be used for heap sort, we're unlikely to choose heap sort
+   over Ruby's quick sort implementation. I'm using this scenario to represent
+   the amortized cost of creating a heap and (eventually) draining it.
+ * For a heap of size N, repeatedly push and pop while keeping a stable size.
+   This is a _very simple_ approximation for how most scheduler/timer heaps
+   would be used. Usually when a timer fires it will be quickly replaced by a
+   new timer, and the overall count of timers will remain roughly stable.
 
-Both sift operations can perform (log[d] n = log n / log d) swaps.
-Swap up performs only a single comparison per swap: O(1).
-Swap down performs as many as d comparions per swap: O(d).
+In these benchmarks, `DHeap` runs faster than all other implementations for
+every scenario and every value of N, although the difference is much more
+noticable at higher values of N.  The pure ruby heap implementation is
+competitive for `push` alone at every value of N, but is significantly slower
+than bsearch + insert for push + pop until N is _very_ large (somewhere between
+10k and 100k)!
 
-Inserting an item is O(log n / log d).
-Deleting the root is O(d log n / log d).
+For very small N values the benchmark implementations,  `DHeap` runs faster than
+the other implementations for each scenario, although the difference is still
+relatively small.  The pure ruby binary heap is 2x or more slower than bsearch +
+insert for common common push/pop scenario.
 
-Assuming every inserted item is eventually deleted from the root, d=4 requires
-the fewest comparisons for combined insert and delete:
- * (1 + 2) lg 2 = 4.328085
- * (1 + 3) lg 3 = 3.640957
- * (1 + 4) lg 4 = 3.606738
- * (1 + 5) lg 5 = 3.728010
- * (1 + 6) lg 6 = 3.906774
- * etc...
+    == push N (N=5) ==========================================================
+    push N (c_dheap):   1701338.1 i/s
+    push N (rb_heap):    971614.1 i/s - 1.75x  slower
+    push N (bsearch):    946363.7 i/s - 1.80x  slower
 
-Leaf nodes require no comparisons to shift down, and higher values for d have
-higher percentage of leaf nodes:
- * d=2 has ~50% leaves,
- * d=3 has ~67% leaves,
- * d=4 has ~75% leaves,
- * and so on...
+    == push N then pop N (N=5) ===============================================
+    push N + pop N (c_dheap):   1087944.8 i/s
+    push N + pop N (findmin):    841708.1 i/s - 1.29x  slower
+    push N + pop N (bsearch):    773252.7 i/s - 1.41x  slower
+    push N + pop N (rb_heap):    471852.9 i/s - 2.31x  slower
 
-See https://en.wikipedia.org/wiki/D-ary_heap#Analysis for deeper analysis.
+    == Push/pop with pre-filled queue of size=N (N=5) ========================
+    push + pop (c_dheap):   5525418.8 i/s
+    push + pop (findmin):   5003904.8 i/s - 1.10x  slower
+    push + pop (bsearch):   4320581.8 i/s - 1.28x  slower
+    push + pop (rb_heap):   2207042.0 i/s - 2.50x  slower
 
-### Space complexity
+By N=21, `DHeap` has pulled significantly ahead of bsearch + insert for all
+scenarios, but the pure ruby heap is still slower than every other
+implementation—even resorting the array after every `#push`—in any scenario that
+uses `#pop`.
 
-Because the heap is a complete binary tree, space usage is linear, regardless
-of d.  However higher d values may provide better cache locality.
+    == push N (N=21) =========================================================
+    push N (c_dheap):    408307.0 i/s
+    push N (rb_heap):    212275.2 i/s - 1.92x  slower
+    push N (bsearch):    169583.2 i/s - 2.41x  slower
 
-We can run comparisons much much faster for Numeric or String objects than for
-ruby objects which delegate comparison to internal Numeric or String objects.
-And it is often advantageous to use extrinsic scores for uncomparable items.
-For this, our internal array uses twice as many entries (one for score and one
-for value) as it would if it only supported intrinsic comparison or used an
-un-memoized "sort_by" proc.
+    == push N then pop N (N=21) ==============================================
+    push N + pop N (c_dheap):    199435.5 i/s
+    push N + pop N (findmin):    162024.5 i/s - 1.23x  slower
+    push N + pop N (bsearch):    146284.3 i/s - 1.36x  slower
+    push N + pop N (rb_heap):     72289.0 i/s - 2.76x  slower
 
+    == Push/pop with pre-filled queue of size=N (N=21) =======================
+    push + pop (c_dheap):   4836860.0 i/s
+    push + pop (findmin):   4467453.9 i/s - 1.08x  slower
+    push + pop (bsearch):   3345458.4 i/s - 1.45x  slower
+    push + pop (rb_heap):   1560476.3 i/s - 3.10x  slower
+
+At higher values of N, `DHeap`'s logarithmic growth leads to little slowdown
+of `DHeap#push`, while insert's linear growth causes it to run slower and
+slower.  But because `#pop` is O(1) for a sorted array and O(d log n / log d)
+for a _d_-heap, scenarios involving `#pop` remain relatively close even as N
+increases to 5k:
+
+    == Push/pop with pre-filled queue of size=N (N=5461) ==============
+    push + pop (c_dheap):   2718225.1 i/s
+    push + pop (bsearch):   1793546.4 i/s - 1.52x  slower
+    push + pop (rb_heap):    707139.9 i/s - 3.84x  slower
+    push + pop (findmin):    122316.0 i/s - 22.22x  slower
+
+Somewhat surprisingly, bsearch + insert still runs faster than a pure ruby heap
+for the repeated push/pop scenario, all the way up to N as high as 87k:
+
+    == push N (N=87381) ======================================================
+    push N (c_dheap):        92.8 i/s
+    push N (rb_heap):        43.5 i/s - 2.13x  slower
+    push N (bsearch):         2.9 i/s - 31.70x  slower
+
+    == push N then pop N (N=87381) ===========================================
+    push N + pop N (c_dheap):        22.6 i/s
+    push N + pop N (rb_heap):         5.5 i/s - 4.08x  slower
+    push N + pop N (bsearch):         2.9 i/s - 7.90x  slower
+
+    == Push/pop with pre-filled queue of size=N (N=87381) ====================
+    push + pop (c_dheap):   1815277.3 i/s
+    push + pop (bsearch):    762343.2 i/s - 2.38x  slower
+    push + pop (rb_heap):    535913.6 i/s - 3.39x  slower
+    push + pop (findmin):      2262.8 i/s - 802.24x  slower
+
+## Profiling
+
+_n.b. `Array#fetch` is reading the input data, external to heap operations.
+These benchmarks use integers for all scores, which enables significantly faster
+comparisons.  If `a <=> b` were used instead, then the difference between push
+and pop would be much larger.  And ruby's `Tracepoint` impacts these different
+implementations differently.  So we can't use these profiler results for
+comparisons between implementations.  A sampling profiler would be needed for
+more accurate relative measurements._
+
+It's informative to look at the `ruby-prof` results for a simple binary search +
+insert implementation, repeatedly pushing and popping to a large heap. In
+particular, even with 1000 members, the linear `Array#insert` is _still_ faster
+than the logarithmic `Array#bsearch_index`. At this scale, ruby comparisons are
+still (relatively) slow and `memcpy` is (relatively) quite fast!
+
+    %self      total      self      wait     child     calls  name                           location
+    34.79      2.222     2.222     0.000     0.000  1000000   Array#insert
+    32.59      2.081     2.081     0.000     0.000  1000000   Array#bsearch_index
+    12.84      6.386     0.820     0.000     5.566        1   DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77
+    10.38      4.966     0.663     0.000     4.303  1000000   DHeap::Benchmarks::BinarySearchAndInsert#<< d_heap/benchmarks/implementations.rb:61
+     5.38      0.468     0.343     0.000     0.125  1000000   DHeap::Benchmarks::BinarySearchAndInsert#pop d_heap/benchmarks/implementations.rb:70
+     2.06      0.132     0.132     0.000     0.000  1000000   Array#fetch
+     1.95      0.125     0.125     0.000     0.000  1000000   Array#pop
+
+Contrast this with a simplistic pure-ruby implementation of a binary heap:
+
+    %self      total      self      wait     child     calls  name                           location
+    48.52      8.487     8.118     0.000     0.369  1000000   DHeap::Benchmarks::NaiveBinaryHeap#pop d_heap/benchmarks/implementations.rb:96
+    42.94      7.310     7.184     0.000     0.126  1000000   DHeap::Benchmarks::NaiveBinaryHeap#<< d_heap/benchmarks/implementations.rb:80
+     4.80     16.732     0.803     0.000    15.929        1   DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77
+
+You can see that it spends almost more time in pop than it does in push.  That
+is expected behavior for a heap: although both are O(log n), pop is
+significantly more complex, and has _d_ comparisons per layer.
+
+And `DHeap` shows a similar comparison between push and pop, although it spends
+half of its time in the benchmark code (which is written in ruby):
+
+    %self      total      self      wait     child     calls  name                           location
+    43.09      1.685     0.726     0.000     0.959        1   DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77
+    26.05      0.439     0.439     0.000     0.000  1000000   DHeap#<<
+    23.57      0.397     0.397     0.000     0.000  1000000   DHeap#pop
+     7.29      0.123     0.123     0.000     0.000  1000000   Array#fetch
+
 ### Timers
 
 Additionally, when used to sort timers, we can reasonably assume that:
  * New timers usually sort after most existing timers.
  * Most timers will be canceled before executing.
@@ -176,25 +376,57 @@
 rescheduled before we garbage collect, adjusting its position will usually be
 faster than a delete and re-insert.
 
 ## Alternative data structures
 
+As always, you should run benchmarks with your expected scenarios to determine
+which is right.
+
 Depending on what you're doing, maintaining a sorted `Array` using
-`#bsearch_index` and `#insert` might be faster!  Although it is technically
-O(n) for insertions, the implementations for `memcpy` or `memmove` can be *very*
-fast on modern architectures.  Also, it can be faster O(n) on average, if
-insertions are usually near the end of the array.  You should run benchmarks
-with your expected scenarios to determine which is right.
+`#bsearch_index` and `#insert` might be just fine!  As discussed above, although
+it is `O(n)` for insertions, `memcpy` is so fast on modern hardware that this
+may not matter.  Also, if you can arrange for insertions to occur near the end
+of the array, that could significantly reduce the `memcpy` overhead even more.
 
+More complex heap varients, e.g. [Fibonacci heap], can allow heaps to be merged
+as well as lower amortized time.
+
+[Fibonacci heap]: https://en.wikipedia.org/wiki/Fibonacci_heap
+
 If it is important to be able to quickly enumerate the set or find the ranking
-of values in it, then you probably want to use a self-balancing binary search
-tree (e.g. a red-black tree) or a skip-list.
+of values in it, then you may want to use a self-balancing binary search tree
+(e.g. a [red-black tree]) or a [skip-list].
 
-A Hashed Timing Wheel or Heirarchical Timing Wheels (or some variant in that
-family of data structures) can be constructed to have effectively O(1) running
-time in most cases.  However, the implementation for that data structure is more
-complex than a heap.  If a 4-ary heap is good enough for go's timers, it should
-be suitable for many use cases.
+[red-black tree]: https://en.wikipedia.org/wiki/Red%E2%80%93black_tree
+[skip-list]: https://en.wikipedia.org/wiki/Skip_list
+
+[Hashed and Heirarchical Timing Wheels][timing wheels] (or some variant in that
+family of data structures) can be constructed to have effectively `O(1)` running
+time in most cases.  Although the implementation for that data structure is more
+complex than a heap, it may be necessary for enormous values of N.
+
+[timing wheels]: http://www.cs.columbia.edu/~nahum/w6998/papers/ton97-timing-wheels.pdf
+
+## TODOs...
+
+_TODO:_ Also ~~included is~~ _will include_ `DHeap::Set`, which augments the
+basic heap with an internal `Hash`, which maps a set of values to scores.
+loosely inspired by go's timers.  e.g: It lazily sifts its heap after deletion
+and adjustments, to achieve faster average runtime for *add* and *cancel*
+operations.
+
+_TODO:_ Also ~~included is~~ _will include_ `DHeap::Lazy`, which contains some
+features that are loosely inspired by go's timers.  e.g: It lazily sifts its
+heap after deletion and adjustments, to achieve faster average runtime for *add*
+and *cancel* operations.
+
+Additionally, I was inspired by reading go's "timer.go" implementation to
+experiment with a 4-ary heap instead of the traditional binary heap.  In the
+case of timers, new timers are usually scheduled to run after most of the
+existing timers.  And timers are usually canceled before they have a chance to
+run. While a binary heap holds 50% of its elements in its last layer, 75% of a
+4-ary heap will have no children.  That diminishes the extra comparison overhead
+during sift-down.
 
 ## Development
 
 After checking out the repo, run `bin/setup` to install dependencies. Then, run
 `rake spec` to run the tests. You can also run `bin/console` for an interactive