README.md in d_heap-0.6.1 vs README.md in d_heap-0.7.0

- old
+ new

@@ -5,84 +5,80 @@
 [![Maintainability](https://api.codeclimate.com/v1/badges/ff274acd0683c99c03e1/maintainability)](https://codeclimate.com/github/nevans/d_heap/maintainability)
 
 A fast [_d_-ary heap][d-ary heap] [priority queue] implementation for ruby,
 implemented as a C extension.
 
+A regular queue has "FIFO" behavior: first in, first out.  A stack is "LIFO":
+last in first out.  A priority queue pushes each element with a score and pops
+out in order by score.  Priority queues are often used in algorithms for e.g.
+[scheduling] of timers or bandwidth management, for [Huffman coding], and for
+various graph search algorithms such as [Dijkstra's algorithm], [A* search], or
+[Prim's algorithm].
+
 From [wikipedia](https://en.wikipedia.org/wiki/Heap_(data_structure)):
 > A heap is a specialized tree-based data structure which is essentially an
 > almost complete tree that satisfies the heap property: in a min heap, for any
 > given node C, if P is a parent node of C, then the key (the value) of P is
 > less than or equal to the key of C. The node at the "top" of the heap (with no
 > parents) is called the root node.
 
 ![tree representation of a min heap](images/wikipedia-min-heap.png)
 
-With a regular queue, you expect "FIFO" behavior: first in, first out.  With a
-stack you expect "LIFO": last in first out.  A priority queue has a score for
-each element and elements are popped in order by score.  Priority queues are
-often used in algorithms for e.g. [scheduling] of timers or bandwidth
-management, for [Huffman coding], and various graph search algorithms such as
-[Dijkstra's algorithm], [A* search], or [Prim's algorithm].
+The _d_-ary heap data structure is a generalization of a [binary heap] in which
+each node has _d_ children instead of 2.  This speeds up "push" or "decrease
+priority" operations (`O(log n / log d)`) with the tradeoff of slower "pop" or
+"increase priority" (`O(d log n / log d)`).  Additionally, _d_-ary heaps can
+have better memory cache behavior than binary heaps, letting them run more
+quickly in practice.
 
-The _d_-ary heap data structure is a generalization of the [binary heap], in
-which the nodes have _d_ children instead of 2.  This allows for "insert" and
-"decrease priority" operations to be performed more quickly with the tradeoff of
-slower delete minimum or "increase priority".  Additionally, _d_-ary heaps can
-have better memory cache behavior than binary heaps, allowing them to run more
-quickly in practice despite slower worst-case time complexity. In the worst
-case, a _d_-ary heap requires only `O(log n / log d)` operations to push, with
-the tradeoff that pop requires `O(d log n / log d)`.
+Although the default _d_ value will usually perform best (see the time
+complexity analysis below), it's always advisable to benchmark your specific
+use-case.  In particular, if you push items more than you pop, higher values for
+_d_ can give a faster total runtime.
 
-Although you should probably just use the default _d_ value  of `4` (see the
-analysis below), it's always advisable to benchmark your specific use-case.  In
-particular, if you push items more than you pop, higher values for _d_ can give
-a faster total runtime.
-
 [d-ary heap]: https://en.wikipedia.org/wiki/D-ary_heap
 [priority queue]: https://en.wikipedia.org/wiki/Priority_queue
 [binary heap]: https://en.wikipedia.org/wiki/Binary_heap
 [scheduling]: https://en.wikipedia.org/wiki/Scheduling_(computing)
 [Huffman coding]: https://en.wikipedia.org/wiki/Huffman_coding#Compression
 [Dijkstra's algorithm]: https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm#Using_a_priority_queue
 [A* search]: https://en.wikipedia.org/wiki/A*_search_algorithm#Description
 [Prim's algorithm]: https://en.wikipedia.org/wiki/Prim%27s_algorithm
 
+## Installation
+
+Add this line to your application's Gemfile:
+
+```ruby
+gem 'd_heap'
+```
+
+And then execute:
+
+    $ bundle install
+
+Or install it yourself as:
+
+    $ gem install d_heap
+
 ## Usage
 
-The basic API is `#push(object, score)` and `#pop`.  Please read the
-[gem documentation] for more details and other methods.
+The basic API is `#push(object, score)` and `#pop`.  Please read the [full
+documentation] for more details.  The score must be convertable to a `Float` via
+`Float(score)` (i.e. it should properly implement `#to_f`).
 
-Quick reference for some common methods:
+Quick reference for the most common methods:
 
-* `heap << object` adds a value, with `Float(object)` as its score.
+* `heap << object` adds a value, using `Float(object)` as its intrinsic score.
 * `heap.push(object, score)` adds a value with an extrinsic score.
-* `heap.pop` removes and returns the value with the minimum score.
-* `heap.pop_lte(max_score)` pops only if the next score is `<=` the argument.
 * `heap.peek` to view the minimum value without popping it.
+* `heap.pop` removes and returns the value with the minimum score.
+* `heap.pop_below(max_score)` pops only if the next score is `<` the argument.
 * `heap.clear` to remove all items from the heap.
 * `heap.empty?` returns true if the heap is empty.
 * `heap.size` returns the number of items in the heap.
 
-If the score changes while the object is still in the heap, it will not be
-re-evaluated again.
-
-The score must either be `Integer` or `Float` or convertable to a `Float` via
-`Float(score)` (i.e. it should implement `#to_f`).  Constraining scores to
-numeric values gives more than 50% speedup under some benchmarks!  _n.b._
-`Integer` _scores must have an absolute value that fits into_ `unsigned long
-long`. This is compiler and architecture dependant but with gcc on an IA-64
-system it's 64 bits, which gives a range of -18,446,744,073,709,551,615 to
-+18,446,744,073,709,551,615, which is more than enough to store e.g. POSIX time
-in nanoseconds.
-
-_Comparing arbitary objects via_ `a <=> b` _was the original design and may be
-added back in a future version,_ if (and only if) _it can be done without
-impacting the speed of numeric comparisons. The speedup from this constraint is
-huge!_
-
-[gem documentation]: https://rubydoc.info/gems/d_heap/DHeap
-
 ### Examples
 
 ```ruby
 # create some example objects to place in our heap
 Task = Struct.new(:id, :time) do
@@ -126,255 +122,276 @@
 
 # popping from an empty heap returns nil
 heap.pop    # => nil
 ```
 
-Please see the [gem documentation] for more methods and more examples.
+Please see the [full documentation] for more methods and more examples.
 
-## Installation
+[full documentation]: https://rubydoc.info/gems/d_heap/DHeap
 
-Add this line to your application's Gemfile:
+### DHeap::Map
 
-```ruby
-gem 'd_heap'
-```
+`DHeap::Map` augments the heap with an internal `Hash`, mapping objects to their
+index in the heap.  For simple push/pop this a bit slower than a normal `DHeap`
+heap, but it can enable huge speed-ups for algorithms that need to adjust scores
+after they've been added, e.g.  [Dijkstra's algorithm].  It adds the following:
 
-And then execute:
+* a uniqueness constraint, by `#hash` value
+* `#[obj] # => score` or `#score(obj)` in `O(1)`
+* `#[obj] = new_score` or `#rescore(obj, score)` in `O(d log n / log d)`
+* TODO:
+  * optionally unique by object identity
+  * `#delete(obj)` in `O(d log n / log d)` (TODO)
 
-    $ bundle install
+## Scores
 
-Or install it yourself as:
+If a score changes while the object is still in the heap, it will not be
+re-evaluated again.
 
-    $ gem install d_heap
+Constraining scores to `Float` gives enormous performance benefits.  n.b.
+very large `Integer` values will lose precision when converted to `Float`.  This
+is compiler and architecture dependant but with gcc on an IA-64 system, `Float`
+is 64 bits with a 53-bit mantissa, which gives a range of -9,007,199,254,740,991
+to +9,007,199,254,740,991, which is _not_ enough to store the precise POSIX
+time since the epoch in nanoseconds.  This can be worked around by adding a
+bias, but probably it's good enough for most usage.
 
-## Motivation
+_Comparing arbitary objects via_ `a <=> b` _was the original design and may be
+added back in a future version,_ if (and only if) _it can be done without
+impacting the speed of numeric comparisons._
 
-One naive approach to a priority queue is to maintain an array in sorted order.
-This can be very simply implemented in ruby with `Array#bseach_index` +
-`Array#insert`.  This can be very fast—`Array#pop` is `O(1)`—but the worst-case
-for insert is `O(n)` because it may need to `memcpy` a significant portion of
-the array.
+## Thread safety
 
-The standard way to implement a priority queue is with a binary heap.  Although
-this increases the time complexity for `pop` alone, it reduces the combined time
-compexity for the combined `push` + `pop`. Using a d-ary heap with d > 2
-makes the tree shorter but broader, which reduces  to `O(log n / log d)` while
-increasing the comparisons needed by sift-down to `O(d log n/ log d)`.
+`DHeap` is _not_ thread-safe, so concurrent access from multiple threads need to
+take precautions such as locking access behind a mutex.
 
-However, I was disappointed when my best ruby heap implementation ran much more
-slowly than the naive approach—even for heaps containing ten thousand items.
-Although it _is_ `O(n)`, `memcpy` is _very_ fast, while calling `<=>` from ruby
-has _much_ higher overhead.  And a _d_-heap needs `d + 1` times more comparisons
-for each push + pop than `bsearch` + `insert`.
-
-Additionally, when researching how other systems handle their scheduling, I was
-inspired by reading go's "timer.go" implementation to experiment with a 4-ary
-heap instead of the traditional binary heap.
-
 ## Benchmarks
 
-_See `bin/benchmarks` and `docs/benchmarks.txt`, as well as `bin/profile` and
-`docs/profile.txt` for much more detail or updated results. These benchmarks
-were measured with v0.5.0 and ruby 2.7.2 without MJIT enabled._
+_See full benchmark output in subdirs of `benchmarks`.  See also or updated
+results. These benchmarks were measured with an Intel Core i7-1065G7 8x3.9GHz
+with d_heap v0.5.0 and ruby 2.7.2 without MJIT enabled._
 
-These benchmarks use very simple implementations for a pure-ruby heap and an
-array that is kept sorted using `Array#bsearch_index` and `Array#insert`.  For
-comparison, I also compare to the [priority_queue_cxx gem] which uses the [C++
-STL priority_queue], and another naive implementation that uses `Array#min` and
-`Array#delete_at` with an unsorted array.
+### Implementations
 
-In these benchmarks, `DHeap` runs faster than all other implementations for
-every scenario and every value of N, although the difference is usually more
-noticable at higher values of N.  The pure ruby heap implementation is
-competitive for `push` alone at every value of N, but is significantly slower
-than bsearch + insert for push + pop, until N is _very_ large (somewhere between
-10k and 100k)!
+ * **findmin** -
+    A very fast `O(1)` push using `Array#push` onto an unsorted Array, but a
+    very slow `O(n)` pop using `Array#min`, `Array#rindex(min)` and
+    `Array#delete_at(min_index)`.  Push + pop is still fast for `n < 100`, but
+    unusably slow for `n > 1000`.
 
-[priority_queue_cxx gem]: https://rubygems.org/gems/priority_queue_cxx
-[C++ STL priority_queue]: http://www.cplusplus.com/reference/queue/priority_queue/
+ * **bsearch** -
+    A simple implementation with a slow `O(n)` push using `Array#bsearch` +
+    `Array#insert` to maintain a sorted Array, but a very fast `O(1)` pop with
+    `Array#pop`.  It is still relatively fast for `n < 10000`, but its linear
+    time complexity really destroys it after that.
 
-Three different scenarios are measured:
+ * **rb_heap** -
+    A pure ruby binary min-heap that has been tuned for performance by making
+    few method calls and allocating and assigning as few variables as possible.
+    It runs in `O(log n)` for both push and pop, although pop is slower than
+    push by a constant factor.  Its much higher constant factors makes it lose
+    to `bsearch` push + pop for `n < 10000` but it holds steady with very little
+    slowdown even with `n > 10000000`.
 
-### push N items onto an empty heap
+ * **c++ stl** -
+    A thin wrapper around the [priority_queue_cxx gem] which uses the [C++ STL
+    priority_queue].  The wrapper is simply to provide compatibility with the
+    other benchmarked implementations, but it should be possible to speed this
+    up a little bit by benchmarking the `priority_queue_cxx` API directly.  It
+    has the same time complexity as rb_heap but its much lower constant
+    factors allow it to easily outperform `bsearch`.
 
-...but never pop (clearing between each set of pushes).
+ * **c_dheap** -
+    A {DHeap} instance with the default `d` value of `4`.  It has the same time
+    complexity as `rb_heap` and `c++ stl`, but is faster than both in every
+    benchmarked scenario.
 
-![bar graph for push_n_pop_n benchmarks](./images/push_n.png)
+[priority_queue_cxx gem]: https://rubygems.org/gems/priority_queue_cxx
+[C++ STL priority_queue]: http://www.cplusplus.com/reference/queue/priority_queue/
 
-### push N items onto an empty heap then pop all N
+### Scenarios
 
-Although this could be used for heap sort, we're unlikely to choose heap sort
-over Ruby's quick sort implementation. I'm using this scenario to represent
-the amortized cost of creating a heap and (eventually) draining it.
+Each benchmark increases N exponentially, either by √1̅0̅ or approximating
+(alternating between x3 and x3.333) in order to simplify keeping loop counts
+evenly divisible by N.
 
-![bar graph for push_n_pop_n benchmarks](./images/push_n_pop_n.png)
+#### push N items
 
-### push and pop on a heap with N values
+This measures the _average time per insert_ to create a queue of size N
+(clearing the queue once it reaches that size).  Use cases which push (or
+decrease) more values than they pop, e.g.  [Dijkstra's algorithm] or [Prim's
+algorithm] when the graph has more edges than verticies, may want to pay more
+attention to this benchmark.
 
-Repeatedly push and pop while keeping a stable heap size.  This is a _very
-simplistic_ approximation for how most scheduler/timer heaps might be used.
-Usually when a timer fires it will be quickly replaced by a new timer, and the
-overall count of timers will remain roughly stable.
+![bar graph for push_n_pop_n benchmarks](./images/push_n.png)
 
-![bar graph for push_pop benchmarks](./images/push_pop.png)
+    == push N (N=100) ==========================================================
+    push N (c_dheap):  10522662.6 i/s
+    push N (findmin):   9980622.3 i/s - 1.05x  slower
+    push N (c++ stl):   7991608.3 i/s - 1.32x  slower
+    push N (rb_heap):   4607849.4 i/s - 2.28x  slower
+    push N (bsearch):   2769106.2 i/s - 3.80x  slower
+    == push N (N=10,000) =======================================================
+    push N (c_dheap):  10444588.3 i/s
+    push N (findmin):  10191797.4 i/s - 1.02x  slower
+    push N (c++ stl):   8210895.4 i/s - 1.27x  slower
+    push N (rb_heap):   4369252.9 i/s - 2.39x  slower
+    push N (bsearch):   1213580.4 i/s - 8.61x  slower
+    == push N (N=1,000,000) ====================================================
+    push N (c_dheap):  10342183.7 i/s
+    push N (findmin):   9963898.8 i/s - 1.04x  slower
+    push N (c++ stl):   7891924.8 i/s - 1.31x  slower
+    push N (rb_heap):   4350116.0 i/s - 2.38x  slower
 
-### numbers
+All three heap implementations have little to no perceptible slowdown for `N >
+100`.  But `DHeap` runs faster than `Array#push` to an unsorted array (findmin)!
 
-Even for very small N values the benchmark implementations,  `DHeap` runs faster
-than the other implementations for each scenario, although the difference is
-still relatively small.  The pure ruby binary heap is 2x or more slower than
-bsearch + insert for common push/pop scenario.
+#### push then pop N items
 
-    == push N (N=5) ==========================================================
-    push N (c_dheap):   1969700.7 i/s
-    push N (c++ stl):   1049738.1 i/s - 1.88x  slower
-    push N (rb_heap):    928435.2 i/s - 2.12x  slower
-    push N (bsearch):    921060.0 i/s - 2.14x  slower
+This measures the _average_ for a push **or** a pop, filling up a queue with N
+items and then draining that queue until empty.  It represents the amortized
+cost of balanced pushes and pops to fill a heap and drain it.
 
-    == push N then pop N (N=5) ===============================================
-    push N + pop N (c_dheap):   1375805.0 i/s
-    push N + pop N (c++ stl):   1134997.5 i/s - 1.21x  slower
-    push N + pop N (findmin):    862913.1 i/s - 1.59x  slower
-    push N + pop N (bsearch):    762887.1 i/s - 1.80x  slower
-    push N + pop N (rb_heap):    506890.4 i/s - 2.71x  slower
+![bar graph for push_n_pop_n benchmarks](./images/push_n_pop_n.png)
 
-    == Push/pop with pre-filled queue of size=N (N=5) ========================
-    push + pop (c_dheap):   9044435.5 i/s
-    push + pop (c++ stl):   7534583.4 i/s - 1.20x  slower
-    push + pop (findmin):   5026155.1 i/s - 1.80x  slower
-    push + pop (bsearch):   4300260.0 i/s - 2.10x  slower
-    push + pop (rb_heap):   2299499.7 i/s - 3.93x  slower
+    == push N then pop N (N=100) ===============================================
+    push N + pop N (c_dheap):  10954469.2 i/s
+    push N + pop N (c++ stl):   9317140.2 i/s - 1.18x  slower
+    push N + pop N (bsearch):   4808770.2 i/s - 2.28x  slower
+    push N + pop N (findmin):   4321411.9 i/s - 2.53x  slower
+    push N + pop N (rb_heap):   2467417.0 i/s - 4.44x  slower
+    == push N then pop N (N=10,000) ============================================
+    push N + pop N (c_dheap):   8083962.7 i/s
+    push N + pop N (c++ stl):   7365661.8 i/s - 1.10x  slower
+    push N + pop N (bsearch):   2257047.9 i/s - 3.58x  slower
+    push N + pop N (rb_heap):   1439204.3 i/s - 5.62x  slower
+    == push N then pop N (N=1,000,000) =========================================
+    push N + pop N (c++ stl):   5274657.5 i/s
+    push N + pop N (c_dheap):   4731117.9 i/s - 1.11x  slower
+    push N + pop N (rb_heap):    976688.6 i/s - 5.40x  slower
 
-By N=21, `DHeap` has pulled significantly ahead of bsearch + insert for all
-scenarios, but the pure ruby heap is still slower than every other
-implementation—even resorting the array after every `#push`—in any scenario that
-uses `#pop`.
+At N=100 findmin still beats a pure-ruby heap. But above that it slows down too
+much to be useful.  At N=10k, bsearch still beats a pure ruby heap, but above
+30k it slows down too much to be useful.  `DHeap` consistently runs 4.5-5.5x
+faster than the pure ruby heap.
 
-    == push N (N=21) =========================================================
-    push N (c_dheap):    464231.4 i/s
-    push N (c++ stl):    305546.7 i/s - 1.52x  slower
-    push N (rb_heap):    202803.7 i/s - 2.29x  slower
-    push N (bsearch):    168678.7 i/s - 2.75x  slower
+#### push & pop on N-item heap
 
-    == push N then pop N (N=21) ==============================================
-    push N + pop N (c_dheap):    298350.3 i/s
-    push N + pop N (c++ stl):    252227.1 i/s - 1.18x  slower
-    push N + pop N (findmin):    161998.7 i/s - 1.84x  slower
-    push N + pop N (bsearch):    143432.3 i/s - 2.08x  slower
-    push N + pop N (rb_heap):     79622.1 i/s - 3.75x  slower
+This measures the combined time to push once and pop once, which is done
+repeatedly while keeping a stable heap size of N.  Its an approximation for
+scenarios which reach a stable size and then plateau with balanced pushes and
+pops.  E.g. timers and timeouts will often reschedule themselves or replace
+themselves with new timers or timeouts, maintaining a roughly stable total count
+of timers.
 
-    == Push/pop with pre-filled queue of size=N (N=21) =======================
-    push + pop (c_dheap):   8855093.4 i/s
-    push + pop (c++ stl):   7223079.5 i/s - 1.23x  slower
-    push + pop (findmin):   4542913.7 i/s - 1.95x  slower
-    push + pop (bsearch):   3461802.4 i/s - 2.56x  slower
-    push + pop (rb_heap):   1845488.7 i/s - 4.80x  slower
+![bar graph for push_pop benchmarks](./images/push_pop.png)
 
-At higher values of N, a heaps logarithmic growth leads to only a little
-slowdown of `#push`, while insert's linear growth causes it to run noticably
-slower and slower.  But because `#pop` is `O(1)` for a sorted array and `O(d log
-n / log d)` for a heap, scenarios involving both `#push` and `#pop` remain
-relatively close, and bsearch + insert still runs faster than a pure ruby heap,
-even up to queues with 10k items.  But as queue size increases beyond than that,
-the linear time compexity to keep a sorted array dominates.
+             push + pop (findmin)
+                N 10:   5480288.0 i/s
+               N 100:   2595178.8 i/s - 2.11x  slower
+              N 1000:    224813.9 i/s - 24.38x  slower
+             N 10000:     12630.7 i/s - 433.89x  slower
+            N 100000:      1097.3 i/s - 4994.31x  slower
+           N 1000000:       135.9 i/s - 40313.05x  slower
+          N 10000000:        12.9 i/s - 425838.01x  slower
 
-    == push + pop (rb_heap)
-    queue size =    10000:    736618.2 i/s
-    queue size =    25000:    670186.8 i/s - 1.10x  slower
-    queue size =    50000:    618156.7 i/s - 1.19x  slower
-    queue size =   100000:    579250.7 i/s - 1.27x  slower
-    queue size =   250000:    572795.0 i/s - 1.29x  slower
-    queue size =   500000:    543648.3 i/s - 1.35x  slower
-    queue size =  1000000:    513523.4 i/s - 1.43x  slower
-    queue size =  2500000:    460848.9 i/s - 1.60x  slower
-    queue size =  5000000:    445234.5 i/s - 1.65x  slower
-    queue size = 10000000:    423119.0 i/s - 1.74x  slower
+             push + pop (bsearch)
+                N 10:   3931408.4 i/s
+               N 100:   2904181.8 i/s - 1.35x  slower
+              N 1000:   2203157.1 i/s - 1.78x  slower
+             N 10000:   1209584.9 i/s - 3.25x  slower
+            N 100000:     81121.4 i/s - 48.46x  slower
+           N 1000000:      5356.0 i/s - 734.02x  slower
+          N 10000000:       281.9 i/s - 13946.33x  slower
 
-    == push + pop (bsearch)
-    queue size =    10000:    786334.2 i/s
-    queue size =    25000:    364963.8 i/s - 2.15x  slower
-    queue size =    50000:    200520.6 i/s - 3.92x  slower
-    queue size =   100000:     88607.0 i/s - 8.87x  slower
-    queue size =   250000:     34530.5 i/s - 22.77x  slower
-    queue size =   500000:     17965.4 i/s - 43.77x  slower
-    queue size =  1000000:      5638.7 i/s - 139.45x  slower
-    queue size =  2500000:      1302.0 i/s - 603.93x  slower
-    queue size =  5000000:       592.0 i/s - 1328.25x  slower
-    queue size = 10000000:       288.8 i/s - 2722.66x  slower
+             push + pop (rb_heap)
+                N 10:   2325816.5 i/s
+               N 100:   1603540.3 i/s - 1.45x  slower
+              N 1000:   1262515.2 i/s - 1.84x  slower
+             N 10000:    950389.3 i/s - 2.45x  slower
+            N 100000:    732548.8 i/s - 3.17x  slower
+           N 1000000:    673577.8 i/s - 3.45x  slower
+          N 10000000:    467512.3 i/s - 4.97x  slower
 
-    == push + pop (c_dheap)
-    queue size =    10000:   7311366.6 i/s
-    queue size =    50000:   6737824.5 i/s - 1.09x  slower
-    queue size =    25000:   6407340.6 i/s - 1.14x  slower
-    queue size =   100000:   6254396.3 i/s - 1.17x  slower
-    queue size =   250000:   5917684.5 i/s - 1.24x  slower
-    queue size =   500000:   5126307.6 i/s - 1.43x  slower
-    queue size =  1000000:   4403494.1 i/s - 1.66x  slower
-    queue size =  2500000:   3304088.2 i/s - 2.21x  slower
-    queue size =  5000000:   2664897.7 i/s - 2.74x  slower
-    queue size = 10000000:   2137927.6 i/s - 3.42x  slower
+             push + pop (c++ stl)
+                N 10:   7706818.6 i/s - 1.01x  slower
+               N 100:   7393127.3 i/s - 1.05x  slower
+              N 1000:   6898781.3 i/s - 1.13x  slower
+             N 10000:   5731130.5 i/s - 1.36x  slower
+            N 100000:   4842393.2 i/s - 1.60x  slower
+           N 1000000:   4170936.4 i/s - 1.86x  slower
+          N 10000000:   2737146.6 i/s - 2.84x  slower
 
-## Analysis
+             push + pop (c_dheap)
+                N 10:  10196454.1 i/s
+               N 100:   9668679.8 i/s - 1.05x  slower
+              N 1000:   9339557.0 i/s - 1.09x  slower
+             N 10000:   8045103.0 i/s - 1.27x  slower
+            N 100000:   7150276.7 i/s - 1.43x  slower
+           N 1000000:   6490261.6 i/s - 1.57x  slower
+          N 10000000:   3734856.5 i/s - 2.73x  slower
 
-### Time complexity
+## Time complexity analysis
 
-There are two fundamental heap operations: sift-up (used by push) and sift-down
-(used by pop).
+There are two fundamental heap operations: sift-up (used by push or decrease
+score) and sift-down (used by pop or delete or increase score).  Each sift
+bubbles an item to its correct location in the tree.
 
-* A _d_-ary heap will have `log n / log d` layers, so both sift operations can
-  perform as many as `log n / log d` writes, when a member sifts the entire
-  length of the tree.
-* Sift-up makes one comparison per layer, so push runs in `O(log n / log d)`.
-* Sift-down makes d comparions per layer, so pop runs in `O(d log n / log d)`.
+* A _d_-ary heap has `log n / log d` layers, so either sift performs as many as
+  `log n / log d` writes, when a member sifts the entire length of the tree.
+* Sift-up needs one comparison per layer: `O(log n / log d)`.
+* Sift-down needs d comparions per layer: `O(d log n / log d)`.
 
-So, in the simplest case of running balanced push/pop while maintaining the same
-heap size, `(1 + d) log n / log d` comparisons are made.  In the worst case,
-when every sift traverses every layer of the tree, `d=4` requires the fewest
-comparisons for combined insert and delete:
+So, in the case of a balanced push then pop, as many as `(1 + d) log n / log d`
+comparisons are made.  Looking only at this worst case combo, `d=4` requires the
+fewest comparisons for a combined push and pop:
 
-* (1 +  2) lg n / lg d ≈ 4.328085 lg n
-* (1 +  3) lg n / lg d ≈ 3.640957 lg n
-* (1 +  4) lg n / lg d ≈ 3.606738 lg n
-* (1 +  5) lg n / lg d ≈ 3.728010 lg n
-* (1 +  6) lg n / lg d ≈ 3.906774 lg n
-* (1 +  7) lg n / lg d ≈ 4.111187 lg n
-* (1 +  8) lg n / lg d ≈ 4.328085 lg n
-* (1 +  9) lg n / lg d ≈ 4.551196 lg n
-* (1 + 10) lg n / lg d ≈ 4.777239 lg n
+* `(1 +  2) log n / log d ≈ 4.328085 log n`
+* `(1 +  3) log n / log d ≈ 3.640957 log n`
+* `(1 +  4) log n / log d ≈ 3.606738 log n`
+* `(1 +  5) log n / log d ≈ 3.728010 log n`
+* `(1 +  6) log n / log d ≈ 3.906774 log n`
+* `(1 +  7) log n / log d ≈ 4.111187 log n`
+* `(1 +  8) log n / log d ≈ 4.328085 log n`
+* `(1 +  9) log n / log d ≈ 4.551196 log n`
+* `(1 + 10) log n / log d ≈ 4.777239 log n`
 * etc...
 
 See https://en.wikipedia.org/wiki/D-ary_heap#Analysis for deeper analysis.
 
-### Space complexity
+However, what this simple count of comparisons misses is the extent to which
+modern compilers can optimize code (e.g. by unrolling the comparison loop to
+execute on registers) and more importantly how well modern processors are at
+pipelined speculative execution using branch prediction, etc.  Benchmarks should
+be run on the _exact same_ hardware platform that production code will use,
+as the sift-down operation is especially sensitive to good pipelining.
 
-Space usage is linear, regardless of d.  However higher d values may
-provide better cache locality.  Because the heap is a complete binary tree, the
-elements can be stored in an array, without the need for tree or list pointers.
+## Comparison performance
 
-Ruby can compare Numeric values _much_ faster than other ruby objects, even if
-those objects simply delegate comparison to internal Numeric values.  And it is
-often useful to use external scores for otherwise uncomparable values.  So
-`DHeap` uses twice as many entries (one for score and one for value)
-as an array which only stores values.
+It is often useful to use external scores for otherwise uncomparable values.
+And casting an item or score (via `to_f`) can also be time consuming.  So
+`DHeap` evaluates and stores scores at the time of insertion, and they will be
+compared directly without needing any further lookup.
 
-## Thread safety
+Numeric values can be compared _much_ faster than other ruby objects, even if
+those objects simply delegate comparison to internal Numeric values.
+Additionally, native C integers or floats can be compared _much_ faster than
+ruby `Numeric` objects.  So scores are converted to Float and stored as
+`double`, which is 64 bits on an [LP64 64-bit system].
 
-`DHeap` is _not_ thread-safe, so concurrent access from multiple threads need to
-take precautions such as locking access behind a mutex.
+[LP64 64-bit system]: https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models
 
 ## Alternative data structures
 
 As always, you should run benchmarks with your expected scenarios to determine
 which is best for your application.
 
-Depending on your use-case, maintaining a sorted `Array` using `#bsearch_index`
-and `#insert` might be just fine!  Even `min` plus `delete` with an unsorted
-array can be very fast on small queues.  Although insertions run with `O(n)`,
-`memcpy` is so fast on modern hardware that your dataset might not be large
-enough for it to matter.
+Depending on your use-case, using a sorted `Array` using `#bsearch_index`
+and `#insert` might be just fine!  It only takes a couple of lines of code and
+is probably "Fast Enough".
 
-More complex heap varients, e.g. [Fibonacci heap], allow heaps to be split and
+More complex heap variant, e.g. [Fibonacci heap], allow heaps to be split and
 merged which gives some graph algorithms a lower amortized time complexity.  But
 in practice, _d_-ary heaps have much lower overhead and often run faster.
 
 [Fibonacci heap]: https://en.wikipedia.org/wiki/Fibonacci_heap
 
@@ -383,28 +400,63 @@
 (e.g. a [red-black tree]) or a [skip-list].
 
 [red-black tree]: https://en.wikipedia.org/wiki/Red%E2%80%93black_tree
 [skip-list]: https://en.wikipedia.org/wiki/Skip_list
 
-[Hashed and Heirarchical Timing Wheels][timing wheels] (or some variant in that
-family of data structures) can be constructed to have effectively `O(1)` running
-time in most cases.  Although the implementation for that data structure is more
+[Hashed and Heirarchical Timing Wheels][timing wheel] (or some variant in the
+timing wheel family of data structures) can have effectively `O(1)` running time
+in most cases.  Although the implementation for that data structure is more
 complex than a heap, it may be necessary for enormous values of N.
 
-[timing wheels]: http://www.cs.columbia.edu/~nahum/w6998/papers/ton97-timing-wheels.pdf
+[timing wheel]: http://www.cs.columbia.edu/~nahum/w6998/papers/ton97-timing-wheels.pdf
 
-## TODOs...
+## Supported platforms
 
-_TODO:_ Also ~~included is~~ _will include_ `DHeap::Map`, which augments the
-basic heap with an internal `Hash`, which maps objects to their position in the
-heap.  This enforces a uniqueness constraint on items on the heap, and also
-allows items to be more efficiently deleted or adjusted.  However maintaining
-the hash does lead to a small drop in normal `#push` and `#pop` performance.
+See the [CI workflow] for all supported platforms.
 
-_TODO:_ Also ~~included is~~ _will include_ `DHeap::Lazy`, which contains some
-features that are loosely inspired by go's timers.  e.g: It lazily sifts its
-heap after deletion and adjustments, to achieve faster average runtime for *add*
-and *cancel* operations.
+[CI workflow]: https://github.com/nevans/d_heap/actions?query=workflow%3ACI
+
+`d_heap` may contain bugs on 32-bit systems.  Currently, `d_heap` is only tested
+on 64-bit x86 CRuby 2.4-3.0 under Linux and Mac OS.
+
+## Caveats and TODOs (PRs welcome!)
+
+A `DHeap`'s internal array grows but never shrinks.  At the very least, there
+should be a `#compact` or `#shrink` method and during `#freeze`.  It might make
+sense to automatically shrink (to no more than 2x the current size) during GC's
+compact phase.
+
+Benchmark sift-down min-child comparisons using SSE, AVX2, and AVX512F.  This
+might lead to a different default `d` value (maybe 16 or 24?).
+
+Shrink scores to 64-bits: either store a type flag with each entry (this could
+be used to support non-numeric scores) or require users to choose between
+`Integer` or `Float` at construction time.  Reducing memory usage should also
+improve speed for very large heaps.
+
+Patches to support JRuby, rubinius, 32-bit systems, or any other platforms are
+welcome!  JRuby and Truffle Ruby ought to be able to use [Java's PriorityQueue]?
+Other platforms could fallback on the (slower) pure ruby implementation used by
+the benchmarks.
+
+[Java's PriorityQueue]: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/PriorityQueue.html
+
+Allow a max-heap (or other configurations of the compare function).  This can be
+very easily implemented by just reversing the scores.
+
+_Maybe_ allow non-numeric scores to be compared with `<=>`, _only_ if the basic
+numeric use case simplicity and speed can be preserved.
+
+Consider `DHeap::Monotonic`, which could rely on `#pop_below` for "current time"
+and move all values below that time onto an Array.
+
+Consider adding `DHeap::Lazy` or `DHeap.new(lazy: true)` which could contain
+some features that are loosely inspired by go's timers.  Go lazily sifts its
+heap after deletion or adjustments, to achieve faster amortized runtime.
+There's no need to actually remove a deleted item from the heap, if you re-add
+it back before it's next evaluated.  A similar trick can be to store "far away"
+values in an internal `Hash`, assuming many will be deleted before they rise to
+the top.  This could naturally evolve into a [timing wheel] variant.
 
 ## Development
 
 After checking out the repo, run `bin/setup` to install dependencies. Then, run
 `rake spec` to run the tests. You can also run `bin/console` for an interactive