README.md in d_heap-0.6.1 vs README.md in d_heap-0.7.0
- old
+ new
@@ -5,84 +5,80 @@
[![Maintainability](https://api.codeclimate.com/v1/badges/ff274acd0683c99c03e1/maintainability)](https://codeclimate.com/github/nevans/d_heap/maintainability)
A fast [_d_-ary heap][d-ary heap] [priority queue] implementation for ruby,
implemented as a C extension.
+A regular queue has "FIFO" behavior: first in, first out. A stack is "LIFO":
+last in first out. A priority queue pushes each element with a score and pops
+out in order by score. Priority queues are often used in algorithms for e.g.
+[scheduling] of timers or bandwidth management, for [Huffman coding], and for
+various graph search algorithms such as [Dijkstra's algorithm], [A* search], or
+[Prim's algorithm].
+
From [wikipedia](https://en.wikipedia.org/wiki/Heap_(data_structure)):
> A heap is a specialized tree-based data structure which is essentially an
> almost complete tree that satisfies the heap property: in a min heap, for any
> given node C, if P is a parent node of C, then the key (the value) of P is
> less than or equal to the key of C. The node at the "top" of the heap (with no
> parents) is called the root node.
![tree representation of a min heap](images/wikipedia-min-heap.png)
-With a regular queue, you expect "FIFO" behavior: first in, first out. With a
-stack you expect "LIFO": last in first out. A priority queue has a score for
-each element and elements are popped in order by score. Priority queues are
-often used in algorithms for e.g. [scheduling] of timers or bandwidth
-management, for [Huffman coding], and various graph search algorithms such as
-[Dijkstra's algorithm], [A* search], or [Prim's algorithm].
+The _d_-ary heap data structure is a generalization of a [binary heap] in which
+each node has _d_ children instead of 2. This speeds up "push" or "decrease
+priority" operations (`O(log n / log d)`) with the tradeoff of slower "pop" or
+"increase priority" (`O(d log n / log d)`). Additionally, _d_-ary heaps can
+have better memory cache behavior than binary heaps, letting them run more
+quickly in practice.
-The _d_-ary heap data structure is a generalization of the [binary heap], in
-which the nodes have _d_ children instead of 2. This allows for "insert" and
-"decrease priority" operations to be performed more quickly with the tradeoff of
-slower delete minimum or "increase priority". Additionally, _d_-ary heaps can
-have better memory cache behavior than binary heaps, allowing them to run more
-quickly in practice despite slower worst-case time complexity. In the worst
-case, a _d_-ary heap requires only `O(log n / log d)` operations to push, with
-the tradeoff that pop requires `O(d log n / log d)`.
+Although the default _d_ value will usually perform best (see the time
+complexity analysis below), it's always advisable to benchmark your specific
+use-case. In particular, if you push items more than you pop, higher values for
+_d_ can give a faster total runtime.
-Although you should probably just use the default _d_ value of `4` (see the
-analysis below), it's always advisable to benchmark your specific use-case. In
-particular, if you push items more than you pop, higher values for _d_ can give
-a faster total runtime.
-
[d-ary heap]: https://en.wikipedia.org/wiki/D-ary_heap
[priority queue]: https://en.wikipedia.org/wiki/Priority_queue
[binary heap]: https://en.wikipedia.org/wiki/Binary_heap
[scheduling]: https://en.wikipedia.org/wiki/Scheduling_(computing)
[Huffman coding]: https://en.wikipedia.org/wiki/Huffman_coding#Compression
[Dijkstra's algorithm]: https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm#Using_a_priority_queue
[A* search]: https://en.wikipedia.org/wiki/A*_search_algorithm#Description
[Prim's algorithm]: https://en.wikipedia.org/wiki/Prim%27s_algorithm
+## Installation
+
+Add this line to your application's Gemfile:
+
+```ruby
+gem 'd_heap'
+```
+
+And then execute:
+
+ $ bundle install
+
+Or install it yourself as:
+
+ $ gem install d_heap
+
## Usage
-The basic API is `#push(object, score)` and `#pop`. Please read the
-[gem documentation] for more details and other methods.
+The basic API is `#push(object, score)` and `#pop`. Please read the [full
+documentation] for more details. The score must be convertable to a `Float` via
+`Float(score)` (i.e. it should properly implement `#to_f`).
-Quick reference for some common methods:
+Quick reference for the most common methods:
-* `heap << object` adds a value, with `Float(object)` as its score.
+* `heap << object` adds a value, using `Float(object)` as its intrinsic score.
* `heap.push(object, score)` adds a value with an extrinsic score.
-* `heap.pop` removes and returns the value with the minimum score.
-* `heap.pop_lte(max_score)` pops only if the next score is `<=` the argument.
* `heap.peek` to view the minimum value without popping it.
+* `heap.pop` removes and returns the value with the minimum score.
+* `heap.pop_below(max_score)` pops only if the next score is `<` the argument.
* `heap.clear` to remove all items from the heap.
* `heap.empty?` returns true if the heap is empty.
* `heap.size` returns the number of items in the heap.
-If the score changes while the object is still in the heap, it will not be
-re-evaluated again.
-
-The score must either be `Integer` or `Float` or convertable to a `Float` via
-`Float(score)` (i.e. it should implement `#to_f`). Constraining scores to
-numeric values gives more than 50% speedup under some benchmarks! _n.b._
-`Integer` _scores must have an absolute value that fits into_ `unsigned long
-long`. This is compiler and architecture dependant but with gcc on an IA-64
-system it's 64 bits, which gives a range of -18,446,744,073,709,551,615 to
-+18,446,744,073,709,551,615, which is more than enough to store e.g. POSIX time
-in nanoseconds.
-
-_Comparing arbitary objects via_ `a <=> b` _was the original design and may be
-added back in a future version,_ if (and only if) _it can be done without
-impacting the speed of numeric comparisons. The speedup from this constraint is
-huge!_
-
-[gem documentation]: https://rubydoc.info/gems/d_heap/DHeap
-
### Examples
```ruby
# create some example objects to place in our heap
Task = Struct.new(:id, :time) do
@@ -126,255 +122,276 @@
# popping from an empty heap returns nil
heap.pop # => nil
```
-Please see the [gem documentation] for more methods and more examples.
+Please see the [full documentation] for more methods and more examples.
-## Installation
+[full documentation]: https://rubydoc.info/gems/d_heap/DHeap
-Add this line to your application's Gemfile:
+### DHeap::Map
-```ruby
-gem 'd_heap'
-```
+`DHeap::Map` augments the heap with an internal `Hash`, mapping objects to their
+index in the heap. For simple push/pop this a bit slower than a normal `DHeap`
+heap, but it can enable huge speed-ups for algorithms that need to adjust scores
+after they've been added, e.g. [Dijkstra's algorithm]. It adds the following:
-And then execute:
+* a uniqueness constraint, by `#hash` value
+* `#[obj] # => score` or `#score(obj)` in `O(1)`
+* `#[obj] = new_score` or `#rescore(obj, score)` in `O(d log n / log d)`
+* TODO:
+ * optionally unique by object identity
+ * `#delete(obj)` in `O(d log n / log d)` (TODO)
- $ bundle install
+## Scores
-Or install it yourself as:
+If a score changes while the object is still in the heap, it will not be
+re-evaluated again.
- $ gem install d_heap
+Constraining scores to `Float` gives enormous performance benefits. n.b.
+very large `Integer` values will lose precision when converted to `Float`. This
+is compiler and architecture dependant but with gcc on an IA-64 system, `Float`
+is 64 bits with a 53-bit mantissa, which gives a range of -9,007,199,254,740,991
+to +9,007,199,254,740,991, which is _not_ enough to store the precise POSIX
+time since the epoch in nanoseconds. This can be worked around by adding a
+bias, but probably it's good enough for most usage.
-## Motivation
+_Comparing arbitary objects via_ `a <=> b` _was the original design and may be
+added back in a future version,_ if (and only if) _it can be done without
+impacting the speed of numeric comparisons._
-One naive approach to a priority queue is to maintain an array in sorted order.
-This can be very simply implemented in ruby with `Array#bseach_index` +
-`Array#insert`. This can be very fast—`Array#pop` is `O(1)`—but the worst-case
-for insert is `O(n)` because it may need to `memcpy` a significant portion of
-the array.
+## Thread safety
-The standard way to implement a priority queue is with a binary heap. Although
-this increases the time complexity for `pop` alone, it reduces the combined time
-compexity for the combined `push` + `pop`. Using a d-ary heap with d > 2
-makes the tree shorter but broader, which reduces to `O(log n / log d)` while
-increasing the comparisons needed by sift-down to `O(d log n/ log d)`.
+`DHeap` is _not_ thread-safe, so concurrent access from multiple threads need to
+take precautions such as locking access behind a mutex.
-However, I was disappointed when my best ruby heap implementation ran much more
-slowly than the naive approach—even for heaps containing ten thousand items.
-Although it _is_ `O(n)`, `memcpy` is _very_ fast, while calling `<=>` from ruby
-has _much_ higher overhead. And a _d_-heap needs `d + 1` times more comparisons
-for each push + pop than `bsearch` + `insert`.
-
-Additionally, when researching how other systems handle their scheduling, I was
-inspired by reading go's "timer.go" implementation to experiment with a 4-ary
-heap instead of the traditional binary heap.
-
## Benchmarks
-_See `bin/benchmarks` and `docs/benchmarks.txt`, as well as `bin/profile` and
-`docs/profile.txt` for much more detail or updated results. These benchmarks
-were measured with v0.5.0 and ruby 2.7.2 without MJIT enabled._
+_See full benchmark output in subdirs of `benchmarks`. See also or updated
+results. These benchmarks were measured with an Intel Core i7-1065G7 8x3.9GHz
+with d_heap v0.5.0 and ruby 2.7.2 without MJIT enabled._
-These benchmarks use very simple implementations for a pure-ruby heap and an
-array that is kept sorted using `Array#bsearch_index` and `Array#insert`. For
-comparison, I also compare to the [priority_queue_cxx gem] which uses the [C++
-STL priority_queue], and another naive implementation that uses `Array#min` and
-`Array#delete_at` with an unsorted array.
+### Implementations
-In these benchmarks, `DHeap` runs faster than all other implementations for
-every scenario and every value of N, although the difference is usually more
-noticable at higher values of N. The pure ruby heap implementation is
-competitive for `push` alone at every value of N, but is significantly slower
-than bsearch + insert for push + pop, until N is _very_ large (somewhere between
-10k and 100k)!
+ * **findmin** -
+ A very fast `O(1)` push using `Array#push` onto an unsorted Array, but a
+ very slow `O(n)` pop using `Array#min`, `Array#rindex(min)` and
+ `Array#delete_at(min_index)`. Push + pop is still fast for `n < 100`, but
+ unusably slow for `n > 1000`.
-[priority_queue_cxx gem]: https://rubygems.org/gems/priority_queue_cxx
-[C++ STL priority_queue]: http://www.cplusplus.com/reference/queue/priority_queue/
+ * **bsearch** -
+ A simple implementation with a slow `O(n)` push using `Array#bsearch` +
+ `Array#insert` to maintain a sorted Array, but a very fast `O(1)` pop with
+ `Array#pop`. It is still relatively fast for `n < 10000`, but its linear
+ time complexity really destroys it after that.
-Three different scenarios are measured:
+ * **rb_heap** -
+ A pure ruby binary min-heap that has been tuned for performance by making
+ few method calls and allocating and assigning as few variables as possible.
+ It runs in `O(log n)` for both push and pop, although pop is slower than
+ push by a constant factor. Its much higher constant factors makes it lose
+ to `bsearch` push + pop for `n < 10000` but it holds steady with very little
+ slowdown even with `n > 10000000`.
-### push N items onto an empty heap
+ * **c++ stl** -
+ A thin wrapper around the [priority_queue_cxx gem] which uses the [C++ STL
+ priority_queue]. The wrapper is simply to provide compatibility with the
+ other benchmarked implementations, but it should be possible to speed this
+ up a little bit by benchmarking the `priority_queue_cxx` API directly. It
+ has the same time complexity as rb_heap but its much lower constant
+ factors allow it to easily outperform `bsearch`.
-...but never pop (clearing between each set of pushes).
+ * **c_dheap** -
+ A {DHeap} instance with the default `d` value of `4`. It has the same time
+ complexity as `rb_heap` and `c++ stl`, but is faster than both in every
+ benchmarked scenario.
-![bar graph for push_n_pop_n benchmarks](./images/push_n.png)
+[priority_queue_cxx gem]: https://rubygems.org/gems/priority_queue_cxx
+[C++ STL priority_queue]: http://www.cplusplus.com/reference/queue/priority_queue/
-### push N items onto an empty heap then pop all N
+### Scenarios
-Although this could be used for heap sort, we're unlikely to choose heap sort
-over Ruby's quick sort implementation. I'm using this scenario to represent
-the amortized cost of creating a heap and (eventually) draining it.
+Each benchmark increases N exponentially, either by √1̅0̅ or approximating
+(alternating between x3 and x3.333) in order to simplify keeping loop counts
+evenly divisible by N.
-![bar graph for push_n_pop_n benchmarks](./images/push_n_pop_n.png)
+#### push N items
-### push and pop on a heap with N values
+This measures the _average time per insert_ to create a queue of size N
+(clearing the queue once it reaches that size). Use cases which push (or
+decrease) more values than they pop, e.g. [Dijkstra's algorithm] or [Prim's
+algorithm] when the graph has more edges than verticies, may want to pay more
+attention to this benchmark.
-Repeatedly push and pop while keeping a stable heap size. This is a _very
-simplistic_ approximation for how most scheduler/timer heaps might be used.
-Usually when a timer fires it will be quickly replaced by a new timer, and the
-overall count of timers will remain roughly stable.
+![bar graph for push_n_pop_n benchmarks](./images/push_n.png)
-![bar graph for push_pop benchmarks](./images/push_pop.png)
+ == push N (N=100) ==========================================================
+ push N (c_dheap): 10522662.6 i/s
+ push N (findmin): 9980622.3 i/s - 1.05x slower
+ push N (c++ stl): 7991608.3 i/s - 1.32x slower
+ push N (rb_heap): 4607849.4 i/s - 2.28x slower
+ push N (bsearch): 2769106.2 i/s - 3.80x slower
+ == push N (N=10,000) =======================================================
+ push N (c_dheap): 10444588.3 i/s
+ push N (findmin): 10191797.4 i/s - 1.02x slower
+ push N (c++ stl): 8210895.4 i/s - 1.27x slower
+ push N (rb_heap): 4369252.9 i/s - 2.39x slower
+ push N (bsearch): 1213580.4 i/s - 8.61x slower
+ == push N (N=1,000,000) ====================================================
+ push N (c_dheap): 10342183.7 i/s
+ push N (findmin): 9963898.8 i/s - 1.04x slower
+ push N (c++ stl): 7891924.8 i/s - 1.31x slower
+ push N (rb_heap): 4350116.0 i/s - 2.38x slower
-### numbers
+All three heap implementations have little to no perceptible slowdown for `N >
+100`. But `DHeap` runs faster than `Array#push` to an unsorted array (findmin)!
-Even for very small N values the benchmark implementations, `DHeap` runs faster
-than the other implementations for each scenario, although the difference is
-still relatively small. The pure ruby binary heap is 2x or more slower than
-bsearch + insert for common push/pop scenario.
+#### push then pop N items
- == push N (N=5) ==========================================================
- push N (c_dheap): 1969700.7 i/s
- push N (c++ stl): 1049738.1 i/s - 1.88x slower
- push N (rb_heap): 928435.2 i/s - 2.12x slower
- push N (bsearch): 921060.0 i/s - 2.14x slower
+This measures the _average_ for a push **or** a pop, filling up a queue with N
+items and then draining that queue until empty. It represents the amortized
+cost of balanced pushes and pops to fill a heap and drain it.
- == push N then pop N (N=5) ===============================================
- push N + pop N (c_dheap): 1375805.0 i/s
- push N + pop N (c++ stl): 1134997.5 i/s - 1.21x slower
- push N + pop N (findmin): 862913.1 i/s - 1.59x slower
- push N + pop N (bsearch): 762887.1 i/s - 1.80x slower
- push N + pop N (rb_heap): 506890.4 i/s - 2.71x slower
+![bar graph for push_n_pop_n benchmarks](./images/push_n_pop_n.png)
- == Push/pop with pre-filled queue of size=N (N=5) ========================
- push + pop (c_dheap): 9044435.5 i/s
- push + pop (c++ stl): 7534583.4 i/s - 1.20x slower
- push + pop (findmin): 5026155.1 i/s - 1.80x slower
- push + pop (bsearch): 4300260.0 i/s - 2.10x slower
- push + pop (rb_heap): 2299499.7 i/s - 3.93x slower
+ == push N then pop N (N=100) ===============================================
+ push N + pop N (c_dheap): 10954469.2 i/s
+ push N + pop N (c++ stl): 9317140.2 i/s - 1.18x slower
+ push N + pop N (bsearch): 4808770.2 i/s - 2.28x slower
+ push N + pop N (findmin): 4321411.9 i/s - 2.53x slower
+ push N + pop N (rb_heap): 2467417.0 i/s - 4.44x slower
+ == push N then pop N (N=10,000) ============================================
+ push N + pop N (c_dheap): 8083962.7 i/s
+ push N + pop N (c++ stl): 7365661.8 i/s - 1.10x slower
+ push N + pop N (bsearch): 2257047.9 i/s - 3.58x slower
+ push N + pop N (rb_heap): 1439204.3 i/s - 5.62x slower
+ == push N then pop N (N=1,000,000) =========================================
+ push N + pop N (c++ stl): 5274657.5 i/s
+ push N + pop N (c_dheap): 4731117.9 i/s - 1.11x slower
+ push N + pop N (rb_heap): 976688.6 i/s - 5.40x slower
-By N=21, `DHeap` has pulled significantly ahead of bsearch + insert for all
-scenarios, but the pure ruby heap is still slower than every other
-implementation—even resorting the array after every `#push`—in any scenario that
-uses `#pop`.
+At N=100 findmin still beats a pure-ruby heap. But above that it slows down too
+much to be useful. At N=10k, bsearch still beats a pure ruby heap, but above
+30k it slows down too much to be useful. `DHeap` consistently runs 4.5-5.5x
+faster than the pure ruby heap.
- == push N (N=21) =========================================================
- push N (c_dheap): 464231.4 i/s
- push N (c++ stl): 305546.7 i/s - 1.52x slower
- push N (rb_heap): 202803.7 i/s - 2.29x slower
- push N (bsearch): 168678.7 i/s - 2.75x slower
+#### push & pop on N-item heap
- == push N then pop N (N=21) ==============================================
- push N + pop N (c_dheap): 298350.3 i/s
- push N + pop N (c++ stl): 252227.1 i/s - 1.18x slower
- push N + pop N (findmin): 161998.7 i/s - 1.84x slower
- push N + pop N (bsearch): 143432.3 i/s - 2.08x slower
- push N + pop N (rb_heap): 79622.1 i/s - 3.75x slower
+This measures the combined time to push once and pop once, which is done
+repeatedly while keeping a stable heap size of N. Its an approximation for
+scenarios which reach a stable size and then plateau with balanced pushes and
+pops. E.g. timers and timeouts will often reschedule themselves or replace
+themselves with new timers or timeouts, maintaining a roughly stable total count
+of timers.
- == Push/pop with pre-filled queue of size=N (N=21) =======================
- push + pop (c_dheap): 8855093.4 i/s
- push + pop (c++ stl): 7223079.5 i/s - 1.23x slower
- push + pop (findmin): 4542913.7 i/s - 1.95x slower
- push + pop (bsearch): 3461802.4 i/s - 2.56x slower
- push + pop (rb_heap): 1845488.7 i/s - 4.80x slower
+![bar graph for push_pop benchmarks](./images/push_pop.png)
-At higher values of N, a heaps logarithmic growth leads to only a little
-slowdown of `#push`, while insert's linear growth causes it to run noticably
-slower and slower. But because `#pop` is `O(1)` for a sorted array and `O(d log
-n / log d)` for a heap, scenarios involving both `#push` and `#pop` remain
-relatively close, and bsearch + insert still runs faster than a pure ruby heap,
-even up to queues with 10k items. But as queue size increases beyond than that,
-the linear time compexity to keep a sorted array dominates.
+ push + pop (findmin)
+ N 10: 5480288.0 i/s
+ N 100: 2595178.8 i/s - 2.11x slower
+ N 1000: 224813.9 i/s - 24.38x slower
+ N 10000: 12630.7 i/s - 433.89x slower
+ N 100000: 1097.3 i/s - 4994.31x slower
+ N 1000000: 135.9 i/s - 40313.05x slower
+ N 10000000: 12.9 i/s - 425838.01x slower
- == push + pop (rb_heap)
- queue size = 10000: 736618.2 i/s
- queue size = 25000: 670186.8 i/s - 1.10x slower
- queue size = 50000: 618156.7 i/s - 1.19x slower
- queue size = 100000: 579250.7 i/s - 1.27x slower
- queue size = 250000: 572795.0 i/s - 1.29x slower
- queue size = 500000: 543648.3 i/s - 1.35x slower
- queue size = 1000000: 513523.4 i/s - 1.43x slower
- queue size = 2500000: 460848.9 i/s - 1.60x slower
- queue size = 5000000: 445234.5 i/s - 1.65x slower
- queue size = 10000000: 423119.0 i/s - 1.74x slower
+ push + pop (bsearch)
+ N 10: 3931408.4 i/s
+ N 100: 2904181.8 i/s - 1.35x slower
+ N 1000: 2203157.1 i/s - 1.78x slower
+ N 10000: 1209584.9 i/s - 3.25x slower
+ N 100000: 81121.4 i/s - 48.46x slower
+ N 1000000: 5356.0 i/s - 734.02x slower
+ N 10000000: 281.9 i/s - 13946.33x slower
- == push + pop (bsearch)
- queue size = 10000: 786334.2 i/s
- queue size = 25000: 364963.8 i/s - 2.15x slower
- queue size = 50000: 200520.6 i/s - 3.92x slower
- queue size = 100000: 88607.0 i/s - 8.87x slower
- queue size = 250000: 34530.5 i/s - 22.77x slower
- queue size = 500000: 17965.4 i/s - 43.77x slower
- queue size = 1000000: 5638.7 i/s - 139.45x slower
- queue size = 2500000: 1302.0 i/s - 603.93x slower
- queue size = 5000000: 592.0 i/s - 1328.25x slower
- queue size = 10000000: 288.8 i/s - 2722.66x slower
+ push + pop (rb_heap)
+ N 10: 2325816.5 i/s
+ N 100: 1603540.3 i/s - 1.45x slower
+ N 1000: 1262515.2 i/s - 1.84x slower
+ N 10000: 950389.3 i/s - 2.45x slower
+ N 100000: 732548.8 i/s - 3.17x slower
+ N 1000000: 673577.8 i/s - 3.45x slower
+ N 10000000: 467512.3 i/s - 4.97x slower
- == push + pop (c_dheap)
- queue size = 10000: 7311366.6 i/s
- queue size = 50000: 6737824.5 i/s - 1.09x slower
- queue size = 25000: 6407340.6 i/s - 1.14x slower
- queue size = 100000: 6254396.3 i/s - 1.17x slower
- queue size = 250000: 5917684.5 i/s - 1.24x slower
- queue size = 500000: 5126307.6 i/s - 1.43x slower
- queue size = 1000000: 4403494.1 i/s - 1.66x slower
- queue size = 2500000: 3304088.2 i/s - 2.21x slower
- queue size = 5000000: 2664897.7 i/s - 2.74x slower
- queue size = 10000000: 2137927.6 i/s - 3.42x slower
+ push + pop (c++ stl)
+ N 10: 7706818.6 i/s - 1.01x slower
+ N 100: 7393127.3 i/s - 1.05x slower
+ N 1000: 6898781.3 i/s - 1.13x slower
+ N 10000: 5731130.5 i/s - 1.36x slower
+ N 100000: 4842393.2 i/s - 1.60x slower
+ N 1000000: 4170936.4 i/s - 1.86x slower
+ N 10000000: 2737146.6 i/s - 2.84x slower
-## Analysis
+ push + pop (c_dheap)
+ N 10: 10196454.1 i/s
+ N 100: 9668679.8 i/s - 1.05x slower
+ N 1000: 9339557.0 i/s - 1.09x slower
+ N 10000: 8045103.0 i/s - 1.27x slower
+ N 100000: 7150276.7 i/s - 1.43x slower
+ N 1000000: 6490261.6 i/s - 1.57x slower
+ N 10000000: 3734856.5 i/s - 2.73x slower
-### Time complexity
+## Time complexity analysis
-There are two fundamental heap operations: sift-up (used by push) and sift-down
-(used by pop).
+There are two fundamental heap operations: sift-up (used by push or decrease
+score) and sift-down (used by pop or delete or increase score). Each sift
+bubbles an item to its correct location in the tree.
-* A _d_-ary heap will have `log n / log d` layers, so both sift operations can
- perform as many as `log n / log d` writes, when a member sifts the entire
- length of the tree.
-* Sift-up makes one comparison per layer, so push runs in `O(log n / log d)`.
-* Sift-down makes d comparions per layer, so pop runs in `O(d log n / log d)`.
+* A _d_-ary heap has `log n / log d` layers, so either sift performs as many as
+ `log n / log d` writes, when a member sifts the entire length of the tree.
+* Sift-up needs one comparison per layer: `O(log n / log d)`.
+* Sift-down needs d comparions per layer: `O(d log n / log d)`.
-So, in the simplest case of running balanced push/pop while maintaining the same
-heap size, `(1 + d) log n / log d` comparisons are made. In the worst case,
-when every sift traverses every layer of the tree, `d=4` requires the fewest
-comparisons for combined insert and delete:
+So, in the case of a balanced push then pop, as many as `(1 + d) log n / log d`
+comparisons are made. Looking only at this worst case combo, `d=4` requires the
+fewest comparisons for a combined push and pop:
-* (1 + 2) lg n / lg d ≈ 4.328085 lg n
-* (1 + 3) lg n / lg d ≈ 3.640957 lg n
-* (1 + 4) lg n / lg d ≈ 3.606738 lg n
-* (1 + 5) lg n / lg d ≈ 3.728010 lg n
-* (1 + 6) lg n / lg d ≈ 3.906774 lg n
-* (1 + 7) lg n / lg d ≈ 4.111187 lg n
-* (1 + 8) lg n / lg d ≈ 4.328085 lg n
-* (1 + 9) lg n / lg d ≈ 4.551196 lg n
-* (1 + 10) lg n / lg d ≈ 4.777239 lg n
+* `(1 + 2) log n / log d ≈ 4.328085 log n`
+* `(1 + 3) log n / log d ≈ 3.640957 log n`
+* `(1 + 4) log n / log d ≈ 3.606738 log n`
+* `(1 + 5) log n / log d ≈ 3.728010 log n`
+* `(1 + 6) log n / log d ≈ 3.906774 log n`
+* `(1 + 7) log n / log d ≈ 4.111187 log n`
+* `(1 + 8) log n / log d ≈ 4.328085 log n`
+* `(1 + 9) log n / log d ≈ 4.551196 log n`
+* `(1 + 10) log n / log d ≈ 4.777239 log n`
* etc...
See https://en.wikipedia.org/wiki/D-ary_heap#Analysis for deeper analysis.
-### Space complexity
+However, what this simple count of comparisons misses is the extent to which
+modern compilers can optimize code (e.g. by unrolling the comparison loop to
+execute on registers) and more importantly how well modern processors are at
+pipelined speculative execution using branch prediction, etc. Benchmarks should
+be run on the _exact same_ hardware platform that production code will use,
+as the sift-down operation is especially sensitive to good pipelining.
-Space usage is linear, regardless of d. However higher d values may
-provide better cache locality. Because the heap is a complete binary tree, the
-elements can be stored in an array, without the need for tree or list pointers.
+## Comparison performance
-Ruby can compare Numeric values _much_ faster than other ruby objects, even if
-those objects simply delegate comparison to internal Numeric values. And it is
-often useful to use external scores for otherwise uncomparable values. So
-`DHeap` uses twice as many entries (one for score and one for value)
-as an array which only stores values.
+It is often useful to use external scores for otherwise uncomparable values.
+And casting an item or score (via `to_f`) can also be time consuming. So
+`DHeap` evaluates and stores scores at the time of insertion, and they will be
+compared directly without needing any further lookup.
-## Thread safety
+Numeric values can be compared _much_ faster than other ruby objects, even if
+those objects simply delegate comparison to internal Numeric values.
+Additionally, native C integers or floats can be compared _much_ faster than
+ruby `Numeric` objects. So scores are converted to Float and stored as
+`double`, which is 64 bits on an [LP64 64-bit system].
-`DHeap` is _not_ thread-safe, so concurrent access from multiple threads need to
-take precautions such as locking access behind a mutex.
+[LP64 64-bit system]: https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models
## Alternative data structures
As always, you should run benchmarks with your expected scenarios to determine
which is best for your application.
-Depending on your use-case, maintaining a sorted `Array` using `#bsearch_index`
-and `#insert` might be just fine! Even `min` plus `delete` with an unsorted
-array can be very fast on small queues. Although insertions run with `O(n)`,
-`memcpy` is so fast on modern hardware that your dataset might not be large
-enough for it to matter.
+Depending on your use-case, using a sorted `Array` using `#bsearch_index`
+and `#insert` might be just fine! It only takes a couple of lines of code and
+is probably "Fast Enough".
-More complex heap varients, e.g. [Fibonacci heap], allow heaps to be split and
+More complex heap variant, e.g. [Fibonacci heap], allow heaps to be split and
merged which gives some graph algorithms a lower amortized time complexity. But
in practice, _d_-ary heaps have much lower overhead and often run faster.
[Fibonacci heap]: https://en.wikipedia.org/wiki/Fibonacci_heap
@@ -383,28 +400,63 @@
(e.g. a [red-black tree]) or a [skip-list].
[red-black tree]: https://en.wikipedia.org/wiki/Red%E2%80%93black_tree
[skip-list]: https://en.wikipedia.org/wiki/Skip_list
-[Hashed and Heirarchical Timing Wheels][timing wheels] (or some variant in that
-family of data structures) can be constructed to have effectively `O(1)` running
-time in most cases. Although the implementation for that data structure is more
+[Hashed and Heirarchical Timing Wheels][timing wheel] (or some variant in the
+timing wheel family of data structures) can have effectively `O(1)` running time
+in most cases. Although the implementation for that data structure is more
complex than a heap, it may be necessary for enormous values of N.
-[timing wheels]: http://www.cs.columbia.edu/~nahum/w6998/papers/ton97-timing-wheels.pdf
+[timing wheel]: http://www.cs.columbia.edu/~nahum/w6998/papers/ton97-timing-wheels.pdf
-## TODOs...
+## Supported platforms
-_TODO:_ Also ~~included is~~ _will include_ `DHeap::Map`, which augments the
-basic heap with an internal `Hash`, which maps objects to their position in the
-heap. This enforces a uniqueness constraint on items on the heap, and also
-allows items to be more efficiently deleted or adjusted. However maintaining
-the hash does lead to a small drop in normal `#push` and `#pop` performance.
+See the [CI workflow] for all supported platforms.
-_TODO:_ Also ~~included is~~ _will include_ `DHeap::Lazy`, which contains some
-features that are loosely inspired by go's timers. e.g: It lazily sifts its
-heap after deletion and adjustments, to achieve faster average runtime for *add*
-and *cancel* operations.
+[CI workflow]: https://github.com/nevans/d_heap/actions?query=workflow%3ACI
+
+`d_heap` may contain bugs on 32-bit systems. Currently, `d_heap` is only tested
+on 64-bit x86 CRuby 2.4-3.0 under Linux and Mac OS.
+
+## Caveats and TODOs (PRs welcome!)
+
+A `DHeap`'s internal array grows but never shrinks. At the very least, there
+should be a `#compact` or `#shrink` method and during `#freeze`. It might make
+sense to automatically shrink (to no more than 2x the current size) during GC's
+compact phase.
+
+Benchmark sift-down min-child comparisons using SSE, AVX2, and AVX512F. This
+might lead to a different default `d` value (maybe 16 or 24?).
+
+Shrink scores to 64-bits: either store a type flag with each entry (this could
+be used to support non-numeric scores) or require users to choose between
+`Integer` or `Float` at construction time. Reducing memory usage should also
+improve speed for very large heaps.
+
+Patches to support JRuby, rubinius, 32-bit systems, or any other platforms are
+welcome! JRuby and Truffle Ruby ought to be able to use [Java's PriorityQueue]?
+Other platforms could fallback on the (slower) pure ruby implementation used by
+the benchmarks.
+
+[Java's PriorityQueue]: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/PriorityQueue.html
+
+Allow a max-heap (or other configurations of the compare function). This can be
+very easily implemented by just reversing the scores.
+
+_Maybe_ allow non-numeric scores to be compared with `<=>`, _only_ if the basic
+numeric use case simplicity and speed can be preserved.
+
+Consider `DHeap::Monotonic`, which could rely on `#pop_below` for "current time"
+and move all values below that time onto an Array.
+
+Consider adding `DHeap::Lazy` or `DHeap.new(lazy: true)` which could contain
+some features that are loosely inspired by go's timers. Go lazily sifts its
+heap after deletion or adjustments, to achieve faster amortized runtime.
+There's no need to actually remove a deleted item from the heap, if you re-add
+it back before it's next evaluated. A similar trick can be to store "far away"
+values in an internal `Hash`, assuming many will be deleted before they rise to
+the top. This could naturally evolve into a [timing wheel] variant.
## Development
After checking out the repo, run `bin/setup` to install dependencies. Then, run
`rake spec` to run the tests. You can also run `bin/console` for an interactive