README.md in d_heap-0.4.0 vs README.md in d_heap-0.5.0
- old
+ new
@@ -2,24 +2,24 @@
A fast [_d_-ary heap][d-ary heap] [priority queue] implementation for ruby,
implemented as a C extension.
With a regular queue, you expect "FIFO" behavior: first in, first out. With a
-stack you expect "LIFO": last in first out. With a priority queue, you push
-elements along with a score and the lowest scored element is the first to be
-popped. Priority queues are often used in algorithms for e.g. [scheduling] of
-timers or bandwidth management, [Huffman coding], and various graph search
-algorithms such as [Dijkstra's algorithm], [A* search], or [Prim's algorithm].
+stack you expect "LIFO": last in first out. A priority queue has a score for
+each element and elements are popped in order by score. Priority queues are
+often used in algorithms for e.g. [scheduling] of timers or bandwidth
+management, for [Huffman coding], and various graph search algorithms such as
+[Dijkstra's algorithm], [A* search], or [Prim's algorithm].
The _d_-ary heap data structure is a generalization of the [binary heap], in
-which the nodes have _d_ children instead of 2. This allows for "decrease
-priority" operations to be performed more quickly with the tradeoff of slower
-delete minimum. Additionally, _d_-ary heaps can have better memory cache
+which the nodes have _d_ children instead of 2. This allows for "insert" and
+"decrease priority" operations to be performed more quickly with the tradeoff of
+slower delete minimum. Additionally, _d_-ary heaps can have better memory cache
behavior than binary heaps, allowing them to run more quickly in practice
despite slower worst-case time complexity. In the worst case, a _d_-ary heap
-requires only `O(log n / log d)` to push, with the tradeoff that pop is `O(d log
-n / log d)`.
+requires only `O(log n / log d)` operations to push, with the tradeoff that pop
+requires `O(d log n / log d)`.
Although you should probably just use the default _d_ value of `4` (see the
analysis below), it's always advisable to benchmark your specific use-case.
[d-ary heap]: https://en.wikipedia.org/wiki/D-ary_heap
@@ -31,57 +31,62 @@
[A* search]: https://en.wikipedia.org/wiki/A*_search_algorithm#Description
[Prim's algorithm]: https://en.wikipedia.org/wiki/Prim%27s_algorithm
## Usage
-The basic API is:
-* `heap << object` adds a value as its own score.
-* `heap.push(score, value)` adds a value with an extrinsic score.
+Quick reference:
+
+* `heap << object` adds a value, with `Float(object)` as its score.
+* `heap.push(object, score)` adds a value with an extrinsic score.
* `heap.pop` removes and returns the value with the minimum score.
* `heap.pop_lte(score)` pops if the minimum score is `<=` the provided score.
* `heap.peek` to view the minimum value without popping it.
+* `heap.clear` to remove all items from the heap.
+* `heap.empty?` returns true if the heap is empty.
+* `heap.size` returns the number of items in the heap.
-The score must be `Integer` or `Float` or convertable to a `Float` via
-`Float(score)` (i.e. it should implement `#to_f`). Constraining scores to
-numeric values gives a 40+% speedup under some benchmarks!
+The basic API is `#push(object, score)` and `pop`. If your values behave as
+their own score, then you can push with `#<<`. If the score changes while the
+object is still in the heap, it will not be re-evaluated again. The score must
+either be `Integer` or `Float` or convertable to a `Float` via `Float(score)`
+(i.e. it should implement `#to_f`).
-_n.b._ `Integer` _scores must have an absolute value that fits into_ `unsigned
-long long`. _This is architecture dependant but on an IA-64 system this is 64
-bits, which gives a range of -18,446,744,073,709,551,615 to
-+18446744073709551615._
-
-_Comparing arbitary objects via_ `a <=> b` _was the original design and may
-be added back in a future version,_ if (and only if) _it can be done without
-impacting the speed of numeric comparisons._
-
```ruby
require "d_heap"
-Task = Struct.new(:id) # for demonstration
+Task = Struct.new(:id, :time) do
+ def to_f; time.to_f end
+end
+t1 = Task.new(1, Time.now + 5*60)
+t2 = Task.new(2, Time.now + 50)
+t3 = Task.new(3, Time.now + 60)
+t4 = Task.new(4, Time.now + 5)
-heap = DHeap.new # defaults to a 4-heap
+# if the object returns its own score via #to_f, "<<" is the simplest API
+heap << t1 << t2
-# storing [score, value] tuples
-heap.push Time.now + 5*60, Task.new(1)
-heap.push Time.now + 30, Task.new(2)
-heap.push Time.now + 60, Task.new(3)
-heap.push Time.now + 5, Task.new(4)
+# or push with an explicit score
+heap.push t3, t4.to_f
+heap.push t4, t4 # score can be implicitly cast with Float
-# peeking and popping (using last to get the task and ignore the time)
-heap.pop # => Task[4]
-heap.pop # => Task[2]
-heap.peek # => Task[3], but don't pop it from the heap
-heap.pop # => Task[3]
-heap.pop # => Task[1]
+# peek and pop
+heap.pop # => #<struct Task id=4, time=2021-01-17 17:02:22.5574 -0500>
+heap.pop # => #<struct Task id=2, time=2021-01-17 17:03:07.5574 -0500>
+heap.peek # => #<struct Task id=3, time=2021-01-17 17:03:17.5574 -0500>
+heap.pop # => #<struct Task id=3, time=2021-01-17 17:03:17.5574 -0500>
+heap.pop # => #<struct Task id=1, time=2021-01-17 17:07:17.5574 -0500>
heap.empty? # => true
heap.pop # => nil
```
-If your values behave as their own score, by being convertible via
-`Float(value)`, then you can use `#<<` for implicit scoring. The score should
-not change for as long as the value remains in the heap, since it will not be
-re-evaluated after being pushed.
+Constraining scores to numeric values gives more than 50% speedup under some
+benchmarks! _n.b._ `Integer` _scores must have an absolute value that fits
+into_ `unsigned long long`. _This is architecture dependant but on an IA-64
+system this is 64 bits, which gives a range of -18,446,744,073,709,551,615 to
++18446744073709551615. Comparing arbitary objects via_ `a <=> b` _was the
+original design and may be added back in a future version,_ if (and only if) _it
+can be done without impacting the speed of numeric comparisons._
```ruby
heap.clear
# The score can be derived from the value by using to_f.
@@ -140,34 +145,34 @@
$ gem install d_heap
## Motivation
One naive approach to a priority queue is to maintain an array in sorted order.
-This can be very simply implemented using `Array#bseach_index` + `Array#insert`.
-This can be very fast—`Array#pop` is `O(1)`—but the worst-case for insert is
-`O(n)` because it may need to `memcpy` a significant portion of the array.
+This can be very simply implemented in ruby with `Array#bseach_index` +
+`Array#insert`. This can be very fast—`Array#pop` is `O(1)`—but the worst-case
+for insert is `O(n)` because it may need to `memcpy` a significant portion of
+the array.
The standard way to implement a priority queue is with a binary heap. Although
this increases the time for `pop`, it converts the amortized time per push + pop
from `O(n)` to `O(d log n / log d)`.
However, I was surprised to find that—at least for some benchmarks—my pure ruby
heap implementation was much slower than inserting into and popping from a fully
-sorted array. The reason for this surprising result: Although it is `O(n)`,
+sorted array. The reasons for this surprising result: Although it is `O(n)`,
`memcpy` has a _very_ small constant factor, and calling `<=>` from ruby code
has relatively _much_ larger constant factors. If your queue contains only a
few thousand items, the overhead of those extra calls to `<=>` is _far_ more
than occasionally calling `memcpy`. In the worst case, a _d_-heap will require
`d + 1` times more comparisons for each push + pop than a `bsearch` + `insert`
sorted array.
Moving the sift-up and sift-down code into C helps some. But much more helpful
is optimizing the comparison of numeric scores, so `a <=> b` never needs to be
-called. I'm hopeful that MJIT will eventually obsolete this C-extension. JRuby
-or TruffleRuby may already run the pure ruby version at high speed. This can be
-hotspot code, and the basic ruby implementation should perform well if not for
-the high overhead of `<=>`.
+called. I'm hopeful that MJIT will eventually obsolete this C-extension. This
+can be hotspot code, and a basic ruby implementation could perform well if `<=>`
+had much lower overhead.
## Analysis
### Time complexity
@@ -215,11 +220,11 @@
## Benchmarks
_See `bin/benchmarks` and `docs/benchmarks.txt`, as well as `bin/profile` and
`docs/profile.txt` for more details or updated results. These benchmarks were
-measured with v0.4.0 and ruby 2.7.2 without MJIT enabled._
+measured with v0.5.0 and ruby 2.7.2 without MJIT enabled._
These benchmarks use very simple implementations for a pure-ruby heap and an
array that is kept sorted using `Array#bsearch_index` and `Array#insert`. For
comparison, an alternate implementation `Array#min` and `Array#delete_at` is
also shown.
@@ -246,77 +251,96 @@
the other implementations for each scenario, although the difference is still
relatively small. The pure ruby binary heap is 2x or more slower than bsearch +
insert for common common push/pop scenario.
== push N (N=5) ==========================================================
- push N (c_dheap): 1701338.1 i/s
- push N (rb_heap): 971614.1 i/s - 1.75x slower
- push N (bsearch): 946363.7 i/s - 1.80x slower
+ push N (c_dheap): 1969700.7 i/s
+ push N (c++ stl): 1049738.1 i/s - 1.88x slower
+ push N (rb_heap): 928435.2 i/s - 2.12x slower
+ push N (bsearch): 921060.0 i/s - 2.14x slower
== push N then pop N (N=5) ===============================================
- push N + pop N (c_dheap): 1087944.8 i/s
- push N + pop N (findmin): 841708.1 i/s - 1.29x slower
- push N + pop N (bsearch): 773252.7 i/s - 1.41x slower
- push N + pop N (rb_heap): 471852.9 i/s - 2.31x slower
+ push N + pop N (c_dheap): 1375805.0 i/s
+ push N + pop N (c++ stl): 1134997.5 i/s - 1.21x slower
+ push N + pop N (findmin): 862913.1 i/s - 1.59x slower
+ push N + pop N (bsearch): 762887.1 i/s - 1.80x slower
+ push N + pop N (rb_heap): 506890.4 i/s - 2.71x slower
== Push/pop with pre-filled queue of size=N (N=5) ========================
- push + pop (c_dheap): 5525418.8 i/s
- push + pop (findmin): 5003904.8 i/s - 1.10x slower
- push + pop (bsearch): 4320581.8 i/s - 1.28x slower
- push + pop (rb_heap): 2207042.0 i/s - 2.50x slower
+ push + pop (c_dheap): 9044435.5 i/s
+ push + pop (c++ stl): 7534583.4 i/s - 1.20x slower
+ push + pop (findmin): 5026155.1 i/s - 1.80x slower
+ push + pop (bsearch): 4300260.0 i/s - 2.10x slower
+ push + pop (rb_heap): 2299499.7 i/s - 3.93x slower
By N=21, `DHeap` has pulled significantly ahead of bsearch + insert for all
scenarios, but the pure ruby heap is still slower than every other
implementation—even resorting the array after every `#push`—in any scenario that
uses `#pop`.
== push N (N=21) =========================================================
- push N (c_dheap): 408307.0 i/s
- push N (rb_heap): 212275.2 i/s - 1.92x slower
- push N (bsearch): 169583.2 i/s - 2.41x slower
+ push N (c_dheap): 464231.4 i/s
+ push N (c++ stl): 305546.7 i/s - 1.52x slower
+ push N (rb_heap): 202803.7 i/s - 2.29x slower
+ push N (bsearch): 168678.7 i/s - 2.75x slower
== push N then pop N (N=21) ==============================================
- push N + pop N (c_dheap): 199435.5 i/s
- push N + pop N (findmin): 162024.5 i/s - 1.23x slower
- push N + pop N (bsearch): 146284.3 i/s - 1.36x slower
- push N + pop N (rb_heap): 72289.0 i/s - 2.76x slower
+ push N + pop N (c_dheap): 298350.3 i/s
+ push N + pop N (c++ stl): 252227.1 i/s - 1.18x slower
+ push N + pop N (findmin): 161998.7 i/s - 1.84x slower
+ push N + pop N (bsearch): 143432.3 i/s - 2.08x slower
+ push N + pop N (rb_heap): 79622.1 i/s - 3.75x slower
== Push/pop with pre-filled queue of size=N (N=21) =======================
- push + pop (c_dheap): 4836860.0 i/s
- push + pop (findmin): 4467453.9 i/s - 1.08x slower
- push + pop (bsearch): 3345458.4 i/s - 1.45x slower
- push + pop (rb_heap): 1560476.3 i/s - 3.10x slower
+ push + pop (c_dheap): 8855093.4 i/s
+ push + pop (c++ stl): 7223079.5 i/s - 1.23x slower
+ push + pop (findmin): 4542913.7 i/s - 1.95x slower
+ push + pop (bsearch): 3461802.4 i/s - 2.56x slower
+ push + pop (rb_heap): 1845488.7 i/s - 4.80x slower
-At higher values of N, `DHeap`'s logarithmic growth leads to little slowdown
-of `DHeap#push`, while insert's linear growth causes it to run slower and
-slower. But because `#pop` is O(1) for a sorted array and O(d log n / log d)
-for a _d_-heap, scenarios involving `#pop` remain relatively close even as N
-increases to 5k:
+At higher values of N, a heaps logarithmic growth leads to only a little
+slowdown of `#push`, while insert's linear growth causes it to run noticably
+slower and slower. But because `#pop` is `O(1)` for a sorted array and `O(d log
+n / log d)` for a heap, scenarios involving both `#push` and `#pop` remain
+relatively close, and bsearch + insert still runs faster than a pure ruby heap,
+even up to queues with 10k items. But as queue size increases beyond than that,
+the linear time compexity to keep a sorted array dominates.
- == Push/pop with pre-filled queue of size=N (N=5461) ==============
- push + pop (c_dheap): 2718225.1 i/s
- push + pop (bsearch): 1793546.4 i/s - 1.52x slower
- push + pop (rb_heap): 707139.9 i/s - 3.84x slower
- push + pop (findmin): 122316.0 i/s - 22.22x slower
+ == push + pop (rb_heap)
+ queue size = 10000: 736618.2 i/s
+ queue size = 25000: 670186.8 i/s - 1.10x slower
+ queue size = 50000: 618156.7 i/s - 1.19x slower
+ queue size = 100000: 579250.7 i/s - 1.27x slower
+ queue size = 250000: 572795.0 i/s - 1.29x slower
+ queue size = 500000: 543648.3 i/s - 1.35x slower
+ queue size = 1000000: 513523.4 i/s - 1.43x slower
+ queue size = 2500000: 460848.9 i/s - 1.60x slower
+ queue size = 5000000: 445234.5 i/s - 1.65x slower
+ queue size = 10000000: 423119.0 i/s - 1.74x slower
-Somewhat surprisingly, bsearch + insert still runs faster than a pure ruby heap
-for the repeated push/pop scenario, all the way up to N as high as 87k:
+ == push + pop (bsearch)
+ queue size = 10000: 786334.2 i/s
+ queue size = 25000: 364963.8 i/s - 2.15x slower
+ queue size = 50000: 200520.6 i/s - 3.92x slower
+ queue size = 100000: 88607.0 i/s - 8.87x slower
+ queue size = 250000: 34530.5 i/s - 22.77x slower
+ queue size = 500000: 17965.4 i/s - 43.77x slower
+ queue size = 1000000: 5638.7 i/s - 139.45x slower
+ queue size = 2500000: 1302.0 i/s - 603.93x slower
+ queue size = 5000000: 592.0 i/s - 1328.25x slower
+ queue size = 10000000: 288.8 i/s - 2722.66x slower
- == push N (N=87381) ======================================================
- push N (c_dheap): 92.8 i/s
- push N (rb_heap): 43.5 i/s - 2.13x slower
- push N (bsearch): 2.9 i/s - 31.70x slower
-
- == push N then pop N (N=87381) ===========================================
- push N + pop N (c_dheap): 22.6 i/s
- push N + pop N (rb_heap): 5.5 i/s - 4.08x slower
- push N + pop N (bsearch): 2.9 i/s - 7.90x slower
-
- == Push/pop with pre-filled queue of size=N (N=87381) ====================
- push + pop (c_dheap): 1815277.3 i/s
- push + pop (bsearch): 762343.2 i/s - 2.38x slower
- push + pop (rb_heap): 535913.6 i/s - 3.39x slower
- push + pop (findmin): 2262.8 i/s - 802.24x slower
+ == push + pop (c_dheap)
+ queue size = 10000: 7311366.6 i/s
+ queue size = 50000: 6737824.5 i/s - 1.09x slower
+ queue size = 25000: 6407340.6 i/s - 1.14x slower
+ queue size = 100000: 6254396.3 i/s - 1.17x slower
+ queue size = 250000: 5917684.5 i/s - 1.24x slower
+ queue size = 500000: 5126307.6 i/s - 1.43x slower
+ queue size = 1000000: 4403494.1 i/s - 1.66x slower
+ queue size = 2500000: 3304088.2 i/s - 2.21x slower
+ queue size = 5000000: 2664897.7 i/s - 2.74x slower
+ queue size = 10000000: 2137927.6 i/s - 3.42x slower
## Profiling
_n.b. `Array#fetch` is reading the input data, external to heap operations.
These benchmarks use integers for all scores, which enables significantly faster