README.md in d_heap-0.3.0 vs README.md in d_heap-0.4.0

- old
+ new

@@ -1,47 +1,130 @@ # DHeap -A fast _d_-ary heap implementation for ruby, useful in priority queues and graph -algorithms. +A fast [_d_-ary heap][d-ary heap] [priority queue] implementation for ruby, +implemented as a C extension. -The _d_-ary heap data structure is a generalization of the binary heap, in which -the nodes have _d_ children instead of 2. This allows for "decrease priority" -operations to be performed more quickly with the tradeoff of slower delete -minimum. Additionally, _d_-ary heaps can have better memory cache behavior than -binary heaps, allowing them to run more quickly in practice despite slower -worst-case time complexity. In the worst case, a _d_-ary heap requires only -`O(log n / log d)` to push, with the tradeoff that pop is `O(d log n / log d)`. +With a regular queue, you expect "FIFO" behavior: first in, first out. With a +stack you expect "LIFO": last in first out. With a priority queue, you push +elements along with a score and the lowest scored element is the first to be +popped. Priority queues are often used in algorithms for e.g. [scheduling] of +timers or bandwidth management, [Huffman coding], and various graph search +algorithms such as [Dijkstra's algorithm], [A* search], or [Prim's algorithm]. -Although you should probably just stick with the default _d_ value of `4`, it -may be worthwhile to benchmark your specific scenario. +The _d_-ary heap data structure is a generalization of the [binary heap], in +which the nodes have _d_ children instead of 2. This allows for "decrease +priority" operations to be performed more quickly with the tradeoff of slower +delete minimum. Additionally, _d_-ary heaps can have better memory cache +behavior than binary heaps, allowing them to run more quickly in practice +despite slower worst-case time complexity. In the worst case, a _d_-ary heap +requires only `O(log n / log d)` to push, with the tradeoff that pop is `O(d log +n / log d)`. +Although you should probably just use the default _d_ value of `4` (see the +analysis below), it's always advisable to benchmark your specific use-case. + +[d-ary heap]: https://en.wikipedia.org/wiki/D-ary_heap +[priority queue]: https://en.wikipedia.org/wiki/Priority_queue +[binary heap]: https://en.wikipedia.org/wiki/Binary_heap +[scheduling]: https://en.wikipedia.org/wiki/Scheduling_(computing) +[Huffman coding]: https://en.wikipedia.org/wiki/Huffman_coding#Compression +[Dijkstra's algorithm]: https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm#Using_a_priority_queue +[A* search]: https://en.wikipedia.org/wiki/A*_search_algorithm#Description +[Prim's algorithm]: https://en.wikipedia.org/wiki/Prim%27s_algorithm + ## Usage -The simplest way to use it is simply with `#push` and `#pop`. Push takes a -score and a value, and pop returns the value with the current minimum score. +The basic API is: +* `heap << object` adds a value as its own score. +* `heap.push(score, value)` adds a value with an extrinsic score. +* `heap.pop` removes and returns the value with the minimum score. +* `heap.pop_lte(score)` pops if the minimum score is `<=` the provided score. +* `heap.peek` to view the minimum value without popping it. +The score must be `Integer` or `Float` or convertable to a `Float` via +`Float(score)` (i.e. it should implement `#to_f`). Constraining scores to +numeric values gives a 40+% speedup under some benchmarks! + +_n.b._ `Integer` _scores must have an absolute value that fits into_ `unsigned +long long`. _This is architecture dependant but on an IA-64 system this is 64 +bits, which gives a range of -18,446,744,073,709,551,615 to ++18446744073709551615._ + +_Comparing arbitary objects via_ `a <=> b` _was the original design and may +be added back in a future version,_ if (and only if) _it can be done without +impacting the speed of numeric comparisons._ + ```ruby require "d_heap" -heap = DHeap.new # defaults to a 4-ary heap +Task = Struct.new(:id) # for demonstration +heap = DHeap.new # defaults to a 4-heap + # storing [score, value] tuples heap.push Time.now + 5*60, Task.new(1) heap.push Time.now + 30, Task.new(2) heap.push Time.now + 60, Task.new(3) heap.push Time.now + 5, Task.new(4) # peeking and popping (using last to get the task and ignore the time) -heap.pop.last # => Task[4] -heap.pop.last # => Task[2] -heap.peak.last # => Task[3], but don't pop it -heap.pop.last # => Task[3] -heap.pop.last # => Task[1] +heap.pop # => Task[4] +heap.pop # => Task[2] +heap.peek # => Task[3], but don't pop it from the heap +heap.pop # => Task[3] +heap.pop # => Task[1] +heap.empty? # => true +heap.pop # => nil ``` -Read the `rdoc` for more detailed documentation and examples. +If your values behave as their own score, by being convertible via +`Float(value)`, then you can use `#<<` for implicit scoring. The score should +not change for as long as the value remains in the heap, since it will not be +re-evaluated after being pushed. +```ruby +heap.clear + +# The score can be derived from the value by using to_f. +# "a <=> b" is *much* slower than comparing numbers, so it isn't used. +class Event + include Comparable + attr_reader :time, :payload + alias_method :to_time, :time + + def initialize(time, payload) + @time = time.to_time + @payload = payload + freeze + end + + def to_f + time.to_f + end + + def <=>(other) + to_f <=> other.to_f + end +end + +heap << comparable_max # sorts last, using <=> +heap << comparable_min # sorts first, using <=> +heap << comparable_mid # sorts in the middle, using <=> +heap.pop # => comparable_min +heap.pop # => comparable_mid +heap.pop # => comparable_max +heap.empty? # => true +heap.pop # => nil +``` + +You can also pass a value into `#pop(max)` which will only pop if the minimum +score is less than or equal to `max`. + +Read the [API documentation] for more detailed documentation and examples. + +[API documentation]: https://rubydoc.info/gems/d_heap/DHeap + ## Installation Add this line to your application's Gemfile: ```ruby @@ -56,114 +139,231 @@ $ gem install d_heap ## Motivation -Sometimes you just need a priority queue, right? With a regular queue, you -expect "FIFO" behavior: first in, first out. With a priority queue, you push -with a score (or your elements are comparable), and you want to be able to -efficiently pop off the minimum (or maximum) element. +One naive approach to a priority queue is to maintain an array in sorted order. +This can be very simply implemented using `Array#bseach_index` + `Array#insert`. +This can be very fast—`Array#pop` is `O(1)`—but the worst-case for insert is +`O(n)` because it may need to `memcpy` a significant portion of the array. -One obvious approach is to simply maintain an array in sorted order. And -ruby's Array class makes it simple to maintain a sorted array by combining -`#bsearch_index` with `#insert`. With certain insert/remove workloads that can -perform very well, but in the worst-case an insert or delete can result in O(n), -since `#insert` may need to `memcpy` or `memmove` a significant portion of the -array. +The standard way to implement a priority queue is with a binary heap. Although +this increases the time for `pop`, it converts the amortized time per push + pop +from `O(n)` to `O(d log n / log d)`. -But the standard way to efficiently and simply solve this problem is using a -binary heap. Although it increases the time for `pop`, it converts the -amortized time per push + pop from `O(n)` to `O(d log n / log d)`. +However, I was surprised to find that—at least for some benchmarks—my pure ruby +heap implementation was much slower than inserting into and popping from a fully +sorted array. The reason for this surprising result: Although it is `O(n)`, +`memcpy` has a _very_ small constant factor, and calling `<=>` from ruby code +has relatively _much_ larger constant factors. If your queue contains only a +few thousand items, the overhead of those extra calls to `<=>` is _far_ more +than occasionally calling `memcpy`. In the worst case, a _d_-heap will require +`d + 1` times more comparisons for each push + pop than a `bsearch` + `insert` +sorted array. -I was surprised to find that, at least under certain benchmarks, my pure ruby -heap implementation was usually slower than inserting into a fully sorted -array. While this is a testament to ruby's fine-tuned Array implementationw, a -heap implementated in C should easily peform faster than `Array#insert`. +Moving the sift-up and sift-down code into C helps some. But much more helpful +is optimizing the comparison of numeric scores, so `a <=> b` never needs to be +called. I'm hopeful that MJIT will eventually obsolete this C-extension. JRuby +or TruffleRuby may already run the pure ruby version at high speed. This can be +hotspot code, and the basic ruby implementation should perform well if not for +the high overhead of `<=>`. -The biggest issue is that it just takes far too much time to call `<=>` from -ruby code: A sorted array only requires `log n / log 2` comparisons to insert -and no comparisons to pop. However a _d_-ary heap requires `log n / log d` to -insert plus an additional `d log n / log d` to pop. If your queue contains only -a few hundred items at once, the overhead of those extra calls to `<=>` is far -more than occasionally calling `memcpy`. +## Analysis -It's likely that MJIT will eventually make the C-extension completely -unnecessary. This is definitely hotspot code, and the basic ruby implementation -would work fine, if not for that `<=>` overhead. Until then... this gem gets -the job done. +### Time complexity -## TODOs... +There are two fundamental heap operations: sift-up (used by push) and sift-down +(used by pop). -_TODO:_ In addition to a basic _d_-ary heap class (`DHeap`), this library -~~includes~~ _will include_ extensions to `Array`, allowing an Array to be -directly handled as a priority queue. These extension methods are meant to be -used similarly to how `#bsearch` and `#bsearch_index` might be used. +* Both sift operations can perform as many as `log n / log d` swaps, as the + element may sift from the bottom of the tree to the top, or vice versa. +* Sift-up performs a single comparison per swap: `O(1)`. + So pushing a new element is `O(log n / log d)`. +* Swap down performs as many as d comparions per swap: `O(d)`. + So popping the min element is `O(d log n / log d)`. -_TODO:_ Also ~~included is~~ _will include_ `DHeap::Set`, which augments the -basic heap with an internal `Hash`, which maps a set of values to scores. -loosely inspired by go's timers. e.g: It lazily sifts its heap after deletion -and adjustments, to achieve faster average runtime for *add* and *cancel* -operations. +Assuming every inserted element is eventually deleted from the root, d=4 +requires the fewest comparisons for combined insert and delete: -_TODO:_ Also ~~included is~~ _will include_ `DHeap::Timers`, which contains some -features that are loosely inspired by go's timers. e.g: It lazily sifts its -heap after deletion and adjustments, to achieve faster average runtime for *add* -and *cancel* operations. +* (1 + 2) lg 2 = 4.328085 +* (1 + 3) lg 3 = 3.640957 +* (1 + 4) lg 4 = 3.606738 +* (1 + 5) lg 5 = 3.728010 +* (1 + 6) lg 6 = 3.906774 +* etc... -Additionally, I was inspired by reading go's "timer.go" implementation to -experiment with a 4-ary heap instead of the traditional binary heap. In the -case of timers, new timers are usually scheduled to run after most of the -existing timers. And timers are usually canceled before they have a chance to -run. While a binary heap holds 50% of its elements in its last layer, 75% of a -4-ary heap will have no children. That diminishes the extra comparison overhead -during sift-down. +Leaf nodes require no comparisons to shift down, and higher values for d have +higher percentage of leaf nodes: +* d=2 has ~50% leaves, +* d=3 has ~67% leaves, +* d=4 has ~75% leaves, +* and so on... + +See https://en.wikipedia.org/wiki/D-ary_heap#Analysis for deeper analysis. + +### Space complexity + +Space usage is linear, regardless of d. However higher d values may +provide better cache locality. Because the heap is a complete binary tree, the +elements can be stored in an array, without the need for tree or list pointers. + +Ruby can compare Numeric values _much_ faster than other ruby objects, even if +those objects simply delegate comparison to internal Numeric values. And it is +often useful to use external scores for otherwise uncomparable values. So +`DHeap` uses twice as many entries (one for score and one for value) +as an array which only stores values. + ## Benchmarks -_TODO: put benchmarks here._ +_See `bin/benchmarks` and `docs/benchmarks.txt`, as well as `bin/profile` and +`docs/profile.txt` for more details or updated results. These benchmarks were +measured with v0.4.0 and ruby 2.7.2 without MJIT enabled._ -## Analysis +These benchmarks use very simple implementations for a pure-ruby heap and an +array that is kept sorted using `Array#bsearch_index` and `Array#insert`. For +comparison, an alternate implementation `Array#min` and `Array#delete_at` is +also shown. -### Time complexity +Three different scenarios are measured: + * push N values but never pop (clearing between each set of pushes). + * push N values and then pop N values. + Although this could be used for heap sort, we're unlikely to choose heap sort + over Ruby's quick sort implementation. I'm using this scenario to represent + the amortized cost of creating a heap and (eventually) draining it. + * For a heap of size N, repeatedly push and pop while keeping a stable size. + This is a _very simple_ approximation for how most scheduler/timer heaps + would be used. Usually when a timer fires it will be quickly replaced by a + new timer, and the overall count of timers will remain roughly stable. -Both sift operations can perform (log[d] n = log n / log d) swaps. -Swap up performs only a single comparison per swap: O(1). -Swap down performs as many as d comparions per swap: O(d). +In these benchmarks, `DHeap` runs faster than all other implementations for +every scenario and every value of N, although the difference is much more +noticable at higher values of N. The pure ruby heap implementation is +competitive for `push` alone at every value of N, but is significantly slower +than bsearch + insert for push + pop until N is _very_ large (somewhere between +10k and 100k)! -Inserting an item is O(log n / log d). -Deleting the root is O(d log n / log d). +For very small N values the benchmark implementations, `DHeap` runs faster than +the other implementations for each scenario, although the difference is still +relatively small. The pure ruby binary heap is 2x or more slower than bsearch + +insert for common common push/pop scenario. -Assuming every inserted item is eventually deleted from the root, d=4 requires -the fewest comparisons for combined insert and delete: - * (1 + 2) lg 2 = 4.328085 - * (1 + 3) lg 3 = 3.640957 - * (1 + 4) lg 4 = 3.606738 - * (1 + 5) lg 5 = 3.728010 - * (1 + 6) lg 6 = 3.906774 - * etc... + == push N (N=5) ========================================================== + push N (c_dheap): 1701338.1 i/s + push N (rb_heap): 971614.1 i/s - 1.75x slower + push N (bsearch): 946363.7 i/s - 1.80x slower -Leaf nodes require no comparisons to shift down, and higher values for d have -higher percentage of leaf nodes: - * d=2 has ~50% leaves, - * d=3 has ~67% leaves, - * d=4 has ~75% leaves, - * and so on... + == push N then pop N (N=5) =============================================== + push N + pop N (c_dheap): 1087944.8 i/s + push N + pop N (findmin): 841708.1 i/s - 1.29x slower + push N + pop N (bsearch): 773252.7 i/s - 1.41x slower + push N + pop N (rb_heap): 471852.9 i/s - 2.31x slower -See https://en.wikipedia.org/wiki/D-ary_heap#Analysis for deeper analysis. + == Push/pop with pre-filled queue of size=N (N=5) ======================== + push + pop (c_dheap): 5525418.8 i/s + push + pop (findmin): 5003904.8 i/s - 1.10x slower + push + pop (bsearch): 4320581.8 i/s - 1.28x slower + push + pop (rb_heap): 2207042.0 i/s - 2.50x slower -### Space complexity +By N=21, `DHeap` has pulled significantly ahead of bsearch + insert for all +scenarios, but the pure ruby heap is still slower than every other +implementation—even resorting the array after every `#push`—in any scenario that +uses `#pop`. -Because the heap is a complete binary tree, space usage is linear, regardless -of d. However higher d values may provide better cache locality. + == push N (N=21) ========================================================= + push N (c_dheap): 408307.0 i/s + push N (rb_heap): 212275.2 i/s - 1.92x slower + push N (bsearch): 169583.2 i/s - 2.41x slower -We can run comparisons much much faster for Numeric or String objects than for -ruby objects which delegate comparison to internal Numeric or String objects. -And it is often advantageous to use extrinsic scores for uncomparable items. -For this, our internal array uses twice as many entries (one for score and one -for value) as it would if it only supported intrinsic comparison or used an -un-memoized "sort_by" proc. + == push N then pop N (N=21) ============================================== + push N + pop N (c_dheap): 199435.5 i/s + push N + pop N (findmin): 162024.5 i/s - 1.23x slower + push N + pop N (bsearch): 146284.3 i/s - 1.36x slower + push N + pop N (rb_heap): 72289.0 i/s - 2.76x slower + == Push/pop with pre-filled queue of size=N (N=21) ======================= + push + pop (c_dheap): 4836860.0 i/s + push + pop (findmin): 4467453.9 i/s - 1.08x slower + push + pop (bsearch): 3345458.4 i/s - 1.45x slower + push + pop (rb_heap): 1560476.3 i/s - 3.10x slower + +At higher values of N, `DHeap`'s logarithmic growth leads to little slowdown +of `DHeap#push`, while insert's linear growth causes it to run slower and +slower. But because `#pop` is O(1) for a sorted array and O(d log n / log d) +for a _d_-heap, scenarios involving `#pop` remain relatively close even as N +increases to 5k: + + == Push/pop with pre-filled queue of size=N (N=5461) ============== + push + pop (c_dheap): 2718225.1 i/s + push + pop (bsearch): 1793546.4 i/s - 1.52x slower + push + pop (rb_heap): 707139.9 i/s - 3.84x slower + push + pop (findmin): 122316.0 i/s - 22.22x slower + +Somewhat surprisingly, bsearch + insert still runs faster than a pure ruby heap +for the repeated push/pop scenario, all the way up to N as high as 87k: + + == push N (N=87381) ====================================================== + push N (c_dheap): 92.8 i/s + push N (rb_heap): 43.5 i/s - 2.13x slower + push N (bsearch): 2.9 i/s - 31.70x slower + + == push N then pop N (N=87381) =========================================== + push N + pop N (c_dheap): 22.6 i/s + push N + pop N (rb_heap): 5.5 i/s - 4.08x slower + push N + pop N (bsearch): 2.9 i/s - 7.90x slower + + == Push/pop with pre-filled queue of size=N (N=87381) ==================== + push + pop (c_dheap): 1815277.3 i/s + push + pop (bsearch): 762343.2 i/s - 2.38x slower + push + pop (rb_heap): 535913.6 i/s - 3.39x slower + push + pop (findmin): 2262.8 i/s - 802.24x slower + +## Profiling + +_n.b. `Array#fetch` is reading the input data, external to heap operations. +These benchmarks use integers for all scores, which enables significantly faster +comparisons. If `a <=> b` were used instead, then the difference between push +and pop would be much larger. And ruby's `Tracepoint` impacts these different +implementations differently. So we can't use these profiler results for +comparisons between implementations. A sampling profiler would be needed for +more accurate relative measurements._ + +It's informative to look at the `ruby-prof` results for a simple binary search + +insert implementation, repeatedly pushing and popping to a large heap. In +particular, even with 1000 members, the linear `Array#insert` is _still_ faster +than the logarithmic `Array#bsearch_index`. At this scale, ruby comparisons are +still (relatively) slow and `memcpy` is (relatively) quite fast! + + %self total self wait child calls name location + 34.79 2.222 2.222 0.000 0.000 1000000 Array#insert + 32.59 2.081 2.081 0.000 0.000 1000000 Array#bsearch_index + 12.84 6.386 0.820 0.000 5.566 1 DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77 + 10.38 4.966 0.663 0.000 4.303 1000000 DHeap::Benchmarks::BinarySearchAndInsert#<< d_heap/benchmarks/implementations.rb:61 + 5.38 0.468 0.343 0.000 0.125 1000000 DHeap::Benchmarks::BinarySearchAndInsert#pop d_heap/benchmarks/implementations.rb:70 + 2.06 0.132 0.132 0.000 0.000 1000000 Array#fetch + 1.95 0.125 0.125 0.000 0.000 1000000 Array#pop + +Contrast this with a simplistic pure-ruby implementation of a binary heap: + + %self total self wait child calls name location + 48.52 8.487 8.118 0.000 0.369 1000000 DHeap::Benchmarks::NaiveBinaryHeap#pop d_heap/benchmarks/implementations.rb:96 + 42.94 7.310 7.184 0.000 0.126 1000000 DHeap::Benchmarks::NaiveBinaryHeap#<< d_heap/benchmarks/implementations.rb:80 + 4.80 16.732 0.803 0.000 15.929 1 DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77 + +You can see that it spends almost more time in pop than it does in push. That +is expected behavior for a heap: although both are O(log n), pop is +significantly more complex, and has _d_ comparisons per layer. + +And `DHeap` shows a similar comparison between push and pop, although it spends +half of its time in the benchmark code (which is written in ruby): + + %self total self wait child calls name location + 43.09 1.685 0.726 0.000 0.959 1 DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77 + 26.05 0.439 0.439 0.000 0.000 1000000 DHeap#<< + 23.57 0.397 0.397 0.000 0.000 1000000 DHeap#pop + 7.29 0.123 0.123 0.000 0.000 1000000 Array#fetch + ### Timers Additionally, when used to sort timers, we can reasonably assume that: * New timers usually sort after most existing timers. * Most timers will be canceled before executing. @@ -176,25 +376,57 @@ rescheduled before we garbage collect, adjusting its position will usually be faster than a delete and re-insert. ## Alternative data structures +As always, you should run benchmarks with your expected scenarios to determine +which is right. + Depending on what you're doing, maintaining a sorted `Array` using -`#bsearch_index` and `#insert` might be faster! Although it is technically -O(n) for insertions, the implementations for `memcpy` or `memmove` can be *very* -fast on modern architectures. Also, it can be faster O(n) on average, if -insertions are usually near the end of the array. You should run benchmarks -with your expected scenarios to determine which is right. +`#bsearch_index` and `#insert` might be just fine! As discussed above, although +it is `O(n)` for insertions, `memcpy` is so fast on modern hardware that this +may not matter. Also, if you can arrange for insertions to occur near the end +of the array, that could significantly reduce the `memcpy` overhead even more. +More complex heap varients, e.g. [Fibonacci heap], can allow heaps to be merged +as well as lower amortized time. + +[Fibonacci heap]: https://en.wikipedia.org/wiki/Fibonacci_heap + If it is important to be able to quickly enumerate the set or find the ranking -of values in it, then you probably want to use a self-balancing binary search -tree (e.g. a red-black tree) or a skip-list. +of values in it, then you may want to use a self-balancing binary search tree +(e.g. a [red-black tree]) or a [skip-list]. -A Hashed Timing Wheel or Heirarchical Timing Wheels (or some variant in that -family of data structures) can be constructed to have effectively O(1) running -time in most cases. However, the implementation for that data structure is more -complex than a heap. If a 4-ary heap is good enough for go's timers, it should -be suitable for many use cases. +[red-black tree]: https://en.wikipedia.org/wiki/Red%E2%80%93black_tree +[skip-list]: https://en.wikipedia.org/wiki/Skip_list + +[Hashed and Heirarchical Timing Wheels][timing wheels] (or some variant in that +family of data structures) can be constructed to have effectively `O(1)` running +time in most cases. Although the implementation for that data structure is more +complex than a heap, it may be necessary for enormous values of N. + +[timing wheels]: http://www.cs.columbia.edu/~nahum/w6998/papers/ton97-timing-wheels.pdf + +## TODOs... + +_TODO:_ Also ~~included is~~ _will include_ `DHeap::Set`, which augments the +basic heap with an internal `Hash`, which maps a set of values to scores. +loosely inspired by go's timers. e.g: It lazily sifts its heap after deletion +and adjustments, to achieve faster average runtime for *add* and *cancel* +operations. + +_TODO:_ Also ~~included is~~ _will include_ `DHeap::Lazy`, which contains some +features that are loosely inspired by go's timers. e.g: It lazily sifts its +heap after deletion and adjustments, to achieve faster average runtime for *add* +and *cancel* operations. + +Additionally, I was inspired by reading go's "timer.go" implementation to +experiment with a 4-ary heap instead of the traditional binary heap. In the +case of timers, new timers are usually scheduled to run after most of the +existing timers. And timers are usually canceled before they have a chance to +run. While a binary heap holds 50% of its elements in its last layer, 75% of a +4-ary heap will have no children. That diminishes the extra comparison overhead +during sift-down. ## Development After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive