README.md in d_heap-0.5.0 vs README.md in d_heap-0.6.0

- old
+ new

@@ -1,28 +1,43 @@ -# DHeap +# DHeap - Fast d-ary heap for ruby +[![Gem Version](https://badge.fury.io/rb/d_heap.svg)](https://badge.fury.io/rb/d_heap) +[![Build Status](https://github.com/nevans/d_heap/workflows/CI/badge.svg)](https://github.com/nevans/d_heap/actions?query=workflow%3ACI) +[![Maintainability](https://api.codeclimate.com/v1/badges/ff274acd0683c99c03e1/maintainability)](https://codeclimate.com/github/nevans/d_heap/maintainability) + A fast [_d_-ary heap][d-ary heap] [priority queue] implementation for ruby, implemented as a C extension. +From [wikipedia](https://en.wikipedia.org/wiki/Heap_(data_structure)): +> A heap is a specialized tree-based data structure which is essentially an +> almost complete tree that satisfies the heap property: in a min heap, for any +> given node C, if P is a parent node of C, then the key (the value) of P is +> less than or equal to the key of C. The node at the "top" of the heap (with no +> parents) is called the root node. + +![tree representation of a min heap](images/wikipedia-min-heap.png) + With a regular queue, you expect "FIFO" behavior: first in, first out. With a stack you expect "LIFO": last in first out. A priority queue has a score for each element and elements are popped in order by score. Priority queues are often used in algorithms for e.g. [scheduling] of timers or bandwidth management, for [Huffman coding], and various graph search algorithms such as [Dijkstra's algorithm], [A* search], or [Prim's algorithm]. The _d_-ary heap data structure is a generalization of the [binary heap], in which the nodes have _d_ children instead of 2. This allows for "insert" and "decrease priority" operations to be performed more quickly with the tradeoff of -slower delete minimum. Additionally, _d_-ary heaps can have better memory cache -behavior than binary heaps, allowing them to run more quickly in practice -despite slower worst-case time complexity. In the worst case, a _d_-ary heap -requires only `O(log n / log d)` operations to push, with the tradeoff that pop -requires `O(d log n / log d)`. +slower delete minimum or "increase priority". Additionally, _d_-ary heaps can +have better memory cache behavior than binary heaps, allowing them to run more +quickly in practice despite slower worst-case time complexity. In the worst +case, a _d_-ary heap requires only `O(log n / log d)` operations to push, with +the tradeoff that pop requires `O(d log n / log d)`. Although you should probably just use the default _d_ value of `4` (see the -analysis below), it's always advisable to benchmark your specific use-case. +analysis below), it's always advisable to benchmark your specific use-case. In +particular, if you push items more than you pop, higher values for _d_ can give +a faster total runtime. [d-ary heap]: https://en.wikipedia.org/wiki/D-ary_heap [priority queue]: https://en.wikipedia.org/wiki/Priority_queue [binary heap]: https://en.wikipedia.org/wiki/Binary_heap [scheduling]: https://en.wikipedia.org/wiki/Scheduling_(computing) @@ -31,105 +46,92 @@ [A* search]: https://en.wikipedia.org/wiki/A*_search_algorithm#Description [Prim's algorithm]: https://en.wikipedia.org/wiki/Prim%27s_algorithm ## Usage -Quick reference: +The basic API is `#push(object, score)` and `#pop`. Please read the +[gem documentation] for more details and other methods. +Quick reference for some common methods: + * `heap << object` adds a value, with `Float(object)` as its score. * `heap.push(object, score)` adds a value with an extrinsic score. * `heap.pop` removes and returns the value with the minimum score. -* `heap.pop_lte(score)` pops if the minimum score is `<=` the provided score. +* `heap.pop_lte(max_score)` pops only if the next score is `<=` the argument. * `heap.peek` to view the minimum value without popping it. * `heap.clear` to remove all items from the heap. * `heap.empty?` returns true if the heap is empty. * `heap.size` returns the number of items in the heap. -The basic API is `#push(object, score)` and `pop`. If your values behave as -their own score, then you can push with `#<<`. If the score changes while the -object is still in the heap, it will not be re-evaluated again. The score must -either be `Integer` or `Float` or convertable to a `Float` via `Float(score)` -(i.e. it should implement `#to_f`). +If the score changes while the object is still in the heap, it will not be +re-evaluated again. -```ruby -require "d_heap" +The score must either be `Integer` or `Float` or convertable to a `Float` via +`Float(score)` (i.e. it should implement `#to_f`). Constraining scores to +numeric values gives more than 50% speedup under some benchmarks! _n.b._ +`Integer` _scores must have an absolute value that fits into_ `unsigned long +long`. This is compiler and architecture dependant but with gcc on an IA-64 +system it's 64 bits, which gives a range of -18,446,744,073,709,551,615 to ++18,446,744,073,709,551,615, which is more than enough to store e.g. POSIX time +in nanoseconds. +_Comparing arbitary objects via_ `a <=> b` _was the original design and may be +added back in a future version,_ if (and only if) _it can be done without +impacting the speed of numeric comparisons. The speedup from this constraint is +huge!_ + +[gem documentation]: https://rubydoc.info/gems/d_heap/DHeap + +### Examples + +```ruby +# create some example objects to place in our heap Task = Struct.new(:id, :time) do def to_f; time.to_f end end t1 = Task.new(1, Time.now + 5*60) t2 = Task.new(2, Time.now + 50) t3 = Task.new(3, Time.now + 60) t4 = Task.new(4, Time.now + 5) -# if the object returns its own score via #to_f, "<<" is the simplest API -heap << t1 << t2 +# create the heap +require "d_heap" +heap = DHeap.new -# or push with an explicit score -heap.push t3, t4.to_f -heap.push t4, t4 # score can be implicitly cast with Float +# push with an explicit score (which might be extrinsic to the value) +heap.push t1, t1.to_f -# peek and pop +# the score will be implicitly cast with Float, so any object with #to_f +heap.push t2, t2 + +# if the object has an intrinsic score via #to_f, "<<" is the simplest API +heap << t3 << t4 + +# pop returns the lowest scored item, and removes it from the heap heap.pop # => #<struct Task id=4, time=2021-01-17 17:02:22.5574 -0500> heap.pop # => #<struct Task id=2, time=2021-01-17 17:03:07.5574 -0500> + +# peek returns the lowest scored item, without removing it from the heap heap.peek # => #<struct Task id=3, time=2021-01-17 17:03:17.5574 -0500> heap.pop # => #<struct Task id=3, time=2021-01-17 17:03:17.5574 -0500> -heap.pop # => #<struct Task id=1, time=2021-01-17 17:07:17.5574 -0500> -heap.empty? # => true -heap.pop # => nil -``` -Constraining scores to numeric values gives more than 50% speedup under some -benchmarks! _n.b._ `Integer` _scores must have an absolute value that fits -into_ `unsigned long long`. _This is architecture dependant but on an IA-64 -system this is 64 bits, which gives a range of -18,446,744,073,709,551,615 to -+18446744073709551615. Comparing arbitary objects via_ `a <=> b` _was the -original design and may be added back in a future version,_ if (and only if) _it -can be done without impacting the speed of numeric comparisons._ +# pop_lte handles the common "h.pop if h.peek_score < max" pattern +heap.pop_lte(Time.now + 65) # => nil -```ruby -heap.clear - -# The score can be derived from the value by using to_f. -# "a <=> b" is *much* slower than comparing numbers, so it isn't used. -class Event - include Comparable - attr_reader :time, :payload - alias_method :to_time, :time - - def initialize(time, payload) - @time = time.to_time - @payload = payload - freeze - end - - def to_f - time.to_f - end - - def <=>(other) - to_f <=> other.to_f - end -end - -heap << comparable_max # sorts last, using <=> -heap << comparable_min # sorts first, using <=> -heap << comparable_mid # sorts in the middle, using <=> -heap.pop # => comparable_min -heap.pop # => comparable_mid -heap.pop # => comparable_max +# the heap size can be inspected with size and empty? +heap.empty? # => false +heap.size # => 1 +heap.pop # => #<struct Task id=1, time=2021-01-17 17:07:17.5574 -0500> heap.empty? # => true +heap.size # => 0 + +# popping from an empty heap returns nil heap.pop # => nil ``` -You can also pass a value into `#pop(max)` which will only pop if the minimum -score is less than or equal to `max`. +Please see the [gem documentation] for more methods and more examples. -Read the [API documentation] for more detailed documentation and examples. - -[API documentation]: https://rubydoc.info/gems/d_heap/DHeap - ## Installation Add this line to your application's Gemfile: ```ruby @@ -151,109 +153,79 @@ `Array#insert`. This can be very fast—`Array#pop` is `O(1)`—but the worst-case for insert is `O(n)` because it may need to `memcpy` a significant portion of the array. The standard way to implement a priority queue is with a binary heap. Although -this increases the time for `pop`, it converts the amortized time per push + pop -from `O(n)` to `O(d log n / log d)`. +this increases the time complexity for `pop` alone, it reduces the combined time +compexity for the combined `push` + `pop`. Using a d-ary heap with d > 2 +makes the tree shorter but broader, which reduces to `O(log n / log d)` while +increasing the comparisons needed by sift-down to `O(d log n/ log d)`. -However, I was surprised to find that—at least for some benchmarks—my pure ruby -heap implementation was much slower than inserting into and popping from a fully -sorted array. The reasons for this surprising result: Although it is `O(n)`, -`memcpy` has a _very_ small constant factor, and calling `<=>` from ruby code -has relatively _much_ larger constant factors. If your queue contains only a -few thousand items, the overhead of those extra calls to `<=>` is _far_ more -than occasionally calling `memcpy`. In the worst case, a _d_-heap will require -`d + 1` times more comparisons for each push + pop than a `bsearch` + `insert` -sorted array. +However, I was disappointed when my best ruby heap implementation ran much more +slowly than the naive approach—even for heaps containing ten thousand items. +Although it _is_ `O(n)`, `memcpy` is _very_ fast, while calling `<=>` from ruby +has _much_ higher overhead. And a _d_-heap needs `d + 1` times more comparisons +for each push + pop than `bsearch` + `insert`. -Moving the sift-up and sift-down code into C helps some. But much more helpful -is optimizing the comparison of numeric scores, so `a <=> b` never needs to be -called. I'm hopeful that MJIT will eventually obsolete this C-extension. This -can be hotspot code, and a basic ruby implementation could perform well if `<=>` -had much lower overhead. +Additionally, when researching how other systems handle their scheduling, I was +inspired by reading go's "timer.go" implementation to experiment with a 4-ary +heap instead of the traditional binary heap. -## Analysis +## Benchmarks -### Time complexity +_See `bin/benchmarks` and `docs/benchmarks.txt`, as well as `bin/profile` and +`docs/profile.txt` for much more detail or updated results. These benchmarks +were measured with v0.5.0 and ruby 2.7.2 without MJIT enabled._ -There are two fundamental heap operations: sift-up (used by push) and sift-down -(used by pop). +These benchmarks use very simple implementations for a pure-ruby heap and an +array that is kept sorted using `Array#bsearch_index` and `Array#insert`. For +comparison, I also compare to the [priority_queue_cxx gem] which uses the [C++ +STL priority_queue], and another naive implementation that uses `Array#min` and +`Array#delete_at` with an unsorted array. -* Both sift operations can perform as many as `log n / log d` swaps, as the - element may sift from the bottom of the tree to the top, or vice versa. -* Sift-up performs a single comparison per swap: `O(1)`. - So pushing a new element is `O(log n / log d)`. -* Swap down performs as many as d comparions per swap: `O(d)`. - So popping the min element is `O(d log n / log d)`. +In these benchmarks, `DHeap` runs faster than all other implementations for +every scenario and every value of N, although the difference is usually more +noticable at higher values of N. The pure ruby heap implementation is +competitive for `push` alone at every value of N, but is significantly slower +than bsearch + insert for push + pop, until N is _very_ large (somewhere between +10k and 100k)! -Assuming every inserted element is eventually deleted from the root, d=4 -requires the fewest comparisons for combined insert and delete: +[priority_queue_cxx gem]: https://rubygems.org/gems/priority_queue_cxx +[C++ STL priority_queue]: http://www.cplusplus.com/reference/queue/priority_queue/ -* (1 + 2) lg 2 = 4.328085 -* (1 + 3) lg 3 = 3.640957 -* (1 + 4) lg 4 = 3.606738 -* (1 + 5) lg 5 = 3.728010 -* (1 + 6) lg 6 = 3.906774 -* etc... +Three different scenarios are measured: -Leaf nodes require no comparisons to shift down, and higher values for d have -higher percentage of leaf nodes: +### push N items onto an empty heap -* d=2 has ~50% leaves, -* d=3 has ~67% leaves, -* d=4 has ~75% leaves, -* and so on... +...but never pop (clearing between each set of pushes). -See https://en.wikipedia.org/wiki/D-ary_heap#Analysis for deeper analysis. +![bar graph for push_n_pop_n benchmarks](./images/push_n.png) -### Space complexity +### push N items onto an empty heap then pop all N -Space usage is linear, regardless of d. However higher d values may -provide better cache locality. Because the heap is a complete binary tree, the -elements can be stored in an array, without the need for tree or list pointers. +Although this could be used for heap sort, we're unlikely to choose heap sort +over Ruby's quick sort implementation. I'm using this scenario to represent +the amortized cost of creating a heap and (eventually) draining it. -Ruby can compare Numeric values _much_ faster than other ruby objects, even if -those objects simply delegate comparison to internal Numeric values. And it is -often useful to use external scores for otherwise uncomparable values. So -`DHeap` uses twice as many entries (one for score and one for value) -as an array which only stores values. +![bar graph for push_n_pop_n benchmarks](./images/push_n_pop_n.png) -## Benchmarks +### push and pop on a heap with N values -_See `bin/benchmarks` and `docs/benchmarks.txt`, as well as `bin/profile` and -`docs/profile.txt` for more details or updated results. These benchmarks were -measured with v0.5.0 and ruby 2.7.2 without MJIT enabled._ +Repeatedly push and pop while keeping a stable heap size. This is a _very +simplistic_ approximation for how most scheduler/timer heaps might be used. +Usually when a timer fires it will be quickly replaced by a new timer, and the +overall count of timers will remain roughly stable. -These benchmarks use very simple implementations for a pure-ruby heap and an -array that is kept sorted using `Array#bsearch_index` and `Array#insert`. For -comparison, an alternate implementation `Array#min` and `Array#delete_at` is -also shown. +![bar graph for push_pop benchmarks](./images/push_pop.png) -Three different scenarios are measured: - * push N values but never pop (clearing between each set of pushes). - * push N values and then pop N values. - Although this could be used for heap sort, we're unlikely to choose heap sort - over Ruby's quick sort implementation. I'm using this scenario to represent - the amortized cost of creating a heap and (eventually) draining it. - * For a heap of size N, repeatedly push and pop while keeping a stable size. - This is a _very simple_ approximation for how most scheduler/timer heaps - would be used. Usually when a timer fires it will be quickly replaced by a - new timer, and the overall count of timers will remain roughly stable. +### numbers -In these benchmarks, `DHeap` runs faster than all other implementations for -every scenario and every value of N, although the difference is much more -noticable at higher values of N. The pure ruby heap implementation is -competitive for `push` alone at every value of N, but is significantly slower -than bsearch + insert for push + pop until N is _very_ large (somewhere between -10k and 100k)! +Even for very small N values the benchmark implementations, `DHeap` runs faster +than the other implementations for each scenario, although the difference is +still relatively small. The pure ruby binary heap is 2x or more slower than +bsearch + insert for common push/pop scenario. -For very small N values the benchmark implementations, `DHeap` runs faster than -the other implementations for each scenario, although the difference is still -relatively small. The pure ruby binary heap is 2x or more slower than bsearch + -insert for common common push/pop scenario. - == push N (N=5) ========================================================== push N (c_dheap): 1969700.7 i/s push N (c++ stl): 1049738.1 i/s - 1.88x slower push N (rb_heap): 928435.2 i/s - 2.12x slower push N (bsearch): 921060.0 i/s - 2.14x slower @@ -339,82 +311,72 @@ queue size = 1000000: 4403494.1 i/s - 1.66x slower queue size = 2500000: 3304088.2 i/s - 2.21x slower queue size = 5000000: 2664897.7 i/s - 2.74x slower queue size = 10000000: 2137927.6 i/s - 3.42x slower -## Profiling +## Analysis -_n.b. `Array#fetch` is reading the input data, external to heap operations. -These benchmarks use integers for all scores, which enables significantly faster -comparisons. If `a <=> b` were used instead, then the difference between push -and pop would be much larger. And ruby's `Tracepoint` impacts these different -implementations differently. So we can't use these profiler results for -comparisons between implementations. A sampling profiler would be needed for -more accurate relative measurements._ +### Time complexity -It's informative to look at the `ruby-prof` results for a simple binary search + -insert implementation, repeatedly pushing and popping to a large heap. In -particular, even with 1000 members, the linear `Array#insert` is _still_ faster -than the logarithmic `Array#bsearch_index`. At this scale, ruby comparisons are -still (relatively) slow and `memcpy` is (relatively) quite fast! +There are two fundamental heap operations: sift-up (used by push) and sift-down +(used by pop). - %self total self wait child calls name location - 34.79 2.222 2.222 0.000 0.000 1000000 Array#insert - 32.59 2.081 2.081 0.000 0.000 1000000 Array#bsearch_index - 12.84 6.386 0.820 0.000 5.566 1 DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77 - 10.38 4.966 0.663 0.000 4.303 1000000 DHeap::Benchmarks::BinarySearchAndInsert#<< d_heap/benchmarks/implementations.rb:61 - 5.38 0.468 0.343 0.000 0.125 1000000 DHeap::Benchmarks::BinarySearchAndInsert#pop d_heap/benchmarks/implementations.rb:70 - 2.06 0.132 0.132 0.000 0.000 1000000 Array#fetch - 1.95 0.125 0.125 0.000 0.000 1000000 Array#pop +* A _d_-ary heap will have `log n / log d` layers, so both sift operations can + perform as many as `log n / log d` writes, when a member sifts the entire + length of the tree. +* Sift-up makes one comparison per layer, so push runs in `O(log n / log d)`. +* Sift-down makes d comparions per layer, so pop runs in `O(d log n / log d)`. -Contrast this with a simplistic pure-ruby implementation of a binary heap: +So, in the simplest case of running balanced push/pop while maintaining the same +heap size, `(1 + d) log n / log d` comparisons are made. In the worst case, +when every sift traverses every layer of the tree, `d=4` requires the fewest +comparisons for combined insert and delete: - %self total self wait child calls name location - 48.52 8.487 8.118 0.000 0.369 1000000 DHeap::Benchmarks::NaiveBinaryHeap#pop d_heap/benchmarks/implementations.rb:96 - 42.94 7.310 7.184 0.000 0.126 1000000 DHeap::Benchmarks::NaiveBinaryHeap#<< d_heap/benchmarks/implementations.rb:80 - 4.80 16.732 0.803 0.000 15.929 1 DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77 +* (1 + 2) lg n / lg d ≈ 4.328085 lg n +* (1 + 3) lg n / lg d ≈ 3.640957 lg n +* (1 + 4) lg n / lg d ≈ 3.606738 lg n +* (1 + 5) lg n / lg d ≈ 3.728010 lg n +* (1 + 6) lg n / lg d ≈ 3.906774 lg n +* (1 + 7) lg n / lg d ≈ 4.111187 lg n +* (1 + 8) lg n / lg d ≈ 4.328085 lg n +* (1 + 9) lg n / lg d ≈ 4.551196 lg n +* (1 + 10) lg n / lg d ≈ 4.777239 lg n +* etc... -You can see that it spends almost more time in pop than it does in push. That -is expected behavior for a heap: although both are O(log n), pop is -significantly more complex, and has _d_ comparisons per layer. +See https://en.wikipedia.org/wiki/D-ary_heap#Analysis for deeper analysis. -And `DHeap` shows a similar comparison between push and pop, although it spends -half of its time in the benchmark code (which is written in ruby): +### Space complexity - %self total self wait child calls name location - 43.09 1.685 0.726 0.000 0.959 1 DHeap::Benchmarks::Scenarios#repeated_push_pop d_heap/benchmarks.rb:77 - 26.05 0.439 0.439 0.000 0.000 1000000 DHeap#<< - 23.57 0.397 0.397 0.000 0.000 1000000 DHeap#pop - 7.29 0.123 0.123 0.000 0.000 1000000 Array#fetch +Space usage is linear, regardless of d. However higher d values may +provide better cache locality. Because the heap is a complete binary tree, the +elements can be stored in an array, without the need for tree or list pointers. -### Timers +Ruby can compare Numeric values _much_ faster than other ruby objects, even if +those objects simply delegate comparison to internal Numeric values. And it is +often useful to use external scores for otherwise uncomparable values. So +`DHeap` uses twice as many entries (one for score and one for value) +as an array which only stores values. -Additionally, when used to sort timers, we can reasonably assume that: - * New timers usually sort after most existing timers. - * Most timers will be canceled before executing. - * Canceled timers usually sort after most existing timers. +## Thread safety -So, if we are able to delete an item without searching for it, by keeping a map -of positions within the heap, most timers can be inserted and deleted in O(1) -time. Canceling a non-leaf timer can be further optimized by marking it as -canceled without immediately removing it from the heap. If the timer is -rescheduled before we garbage collect, adjusting its position will usually be -faster than a delete and re-insert. +`DHeap` is _not_ thread-safe, so concurrent access from multiple threads need to +take precautions such as locking access behind a mutex. ## Alternative data structures As always, you should run benchmarks with your expected scenarios to determine -which is right. +which is best for your application. -Depending on what you're doing, maintaining a sorted `Array` using -`#bsearch_index` and `#insert` might be just fine! As discussed above, although -it is `O(n)` for insertions, `memcpy` is so fast on modern hardware that this -may not matter. Also, if you can arrange for insertions to occur near the end -of the array, that could significantly reduce the `memcpy` overhead even more. +Depending on your use-case, maintaining a sorted `Array` using `#bsearch_index` +and `#insert` might be just fine! Even `min` plus `delete` with an unsorted +array can be very fast on small queues. Although insertions run with `O(n)`, +`memcpy` is so fast on modern hardware that your dataset might not be large +enough for it to matter. -More complex heap varients, e.g. [Fibonacci heap], can allow heaps to be merged -as well as lower amortized time. +More complex heap varients, e.g. [Fibonacci heap], allow heaps to be split and +merged which gives some graph algorithms a lower amortized time complexity. But +in practice, _d_-ary heaps have much lower overhead and often run faster. [Fibonacci heap]: https://en.wikipedia.org/wiki/Fibonacci_heap If it is important to be able to quickly enumerate the set or find the ranking of values in it, then you may want to use a self-balancing binary search tree @@ -430,27 +392,19 @@ [timing wheels]: http://www.cs.columbia.edu/~nahum/w6998/papers/ton97-timing-wheels.pdf ## TODOs... -_TODO:_ Also ~~included is~~ _will include_ `DHeap::Set`, which augments the -basic heap with an internal `Hash`, which maps a set of values to scores. -loosely inspired by go's timers. e.g: It lazily sifts its heap after deletion -and adjustments, to achieve faster average runtime for *add* and *cancel* -operations. +_TODO:_ Also ~~included is~~ _will include_ `DHeap::Map`, which augments the +basic heap with an internal `Hash`, which maps objects to their position in the +heap. This enforces a uniqueness constraint on items on the heap, and also +allows items to be more efficiently deleted or adjusted. However maintaining +the hash does lead to a small drop in normal `#push` and `#pop` performance. _TODO:_ Also ~~included is~~ _will include_ `DHeap::Lazy`, which contains some features that are loosely inspired by go's timers. e.g: It lazily sifts its heap after deletion and adjustments, to achieve faster average runtime for *add* and *cancel* operations. - -Additionally, I was inspired by reading go's "timer.go" implementation to -experiment with a 4-ary heap instead of the traditional binary heap. In the -case of timers, new timers are usually scheduled to run after most of the -existing timers. And timers are usually canceled before they have a chance to -run. While a binary heap holds 50% of its elements in its last layer, 75% of a -4-ary heap will have no children. That diminishes the extra comparison overhead -during sift-down. ## Development After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive