README.md in d_heap-0.4.0 vs README.md in d_heap-0.5.0

- old
+ new

@@ -2,24 +2,24 @@
 
 A fast [_d_-ary heap][d-ary heap] [priority queue] implementation for ruby,
 implemented as a C extension.
 
 With a regular queue, you expect "FIFO" behavior: first in, first out.  With a
-stack you expect "LIFO": last in first out.  With a priority queue, you push
-elements along with a score and the lowest scored element is the first to be
-popped.  Priority queues are often used in algorithms for e.g. [scheduling] of
-timers or bandwidth management, [Huffman coding], and various graph search
-algorithms such as [Dijkstra's algorithm], [A* search], or [Prim's algorithm].
+stack you expect "LIFO": last in first out.  A priority queue has a score for
+each element and elements are popped in order by score.  Priority queues are
+often used in algorithms for e.g. [scheduling] of timers or bandwidth
+management, for [Huffman coding], and various graph search algorithms such as
+[Dijkstra's algorithm], [A* search], or [Prim's algorithm].
 
 The _d_-ary heap data structure is a generalization of the [binary heap], in
-which the nodes have _d_ children instead of 2.  This allows for "decrease
-priority" operations to be performed more quickly with the tradeoff of slower
-delete minimum.  Additionally, _d_-ary heaps can have better memory cache
+which the nodes have _d_ children instead of 2.  This allows for "insert" and
+"decrease priority" operations to be performed more quickly with the tradeoff of
+slower delete minimum.  Additionally, _d_-ary heaps can have better memory cache
 behavior than binary heaps, allowing them to run more quickly in practice
 despite slower worst-case time complexity. In the worst case, a _d_-ary heap
-requires only `O(log n / log d)` to push, with the tradeoff that pop is `O(d log
-n / log d)`.
+requires only `O(log n / log d)` operations to push, with the tradeoff that pop
+requires `O(d log n / log d)`.
 
 Although you should probably just use the default _d_ value  of `4` (see the
 analysis below), it's always advisable to benchmark your specific use-case.
 
 [d-ary heap]: https://en.wikipedia.org/wiki/D-ary_heap
@@ -31,57 +31,62 @@
 [A* search]: https://en.wikipedia.org/wiki/A*_search_algorithm#Description
 [Prim's algorithm]: https://en.wikipedia.org/wiki/Prim%27s_algorithm
 
 ## Usage
 
-The basic API is:
-* `heap << object` adds a value as its own score.
-* `heap.push(score, value)` adds a value with an extrinsic score.
+Quick reference:
+
+* `heap << object` adds a value, with `Float(object)` as its score.
+* `heap.push(object, score)` adds a value with an extrinsic score.
 * `heap.pop` removes and returns the value with the minimum score.
 * `heap.pop_lte(score)` pops if the minimum score is `<=` the provided score.
 * `heap.peek` to view the minimum value without popping it.
+* `heap.clear` to remove all items from the heap.
+* `heap.empty?` returns true if the heap is empty.
+* `heap.size` returns the number of items in the heap.
 
-The score must be `Integer` or `Float` or convertable to a `Float` via
-`Float(score)` (i.e. it should implement `#to_f`).  Constraining scores to
-numeric values gives a 40+% speedup under some benchmarks!
+The basic API is `#push(object, score)` and `pop`.  If your values behave as
+their own score, then you can push with `#<<`.  If the score changes while the
+object is still in the heap, it will not be re-evaluated again.  The score must
+either be `Integer` or `Float` or convertable to a `Float` via `Float(score)`
+(i.e. it should implement `#to_f`).
 
-_n.b._ `Integer` _scores must have an absolute value that fits into_ `unsigned
-long long`. _This is architecture dependant but on an IA-64 system this is 64
-bits, which gives a range of -18,446,744,073,709,551,615 to
-+18446744073709551615._
-
-_Comparing arbitary objects via_ `a <=> b` _was the original design and may
-be added back in a future version,_ if (and only if) _it can be done without
-impacting the speed of numeric comparisons._
-
 ```ruby
 require "d_heap"
 
-Task = Struct.new(:id) # for demonstration
+Task = Struct.new(:id, :time) do
+  def to_f; time.to_f end
+end
+t1 = Task.new(1, Time.now + 5*60)
+t2 = Task.new(2, Time.now + 50)
+t3 = Task.new(3, Time.now + 60)
+t4 = Task.new(4, Time.now +  5)
 
-heap = DHeap.new # defaults to a 4-heap
+# if the object returns its own score via #to_f, "<<" is the simplest API
+heap << t1 << t2
 
-# storing [score, value] tuples
-heap.push Time.now + 5*60, Task.new(1)
-heap.push Time.now +   30, Task.new(2)
-heap.push Time.now +   60, Task.new(3)
-heap.push Time.now +    5, Task.new(4)
+# or push with an explicit score
+heap.push t3, t4.to_f
+heap.push t4, t4 # score can be implicitly cast with Float
 
-# peeking and popping (using last to get the task and ignore the time)
-heap.pop    # => Task[4]
-heap.pop    # => Task[2]
-heap.peek   # => Task[3], but don't pop it from the heap
-heap.pop    # => Task[3]
-heap.pop    # => Task[1]
+# peek and pop
+heap.pop    # => #<struct Task id=4, time=2021-01-17 17:02:22.5574 -0500>
+heap.pop    # => #<struct Task id=2, time=2021-01-17 17:03:07.5574 -0500>
+heap.peek   # => #<struct Task id=3, time=2021-01-17 17:03:17.5574 -0500>
+heap.pop    # => #<struct Task id=3, time=2021-01-17 17:03:17.5574 -0500>
+heap.pop    # => #<struct Task id=1, time=2021-01-17 17:07:17.5574 -0500>
 heap.empty? # => true
 heap.pop    # => nil
 ```
 
-If your values behave as their own score, by being convertible via
-`Float(value)`, then you can use `#<<` for implicit scoring.  The score should
-not change for as long as the value remains in the heap, since it will not be
-re-evaluated after being pushed.
+Constraining scores to numeric values gives more than 50% speedup under some
+benchmarks!  _n.b._ `Integer` _scores must have an absolute value that fits
+into_ `unsigned long long`. _This is architecture dependant but on an IA-64
+system this is 64 bits, which gives a range of -18,446,744,073,709,551,615 to
++18446744073709551615.  Comparing arbitary objects via_ `a <=> b` _was the
+original design and may be added back in a future version,_ if (and only if) _it
+can be done without impacting the speed of numeric comparisons._
 
 ```ruby
 heap.clear
 
 # The score can be derived from the value by using to_f.
@@ -140,34 +145,34 @@
     $ gem install d_heap
 
 ## Motivation
 
 One naive approach to a priority queue is to maintain an array in sorted order.
-This can be very simply implemented using `Array#bseach_index` + `Array#insert`.
-This can be very fast—`Array#pop` is `O(1)`—but the worst-case for insert is
-`O(n)` because it may need to `memcpy` a significant portion of the array.
+This can be very simply implemented in ruby with `Array#bseach_index` +
+`Array#insert`.  This can be very fast—`Array#pop` is `O(1)`—but the worst-case
+for insert is `O(n)` because it may need to `memcpy` a significant portion of
+the array.
 
 The standard way to implement a priority queue is with a binary heap.  Although
 this increases the time for `pop`, it converts the amortized time per push + pop
 from `O(n)` to `O(d log n / log d)`.
 
 However, I was surprised to find that—at least for some benchmarks—my pure ruby
 heap implementation was much slower than inserting into and popping from a fully
-sorted array.  The reason for this surprising result: Although it is `O(n)`,
+sorted array.  The reasons for this surprising result: Although it is `O(n)`,
 `memcpy` has a _very_ small constant factor, and calling `<=>` from ruby code
 has relatively _much_ larger constant factors.  If your queue contains only a
 few thousand items, the overhead of those extra calls to `<=>` is _far_ more
 than occasionally calling `memcpy`.  In the worst case, a _d_-heap will require
 `d + 1` times more comparisons for each push + pop than a `bsearch` + `insert`
 sorted array.
 
 Moving the sift-up and sift-down code into C helps some.  But much more helpful
 is optimizing the comparison of numeric scores, so `a <=> b` never needs to be
-called.  I'm hopeful that MJIT will eventually obsolete this C-extension.  JRuby
-or TruffleRuby may already run the pure ruby version at high speed.  This can be
-hotspot code, and the basic ruby implementation should perform well if not for
-the high overhead of `<=>`.
+called.  I'm hopeful that MJIT will eventually obsolete this C-extension.  This
+can be hotspot code, and a basic ruby implementation could perform well if `<=>`
+had much lower overhead.
 
 ## Analysis
 
 ### Time complexity
 
@@ -215,11 +220,11 @@
 
 ## Benchmarks
 
 _See `bin/benchmarks` and `docs/benchmarks.txt`, as well as `bin/profile` and
 `docs/profile.txt` for more details or updated results. These benchmarks were
-measured with v0.4.0 and ruby 2.7.2 without MJIT enabled._
+measured with v0.5.0 and ruby 2.7.2 without MJIT enabled._
 
 These benchmarks use very simple implementations for a pure-ruby heap and an
 array that is kept sorted using `Array#bsearch_index` and `Array#insert`.  For
 comparison, an alternate implementation `Array#min` and `Array#delete_at` is
 also shown.
@@ -246,77 +251,96 @@
 the other implementations for each scenario, although the difference is still
 relatively small.  The pure ruby binary heap is 2x or more slower than bsearch +
 insert for common common push/pop scenario.
 
     == push N (N=5) ==========================================================
-    push N (c_dheap):   1701338.1 i/s
-    push N (rb_heap):    971614.1 i/s - 1.75x  slower
-    push N (bsearch):    946363.7 i/s - 1.80x  slower
+    push N (c_dheap):   1969700.7 i/s
+    push N (c++ stl):   1049738.1 i/s - 1.88x  slower
+    push N (rb_heap):    928435.2 i/s - 2.12x  slower
+    push N (bsearch):    921060.0 i/s - 2.14x  slower
 
     == push N then pop N (N=5) ===============================================
-    push N + pop N (c_dheap):   1087944.8 i/s
-    push N + pop N (findmin):    841708.1 i/s - 1.29x  slower
-    push N + pop N (bsearch):    773252.7 i/s - 1.41x  slower
-    push N + pop N (rb_heap):    471852.9 i/s - 2.31x  slower
+    push N + pop N (c_dheap):   1375805.0 i/s
+    push N + pop N (c++ stl):   1134997.5 i/s - 1.21x  slower
+    push N + pop N (findmin):    862913.1 i/s - 1.59x  slower
+    push N + pop N (bsearch):    762887.1 i/s - 1.80x  slower
+    push N + pop N (rb_heap):    506890.4 i/s - 2.71x  slower
 
     == Push/pop with pre-filled queue of size=N (N=5) ========================
-    push + pop (c_dheap):   5525418.8 i/s
-    push + pop (findmin):   5003904.8 i/s - 1.10x  slower
-    push + pop (bsearch):   4320581.8 i/s - 1.28x  slower
-    push + pop (rb_heap):   2207042.0 i/s - 2.50x  slower
+    push + pop (c_dheap):   9044435.5 i/s
+    push + pop (c++ stl):   7534583.4 i/s - 1.20x  slower
+    push + pop (findmin):   5026155.1 i/s - 1.80x  slower
+    push + pop (bsearch):   4300260.0 i/s - 2.10x  slower
+    push + pop (rb_heap):   2299499.7 i/s - 3.93x  slower
 
 By N=21, `DHeap` has pulled significantly ahead of bsearch + insert for all
 scenarios, but the pure ruby heap is still slower than every other
 implementation—even resorting the array after every `#push`—in any scenario that
 uses `#pop`.
 
     == push N (N=21) =========================================================
-    push N (c_dheap):    408307.0 i/s
-    push N (rb_heap):    212275.2 i/s - 1.92x  slower
-    push N (bsearch):    169583.2 i/s - 2.41x  slower
+    push N (c_dheap):    464231.4 i/s
+    push N (c++ stl):    305546.7 i/s - 1.52x  slower
+    push N (rb_heap):    202803.7 i/s - 2.29x  slower
+    push N (bsearch):    168678.7 i/s - 2.75x  slower
 
     == push N then pop N (N=21) ==============================================
-    push N + pop N (c_dheap):    199435.5 i/s
-    push N + pop N (findmin):    162024.5 i/s - 1.23x  slower
-    push N + pop N (bsearch):    146284.3 i/s - 1.36x  slower
-    push N + pop N (rb_heap):     72289.0 i/s - 2.76x  slower
+    push N + pop N (c_dheap):    298350.3 i/s
+    push N + pop N (c++ stl):    252227.1 i/s - 1.18x  slower
+    push N + pop N (findmin):    161998.7 i/s - 1.84x  slower
+    push N + pop N (bsearch):    143432.3 i/s - 2.08x  slower
+    push N + pop N (rb_heap):     79622.1 i/s - 3.75x  slower
 
     == Push/pop with pre-filled queue of size=N (N=21) =======================
-    push + pop (c_dheap):   4836860.0 i/s
-    push + pop (findmin):   4467453.9 i/s - 1.08x  slower
-    push + pop (bsearch):   3345458.4 i/s - 1.45x  slower
-    push + pop (rb_heap):   1560476.3 i/s - 3.10x  slower
+    push + pop (c_dheap):   8855093.4 i/s
+    push + pop (c++ stl):   7223079.5 i/s - 1.23x  slower
+    push + pop (findmin):   4542913.7 i/s - 1.95x  slower
+    push + pop (bsearch):   3461802.4 i/s - 2.56x  slower
+    push + pop (rb_heap):   1845488.7 i/s - 4.80x  slower
 
-At higher values of N, `DHeap`'s logarithmic growth leads to little slowdown
-of `DHeap#push`, while insert's linear growth causes it to run slower and
-slower.  But because `#pop` is O(1) for a sorted array and O(d log n / log d)
-for a _d_-heap, scenarios involving `#pop` remain relatively close even as N
-increases to 5k:
+At higher values of N, a heaps logarithmic growth leads to only a little
+slowdown of `#push`, while insert's linear growth causes it to run noticably
+slower and slower.  But because `#pop` is `O(1)` for a sorted array and `O(d log
+n / log d)` for a heap, scenarios involving both `#push` and `#pop` remain
+relatively close, and bsearch + insert still runs faster than a pure ruby heap,
+even up to queues with 10k items.  But as queue size increases beyond than that,
+the linear time compexity to keep a sorted array dominates.
 
-    == Push/pop with pre-filled queue of size=N (N=5461) ==============
-    push + pop (c_dheap):   2718225.1 i/s
-    push + pop (bsearch):   1793546.4 i/s - 1.52x  slower
-    push + pop (rb_heap):    707139.9 i/s - 3.84x  slower
-    push + pop (findmin):    122316.0 i/s - 22.22x  slower
+    == push + pop (rb_heap)
+    queue size =    10000:    736618.2 i/s
+    queue size =    25000:    670186.8 i/s - 1.10x  slower
+    queue size =    50000:    618156.7 i/s - 1.19x  slower
+    queue size =   100000:    579250.7 i/s - 1.27x  slower
+    queue size =   250000:    572795.0 i/s - 1.29x  slower
+    queue size =   500000:    543648.3 i/s - 1.35x  slower
+    queue size =  1000000:    513523.4 i/s - 1.43x  slower
+    queue size =  2500000:    460848.9 i/s - 1.60x  slower
+    queue size =  5000000:    445234.5 i/s - 1.65x  slower
+    queue size = 10000000:    423119.0 i/s - 1.74x  slower
 
-Somewhat surprisingly, bsearch + insert still runs faster than a pure ruby heap
-for the repeated push/pop scenario, all the way up to N as high as 87k:
+    == push + pop (bsearch)
+    queue size =    10000:    786334.2 i/s
+    queue size =    25000:    364963.8 i/s - 2.15x  slower
+    queue size =    50000:    200520.6 i/s - 3.92x  slower
+    queue size =   100000:     88607.0 i/s - 8.87x  slower
+    queue size =   250000:     34530.5 i/s - 22.77x  slower
+    queue size =   500000:     17965.4 i/s - 43.77x  slower
+    queue size =  1000000:      5638.7 i/s - 139.45x  slower
+    queue size =  2500000:      1302.0 i/s - 603.93x  slower
+    queue size =  5000000:       592.0 i/s - 1328.25x  slower
+    queue size = 10000000:       288.8 i/s - 2722.66x  slower
 
-    == push N (N=87381) ======================================================
-    push N (c_dheap):        92.8 i/s
-    push N (rb_heap):        43.5 i/s - 2.13x  slower
-    push N (bsearch):         2.9 i/s - 31.70x  slower
-
-    == push N then pop N (N=87381) ===========================================
-    push N + pop N (c_dheap):        22.6 i/s
-    push N + pop N (rb_heap):         5.5 i/s - 4.08x  slower
-    push N + pop N (bsearch):         2.9 i/s - 7.90x  slower
-
-    == Push/pop with pre-filled queue of size=N (N=87381) ====================
-    push + pop (c_dheap):   1815277.3 i/s
-    push + pop (bsearch):    762343.2 i/s - 2.38x  slower
-    push + pop (rb_heap):    535913.6 i/s - 3.39x  slower
-    push + pop (findmin):      2262.8 i/s - 802.24x  slower
+    == push + pop (c_dheap)
+    queue size =    10000:   7311366.6 i/s
+    queue size =    50000:   6737824.5 i/s - 1.09x  slower
+    queue size =    25000:   6407340.6 i/s - 1.14x  slower
+    queue size =   100000:   6254396.3 i/s - 1.17x  slower
+    queue size =   250000:   5917684.5 i/s - 1.24x  slower
+    queue size =   500000:   5126307.6 i/s - 1.43x  slower
+    queue size =  1000000:   4403494.1 i/s - 1.66x  slower
+    queue size =  2500000:   3304088.2 i/s - 2.21x  slower
+    queue size =  5000000:   2664897.7 i/s - 2.74x  slower
+    queue size = 10000000:   2137927.6 i/s - 3.42x  slower
 
 ## Profiling
 
 _n.b. `Array#fetch` is reading the input data, external to heap operations.
 These benchmarks use integers for all scores, which enables significantly faster