= Hamster

Hamster started out as an implementation of Hash Array Mapped Trees (HAMT) for Ruby (see http://lamp.epfl.ch/papers/idealhashtrees.pdf) and has since expanded to include implementations of other Persistent Data Structures (see http://en.wikipedia.org/wiki/Persistent_data_structure) such as Sets, Lists, Stacks, etc.

== What are persistent data structures?

Persistent data structures have a really neat property: very efficient copy-on-write operations. That allows you to create immutable data-structures that only need copying when something changes. For example:

  hash = Hamster.hash

  hash.put("Name", "Simon")
  hash.has_key?("Name")     # => false
  hash.get("Name")          # => nil

== Double Huh? That's not much use!

Whoops! Unlike Ruby's standard library, each call to <tt>Hamster::hash#put</tt> creates an efficient copy containing the modifications, leaving the original unmodified. Thus, all Hamster classes follow Command-Query-Seperation (see http://martinfowler.com/bliki/CommandQuerySeparation.html) and return the modified copy of themselves after any mutating operation. Let's try that again:

  original = Hamster.hash
  copy = original.put("Name", "Simon")

  original.get("Name")             # => nil
  copy.get("Name")                 # => "Simon"

The same goes for <tt>#remove</tt>:

  original = Hamster.hash
  original = original.put("Name", "Simon")
  copy = hash.remove("Name")

  original.get("Name")  # => Simon
  copy.get("Name")      # => nil

== Oh, I get it. Cool. But I still don't understand why I should care?

As mentioned earlier, persistent data structures perform a copy whenever they are modified meaning there is never any chance that two threads could be modifying the same instance at any one time. And, because they are very efficient copies, you don't need to worry about using up gobs of memory in the process. Moreover, because they're immutable, you can pass them around between objects, methods, and functions and never worry about data corruption; no more defensive calls to <tt>collection.dup</tt>!

For an interesting read on why immutability is a good thing, take a look at Matthias Felleisen's Function Objects presnetation (http://www.ccs.neu.edu/home/matthias/Presentations/ecoop2004.pdf).

== OK, that sounds mildly interesting. What's the downside--there's always a downside?

There's a potential performance hit when compared with MRI's built-in, native, hand-crafted C-code implementation of <tt>Hash</tt>. For example:

  hash = Hamster.hash
  (1..10000).each { |i| hash = hash.put(i, i) }  # => 0.05s
  (1..10000).each { |i| hash.get(i) }  # => 0.008s

versus

  hash = {}
  (1..10000).each { |i| hash[i] = i }  # => 0.004s
  (1..10000).each { |i| hash[i] }  # => 0.001s

== That seems pretty bad?

Well, yes and no. The previous comparison wasn't really fair. Sure, if all you want to do is replace your existing uses of <tt>Hash</tt> in single-threaded environments then don't even bother. However, if you need something that can be used efficiently in concurrent environments where multiple threads are accessing--reading AND writing--the contents things get much better.

== Do you have a better example?

A more realistic comparison might look like:

  hash = Hamster.hash
  (1..10000).each { |i| hash = hash.put(i, i) }  # => 0.05s
  (1..10000).each { |i| hash.get(i) }  # => 0.008s

versus

  hash = {}
  (1..10000).each { |i|
    hash = hash.dup
    hash[i] = i
  } # => 19.8s

  (1..10000).each { |i| hash[i] }  # => 0.001s

Impressive huh? What's even better is--or worse depending on your perspective--is that after all that, the native <tt>Hash</tt> version still isn't thread-safe and still requires some synchronisation around it slowing it down even further.

The <tt>Hamster::Hash</tt> version on the other hand was unchanged from the original whilst remaining inherently thread-safe, and 3 orders of magnitude faster.

== Sure, but as you say, you still need synchronisation so why bother with the copying?

Well, I could show you one but I'd have to re-write--or at least wrap--most <tt>Hash</tt> methods to make it generic, or at least write some application-specific code that synchronised using a <tt>Mutex</tt> and ... well ... it's hard, I always make mistakes, I always end up with weird edge cases and race conditions so, I'll leave that as an exercise for you :)

And don't forget that even if threading isn't a concern for you, the safety provided by immutability is worth it.

== So, you mentioned Sets, Lists, and Stacks?

Indeed I did.

=== Sets


=== Lists

Lists have a head--the value of the item at the head of the list--and a tail--containing the remaining items. For example:

  list = Hamster.list(1, 2, 3)

  list.head    # => 1
  list.tail    # => Hamster.list(2, 3)

To add to a list you use <tt>#cons</tt>:

  original = Hamster.list(1, 2, 3)
  copy = original.cons(0)           # => Hamster.list(0, 1, 2, 3)

Notice how modifying a list actually returns a new list. That's because Hamster lists are immutable. Thankfully, just like Hamster <tt>Set</tt> and <tt>Hash</tt>, they're also very efficient at making copies!

Lists are, where possible, lazy. That is, they try to defer processing items until absolutely necessary. For example, given a crude function to detect prime numbers:

  def prime?(n)
    2.upto(Math.sqrt(n).round) { |i| return false  if n % i == 0 }
    true
  end

The following code will only call <tt>prime?</tt> as many times as necessary to generate the first 3 prime numbers between 10000 and 1000000:

  Hamster.interval(10000, 1000000).filter { |i| prime?(i) }.take(3)    # => 0.0009s

Compare that to the conventional equivalent which needs to calculate all possible values in the range before taking the first 3:

  (10000..1000000).select { |i| prime?(i) }[0, 3]   # => 10s

Besides <tt>Hamster.list</tt> there are other ways to construct lists:

<tt>Hamster.interval(from, to)</tt> (aliased as <tt>.range</tt>) creates a lazy list equivalent to a list containing all the values between <tt>from</tt> and <tt>to</tt> without actually creating a list that big.

<tt>Hamster.stream { ... }</tt> allows you to creates infinite lists. Each time a new value is required, the supplied block is called.

<tt>Hamster.repeat(x)</tt> creates an infinite list with x the value for every element.

<tt>Hamster.replicate(n, x)</tt> creates a list of size n with x the value for every element.

<tt>Hamster.iterate(x) { ... }</tt> creates an infinite list where the first item is calculated by applying the block on the initial argument, the second item by applying the function on the previous result and so on.

You also get <tt>Enumerable#to_list</tt> so you can slowly transition from built-in collection classes to Hamster.

And finally, you get <tt>IO#to_list</tt> allowing you to lazily processes huge files. For example, imagine the following code to process a 100MB file:

  File.open("my_100_mb_file.txt") do |io|
    lines = []
    io.each_line do |line|
      break if lines.size == 10
      lines << line.chomp.downcase.reverse
    end
  end

How many times/how long did you read the code before it became apparent what the code actually did? Now compare that to the following:

  File.open("my_100_mb_file.txt") do |io|
    io.map(&:chomp).map(&:downcase).map(&:reverse).take(10)
  end

Unfortunately, though the second example reads nicely, it takes around 13 seconds to run (compared with 0.033 seconds for the first) even though we're only interested in the first 10 lines! However, using a little <tt>#to_list</tt> magic, we can get the running time back down to 0.033 seconds!

  File.open("my_100_mb_file.txt") do |io|
    io.to_list.map(&:chomp).map(&:downcase).map(&:reverse).take(10)
  end

How is this even possible? It's possible because <tt>IO#to_list</tt> creates a lazy list whereby each line is only ever read and processed as needed, in effect converting it to the first example without all the syntactic, imperative, noise.

=== Stacks

== Disclaimer

Hamster started out as a spike to prove a point and has since morphed into something I actually use. My primary concern has been to round out the functionality with good test coverage and clean, readable code.

Performance is pretty good--especially with lazy lists--but there are some things which unfortunately had to be converted from recursive to iterative due to a lack of Tail-Call-Optimisation in Ruby, making them a little less readable, and a little more memory hungry, than I would otherwise have preferred.

Documentation is sparse but I've tried as best I can to write specs that read as documentation. I've also tried to alias methods as their <tt>Enumerable</tt> equivalents where possible to make it easier for people to migrate code.