# oak design desiderata

Some design goals with which I started this project.

- P1 means "top priorty"
- P2 means "very important"
- P3 means "nice to have"
- P4 means "not harmful if cheap"

- `+` means "accomplished"
- `-` means "not accomplished"
- `?` means "accomplished, but only for some combinations of arguments"

Desiderata for the structure layer:

- P1 + losslessly handle nil, true, false, Integer, and String
- P1 + losslessly handle List with arbitrary values and deep nesting
- P1 + losslessly handle Hash with string keys and deep nesting in values
- P1 + detect cycles and DAGS in input structures, fail or handle
- P1 + handle all Integer types without loss
- P1 - handle Floats with no more than a small quantified loss
- P2 + Hash key ordering is preserved in Ruby-Ruby transcoding
- P3 - convenient:  vaguely human-readable representations available
- P3 + encode cycles and DAGs
- P3 + handle Hash with non-string keys and deep nesting in keys
- P3 + losslessly handle Symbol distinct from String
- P3 - handle Times and Dates

Desiderata for the byte layer:

- P1 + reversible:  original string can be reconstructed from only OAK string
- P1 + unambiguous: no OAK string is the prefix of any other OAK string
- P1 + extensible:  OAK strings contain ids for ver, format, compression, etc
- P1 + robust:      error detection in OAK strings
- P2 + flexible:    multiple compression modes available
- P3 + convenient:  available representation without `{}`, comma, whitespace
- P3 + convenient:  7-bit clean representations available
- P3 + convenient:  representations which are selectable with double-click
- P3 + convenient:  vaguely human-readable representations available
- P3 - streamable:  reversing can be accomplished with definite-size buffers
- P4 - embeddable:  reversing can be accomplished with fixed-size buffers
- P4 - defensive:   error correction available (no good libs found)

Techniques used in the byte layer to accomplish these goals.

- manifest type id for self-identification
- manifest version id in case format changes in future
- salient encoding algorithm choices stored in output stream
  - error detection algorithm aka redundancy
  - compression
  - formatting
- microchoices made to confine metadata characters to [_0-9a-z]
- algorithm menu constructed to offer data characters in [-_0-9a-z]


## Serialization Choices

A survey of alternatives considered for the serialization layer.

### Considering Marshal

The Marshal format has some major drawbacks which I believe make it
a nonstarter.

- strictly Ruby-specific
- readability across major versions not guaranteed
- too powerful: can be used to execute arbitrary code
- binary and non-human-readable
  - many option combos for oak make oak strings also non-human-readable
  - still, it is nice to have layer which is at least potentially clear

Marshal does offer one major advantage:

- transcodes all Ruby value types and user-defined value-like classes
- reported to be much faster than JSON or YAML for serializing

### Considering JSON

JSON is awesome most of the time, especially in highly constrained
environments such as API specifications and simple ad-hoc caching
situations.

JSON offers advantages:

- a portable object model
- easy to read
- widely deployed
- the go-to choice for interchange in recent years

But it has some shortcomings which lead me to reject it for the
structural level in OAK.

- floating point precision is implementation-dependent
- always decodes as a tree - fails to transcode DAGiness
- cannot represent cycles - encoder reject, stack overflow, or infinite loop
- no native date or time handling
- table keys may only be strings
  - e.g. `{'123'=>'x'} == JSON.parse(JSON.dump({123=>'x'}))`
- type information symbol-vs-string lost, symbols transcode to strings
  - e.g. `'foo'        == JSON.parse(JSON.dump(:foo))`
  - e.g. `{'foo'=>'x'} == JSON.parse(JSON.dump({:foo=>'x'}))`
- official grammer only allows {} or [] as top-level object
  - e.g. `123 == JSON.parse('123')` but `JSON.parse('123')` raises `ParserError`
  - many parsers in the wild support only this strict official grammer
  - JSON is suitable only for document encoding, not streams
    - allows only one object per file
    - multiple objects must be members of a list
    - lists must be fully scanned and parsed before being processed
    - no possibility of streamy processing

Biggest limitation of JSON IMO is that Hash keys can only be strings:
```
2.1.6 :008 > obj = {'str'=>'bar',[1,2,3]=>'baz'}
 => {"str"=>"bar", [1, 2, 3]=>"baz"}
2.1.6 :009 > JSON.dump(obj)
 => "{\"str\":\"bar\",\"[1, 2, 3]\":\"baz\"}"
2.1.6 :010 > JSON.parse(JSON.dump(obj))
 => {"str"=>"bar", "[1, 2, 3]"=>"baz"}
2.1.6 :011 > JSON.parse(JSON.dump(obj)) == obj
 => false
```

### Considering YAML

YAML is strong where JSON is strong, and also strong in many places
where JSON is weak.  In fact, YAML includes JSON as a subformat: JSON
strings *are* YAML strings!

Some of the advantages of YAML over JSON are:

- handles any directed graph, including DAGy bits and cycles
- arguably more human-readable than JSON
- YAML spec subsumes JSON spec: JSON files are YAML files
- supports non-string keys
  - e.g. `{123=>'x'}  == YAML.load(YAML.dump({123=>'x'}))`
- supports symbols
  - e.g. `:foo        == YAML.load(YAML.dump(:foo))`
  - e.g. `{:foo=>'x'} == YAML.load(YAML.dump({:foo=>'x'}))`
- allows integer or string as top-level object

YAML overcomes the biggest limitation of JSON by supporting non-string
hash keys:
```
  2.1.6 :008 > obj = {'str'=>'bar',[1,2,3]=>'baz'}
   => {"str"=>"bar", [1, 2, 3]=>"baz"}
  2.1.6 :012 > YAML.dump(obj)
   => "---\nstr: bar\n? - 1\n  - 2\n  - 3\n: baz\n"
  2.1.6 :013 > YAML.load(YAML.dump(obj))
  => {"str"=>"bar", [1, 2, 3]=>"baz"}
  2.1.6 :014 > YAML.load(YAML.dump(obj)) == obj
   => true
```

Note: YAML's support for Symbols is due to Psych, not strictly the
YAML format itself.  I've taken both `YAML.dump(:foo)` and
`YAML.dump(':foo')` into Python and done `yaml.load()` on them.  Both
result in `':foo'`.  So this nicety is not portable.

But YAML still has some shortcomings:

- floating point precision is implementation-dependent
- no native date or time handling
- unclear whether available parsers support stream processing
- DAGs and cycles of Arrays and Hash are handled, but Strings are not.

### Considering FRIZZY

FRIZZY is a home-grown serialization format which I ended up commiting
to for OAK.

The name FRIZZY means nothing, and survives only as the rogue `F`
character at the start of a serialized object:

```
.1.6 :006 > OAK.encode('Hello, World!',redundancy: :none,format: :none)
 => "oak_3NNN_0_20_F1SU13_Hello, World!_ok"
```

Advantages:

  - Recongizes when Strings are identical, not just equivalent.
  - It is much more compact than YAML.
  - Has built-in folding of String and Symbol representation.

Disadvantages:

  - Home grown.
  - Very much not human readable.
  - Floating point precision is incompletely specified.
    - Current implementation crudely uses Number.to_s and String.to_f

I decided to reinvent the wheel and go with FRIZZY.  We have
discovered Summaries which are DAGs on strings.  It might be
acceptable to lose that information but I did not want to *prove* it
was acceptable to lose that information.

It may have been an ego-driven sin to go custom here, but I did not
want to pessimize future use cases on fidelity or control.


## Compression Choices

A survey of alternatives considered for the compression layer.

### Considering LZO, LZF, and LZ4.

These compression formats are similar in performance and algorithm.
All are in the Lempel-Ziv family of dictionary-based
redundancy-eaters.  They will all be cheap to compress, cheap to
uncompress, but will delver only modest compression ratios.

This family of algorithms are unfamiliar to those accustomed to
archive formats, but they are used widely in low-latency applications
(such as server caches ;) ).

To keep things simple, I settled on supporting only LZ4 because its
gem, `lz4-ruby`, seems to have more mindshare and momentum.  It is
weaker but faster than the other weak+fast options - which seems like
the way to be.

Based on previous experience, I expect this to be a clear win for use
in Redis caches vs being uncompressed.

### Considering ZLIB

Including ZLIB felt like a no-brainer.  ZLIB is familiar,
widely-deployed, and standardized in RFC 1951.  It uses the L-Z
process with an additional Huffman encoding phase.  It will deliver
intermediate cost for intermediate compression.

Based on previous experience, I expect this option will usually be
dominated by either LZ4 for low-latency applications or BZIP2 for
archival applications, but I'm including it for comparisons and
because it would feel strage not to.

### Considering BZIP2

BZIP2 is an aggressive compression which uses the Burrows–Wheeler,
move-to-front, and run-length-encoding transforms with Huffman It will
be several times slower but several 10% stronger than ZLIB.  I chose
the gem bzip2-ffi over the more flexible rbzip2 to make absolutely
certain that we use the native libbz2 implementation and do not
falling back silently to a Ruby version which is 100x slower if/when
Heroku does not offer FFI.

Based on previous experience, I expect this option will dominate where
data is generally cold or where storage is very expensive compared to
CPU.

### Considering LZMA

LZMA is the Lempel-Ziv-Markov chains algorithm.  It will be an order
of magnitude more expensive to compress than BZIP2, but will
decompress slightly faster and will yield better compression ratios by
few 5%.

This will be useful only for cases where read-write ratios are over 10
and storage:cpu cost ratios are high.  When read-write ratios are
close to unity, LZO will dominate where storage:cpu is low and BZIP2
will dominate where storage:cpu is high.

Nonetheless, I have a soft spot in my heart for this algorithm so I am
including it - if only so we can rule it out by demonstration rather
than hypothesis.


## Encryption Choices

OAK has been live in `ALI` (copper.com's primary web service) in our
Redis cache layer on 2016-06-02 and for archiving correspondence
bodies in S3 on 2016-07-06.

There had been only Rubocop updates and nary a bugfix since
2016-07-01.

Encryption is the first extension since OAK first went live.


### Encryption-in-OAK Design Decisions (see arch doc for discussion):

- Encryption is the only change in OAK4.
- OAK4 will only support AES-256-GCM with random IVs chosen for
  each encryption event.
  - OAK4 will use no salt other than the random IV.
- Encrypted OAK strings will be nondeterministic.
  - This crushes the desiderata of making OAK.encode a pure function.
  - This is unavoidable to avoid a blatant security hole.
- OAK4 dramatically changes how headers are managed from OAK3.
  - Encrypts all headers which are not required for decryption.
  - Athenticates all headers and the encrypted stream.
- Key rotation is supported.
  - Via an ENV-specified key chain.
  - Can hold multiple master keys.

### Encryption-in-OAK Backward Compatibility

Before encryption was added, the format identifier for OAK strings
was `'oak_3'`.

To indicate we are making a non-backward compatible change, I am
bumping that up to `'oak_4'` for encrypted strings.

The legacy OAK3 are still supported both on read and on write.

By default, OAK4 is used only when encryption is requested.

### Encryption-in-OAK Regarding Compression vs Encryption

Note that compression of encrypted strings is next to useless.  By
design, encryption algorithms obscure exploitable redundancy in
plaintext and produce incompressible ciphertext.

On the other hand, in the wild there have been a handful of successful
chosen-plaintext attacks on compress-then-encrypt encodings.  See:

- https://blog.appcanary.com/2016/encrypt-or-compress.html
- https://en.wikipedia.org/wiki/CRIME

OAK4 supports compression and does compression-then-encryption.

The extremely paranoid are encouraged to use compression: :none.  Note
however that the source data may be compressed.  Furthermore, for
larger objects FRIZZY itself is, in part, a compression algorithm.