h3. Don't Drop ACID while exploring Big Data The traditional "ACID quartet":http://en.wikipedia.org/wiki/ACID for relational databases can be re-interpreted in a Big Data context: * A -- Associative * C -- Commutative * I -- Idempotent * D -- Distributed * (*) -- (and where possible, left in sort order) Finally, where possible leave things in sort order by some appropriate index. Clearly I'm not talking about introducing extra unnecessary sorts on ephemeral data. For things that will be read (and experimented with) much more often than they're written, though, it's worth running a final sort. Now you can * Efficiently index into a massive dataset with binary search * Do a direct merge sort on two files with the same sort order * Run a reducer directly across the data * Assign a synthetic key by just serially numbering lines (either distribute a unique prefix to each mapper Note: for files that will live on the DFS, you should usually *not* do a total sort, h3. If it's not broken, it's wrong Something that goes wrong one in a five million times will crop up hundreds of times in a billion-record collection. h3. Error is not normally distributed What's more, errors introduced will not in general be normally distributed and their impact may not decrease with increasing data size. h3. Encode once, and carefully. Encoding violates idempotence. Data brought in from elsewhere *must* be considered unparsable, ill-formatted and rife with illegal characters. * Immediately fix a copy of the original data with as minimal encoding as possible. * Follow this with a separate parse stage to emit perfectly well-formed, tab-separated / newline delimited data * In this parse stage, encode the data to 7-bits, free of internal tabs, backslashes, carriage return/line feed or control characters. You want your encoding scheme to be ** perfectly reversible ** widely implemented ** easily parseable ** recognizable: incoming data that is mostly inoffensive (a json record, or each line of a document such as this one) should be minimally altered from its original. This lets you do rough exploration with sort/cut/grep and friends. ** !! Involve **NO QUOTING**, only escaping. I can write a simple regexp to decode entities such as %10, \n or . This regexp will behave harmlessly with ill-formed data (eg %%10 or &&; or \ at end of line) and is robust against data being split or interpolated. Schemes such as "quoting: it's bad", %Q{quoting: "just say no"} or tagged markup require a recursive parser. An extra or missing quote mark is almost impossible to backtrack. And av In the absence of some lightweight, mostly-transparent, ASCII-compatible *AND* idempotent encoding scheme lurking in a back closet of some algorithms book -- how to handle the initial lousy payload coming off the wire? * For data that is *mostly* text in a western language, you'll do well wiht XML encoding (with [\n\r\t\\] forced to encode as entities) * URL encoding isn't as recognizable, but is also safe. Use this for things like URIs and filenames, or if you want to be /really/ paranoid about escaping. * For binary data, Binhex is efficient enough and every toolkit can handle it. There are more data-efficient ascii-compatible encoding schemes but it's not worth the hassle for the 10% or whatever gain in size. * If your payload itself is XML data, consider using \0 (nul) between records, with a fixed number of tab-separated metadata fields leading the XML data, which can then include tabs, newlines, or whatever the hell it wants. No changes are made to the data apart from a quick gsub to remove any (highly illegal) \0 in the XML data itself. A later parse round will convert it to structured hadoop-able data. Ex: {% highlight html %} feed_request 20090809101112 200 OK infochimps.org — Find Any Dataset in the World {% endhighlight %} p. Many of the command line utilities (@cat@, @grep@, etc.) will accept nul-delimited files. You may be tempted to use XML around your XML so you can XML while you XML, but this is ultimately only done right by parsing or scrubbing the inputm and at that point you should just translate directly to a reasonable tab/newline format. (Even if that format is tsv-compatible JSON). h3. Do your exception handling in-band A large, heavily-used cluster will want to have ganglia or "scribe":http://www.cloudera.com/blog/2008/11/02/configuring-and-using-scribe-for-hadoop-log-collection/ or the like collecting and managing log data. "Splunk":http://www.splunk.com/ is a compelling option I haven't myself used, but it is "broadly endorsed.":http://www.igvita.com/2008/10/22/distributed-logging-syslog-ng-splunk/ However, it's worth considering another extremely efficient, simple and powerful distributed system for routing massive quantities of data in a structured way, namely wukong|hadoop itself. Wukong gives you a BadRecord class -- just rescue errors, pass the full or partial contents of the offending input. and emit the BadRecord instance in-band. They'll be serialized out along with the rest, and at your preference can be made to reduce to a single instance. Do analysis on them at your leisure; by default, any StructStreamer will silently discard *inbound* BadRecords -- they won't survive past the current generation. h3. Don't be afraid to use the command line as an IDE %{ highlight sh %} cat /data/foo.tsv | ruby -ne 'puts $_.chomp.scan(/text="([^"]+)"/).join("\t")' {% endhighlight %} h3. Keys * Artificial key: assigned externally, key is not a function of the object's intrinsic values. A social security number is an artificial key. Artificial * Natural Key: minimal subset of fields with _intrinsic semantic value_ that _uniquely identify_ the record. My name isn't unique, but my fingerprint is both uniqe and intrinsic. Given the object (me) you can generate the key, and given the key there's exactly one object (me) that matches. h4. other fields * Mutable: ** A user's 'bio' section. * Immutable: ** A user's created_at date is immutable: it doesn't help identify the person but it will never change. h4. Natural keys are right for big data Synthetic keys suck. They demand locality or a central keymaster. * Use the natural key * Hash the natural key. This has some drawbacks OK, fine. you need a synthetic key * Do a total sort, and use nl * Generate * Use a single reducer to reduce locality. YUCK. * have each mapper generate a unique prefix; number each line as "prefix#{line_number}" or whatever. How do you get a unique prefix? * Distribute a unique prefix to each mapper out-of-band. People using Streaming are out of luck. * Use a UUID -- that's what they're for. Drawback: ridiculously long * Hash the machine name, PID and timestamp to something short. Check after the fact that uniqueness was achieved. Use the birthday party formula to find out how often this will happen. (In practice, almost never.) Working with records that change over time,