---
layout: default
title: mrflip.github.com/wukong - NFS on Hadoop FTW
collapse: false
---

h2. Hadoop Config Tips

h3(#hadoopnfs). Setup NFS within the cluster

If you're lazy, I recommend setting up NFS -- it makes dispatching simple config and script files much easier. (And if you're not lazy, what the hell are you doing using Wukong?). Be careful though -- used unwisely, a swarm of NFS requests will mount a devastatingly effective denial of service attack on your poor old master node.

Installing NFS to share files along the cluster gives the following conveniences:
* You don't have to bundle everything up with each run: any path in ~coder/ will refer back via NFS to the filesystem on master.
* The user can now passwordless ssh among the nodes, since there's only one shared home directory and since we included the user's own public key in the authorized_keys2 file. This lets you easily rsync files among the nodes.

First, you need to take note of the _internal_ name for your master, perhaps something like @domU-xx-xx-xx-xx-xx-xx.compute-1.internal@.

As root, on the master (change @compute-1.internal@ to match your setup):

<pre>
apt-get install nfs-kernel-server
echo "/home *.compute-1.internal(rw)" >> /etc/exports ;
/etc/init.d/nfs-kernel-server stop ;
</pre>

(The @*.compute-1.internal@ part limits host access, but you should take a look at the security settings of both EC2 and the built-in portmapper as well.)

Next, set up a regular user account on the *master only*. In this case our user will be named 'chimpy':

<pre>
visudo # uncomment the last line, to allow group sudo to sudo
groupadd admin
adduser chimpy
usermod -a -G sudo,admin chimpy
su chimpy # now you are the new user
ssh-keygen -t rsa # accept all the defaults
cat ~/.ssh/id_rsa.pub # can paste this public key into your github, etc
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys2
</pre>

Then on each slave (replacing domU-xx-... by the internal name for the master node):

<pre>
apt-get install nfs-common ;
echo "domU-xx-xx-xx-xx-xx-xx.compute-1.internal:/home /mnt/home nfs rw 0 0" >> /etc/fstab
/etc/init.d/nfs-common restart
mkdir /mnt/home
mount /mnt/home
ln -s /mnt/home/chimpy /home/chimpy
</pre>

You should now be in business.

Performance tradeoffs should be small as long as you're just sending code files and gems around. *Don't* write out log entries or data to NFS partitions, or you'll effectively perform a denial-of-service attack on the master node.

* http://nfs.sourceforge.net/nfs-howto/ar01s03.html
* The "Setting up an NFS Server HOWTO":http://nfs.sourceforge.net/nfs-howto/index.html was an immense help, and I recommend reading it carefully.

h3(#awstools). Tools for EC2 and S3 Management

* http://s3sync.net/wiki
* http://jets3t.s3.amazonaws.com/applications/applications.html#uploader
* "ElasticFox"
* "S3Fox (S3 Organizer)":
* "FoxyProxy":

h3. Random EC2 notes

* "How to Mount EBS volume at launch":http://clouddevelopertips.blogspot.com/2009/08/mount-ebs-volume-created-from-snapshot.html

* The Cloudera AMIs and distribution include BZip2 support. This means that if you have input files with a .bz2 extension, they will be naturally un-bzipped and streamed. (Note that there is a non-trivial penalty for doing so: each bzip'ed file must go, in whole, to a single mapper; and the CPU load for un-bzipping is sizeable.)

* To _produce_ bzip2 files, specify the @--compress_output=@ flag. If you have the BZip2 patches installed, you can give @--compress_output=bz2@; everyone should be able to use @--compress_output=gz@.

* For excellent performance you can patch your install for "Parallel LZO Splitting":http://www.cloudera.com/blog/2009/06/24/parallel-lzo-splittable-compression-for-hadoop/

* If you're using XFS, consider setting the nobarrier option
/dev/sdf /mnt/data2 xfs noatime,nodiratime,nobarrier 0 0

* The first write to any disk location is about 5x slower than later writes. Explanation, and how to pre-soften a volume, here: http://docs.amazonwebservices.com/AWSEC2/latest/DeveloperGuide/index.html?instance-storage.html