A number of people have contacted me and expressed interest in reading a guide that starts a little earlier on in the process--that is, how do you first install Hadoop and myHadoop on a cluster, and then how do users interact with that installation in such a way that allows them to submit jobs that spawn Hadoop clusters? +Dana Brunson was kind enough to remind me that I had promised some people such a guide and never delivered it, so I set out to write one up this weekend.
It turned out that writing up a guide that was sufficiently simple required a substantial rewrite of the myHadoop framework itself, so I wound up doing that as well. I made a lot of changes (and broke backwards compatibility-sorry!), but the resulting guide on how to deploy Hadoop on a traditional high-performance computing cluster is now pretty simple. Here it is:
In addition, I released a new version of myHadoop to accompany this guide:
I'm pretty excited about this, as it represents a very easy way to get people playing with the technology without having to invest in new infrastructure. I was able to deploy it out on a number of XSEDE resources entirely from userland, and there's no reason it would not be simple on any machine with local disks and a batch system.
3-Step Install and 3-Step Spin-upAlthough the details are in the guide I referenced above, deploying Hadoop on a traditional cluster using myHadoop only takes three easy steps:
- Download a stock Apache Hadoop 1.x binary distribution and myHadoop 0.30b
- Unpack both Hadoop and myHadoop into an installation directory (no actual installation required)
- Apply a single patch shipped with myHadoop to Apache Hadoop
- There is no step #4
- Export a few environment variables ($PATH, $HADOOP_HOME, $JAVA_HOME, and $HADOOP_CONF_DIR)
- Run the myhadoop-configure.sh command and pass it $HADOOP_CONF_DIR and the location of each compute node's local disk filesystem (e.g., /scratch/$USER)
- Run Hadoop's "start-all.sh" script (no input needed)
- Like I said, no step #4
RequirementsHadoop clusters have fundamentally different architectures than traditional supercomputers due to their workloads being data-intensive rather than compute-intensive. However, any compute cluster that has
- a local scratch disk in each node†,
- some form of TCP-capable interconnect (1gig, 10gig, or InfiniBand), and
- a supported resource manager (currently Torque, SLURM, and SGE)
† This is a lie. I reimplemented myHadoop's "persistent mode" support so that you can also run HDFS off of a durable networked parallel filesystem like Lustre and keep your HDFS data around even when you haven't got a Hadoop cluster spun up. This is undercutting the benefits of HDFS though, so it's not a recommended approach.