Friday, August 3, 2012

August Houston Hadoop Meetup - Hands-on with Hortonworks

The slides for this presentation are found here. My goal was to play with the Hortonworks distribution of Hadoop, called called HDP, which stands for Hortonworks Data Platform. I also wanted to see what would be involved in porting my eDiscovery (legal) application, called FreeEed, which uses generic Hadoop and runs on EC2, to HDP.

(I had to learn to spell HDP, because my fingers tend to type HPD, which stands of Houston Police Department. I used search for HPD, and fixed it).

There were many things on the Hortonworks web site that I liked, in particular, the short videos. I also understood the positioning of Hortonworks, "We are new as a company, but we are the birthplace of Hadoop, and these same Yahoo people are now working at Hortonworks. We are also all completely open source." I liked the Talend Open Studio, featured on the site.

Then I started the real work. HDP works only on RedHat or CentOS. That was not a problem, since on EC2 I could get any flavor of it. I chose the basic. Note that it comes with only 6 GB of hard drive. For a quick test, I kept this, but watched the harddrive with df through the installs. For more serious use, I would recommend resizing the drive. Here is an instruction on how to do it: http://labs.thredup.com/resizing-the-root-disk-on-a-running-ebs-boot. I've used it successfully many times before. A little setup of your dev machine is required, but that's only done once.

Here is an important note: if you are installing remotely, like I did on the EC2 machine, you don't need to have GUI access to that machine. Just start installing the hmc, or the configuration manager, and it will start a web server. Then you will be able to point your browser to this remote web server, and continue the install.

You get a choice of many Hadoop services that you can install. As you can see from my slides, something did not work when installing HBase. I did not bother fixing that (I am sure it was possible), but instead chose to re-install without HBase. It worked. This was more important for me - I wanted to know if the install can be repeated, and yes, it can.

See the rest of the setup on the slides' screenshots.

My next two questions were, is there an instruction to doing all this automation manually, and how do I re-create my own Hadoop install Java-based software. Rohit of Hortonworks answered both: they are writing an instruction for manual setup - for those people who have their own puppet and chef and other automated environments. With my eDiscovery setup, I will have to wait until there is an instruction, and then repeat Cloudera steps, but with HDP.

The meetup itself was great as always: some very deeply technical people who work with Hadoop or related technology, a very tech savvy investment manager, and some new people who want to jump into the sea of Big Data. Thank you, all!

No comments: