SHMsoft blog: 2014

Tuesday, December 23, 2014

Hadoop goes to Harvard

There is a community of Big Data experts called Experfy, and it is "Made in Boston" and backed by Harvard Innovation Lab. They do Hadoop there, and have pretty interesting projects. This would make Hadoop quite happy.

Wednesday, December 17, 2014

Packt’s $5 eBonanza returns

Just got these news from Packt (of course this includes our recently published book)

Following the success of last year’s festive offer, Packt Publishing will be celebrating the Holiday
season with an even bigger $5 offer. From Thursday 18th December, every eBook and video will be available on the publisher’s website for just $5. Customers are invited to purchase as many as they like before the offer ends on Tuesday January 6th, making it the perfect opportunity to try something new or to take your skills to the next level as 2015 begins.

With all $5 products available in a range of formats and DRM-free, customers will find great value
content delivered exactly how they want it across Packt’s website this Xmas and New Year.

Find out more at www.packtpub.com/packt5dollar

#packt5dollar

Media Contact: sam@packtpub.com

About Packt Publishing

Founded in 2004 in Birmingham, UK, Packt’s mission is to help the world put software to work in new ways, through the delivery of effective learning and information services to IT professionals. Working towards that vision, we have published over 2000 books and videos so far, providing IT
professionals with the actionable knowledge they need to get the job done –whether that’s specific
learning on an emerging technology or optimizing key skills in more established tools.

As part of our mission, we have also awarded over $1,000,000 through our Open Source Project Royalty scheme, helping numerous projects become household names along the way.

Wednesday, November 26, 2014

Big Data Cartoon - What is text analytics?

Analytics may be the next big thing in Big Data, but it is very hard to define what it really is. Firstly, this word shows as misspelled in the browser and in Word or OpenOffice. Secondly, it's too vague and nebulous. As always, when in doubt, we turn to our illustrator, and our RK can illuminate us with a simple to understand cartoon that even data scientists can get.

Thursday, November 20, 2014

Big Data Cartoon - What's the latest and greatest in Hadoop?

What's the latest and greatest in Hadoop? Ask this question, and many people will say "Real-time" and point to Spark. Look at Berkeley's AMP labs two-day seminar going on right now, for example.

But what is Spark, really? What are those RDD's? They stand for Resilient Distributed Datasets, but is it any clearer? We asked our illustrator to clarify this, and hopefully we got it explained.

Wednesday, November 12, 2014

Announcing HBase Design Patterns Book

Happy to announce the "HBase Design Patterns" book, by Mark Kerzner and Sujee Maniyam. The book just went into production and can be pre-ordered using this link: https://www.packtpub.com/big-data-and-business-intelligence/hbase-design-patterns.

The book offers an HBase and NoSQL developer practical guidance in designing and implementing real-world applications. Subjects covered include

Various HBase install options
Single entity tables
Key generation
Storing large files
Dealing with time series data
Advanced modeling
Performance optimization
A number of labs and exercises

Based on the authors' own work, research and experience gained in writing the open source book "Hadoop Illuminated." Oh, and did we forget to mention cartoons by RK? Each chapter has at least one.

Cheers,

Mark & Sujee

Tuesday, November 11, 2014

An excellent presentation by Rohit Jain about exciting new open source product Trafodion

Rohit Jain drove from Austin and presented Trafodion (Welsh for "Transaction"), pronounced "Travodion" - for those in the know. Rohit is an HP Database Distinguished and Chief Technologist. The breadth and depth of his knowledge is amazing.

In turn, the audience did not betray the expectations. Houston is getting its Big Data people, by importing them, and people from Cloudera, Hortonworks and DataStax were all represented.

Pizza was sponsored by HP - thank you - and Rohit has already uploaded the slides to the Meetup. Here are the main slides, and the architecture

, with this note from Rohit: "There was interest in the Trafodion Distributed Transaction Management (DTM) architecture. However, it is a bit dated. Since this presentation, DTM has now been implemented as HBase co-processor code & THLOG has been integrated with the HBase HLOG."

My comment: I started Houston Hadoop Meetup in 2010, with the expectation of an imminent Big Data Boom in Houston. I am still expecting. This was the first meetup though where we had active Big Data professionals, but they were all imported, as I said, from Big Data companies. We are still yet to see native Houstonians and Houston companies doing Big Data. Again, it's coming, and our meetup is one of the focal points.

Monday, October 27, 2014

Big Data Cartoon: Big Data needs big muscle

Inspired possibly by this cartoon in New Yorker, our illustrator has set out to tell us that being in Big Data, you travel a lot, and of course avail yourself of the exercise facilities found in each and every hotel. My latest was a packed gym in downtown San Francisco.

Lately, I've been noticing that trainers at Elephant Scale have been gaining muscle weight.

Tuesday, September 30, 2014

Got an Ubuntu laptop!

Quite powerful and good-looking, from System76. (It is the one in the middle). Now I have a chance to be productive while traveling or working in friends' place.

I am planning to add Windows in a VM, stay tuned...

Sunday, September 7, 2014

Big Data Cartoon: NY is new Silicon Valley

Silicon Valley may be the leader in Big Data, but when you compare it to New York, it is underwhelming. Indeed.com gives 994 Hadoop jobs in NY, and 1719 in Silicon Valley.

What's more is that if you are a financial startup, then you simply must be in New York. You might have an office in TechSpaces in SF, but that's about it. This is fully supported by our illustrator and cartoon author, whose new residence is now appropriately in Manhattan.

Silicon Valley, pay attention!

Thursday, July 24, 2014

FreeEed does Concordance (R)

The latest release of FreeEed (V4.4) allows import into Concordance (R) eDiscovery management software. Here are the instructions.

It also contains a number of fixes. You can use FreeEed in so many ways:

Start a FreeEed server on Amazon, no hardware needed;
Download a virtual machine to your workstations;
Install in Windows, Linux, or Mac.

Download page: here. And all of the popcorn advantages still apply.

PS. Sneak preview: we are working on a document processing engine for today's 3V's - volume, velocity, variety. It is 10-100 times faster, and allows dynamic data sources.

Wednesday, July 2, 2014

Run FreeEed in the cloud, no downloads or hardware needed

Hi, all,

now we have another option to run FreeEed: on Amazon AWS cloud. There are three steps: (1) start the server, (2) connect to it with X2GO, (3) download and unzip the latest software. The rest of the environment is already prepared for you. Check it out here.

We are planning regular webinars teaching this setup, please write indicating your interest.

Cheers,
FreeEed team

Sunday, June 22, 2014

I am a reviewer on Apache Solr High Performance book

As always, I acknowledge my colleagues, my friend and partner Sujee, and my multi-talented family.

Next time, more of my friends who always help.

Friday, June 13, 2014

Houston Hadoop Meetup - Marco Vasquez presents Apache Spark

Invited speaker Marco Vasques told the group about his work as Data Scientist at MapR, and his use of Spark for this purpose. Thanks to YARN in Hadoop 2, Spark has become a part of every major distribution, either as a release or as early preview.

The group was quite technical and asked a lot of detailed questions. Thanks to everyone, and to MapR for sponsoring the pizza and drinks.

And here are the slides: http://www.slideshare.net/MapRTechnologies/spark-v1

Thursday, June 12, 2014

SHMsoft, Inc. Offers FreeEed as Pre-packaged Open Source Software for eDiscovery

For immediate release

SHMsoft, a leader in open source software for eDiscovery, is pleased to announce its latest offering - a complete eDiscovery application, with all components pre-installed and integrated together. This is often explained using a popcorn metaphor: a corn is a lawsuit, FreeEed is a popcorn maker, and processing is adding the lawsuit (corn) to the popcorn maker (FreeEed).

Read on...

Monday, June 9, 2014

Cartoon: Hadoop-based search

Hadoop is a perfect platform for search: it is big and strong, and attentive to details. However, don't just take my word for it. Here is the "follow the money" hint: Elastic Search announced $70 Million series C financing, with its products of ElasticSearch, Logstash, and Kibana.

Our Hadoop-based legal search, FreeEed, is also witnessing increased adoption, by law firms looking for eDiscovery alternatives, and by IT departments and government agencies who "search for open source legal search" :)

Friday, June 6, 2014

I win another bet

As my friends and students know, I like to make a bet with them, at any time, that there will be some new Big Data development within the next 30 days from the bet.

I think this one qualifies quite well: ElasticSearch just announced that they got funded to the tune of 70 million US dollars: http://www.elasticsearch.com/blog/press/elasticsearch-raises-70-million-series-c-financing/

So why this is big? It shows that not only Big Data infrastructure companies, like Cloudera, who got about 1 billion dollars a month ago, but also more vertically oriented startups are just as important.

Another bet can start today - anyone?

How to build a Hadoop cluster on AWS

Below are some excerpts from a book I am writing. Since this seems to be a matter of general interest, I decided to put this in a blog.

Very often people need to build a Hadoop cluster for work or for fun. There is nothing better than borrowed powerful hardware for this (provided that you don't forget to shut the cluster down when you are done, so head directly to Amazon AWS console:

Real time Hadoop

Real time Hadoop is all the rage, with Storm, Spark, Shark, and a plethora of other products, initiatives and events. However, it may be hard to visualize. For example, when you do a "bring your child to work." If so, our cartoonist comes into the picture and explains it very clearly: it is an elephant surfing in the clouds. See for yourself. In fact, don't forget our complete Hadoop coloring book for kids.

Sunday, May 11, 2014

Hadoop bootcamp in Dallas this past Friday was a big success

Just look at all the wonderful students. We got to teach one day of the Global Big Data Conference. By now, with our experience of doing the same in Austin and in Santa Clara, it came out really impressive. We covered theory and practice, HDFS and MapReduce, did the labs, and even constructed about sixty individual clusters: each student did his or her own.

From the early morning flight, to the sessions, lunch, snacks, and work - it was a ball!

At the OTC (Offshore Technology Conference)

The OTC is huge; last year it was visited by some 110,000 people. It is hard to get to: parking, walk, shuttle, expo. But then you are between buzzing exhibits; each country is given its own island. I liked quite a few, some had interesting Big Data sets - these got my "Big Data in Oil and Gas" brochure. At the Chinese company I got a Texas hat. Will surely do more next year.

All became aware of our "Oil and Gas training workshop," as described in our PR, specifically connected to OTC.

Monday, April 14, 2014

We will be presenting FreeEed at a SNIA conference here, so we just updated our slide presentation. We though you would enjoy the more generic form of it, without too much technical detail, so here is the link.

Tuesday, April 8, 2014

Elephant Scale is Building on the Success of its First Houston Hadoop Bootcamp

Elephant Scale, a provider of Big Data training, implementations, and vertical Hadoop product applications, is pleased to announce that it has successfully completed the first Houston Hadoop Bootcamp.

Lessons learned, future plans and student feedback? - Just go here.

Sunday, April 6, 2014

Hadoop is cool and rich, more than ever

How so? Cloudera gets an "insane amount" of 900M investment, and HortonWorks - 100M. What does this mean? - Depends on who you ask. Reuters review just talks about big valuations and hints at a need for caution. However, my favorite reviewer Matt Asay has a more radical point of view: it is "a declaration of war. War on incumbent data infrastructure providers, but it's also war on an increasingly outdated way of thinking about data."

Who is right? We asked our artist, and the opinion came back: Hadoop is cool, way cool.

Monday, March 31, 2014

Houston Hadoop Bootcamp was a real success

What was so amazing about our March 28-30 bootcamp? A number of things:

We collected more than twenty students altogether (with some remotes and some rescheduling). In the place where nobody could do it (some large companies tried) - there we were able to do it! Houston is just beginning its Big Data journey, and we at Elephant Scale may well be the catalyst taking it forward.
We managed to go through the complete training agenda: HDFS, MapReduce, Pig, Hive, HBase, theory and practice. We packaged more real experience here than regular training programs accomplish in ten days. At the end we ran a real mini-hackathon, and each of the three teams was able to complete the SmartMeters project: find smart meter data, download it, collect with Flume into HDFS, and analyze it with Hive.
We have formed real friendships, and our team will surely continue to maintain close ties through out LinkedIn Group, Houston Big Data: share news, share leads, and perhaps even work together. The fact that we took the whole group to lunch every day, at nearby Papasitos or Papadeaux may also have played some role :)

Of course, we have lessons to learn and things to improve, but overall it was an unbelievable experience for all, students and trainers alike. We are already thinking of the next bootcamp on May 2, and of taking Houston Hadoop Bootcamp to Chicago and to Washington, DC.

Our students are proud of their "Excellence in Hadoop" diploma by Elephant Scale, which they can substantiate with real knowledge.

One student, Guadalupe Hernandez, had this to say, "The bootcamp delivered on all promises and more. The experience was challenging and invigorating!"

Another student, Lila Ghemri, commented: "This was a great experience and great people to learn from and work with. Thanks to all."

Student Buu Vinh said, "It was an incredible experience. Everyone was highly motivated, hard working and helpful to each other in a 3-day super charged weekend. Mark Kerzner and Manish Mehndiratta were knowledgeable and willing to deep dive into questions, or lab work and still were able to complete the entire agenda for the bootcamp. AWESOME!"

Tuesday, March 18, 2014

Cartoon: Is Hadoop ready for the enterprise? - Yes and no

On the one hand, you have this view: "Big Data Adoption A Big Headache For Some Companies" (thanks to my friend Jeff for pointing this out). The reservations are summarized in the cartoon: it takes highly valuable and rare skills, the overall expenses are significant, and management's expectations may be too high.

However, this quote from Mike Tuchen, chief executive of Talend, made my tweet about it quite popular.

"Once a developer learns how to work with Hadoop and (other Big Data software) then their salary doubles, so it gets to be relatively unaffordable for a smaller company," he said. "Bigger companies not only can afford that, but they also can take their time to train their existing developers, and they are willing to make that investment."

In fact, every negative on the list may be viewed as positive. For example, there is nothing wrong about high expectations, if you can deliver.

On the other hand, Forbes leaves no doubt that Hadoop is ready for the enterprise and lists five reasons why it is so: latest version is very fast, data is growing, best practices are here, many companies support it (of course Elephant Scale is one - just kidding), and there are hosts of open source developers. So stopping Hadoop is like stopping Linux.

Who is right? Both, or course, have their points.

Monday, March 10, 2014

Big Data Cartoon - Data is the New Oil

People often say that "Data is the New Oil." You have Kaggle predicting where and how you will find oil, but you also have the data as being the new gold. This is confusing enough, so we decided to just let our illustrator explain it to us - and here is the result.

Friday, March 7, 2014

Working on the first-ever Hadoop Bootcamp in Houston

Why is Houston special? There is very little of Big Data going in Houston now, and many tried but failed to have a course here. We are succeeding because we are local, and because of our ties with the Houston community.

http://elephantscale.com/bootcamp

Tuesday, March 4, 2014

Cartoon - Linus Torvalds and the invention of Git

Linus Torvalds is an amazing man, and without him much of Big Data development would not be where it is now. This is our artist's homage to Li
nus.

Friday, February 28, 2014

Announcing FreeEed VM eDiscovery appliance

Hi, friends,

we have packaged all the goodies of FreeEed into a VirtualBox machine, so no more install hassles. This includes the new release 4.2.0 with all the bug fixes and enhancements. Future plans? Adding data collection and advanced analytics tools.

Cheers, all, and write back!

Oh, and the download is here, http://freeeed.org/index.php/download

Tuesday, February 25, 2014

Removing Hadoop Limitations By Using Persistent Memory Kove® XPD® Device

Mark Kerzner(mark@shmsoft.com), Greg Keller (greg@r-hpc.com), Ivan Lazarov (ivan.lazarov@shmsoft.com)

Abstract

Hadoop cluster stores its most vital information in the RAM of the NameNode server. Although this architecture is vital to fast operation, it represents a single point of failure. To mitigate this, the NameNode’s memory is regularly flushed to hard drives. Consequently, in case of a failure, it takes many hours to restore the cluster to its operation. Another limitation is imposed on Hadoop by the size of the RAM on the NameNode server: Hadoop can store only as much data (files and blocks) as the number of descriptors that can fit in the memory of the NameNode.

The new Hadoop architecture described in this paper removes the size limitation and greatly improves the uptime by running the Hadoop NameNode on a persistent memory device, Kove (www.kove.com) XPD. Its advantages are: no limit on the number of files, no limit on the size of the cluster storage, and faster restore times. The performance of the ruggedized Hadoop cluster is on par with the standard Hadoop configuration.

The Hadoop XPD driver software that achieves this operation is open source, and is freely available for download.

Cartoon: EDW vs Hadoop

Which is better, Enterprise Data Warehouse or Hadoop? Here is our artist's simple answer.

Saturday, February 15, 2014

Cartoon - the rise of the data scientist

According to many - and here is an article on this that I liked - data scientist is one of the most demanded professions. For each three or four open positions, there will be only one data scientist in the coming ten years. But this is not all. Being a data scientist also makes you popular at parties, as evidenced by this cartoon.

Incidentally, we at Elephant Scale teach this vital skill in our course, see here.

Houston Hadoop Meetup - Nutch on Hadoop + crawling protected web sites

We have a wonderful turnout and a great crowd. Thirty-one RSVP's, and close to 300 members. We also discussed our plans for the upcoming Houston Hadoop Bootcamp.

The slides are here. See you all next time.

Monday, February 10, 2014

Big Data cartoons - Hadoop (™) bootcamp

Everybody knows, there is no Big Data in Houston as yet. That's why the Hadoop (™) bootcamp here is especially Big News. As every self-respecting Texan will tell you, our Big Data is way bigger than everybody else's.

We will talk for the first time about it at Houston Hadoop Meetup in two days, on Wednesday. Look here for yet more.

Sunday, February 9, 2014

A review on a new book, about Flume

The full title of the book is "Apache Flume: Distributed Log Collection for Hadoop," and indeed it covers "what you need to know," just as it promises. I left my review on Amazon here, and generally find it useful.

I am a reviewer on a new book about Nutch

The title of the book is "Web Crawling and Data Mining with Apache Nutch," and I am the reviewer on it. I have also written a review for Amazon. The gist? - Treat the book as the first step, read through the installation guides, decide what you want to continue with, and then you are on your own - and report back your achievements :)

Friday, February 7, 2014

FreeEed survey results

Hi, all friends of open source eDiscovery project FreeEed, we got great feedback from our users, and here is what they want

Easier to use search features
Email threading
Maintained archive of processed files (especially PST) for repeated searches.
Social media analytics
iCONECT integration - export to iCONECT XML to simplify loading into iCONECTnXT and XERA

This is awesome list, and we will be working on it.

Cheers,

FreeEed Team

Monday, February 3, 2014

Big Data Cartoon - Big Data can be overwhelming

Big Data has become Big Business in 2013 - you read about it everywhere. I read it in SD Times. But sometimes it can become so overwhelming that I just leave it to the artist to explain. Please enjoy the cartoon.

Sunday, February 2, 2014

Big Data, Hadoop, and NoSQL Testing

By Mark Kerzner and Sujee Maniyam, Elephant Scale LLC

Abstract

In this paper we discuss best practices and real world testing strategies for Big Data, Hadoop, and NoSQL. The subjects of testing and software correctness take an even more important role in the world of Big Data, and that is why taking them into account throughout the project lifetime, from design to implementation and to maintenance is paramount. We discuss the maven project organization, the test modules, the use of the mock frameworks, and the TestSuite design pattern. All these serve to factor out extensive copy/pasting into the framework, and in this way to make the projects less error-prone and to improve code quality.

Table of contents

Introduction
Project organization for test-ability
JUnit single unit tests
Test modules
A word on Scala, Scalding and Kiji
System integration testing
Conclusion: lessons and further direction

Introduction

Software testing is one of the most important yet often neglected parts of the software development. For this reason, developers have created a list of 20 most popular responses to give when their software fails the tests. Here they are:

20. "That's weird..."
19. "It's never done that before."
18. "It worked yesterday."

Review on "Cassandra Design Patterns" book

A new book by Packt, on which I am a reviewer. Also, see my Amazon review for it here.

Sunday, January 26, 2014

Big Data Cartoon - Announcing Hadoop (TM) Coin

With Bitcoin market cap at reported 12B, and with the Hadoop yearly market targeting 4B by 2017, this is an apples-to-oranges comparison, and it is as hard to decide between the two as to answer the old children's question, "if an elephant steps on a whale, who will win?"

For this reason we decided to create a completely unauthorized Hadoop coin and offer it to the world. (Please keep in mind that Apache Hadoop is a trademark which we are only using here, not suggesting that we have the authority to speak on its behalf).

How to generate Hadoop coins? - much simpler than Bitcoins: all you need to do is forward this link or letter to a friend, and you have sent a Hadoop coin. It is eco-friendly, wasting no paper, and perhaps increasing the world's internet traffic by a paltry 1%. Thus, the possession of this coins provokes no envy and perhaps negates the ancient wisdom that "one who wants money will never have enough money." I, for one, can stare at this coin for a long time.

Tuesday, January 21, 2014

Big Data cartoon - a day in life of a Big Data startup

Have you ever been a part of a startup? The normal worries: the team, the investors, the payroll, the plans?

Now, in a Big Data startup it's magnified: Big Data is unwieldy, it may cost, your Amazon clusters have been running idle for a week, and your monthly bill is in the thousands, and your board is questioning you on the slow progress. Add to this that your HBase code works on a single node but not on a cluster, and you got a perfect storm.

Wednesday, January 15, 2014

Hadoop cartoons - one day in the life of a Big Data developer

What is one to when his HBase application hangs, regardless of how he connects to the ZooKeeper? - Banging your head on the keyboard helps.

The possible reasons for this are

ZooKeeper/HBase configuration problems (check them out independently with 'hbase shell' and zkCli.sh)
HADOOP_CLASSPATH not configured correctly
Running tests from maven may also give you problem.

When your 'hadoop classpath' is a few pages long, and you run outside of maven, using this classpath, it will work.

Thursday, January 9, 2014

Hadoop Operations and Cluster Management Cookbook from Packt

I am reviewer on this book, , and here is what I say on Amazon about it:

The book talks about every aspect of Hadoop administration: choice of hardware/software, installation of Hadoop and all the tools, Pig, Hive, Mahout, etc. There are chapters on maintenance and monitoring. Lots of screen shots and command-line instructions.

I wish the book showed the latest developments in these areas, which are Cloudera Manager, HortonWorks Ambari, etc., which make it all ridiculously simple. However, when those managers fail or are not supported, you are still back on the command line, so this approach definitely has its place.

I especially liked the monitoring chapters, nagious, Ganglia and Ambari.