SHMsoft blog: February 2014

Friday, February 28, 2014

Announcing FreeEed VM eDiscovery appliance

Hi, friends,

we have packaged all the goodies of FreeEed into a VirtualBox machine, so no more install hassles. This includes the new release 4.2.0 with all the bug fixes and enhancements. Future plans? Adding data collection and advanced analytics tools.

Cheers, all, and write back!

Oh, and the download is here, http://freeeed.org/index.php/download

Tuesday, February 25, 2014

Removing Hadoop Limitations By Using Persistent Memory Kove® XPD® Device

Mark Kerzner(mark@shmsoft.com), Greg Keller (greg@r-hpc.com), Ivan Lazarov (ivan.lazarov@shmsoft.com)

Abstract

Hadoop cluster stores its most vital information in the RAM of the NameNode server. Although this architecture is vital to fast operation, it represents a single point of failure. To mitigate this, the NameNode’s memory is regularly flushed to hard drives. Consequently, in case of a failure, it takes many hours to restore the cluster to its operation. Another limitation is imposed on Hadoop by the size of the RAM on the NameNode server: Hadoop can store only as much data (files and blocks) as the number of descriptors that can fit in the memory of the NameNode.

The new Hadoop architecture described in this paper removes the size limitation and greatly improves the uptime by running the Hadoop NameNode on a persistent memory device, Kove (www.kove.com) XPD. Its advantages are: no limit on the number of files, no limit on the size of the cluster storage, and faster restore times. The performance of the ruggedized Hadoop cluster is on par with the standard Hadoop configuration.

The Hadoop XPD driver software that achieves this operation is open source, and is freely available for download.

Cartoon: EDW vs Hadoop

Which is better, Enterprise Data Warehouse or Hadoop? Here is our artist's simple answer.

Saturday, February 15, 2014

Cartoon - the rise of the data scientist

According to many - and here is an article on this that I liked - data scientist is one of the most demanded professions. For each three or four open positions, there will be only one data scientist in the coming ten years. But this is not all. Being a data scientist also makes you popular at parties, as evidenced by this cartoon.

Incidentally, we at Elephant Scale teach this vital skill in our course, see here.

Houston Hadoop Meetup - Nutch on Hadoop + crawling protected web sites

We have a wonderful turnout and a great crowd. Thirty-one RSVP's, and close to 300 members. We also discussed our plans for the upcoming Houston Hadoop Bootcamp.

The slides are here. See you all next time.

Monday, February 10, 2014

Big Data cartoons - Hadoop (™) bootcamp

Everybody knows, there is no Big Data in Houston as yet. That's why the Hadoop (™) bootcamp here is especially Big News. As every self-respecting Texan will tell you, our Big Data is way bigger than everybody else's.

We will talk for the first time about it at Houston Hadoop Meetup in two days, on Wednesday. Look here for yet more.

Sunday, February 9, 2014

A review on a new book, about Flume

The full title of the book is "Apache Flume: Distributed Log Collection for Hadoop," and indeed it covers "what you need to know," just as it promises. I left my review on Amazon here, and generally find it useful.

I am a reviewer on a new book about Nutch

The title of the book is "Web Crawling and Data Mining with Apache Nutch," and I am the reviewer on it. I have also written a review for Amazon. The gist? - Treat the book as the first step, read through the installation guides, decide what you want to continue with, and then you are on your own - and report back your achievements :)

Friday, February 7, 2014

FreeEed survey results

Hi, all friends of open source eDiscovery project FreeEed, we got great feedback from our users, and here is what they want

Easier to use search features
Email threading
Maintained archive of processed files (especially PST) for repeated searches.
Social media analytics
iCONECT integration - export to iCONECT XML to simplify loading into iCONECTnXT and XERA

This is awesome list, and we will be working on it.

Cheers,

FreeEed Team

Monday, February 3, 2014

Big Data Cartoon - Big Data can be overwhelming

Big Data has become Big Business in 2013 - you read about it everywhere. I read it in SD Times. But sometimes it can become so overwhelming that I just leave it to the artist to explain. Please enjoy the cartoon.

Sunday, February 2, 2014

Big Data, Hadoop, and NoSQL Testing

By Mark Kerzner and Sujee Maniyam, Elephant Scale LLC

Abstract

In this paper we discuss best practices and real world testing strategies for Big Data, Hadoop, and NoSQL. The subjects of testing and software correctness take an even more important role in the world of Big Data, and that is why taking them into account throughout the project lifetime, from design to implementation and to maintenance is paramount. We discuss the maven project organization, the test modules, the use of the mock frameworks, and the TestSuite design pattern. All these serve to factor out extensive copy/pasting into the framework, and in this way to make the projects less error-prone and to improve code quality.

Table of contents

Introduction
Project organization for test-ability
JUnit single unit tests
Test modules
A word on Scala, Scalding and Kiji
System integration testing
Conclusion: lessons and further direction

Introduction

Software testing is one of the most important yet often neglected parts of the software development. For this reason, developers have created a list of 20 most popular responses to give when their software fails the tests. Here they are:

20. "That's weird..."
19. "It's never done that before."
18. "It worked yesterday."

SHMsoft blog