Hi, friends,
we have packaged all the goodies of FreeEed into a VirtualBox machine, so no more install hassles. This includes the new release 4.2.0 with all the bug fixes and enhancements. Future plans? Adding data collection and advanced analytics tools.
Cheers, all, and write back!
Oh, and the download is here, http://freeeed.org/index.php/download
Friday, February 28, 2014
Tuesday, February 25, 2014
Removing Hadoop Limitations By Using Persistent Memory Kove® XPD® Device
Mark Kerzner(mark@shmsoft.com), Greg Keller (greg@r-hpc.com), Ivan Lazarov (ivan.lazarov@shmsoft.com)
Abstract
Hadoop cluster stores its most vital information in the RAM of the NameNode server. Although this architecture is vital to fast operation, it represents a single point of failure. To mitigate this, the NameNode’s memory is regularly flushed to hard drives. Consequently, in case of a failure, it takes many hours to restore the cluster to its operation. Another limitation is imposed on Hadoop by the size of the RAM on the NameNode server: Hadoop can store only as much data (files and blocks) as the number of descriptors that can fit in the memory of the NameNode.
The new Hadoop architecture described in this paper removes the size limitation and greatly improves the uptime by running the Hadoop NameNode on a persistent memory device, Kove (www.kove.com) XPD. Its advantages are: no limit on the number of files, no limit on the size of the cluster storage, and faster restore times. The performance of the ruggedized Hadoop cluster is on par with the standard Hadoop configuration.
The Hadoop XPD driver software that achieves this operation is open source, and is freely available for download.
Sunday, February 23, 2014
Saturday, February 15, 2014
Cartoon - the rise of the data scientist
Incidentally, we at Elephant Scale teach this vital skill in our course, see here.
Houston Hadoop Meetup - Nutch on Hadoop + crawling protected web sites
We have a wonderful turnout and a great crowd. Thirty-one RSVP's, and close to 300 members. We also discussed our plans for the upcoming Houston Hadoop Bootcamp.
The slides are here. See you all next time.
The slides are here. See you all next time.
Monday, February 10, 2014
Big Data cartoons - Hadoop (™) bootcamp
Everybody knows, there is no Big Data in Houston as yet. That's why the Hadoop (™) bootcamp here is especially Big News. As every self-respecting Texan will tell you, our Big Data is way bigger than everybody else's.
We will talk for the first time about it at Houston Hadoop Meetup in two days, on Wednesday. Look here for yet more.
We will talk for the first time about it at Houston Hadoop Meetup in two days, on Wednesday. Look here for yet more.
Sunday, February 9, 2014
A review on a new book, about Flume
The full title of the book is "Apache Flume: Distributed Log Collection for Hadoop," and indeed it covers "what you need to know," just as it promises. I left my review on Amazon here, and generally find it useful.
I am a reviewer on a new book about Nutch
The title of the book is "Web Crawling and Data Mining with Apache Nutch," and I am the reviewer on it. I have also written a review for Amazon. The gist? - Treat the book as the first step, read through the installation guides, decide what you want to continue with, and then you are on your own - and report back your achievements :)
Friday, February 7, 2014
FreeEed survey results
Hi, all friends of open source eDiscovery project FreeEed, we got great feedback from our users, and here is what they want
- Easier to use search features
- Email threading
- Maintained archive of processed files (especially PST) for repeated searches.
- Social media analytics
- iCONECT integration - export to iCONECT XML to simplify loading into iCONECTnXT and XERA
This is awesome list, and we will be working on it.
Cheers,
FreeEed Team
Monday, February 3, 2014
Big Data Cartoon - Big Data can be overwhelming
Big Data has become Big Business in 2013 - you read about it everywhere. I read it in SD Times. But sometimes it can become so overwhelming that I just leave it to the artist to explain. Please enjoy the cartoon.
Sunday, February 2, 2014
Big Data, Hadoop, and NoSQL Testing
By Mark Kerzner and Sujee Maniyam, Elephant Scale LLC
Abstract
In this paper we discuss best practices and real world testing strategies for Big Data, Hadoop, and NoSQL. The subjects of testing and software correctness take an even more important role in the world of Big Data, and that is why taking them into account throughout the project lifetime, from design to implementation and to maintenance is paramount. We discuss the maven project organization, the test modules, the use of the mock frameworks, and the TestSuite design pattern. All these serve to factor out extensive copy/pasting into the framework, and in this way to make the projects less error-prone and to improve code quality.
Table of contents
- Introduction
- Project organization for test-ability
- JUnit single unit tests
- Test modules
- A word on Scala, Scalding and Kiji
- System integration testing
- Conclusion: lessons and further direction
Software testing is one of the most important yet often neglected parts of the software development. For this reason, developers have created a list of 20 most popular responses to give when their software fails the tests. Here they are:
20. "That's weird..."
19. "It's never done that before."
18. "It worked yesterday."
Subscribe to:
Posts (Atom)