Sunday, December 9, 2012

SHMcloud (TM) - Open Source Big Data Solution for eDiscovery

Summary

SHMcloud (TM) is an open source solution for eDiscovery.  It is based on the Hadoop framework and other modern Big Data tools.  It has been extensively tested on large volumes of data.  Its current capabilities include metadata and text extraction, culling, exception handling and deduplication.  It also allows searching from within the processed results.  Its output consists of archives with native files, text files, and exceptions, as well as a separate load file which can be loaded into a review platform such as Concordance or Summation.

The SHMcloud project initially started as FreeEed. After being under constant development for a year, with new functionality going into the closed-source, enterprise version of the software, we decided to open source all of it, to provide more benefits to the users, and to simplify the development. The most recently added capabilities include OCR, imaging (PDF), and search with Lucene and Solr, as well as speed enhancements and load-balancing.  SHMsoft provides rigorous testing and quality assurance, and offers responsive commercial support.

History

The FreeEed software was created by developer Mark Kerzner, and published on GitHub in March of 2011.  This was Mark’s third eDiscovery project, with the first two being early attempts at distributed computing.  Thus, FreeEed was the result of years of experience and the deep knowledge of eDiscovery software.  It was built for Big Data from the start, using such technologies as Hadoop, Lucene, Tika, and Hive.

The project received its initial publicity through an article by LTN reporter Evan Koblentz, “Open Source Could Change the Future of E-Discovery”.  Since that time Mark has presented the project at meetings such as Women in eDiscovery in Houston, and a meeting of the Houston Association Of Litigation Support Managers (HALSM), which took place in Houston.  As Mark continued developing the project, he brought it to the Amazon AWS Cloud as the quickest route to adoption.  He lined up his software consulting company, SHMsoft, to offer support.

Having started as a software consulting company, SHMsoft soon evolved into a developer and promoter of FreeEed, offering commercial support and adding open source and closed source enhancements.  It became necessary to have a separate enterprise version of FreeEed, which is now offered as SHMcloud.  The company was accepted as a client of the Houston Technology Center, a technology incubator.  SHMsoft received initial angel funding, and became noticed as one of the first Big Data companies in Houston, TX.  It was selected as a finalist in the prestigious Goradia Startup Competition.  SHMsoft moved forward by hiring technical and marketing personnel, formed an advisory board, and has a current headcount of around twenty people.

Then we open sourced the additional capabilties under the name of SHMcloud, which is also found on GitHib. SHMsoft plans to stand behind the SHMcloud project.  In fact, it is in the process of forming a separate non-profit foundation to promote FreeEed and other open source software for eDiscovery. The name of the foundation is “EddFoundation”.  SHMsoft is currently working with eDiscovery processing bureaus as well as with enterprises, not only in Houston but also nationwide,  to offer support and facilitate the use and acceptance of the SHMcloud software.

Architecture and software development processing

Processing is organized on the Hadoop framework.  The input data is combined (“staged”) into zip archives for processing and chain-of-custody purposes.  During processing, each file is read from the archive and assigned a unique ID.  The data is then processed with Tika, which extracts text and metadata.  Metadata, text, and the file itself are delivered as processed results.

The current and future building blocks of the system are HDFS, Hadoop, HBase, Tika, Lucene, Solr, Mahout, Hive, and Pig.  A proprietary enhancement used for quick searches and review will include DataStax technologies.

Indexing and searching

Culling is accomplished through the use of an open source search engine called Lucene. An efficient in-memory index is created for each document, and all of the project’s keywords are run against this index. If the index contains any of the keyword combinations, the document is considered responsive and is sent for further processing.

A feature that is currently being tested is the capability to store each search index for each document in a complete Lucene index. This allows for additional searching and culling to be performed once the project processing is completed.

This process is made even more efficient and flexible because each node on a Hadoop cluster is creating its own Lucene index. The indexes can then be used for searches, where the software queries all of them in a combined query. For the sake of efficiency, the indexes get merged into the project’s search index during the next step of processing.

Output

Metadata results are output as a CSV file, while the native files and the extracted text are stored in a zip file(s). The end results can be used for culling and producing native files for legal review.

Supported file formats

MS Office formats
PST processing
PDF
Images

Speed of processing

On regular commodity servers, SHMcloud processes about 2 GB of data per hour. The speed linearly increases with the number of servers in the Hadoop cluster. Thus, at a recent demo for HALSM using 50 computers on the Amazon EC2 cloud, SHMcloud processed 100 GB of Enron data in 1 hour.

Testing

SHMsoft has a full-time tester dedicated to testing the stand-alone and cloud-based versions of FreeEed/SHMcloud. The testing is done using standard data sets, in particular the Enron set. The results of the complete Enron data processing can be found at FreeEed.org, or by navigating to http://freeeed.org/index.php/documentation/testing-with-enron-data.

Controlling the software

The SHMcloud software is controlled through a desktop application called a “Player”. The Player allows the user to set and organize projects, add data to the project, set and update processing parameters, stage the data (copy it to archive files for deployment on the Hadoop cluster or on the Amazon AWS cloud), and then to start and control processing.

The web browser-based GUI is under development, first for the search and culling, and later to replace the Player.

The back-end processing, residing on an internal Hadoop cluster or in the private AWS Amazon Cloud, is referred to as SHMcloud.  It consists of the same SHMcloud software deployed to every cluster node. The Player organizes the cluster processing. This is illustrated in the diagram below.





In the near future, SHMcloud processing will have the following enhancements:

  1. Browser interface, instead of a desktop application
  2. Optimized data harvesting
  3. Added proprietary data sources and databases
  4. Allow searches and first-pass review directly on the cluster
  5. Allow additional culling, based on previous results

The enhanced near-term processing is illustrated below:




The next enhancements for SHMcloud will include:

  1. Advanced analytics
  2. Review built on Big Data


Comparison of FreeEed / SHMcloud editions


Edition/FeatureFreeEedSHMcloud for Amazon AWSSHMcloud for Hadoop cluster - support
LicenseApacheProprietaryProprietary
Player (desktop application) for local one-workstation processingFreeFreeFree
Player app to control cluster processingIt works, but you do it yourselfEnterprise supportEnterprise support
Levels of supportEmail, communityTraining, implementation, and support, 8 through 5, or 24x7Training, implementation, and support, 8 through 5, or 24x7
PricingFree$0/month+$1 / server instance hour + AWS chargesYearly: $2,500 per node on Hadoop cluster
Text and metadata extraction, culling, load fileYesYesYes
OCRNoYesYes
ImagingNoYesYes
DeduplicationNo on Windows, yes on LinuxYesYes
Speed2 GB / hour2 GB / hour * number of machines in the cluster (which is limited only by your AWS account)2 GB / hour * number of machines in the cluster
Formats: MS Office, PST, PDF, imagesYesYesYes
Custom formatsNoOptionalOptional
DatabasesNoOptionalOptional
Integration supportNoAvailableAvailable
TrainingNoAvailableAvailable
ScalabilityLimited by one workstation, or by your own support on the clusterBasically, unlimited - you only need to increase your maximum number of instances assigned by AmazonLike any Hadoop cluster, depends on your hardware



Third-party validations

1. Beta testing.

FreeEed / SHMcloud has been tested by a number of parties, including PriceWaterhouseCooper, and various eDiscovery service bureaus.

2. Publications.

LTN regularly writes about FreeEed and SHMcloud, see for example, here, http://www.law.com/jsp/lawtechnologynews/PubArticleLTN.jsp?id=1202556672056&SHMsoft_Tests_Open_Source_EDiscovery_App_in_Cloud&slreturn=20120813164425

3. Comparisons to other eDiscovery software.

"Capital Toomey," an e-discovery blogger from Albany, New York, recently posted about his tests of the core FreeEed and said he's optimistic about the program's future.  He noted that as with many open-source applications, FreeEed requires some technical know-how and has room for improvement in its user interfaces.  After making his data available from FreeEed and LexisNexis Concordance tests, Toomey writes: “[...] most any e-discovery tool -- in the proper hands -- can be employed successfully." See here, http://capitaltoomey.blogspot.com/2012/04/ediscovery-lower-in-stack-ptiv-wrap.html

4. University research.

a) Marcel Miersebach, a student of computer security in Vienna, Austria, at Fachhochschule St. Pölten, wrote the following paper, “eDiscovery with Hadoop: Is open-source an option?” In this 100+ page work, Marcel compares FreeEed to NUIX, against the general background of eDiscovery.  The draft version of the paper (in German, with a summary in English) can be found here: http://shmsoft.com/images/stories/Diplomarbeit_Marcel_Miersebach-v3.0.pdf.

b) A group of MBA students at the University of Houston chose SHMsoft as their graduation project. Their work includes the analysis of the company strategy and improvement suggestions. Here are the final presentation: http://shmsoft.com/images/stories/SHMsoft%20Project%20Presentation.pptx, and the executive summary: http://shmsoft.com/images/stories/Exec-Summary.pdf.

No comments: