Thursday, December 20, 2012

Teaching Hadoop in high schools!

I've just read a great book, "Automate this." Its topics range from automated trading on Wall Street to a dark fiber line from Chicago to New York cutting through mountains, to Facebook tracking what its billion of users do, to Cloudera and the story of Jeff Hammerbacher, to automated legal review, and then to US educational system.

On the last subject, the author suggests to give a course in programming to every high school student, in order to find those 5% who are born developers (that is because in the mind of the author, Christopher Steiner, everything can be automated: doctors, lawyers, and traders, except the automators themselves).

I say more: we should teach Hadoop in high school. Going through basic Java or C++ is booooring. Rather, tell the kids that you will teach them to program just like Facebook and Google people do, then by the way teach them Java. Don't start from int and float, but teach them to read the code, then to understand it, then to write it. Just like learning a human language the right way.

Do it in the IDE (NetBeans, of course:) with all the modern conveniences. And watch for the amazing results.

Sunday, December 9, 2012

Cloud security for lawyers


With the proliferation and easy availability of cloud computing resources it is inevitable that lawyers and legal IT departments will eventually start using clouds. It is also inevitable that they will ask many questions about cloud security. This paper deals with general cloud security, as well as with specific security requirements that lawyers and legal/IT departments of the corporation usually deal with.

We specifically discuss Amazon Web Services (AWS) and the SHMcloud (open source solution for eDiscovery) from SHMsoft. At the end we provide the Q&A section.

What is a “cloud?”

We will define cloud computing as elastic computing resources. A perfect example of this definition is the Elastic Compute Cloud (EC2) provided by the Amazon Web Services (AWS). Its characteristics are:
  • Easy availability of computing resources: You can provision any amount of storage and any number of computing instances within the matter of minutes;
  • Self-service: you have the ability to manipulate your computing resources through the browser in your AWS account, or programmatically, without the need of the formal provisioning process, phone calls, and human involvement;
  • Pay-as-you go: you pay only for the computing resources you use, while you use them.

Since AWS is one of the most prevalent providers, and since SHMcloud is primarily implemented on EC2, we will limit our discussion to these two services. However, the general approach, conclusions, and recommendation are applicable to other cloud services.

Cloud security today

The IT departments of the corporations and law firms used to be in complete control of their web-accessible applications, or at least the responsibility was lying with them. Now the situation is changing. With the availability of cloud resources, and with multiple reasons driving cloud adoption, these same people often have to rely on the security furnished by the cloud provider.

A recent Forbes article, Data Privacy And The Cloud: Fact Versus Fiction, highlights the growing understanding and adoption of the cloud in general.

Amazon’s EC2 - the one we concentrate upon here - sets very high standards for security protection. It also recommends, and in some cases mandates the best security practices. However, it is in large measure up to the users to make sure that they use the EC2-provided security correctly. In case of applications built on top of EC2, the users also need to verify that the applications designers have implemented the Amazon’s guidelines. If any additional security measures have been used, they need to be documented, and in some cases subject to an audit.

Components of an EC2 application

There are three components in an EC2 application: the Amazon Machine Image (AMI) that you run, in one or multiple copies; security key pairs; and security groups.


AMI is what you run, and here what is the important is the source where you get your AMI. If you use community-provided AMI, it falls on you to check for the backdoors that may have been left by the creators. You would need to delete credentials, remove certificate and key material. About 30-40% of all the community AMIs have some form of backdoors left behind, most likely by oversight, but it does not matter how this happened, the AMI that you run must be clean.

If you are using a Marketplace AMI, then you can be sure that the Amazon Marketplace team has taken these steps before placing the AMI into general availalbity through the Marketplace. Thus, for example, all of the AMIs used by SHMcloud are guaranteed by Amazon to be secure in this sense, and free of backdoors.

Key Pair

Key pair is the pair of private/public keys. The public part is stored on Amazon, and it allows you to access your AMI’s, provided that it agrees with the private key that is stored on your machine. The recommended practice here is for every person to generate and use his/her own key pair. This generation is easy with Amazon, but doing this right has multiple advantage: better security, better accountability, and easier reassignment of rights, when a person leaves the company.

Security group

Security group can be viewed as a firewall to the application. It lists the user(you)-defined access rules for ingress and egress. The security group includes its name, protocol (CTP, UDP)
to and from ports, and the source (where the traffic is coming from).

There can be many security group (up to 500), and in addition there are some known problems with them, such as the use of memcache server. The best way to deal with this is to have a person responsible for security group setup, and auditor, and to use automation to verify the common problems with the security group, such as the Scout tool, Whatever the method, you need to highlight potentially dangerous security groups, and compare what it should be to what it is really there.

Simple Storage Service - S3

S3 is the storage part of the AWS in general, and of SHMcloud in particular. It has the following three security mechanisms:

  • ACL, or Access Control List;
  • Bucket policy; and
  • IAM (Identity Access Management) policy.

Let’s look at each one separately.


ACL’s work together with bucket and IAM security.

Bucket security

With bucket-level permissions, one can have fine-grained permissions, on the level of specific  objects. One can also use more granular bucket policies: such as specific actions and conditions. For example, one can enforce permissions based on object size.


IAM, or Identity Access Management, is the way to have multiple users for the same Amazon AWS account. The best practice of using it is outlined below:

  • Create IAM principal, attach IAM policies to it;
  • Create departments and buckets, set policies;
  • Attach users to departments.

Since IAM includes Identity federation, it is a convenient and powerful for users and groups permissions and controls.

Access Logs allow you to verify how the data is being access. They can be used for security auditing.

IAM allows for coarse-grained permission: read, write, etc. The grantee can be a human user, or a system user (software agent).


Encryption is another layer for the security protection of your data. The Amazon AMI images are already encrypted, to guarantee that Amazon’s employees do not have access to your data. However, a targeted hacking attempt or a human glitch to lead to data exposure. To mitigate this risk, sensitive data should additionally encrypted.

There are two ways of data encryption, client side and server side.

With server-side encryption, Amazon manages keys with AES-256. This is more convenient to implement. In this scheme, objects are encrypted, not buckets. Furthermore, there is no need to manage keys, and risk is transferred to AWS services.

With client-side encryption, you manage the keys, using AWS SDK. There is additional implementation load, but one has even greater flexibility. Also, the chain of custody includes only the known elements and excludes the third party of AWS.

With encryption, as with all other levels of security, it is a recommended practice to use automated tools (such as Scout) should verify and enforce encryption.

Questions often asked by lawyers

As we have seen, AWS platform has all the necessary elements to implement the best practices of security in web-base applications. SHMcloud, an eDiscovery system based on AWS, implements all of these best practices.

Of course, SHMcloud can be deployed internally, in a hosted or internal computing center, and then it will carry over all of its security practices to this implementation.

In addition, below is a list of questions that the law practitioners usually ask about cloud-based eDiscovery, as it relates to specific areas of legal responsibility and various geographical jurisdictions. We have also include some questions related to price/benefits analysis.


How well is my data protected against accidental loss?


S3 stores all its data with the replication factor of 3. Currently S3 stored one billion of new objects daily. Yet, since the beginning of its operation in 2002, AWS does not have a documented case of customer data loss. Given the public nature of all outages, this is a remarkable record.


Some lawsuit case require storing data for years, can S3 accommodate this?


S3 has no time limit on data storage. In addition, you can implement selective backup for the important information.


The price of 20 cents / month/ GB of data can be quite high. Is there a way to mitigate this cost?


Amazon Glacier store the data at 1/20 of this cost, at 1 cent/month/GB. You can think of it as inexpensive long-term backup.


What about human errors, such as deletions?


There is no complete protection against human errors, but best practices help mitigate this risk. These include storing multiple copies of the important information, some of with read-only permissions for all but the project administrator.


How can I make sure that my data is indeed deleted after the necessary retention period is finished?


There are multiple measures that you can take

  • Delete the data from S3, shut down your EC2 instances. This has the effect of erasing the data from your hard drive, only more so. In the case of the local hard drive, there is undelete and forensics restore. By contrast, in the case of S3 and EC2, there data is encrypted by Amazon in the first place, so now it is essentially gone;
  • If on top of that you used your own encryption, the data is unrecoverable;
  • You can overwrite your data with another, bogus data. This is not needed, but some people like to run PC Eraser type program a number of times, for their comfort, and this has about the same effect here.


Some jurisdictions, such as the European Union, impose the data locality requirement, such as that the data should never leave the European region, for example. Can this be accommodated by AWS, and by extension, by SHMcloud?


Amazon AWS provides “Regions” for this exact purpose. Data deployed in one region (such as Ireland for Europe) is guaranteed to never leave the particular computing center. In addition to satisfying the legal requirements, regions provide for better application latency and responsiveness.

SHMcloud takes full advantage of this Amazon Regions, and it has AMI instances that can be deployed in any Amazon region. Below is a screenshot showing SHMcloud instances and their regions.

SHMcloud (TM) - Open Source Big Data Solution for eDiscovery


SHMcloud (TM) is an open source solution for eDiscovery.  It is based on the Hadoop framework and other modern Big Data tools.  It has been extensively tested on large volumes of data.  Its current capabilities include metadata and text extraction, culling, exception handling and deduplication.  It also allows searching from within the processed results.  Its output consists of archives with native files, text files, and exceptions, as well as a separate load file which can be loaded into a review platform such as Concordance or Summation.

The SHMcloud project initially started as FreeEed. After being under constant development for a year, with new functionality going into the closed-source, enterprise version of the software, we decided to open source all of it, to provide more benefits to the users, and to simplify the development. The most recently added capabilities include OCR, imaging (PDF), and search with Lucene and Solr, as well as speed enhancements and load-balancing.  SHMsoft provides rigorous testing and quality assurance, and offers responsive commercial support.


The FreeEed software was created by developer Mark Kerzner, and published on GitHub in March of 2011.  This was Mark’s third eDiscovery project, with the first two being early attempts at distributed computing.  Thus, FreeEed was the result of years of experience and the deep knowledge of eDiscovery software.  It was built for Big Data from the start, using such technologies as Hadoop, Lucene, Tika, and Hive.

The project received its initial publicity through an article by LTN reporter Evan Koblentz, “Open Source Could Change the Future of E-Discovery”.  Since that time Mark has presented the project at meetings such as Women in eDiscovery in Houston, and a meeting of the Houston Association Of Litigation Support Managers (HALSM), which took place in Houston.  As Mark continued developing the project, he brought it to the Amazon AWS Cloud as the quickest route to adoption.  He lined up his software consulting company, SHMsoft, to offer support.

Having started as a software consulting company, SHMsoft soon evolved into a developer and promoter of FreeEed, offering commercial support and adding open source and closed source enhancements.  It became necessary to have a separate enterprise version of FreeEed, which is now offered as SHMcloud.  The company was accepted as a client of the Houston Technology Center, a technology incubator.  SHMsoft received initial angel funding, and became noticed as one of the first Big Data companies in Houston, TX.  It was selected as a finalist in the prestigious Goradia Startup Competition.  SHMsoft moved forward by hiring technical and marketing personnel, formed an advisory board, and has a current headcount of around twenty people.

Then we open sourced the additional capabilties under the name of SHMcloud, which is also found on GitHib. SHMsoft plans to stand behind the SHMcloud project.  In fact, it is in the process of forming a separate non-profit foundation to promote FreeEed and other open source software for eDiscovery. The name of the foundation is “EddFoundation”.  SHMsoft is currently working with eDiscovery processing bureaus as well as with enterprises, not only in Houston but also nationwide,  to offer support and facilitate the use and acceptance of the SHMcloud software.

Architecture and software development processing

Processing is organized on the Hadoop framework.  The input data is combined (“staged”) into zip archives for processing and chain-of-custody purposes.  During processing, each file is read from the archive and assigned a unique ID.  The data is then processed with Tika, which extracts text and metadata.  Metadata, text, and the file itself are delivered as processed results.

The current and future building blocks of the system are HDFS, Hadoop, HBase, Tika, Lucene, Solr, Mahout, Hive, and Pig.  A proprietary enhancement used for quick searches and review will include DataStax technologies.

Indexing and searching

Culling is accomplished through the use of an open source search engine called Lucene. An efficient in-memory index is created for each document, and all of the project’s keywords are run against this index. If the index contains any of the keyword combinations, the document is considered responsive and is sent for further processing.

A feature that is currently being tested is the capability to store each search index for each document in a complete Lucene index. This allows for additional searching and culling to be performed once the project processing is completed.

This process is made even more efficient and flexible because each node on a Hadoop cluster is creating its own Lucene index. The indexes can then be used for searches, where the software queries all of them in a combined query. For the sake of efficiency, the indexes get merged into the project’s search index during the next step of processing.


Metadata results are output as a CSV file, while the native files and the extracted text are stored in a zip file(s). The end results can be used for culling and producing native files for legal review.

Supported file formats

MS Office formats
PST processing

Speed of processing

On regular commodity servers, SHMcloud processes about 2 GB of data per hour. The speed linearly increases with the number of servers in the Hadoop cluster. Thus, at a recent demo for HALSM using 50 computers on the Amazon EC2 cloud, SHMcloud processed 100 GB of Enron data in 1 hour.


SHMsoft has a full-time tester dedicated to testing the stand-alone and cloud-based versions of FreeEed/SHMcloud. The testing is done using standard data sets, in particular the Enron set. The results of the complete Enron data processing can be found at, or by navigating to

Controlling the software

The SHMcloud software is controlled through a desktop application called a “Player”. The Player allows the user to set and organize projects, add data to the project, set and update processing parameters, stage the data (copy it to archive files for deployment on the Hadoop cluster or on the Amazon AWS cloud), and then to start and control processing.

The web browser-based GUI is under development, first for the search and culling, and later to replace the Player.

The back-end processing, residing on an internal Hadoop cluster or in the private AWS Amazon Cloud, is referred to as SHMcloud.  It consists of the same SHMcloud software deployed to every cluster node. The Player organizes the cluster processing. This is illustrated in the diagram below.

In the near future, SHMcloud processing will have the following enhancements:

  1. Browser interface, instead of a desktop application
  2. Optimized data harvesting
  3. Added proprietary data sources and databases
  4. Allow searches and first-pass review directly on the cluster
  5. Allow additional culling, based on previous results

The enhanced near-term processing is illustrated below:

The next enhancements for SHMcloud will include:

  1. Advanced analytics
  2. Review built on Big Data

Comparison of FreeEed / SHMcloud editions

Edition/FeatureFreeEedSHMcloud for Amazon AWSSHMcloud for Hadoop cluster - support
Player (desktop application) for local one-workstation processingFreeFreeFree
Player app to control cluster processingIt works, but you do it yourselfEnterprise supportEnterprise support
Levels of supportEmail, communityTraining, implementation, and support, 8 through 5, or 24x7Training, implementation, and support, 8 through 5, or 24x7
PricingFree$0/month+$1 / server instance hour + AWS chargesYearly: $2,500 per node on Hadoop cluster
Text and metadata extraction, culling, load fileYesYesYes
DeduplicationNo on Windows, yes on LinuxYesYes
Speed2 GB / hour2 GB / hour * number of machines in the cluster (which is limited only by your AWS account)2 GB / hour * number of machines in the cluster
Formats: MS Office, PST, PDF, imagesYesYesYes
Custom formatsNoOptionalOptional
Integration supportNoAvailableAvailable
ScalabilityLimited by one workstation, or by your own support on the clusterBasically, unlimited - you only need to increase your maximum number of instances assigned by AmazonLike any Hadoop cluster, depends on your hardware

Third-party validations

1. Beta testing.

FreeEed / SHMcloud has been tested by a number of parties, including PriceWaterhouseCooper, and various eDiscovery service bureaus.

2. Publications.

LTN regularly writes about FreeEed and SHMcloud, see for example, here,

3. Comparisons to other eDiscovery software.

"Capital Toomey," an e-discovery blogger from Albany, New York, recently posted about his tests of the core FreeEed and said he's optimistic about the program's future.  He noted that as with many open-source applications, FreeEed requires some technical know-how and has room for improvement in its user interfaces.  After making his data available from FreeEed and LexisNexis Concordance tests, Toomey writes: “[...] most any e-discovery tool -- in the proper hands -- can be employed successfully." See here,

4. University research.

a) Marcel Miersebach, a student of computer security in Vienna, Austria, at Fachhochschule St. Pölten, wrote the following paper, “eDiscovery with Hadoop: Is open-source an option?” In this 100+ page work, Marcel compares FreeEed to NUIX, against the general background of eDiscovery.  The draft version of the paper (in German, with a summary in English) can be found here:

b) A group of MBA students at the University of Houston chose SHMsoft as their graduation project. Their work includes the analysis of the company strategy and improvement suggestions. Here are the final presentation:, and the executive summary:

Monday, December 3, 2012

SHMsoft News - eDiscovery GUI

Let's take a quick look at what happened at SHMsoft last week:

  1. We started receiving feedback from our beta testers and advisors on the user interface for eDiscovery processing and culling, and we already started implementing the changes. It will be in the browser, simple, and powerful, and it will run on a single workstation, internal Hadoop cluster, or EC2 - take your pick.
  2. We are deepening our Hadoop training expertise with administrators’ and operations needs, and with advanced MapReduce and HBase programming techniques.
  3. We are expecting a great week to come working with law firms, hosting providers etc. - stay tuned.

Thank you and best regards from the SHMsoft team.