Thursday, December 20, 2012

Teaching Hadoop in high schools!

I've just read a great book, "Automate this." Its topics range from automated trading on Wall Street to a dark fiber line from Chicago to New York cutting through mountains, to Facebook tracking what its billion of users do, to Cloudera and the story of Jeff Hammerbacher, to automated legal review, and then to US educational system.

On the last subject, the author suggests to give a course in programming to every high school student, in order to find those 5% who are born developers (that is because in the mind of the author, Christopher Steiner, everything can be automated: doctors, lawyers, and traders, except the automators themselves).

I say more: we should teach Hadoop in high school. Going through basic Java or C++ is booooring. Rather, tell the kids that you will teach them to program just like Facebook and Google people do, then by the way teach them Java. Don't start from int and float, but teach them to read the code, then to understand it, then to write it. Just like learning a human language the right way.

Do it in the IDE (NetBeans, of course:) with all the modern conveniences. And watch for the amazing results.

Sunday, December 9, 2012

Cloud security for lawyers

Introduction

With the proliferation and easy availability of cloud computing resources it is inevitable that lawyers and legal IT departments will eventually start using clouds. It is also inevitable that they will ask many questions about cloud security. This paper deals with general cloud security, as well as with specific security requirements that lawyers and legal/IT departments of the corporation usually deal with.

We specifically discuss Amazon Web Services (AWS) and the SHMcloud (open source solution for eDiscovery) from SHMsoft. At the end we provide the Q&A section.

What is a “cloud?”

We will define cloud computing as elastic computing resources. A perfect example of this definition is the Elastic Compute Cloud (EC2) provided by the Amazon Web Services (AWS). Its characteristics are:
  • Easy availability of computing resources: You can provision any amount of storage and any number of computing instances within the matter of minutes;
  • Self-service: you have the ability to manipulate your computing resources through the browser in your AWS account, or programmatically, without the need of the formal provisioning process, phone calls, and human involvement;
  • Pay-as-you go: you pay only for the computing resources you use, while you use them.

Since AWS is one of the most prevalent providers, and since SHMcloud is primarily implemented on EC2, we will limit our discussion to these two services. However, the general approach, conclusions, and recommendation are applicable to other cloud services.

Cloud security today

The IT departments of the corporations and law firms used to be in complete control of their web-accessible applications, or at least the responsibility was lying with them. Now the situation is changing. With the availability of cloud resources, and with multiple reasons driving cloud adoption, these same people often have to rely on the security furnished by the cloud provider.

A recent Forbes article, Data Privacy And The Cloud: Fact Versus Fiction, highlights the growing understanding and adoption of the cloud in general.

Amazon’s EC2 - the one we concentrate upon here - sets very high standards for security protection. It also recommends, and in some cases mandates the best security practices. However, it is in large measure up to the users to make sure that they use the EC2-provided security correctly. In case of applications built on top of EC2, the users also need to verify that the applications designers have implemented the Amazon’s guidelines. If any additional security measures have been used, they need to be documented, and in some cases subject to an audit.

Components of an EC2 application

There are three components in an EC2 application: the Amazon Machine Image (AMI) that you run, in one or multiple copies; security key pairs; and security groups.

AMI

AMI is what you run, and here what is the important is the source where you get your AMI. If you use community-provided AMI, it falls on you to check for the backdoors that may have been left by the creators. You would need to delete credentials, remove certificate and key material. About 30-40% of all the community AMIs have some form of backdoors left behind, most likely by oversight, but it does not matter how this happened, the AMI that you run must be clean.

If you are using a Marketplace AMI, then you can be sure that the Amazon Marketplace team has taken these steps before placing the AMI into general availalbity through the Marketplace. Thus, for example, all of the AMIs used by SHMcloud are guaranteed by Amazon to be secure in this sense, and free of backdoors.

Key Pair

Key pair is the pair of private/public keys. The public part is stored on Amazon, and it allows you to access your AMI’s, provided that it agrees with the private key that is stored on your machine. The recommended practice here is for every person to generate and use his/her own key pair. This generation is easy with Amazon, but doing this right has multiple advantage: better security, better accountability, and easier reassignment of rights, when a person leaves the company.

Security group

Security group can be viewed as a firewall to the application. It lists the user(you)-defined access rules for ingress and egress. The security group includes its name, protocol (CTP, UDP)
to and from ports, and the source (where the traffic is coming from).

There can be many security group (up to 500), and in addition there are some known problems with them, such as the use of memcache server. The best way to deal with this is to have a person responsible for security group setup, and auditor, and to use automation to verify the common problems with the security group, such as the Scout tool, https://github.com/iSECPartners/scout. Whatever the method, you need to highlight potentially dangerous security groups, and compare what it should be to what it is really there.

Simple Storage Service - S3

S3 is the storage part of the AWS in general, and of SHMcloud in particular. It has the following three security mechanisms:

  • ACL, or Access Control List;
  • Bucket policy; and
  • IAM (Identity Access Management) policy.

Let’s look at each one separately.

ACL

ACL’s work together with bucket and IAM security.

Bucket security

With bucket-level permissions, one can have fine-grained permissions, on the level of specific  objects. One can also use more granular bucket policies: such as specific actions and conditions. For example, one can enforce permissions based on object size.

IAM

IAM, or Identity Access Management, is the way to have multiple users for the same Amazon AWS account. The best practice of using it is outlined below:

  • Create IAM principal, attach IAM policies to it;
  • Create departments and buckets, set policies;
  • Attach users to departments.

Since IAM includes Identity federation, it is a convenient and powerful for users and groups permissions and controls.

Access Logs allow you to verify how the data is being access. They can be used for security auditing.

IAM allows for coarse-grained permission: read, write, etc. The grantee can be a human user, or a system user (software agent).

Encryption

Encryption is another layer for the security protection of your data. The Amazon AMI images are already encrypted, to guarantee that Amazon’s employees do not have access to your data. However, a targeted hacking attempt or a human glitch to lead to data exposure. To mitigate this risk, sensitive data should additionally encrypted.

There are two ways of data encryption, client side and server side.

With server-side encryption, Amazon manages keys with AES-256. This is more convenient to implement. In this scheme, objects are encrypted, not buckets. Furthermore, there is no need to manage keys, and risk is transferred to AWS services.

With client-side encryption, you manage the keys, using AWS SDK. There is additional implementation load, but one has even greater flexibility. Also, the chain of custody includes only the known elements and excludes the third party of AWS.

With encryption, as with all other levels of security, it is a recommended practice to use automated tools (such as Scout) should verify and enforce encryption.

Questions often asked by lawyers

As we have seen, AWS platform has all the necessary elements to implement the best practices of security in web-base applications. SHMcloud, an eDiscovery system based on AWS, implements all of these best practices.

Of course, SHMcloud can be deployed internally, in a hosted or internal computing center, and then it will carry over all of its security practices to this implementation.

In addition, below is a list of questions that the law practitioners usually ask about cloud-based eDiscovery, as it relates to specific areas of legal responsibility and various geographical jurisdictions. We have also include some questions related to price/benefits analysis.

Q.

How well is my data protected against accidental loss?

A.

S3 stores all its data with the replication factor of 3. Currently S3 stored one billion of new objects daily. Yet, since the beginning of its operation in 2002, AWS does not have a documented case of customer data loss. Given the public nature of all outages, this is a remarkable record.

Q.

Some lawsuit case require storing data for years, can S3 accommodate this?

A.

S3 has no time limit on data storage. In addition, you can implement selective backup for the important information.

Q.

The price of 20 cents / month/ GB of data can be quite high. Is there a way to mitigate this cost?

A.

Amazon Glacier store the data at 1/20 of this cost, at 1 cent/month/GB. You can think of it as inexpensive long-term backup.

Q.

What about human errors, such as deletions?

A.

There is no complete protection against human errors, but best practices help mitigate this risk. These include storing multiple copies of the important information, some of with read-only permissions for all but the project administrator.

Q. 

How can I make sure that my data is indeed deleted after the necessary retention period is finished?

A. 

There are multiple measures that you can take

  • Delete the data from S3, shut down your EC2 instances. This has the effect of erasing the data from your hard drive, only more so. In the case of the local hard drive, there is undelete and forensics restore. By contrast, in the case of S3 and EC2, there data is encrypted by Amazon in the first place, so now it is essentially gone;
  • If on top of that you used your own encryption, the data is unrecoverable;
  • You can overwrite your data with another, bogus data. This is not needed, but some people like to run PC Eraser type program a number of times, for their comfort, and this has about the same effect here.

Q.

Some jurisdictions, such as the European Union, impose the data locality requirement, such as that the data should never leave the European region, for example. Can this be accommodated by AWS, and by extension, by SHMcloud?

A.

Amazon AWS provides “Regions” for this exact purpose. Data deployed in one region (such as Ireland for Europe) is guaranteed to never leave the particular computing center. In addition to satisfying the legal requirements, regions provide for better application latency and responsiveness.

SHMcloud takes full advantage of this Amazon Regions, and it has AMI instances that can be deployed in any Amazon region. Below is a screenshot showing SHMcloud instances and their regions.






SHMcloud (TM) - Open Source Big Data Solution for eDiscovery

Summary

SHMcloud (TM) is an open source solution for eDiscovery.  It is based on the Hadoop framework and other modern Big Data tools.  It has been extensively tested on large volumes of data.  Its current capabilities include metadata and text extraction, culling, exception handling and deduplication.  It also allows searching from within the processed results.  Its output consists of archives with native files, text files, and exceptions, as well as a separate load file which can be loaded into a review platform such as Concordance or Summation.

The SHMcloud project initially started as FreeEed. After being under constant development for a year, with new functionality going into the closed-source, enterprise version of the software, we decided to open source all of it, to provide more benefits to the users, and to simplify the development. The most recently added capabilities include OCR, imaging (PDF), and search with Lucene and Solr, as well as speed enhancements and load-balancing.  SHMsoft provides rigorous testing and quality assurance, and offers responsive commercial support.

History

The FreeEed software was created by developer Mark Kerzner, and published on GitHub in March of 2011.  This was Mark’s third eDiscovery project, with the first two being early attempts at distributed computing.  Thus, FreeEed was the result of years of experience and the deep knowledge of eDiscovery software.  It was built for Big Data from the start, using such technologies as Hadoop, Lucene, Tika, and Hive.

The project received its initial publicity through an article by LTN reporter Evan Koblentz, “Open Source Could Change the Future of E-Discovery”.  Since that time Mark has presented the project at meetings such as Women in eDiscovery in Houston, and a meeting of the Houston Association Of Litigation Support Managers (HALSM), which took place in Houston.  As Mark continued developing the project, he brought it to the Amazon AWS Cloud as the quickest route to adoption.  He lined up his software consulting company, SHMsoft, to offer support.

Having started as a software consulting company, SHMsoft soon evolved into a developer and promoter of FreeEed, offering commercial support and adding open source and closed source enhancements.  It became necessary to have a separate enterprise version of FreeEed, which is now offered as SHMcloud.  The company was accepted as a client of the Houston Technology Center, a technology incubator.  SHMsoft received initial angel funding, and became noticed as one of the first Big Data companies in Houston, TX.  It was selected as a finalist in the prestigious Goradia Startup Competition.  SHMsoft moved forward by hiring technical and marketing personnel, formed an advisory board, and has a current headcount of around twenty people.

Then we open sourced the additional capabilties under the name of SHMcloud, which is also found on GitHib. SHMsoft plans to stand behind the SHMcloud project.  In fact, it is in the process of forming a separate non-profit foundation to promote FreeEed and other open source software for eDiscovery. The name of the foundation is “EddFoundation”.  SHMsoft is currently working with eDiscovery processing bureaus as well as with enterprises, not only in Houston but also nationwide,  to offer support and facilitate the use and acceptance of the SHMcloud software.

Architecture and software development processing

Processing is organized on the Hadoop framework.  The input data is combined (“staged”) into zip archives for processing and chain-of-custody purposes.  During processing, each file is read from the archive and assigned a unique ID.  The data is then processed with Tika, which extracts text and metadata.  Metadata, text, and the file itself are delivered as processed results.

The current and future building blocks of the system are HDFS, Hadoop, HBase, Tika, Lucene, Solr, Mahout, Hive, and Pig.  A proprietary enhancement used for quick searches and review will include DataStax technologies.

Indexing and searching

Culling is accomplished through the use of an open source search engine called Lucene. An efficient in-memory index is created for each document, and all of the project’s keywords are run against this index. If the index contains any of the keyword combinations, the document is considered responsive and is sent for further processing.

A feature that is currently being tested is the capability to store each search index for each document in a complete Lucene index. This allows for additional searching and culling to be performed once the project processing is completed.

This process is made even more efficient and flexible because each node on a Hadoop cluster is creating its own Lucene index. The indexes can then be used for searches, where the software queries all of them in a combined query. For the sake of efficiency, the indexes get merged into the project’s search index during the next step of processing.

Output

Metadata results are output as a CSV file, while the native files and the extracted text are stored in a zip file(s). The end results can be used for culling and producing native files for legal review.

Supported file formats

MS Office formats
PST processing
PDF
Images

Speed of processing

On regular commodity servers, SHMcloud processes about 2 GB of data per hour. The speed linearly increases with the number of servers in the Hadoop cluster. Thus, at a recent demo for HALSM using 50 computers on the Amazon EC2 cloud, SHMcloud processed 100 GB of Enron data in 1 hour.

Testing

SHMsoft has a full-time tester dedicated to testing the stand-alone and cloud-based versions of FreeEed/SHMcloud. The testing is done using standard data sets, in particular the Enron set. The results of the complete Enron data processing can be found at FreeEed.org, or by navigating to http://freeeed.org/index.php/documentation/testing-with-enron-data.

Controlling the software

The SHMcloud software is controlled through a desktop application called a “Player”. The Player allows the user to set and organize projects, add data to the project, set and update processing parameters, stage the data (copy it to archive files for deployment on the Hadoop cluster or on the Amazon AWS cloud), and then to start and control processing.

The web browser-based GUI is under development, first for the search and culling, and later to replace the Player.

The back-end processing, residing on an internal Hadoop cluster or in the private AWS Amazon Cloud, is referred to as SHMcloud.  It consists of the same SHMcloud software deployed to every cluster node. The Player organizes the cluster processing. This is illustrated in the diagram below.





In the near future, SHMcloud processing will have the following enhancements:

  1. Browser interface, instead of a desktop application
  2. Optimized data harvesting
  3. Added proprietary data sources and databases
  4. Allow searches and first-pass review directly on the cluster
  5. Allow additional culling, based on previous results

The enhanced near-term processing is illustrated below:




The next enhancements for SHMcloud will include:

  1. Advanced analytics
  2. Review built on Big Data


Comparison of FreeEed / SHMcloud editions


Edition/FeatureFreeEedSHMcloud for Amazon AWSSHMcloud for Hadoop cluster - support
LicenseApacheProprietaryProprietary
Player (desktop application) for local one-workstation processingFreeFreeFree
Player app to control cluster processingIt works, but you do it yourselfEnterprise supportEnterprise support
Levels of supportEmail, communityTraining, implementation, and support, 8 through 5, or 24x7Training, implementation, and support, 8 through 5, or 24x7
PricingFree$0/month+$1 / server instance hour + AWS chargesYearly: $2,500 per node on Hadoop cluster
Text and metadata extraction, culling, load fileYesYesYes
OCRNoYesYes
ImagingNoYesYes
DeduplicationNo on Windows, yes on LinuxYesYes
Speed2 GB / hour2 GB / hour * number of machines in the cluster (which is limited only by your AWS account)2 GB / hour * number of machines in the cluster
Formats: MS Office, PST, PDF, imagesYesYesYes
Custom formatsNoOptionalOptional
DatabasesNoOptionalOptional
Integration supportNoAvailableAvailable
TrainingNoAvailableAvailable
ScalabilityLimited by one workstation, or by your own support on the clusterBasically, unlimited - you only need to increase your maximum number of instances assigned by AmazonLike any Hadoop cluster, depends on your hardware



Third-party validations

1. Beta testing.

FreeEed / SHMcloud has been tested by a number of parties, including PriceWaterhouseCooper, and various eDiscovery service bureaus.

2. Publications.

LTN regularly writes about FreeEed and SHMcloud, see for example, here, http://www.law.com/jsp/lawtechnologynews/PubArticleLTN.jsp?id=1202556672056&SHMsoft_Tests_Open_Source_EDiscovery_App_in_Cloud&slreturn=20120813164425

3. Comparisons to other eDiscovery software.

"Capital Toomey," an e-discovery blogger from Albany, New York, recently posted about his tests of the core FreeEed and said he's optimistic about the program's future.  He noted that as with many open-source applications, FreeEed requires some technical know-how and has room for improvement in its user interfaces.  After making his data available from FreeEed and LexisNexis Concordance tests, Toomey writes: “[...] most any e-discovery tool -- in the proper hands -- can be employed successfully." See here, http://capitaltoomey.blogspot.com/2012/04/ediscovery-lower-in-stack-ptiv-wrap.html

4. University research.

a) Marcel Miersebach, a student of computer security in Vienna, Austria, at Fachhochschule St. Pölten, wrote the following paper, “eDiscovery with Hadoop: Is open-source an option?” In this 100+ page work, Marcel compares FreeEed to NUIX, against the general background of eDiscovery.  The draft version of the paper (in German, with a summary in English) can be found here: http://shmsoft.com/images/stories/Diplomarbeit_Marcel_Miersebach-v3.0.pdf.

b) A group of MBA students at the University of Houston chose SHMsoft as their graduation project. Their work includes the analysis of the company strategy and improvement suggestions. Here are the final presentation: http://shmsoft.com/images/stories/SHMsoft%20Project%20Presentation.pptx, and the executive summary: http://shmsoft.com/images/stories/Exec-Summary.pdf.

Monday, December 3, 2012

SHMsoft News - eDiscovery GUI

Let's take a quick look at what happened at SHMsoft last week:

  1. We started receiving feedback from our beta testers and advisors on the user interface for eDiscovery processing and culling, and we already started implementing the changes. It will be in the browser, simple, and powerful, and it will run on a single workstation, internal Hadoop cluster, or EC2 - take your pick.
  2. We are deepening our Hadoop training expertise with administrators’ and operations needs, and with advanced MapReduce and HBase programming techniques.
  3. We are expecting a great week to come working with law firms, hosting providers etc. - stay tuned.

Thank you and best regards from the SHMsoft team.
12.03.2012

Sunday, November 18, 2012

SHMsoft News: eDiscovery, bio-informatics and security

Let's take a quick look at what happened at SHMsoft last week:

  1. While still perfecting search, OCR and imaging capabilities for SHMcloud, we are now working on tagging and export;
  2. SHMsoft got a very active reseller in London, England. The business model that we created  for this seems to work, and more eDiscovery consultants are welcome to apply.
  3. We have performed our first pilot Big Data projects for a large chemical company, dealing with comparative genomics and with crop genomics. Accordingly, we are implementing company-wide bioinformatics education with courses like this one.
  4. Performed Big Data training for a major hardware manufacturer, and more courses are scheduled.
  5. SHMsoft is in talks with a fast-growing cloud security provider, to provide cloud implementation services for their clients.

Thank you and best regards from the SHMsoft team.
11.18.2012

Monday, November 5, 2012

SHMsoft news


  • Search, OCR, and imaging are released. You can look in the manual here, http://shmsoft.com/manual.html, or email us asking how you can use those features for your needs.
  • SHMcloud pricing for the EC2 and for internal Hadoop cluster maintenance is ready and will be announced next week.
  • We are working on further improvements to SHMcloud.
  • We are taking steps toward analytics platform.

Thank you and best regards from the SHMsoft team.
11.04.2012

Thursday, October 25, 2012

SHMsoft News: It’s All About Team Building

Let's take a quick look at what happened at SHMsoft last week:

  • Our team of programmers grew.  Together with our tester they are having a wonderful time developing, testing, fixing bugs, and documenting the new features.
  • OCR, imaging, and search have been added to SHMcloud and are in testing by our team.
  • Out beta testers are excited about the enhancements and are eagerly anticipating their use.
  • Our lead Natural Language Processing (NLP) developer started his work on the integration of the NLP technology for SHMcloud, in close collaboration with the technology provider.
  • SHMsoft’s total headcount grew to over twenty!
  • Our Lead Data Scientist is currently working on the analytics design for our NLP developer to implement.
  • Our marketing specialist and our technical editor are working full swing on our new web site.

Saturday, October 13, 2012

SHMsoft / FreeEed / SHMcloud weekly news, 10/07/12

Let's take a quick look at what happened at SHMsoft last week:

  • Instant search for our SHMcloud platform is now ready and undergoing testing before the final release.
  • SHMsoft was the 2nd place winner of the 2012 Goradia Innovation Awards at a prestigious  innovation conference organized by the Houston Technology Center. Read the full story soon. 
  • SHMsoft’s EDD Foundation is featured in Law Technology News as an important step in open-source, law technology projects. Read the full article here
  • One of the largest and well-known companies in IT has chosen SHMsoft for its big data and Hadoop training, as well as its pilot software products implementation. 
  • Gallery Furniture, a big and very well known Houston retailer, selected SHMsoft as its partner for researching, developing and implementing a big data solution for targeted and direct marketing. 
  • SHMsoft was chosen to co-present with Jim McIngvale at the American Marketing Association forum event in Houston on Dec. 7, to talk about big data marketing and how Gallery Furniture is using it to leverage sales. 

Wednesday, October 3, 2012

Houston Hadoop Meetup - Pig hands-on

Ravi Mutyala of Hortonworks (but also great by himself) presented Pig, and covered a lot of material in just a couple of hours. Everybody got a VM in the cloud, and Ravi provided the data and the code for the labs, which can be found here:

Slides with all links: http://www.slideshare.net/rmutyala/introduction-to-pig-14641274
Data: https://s3.amazonaws.com/freeeed.org/labs.tar.gz

We overcame all obstacles, such as internet connectivity problems and figuring out which AMI to use for training (use mine! - no, use mine!).

Thank you everybody, and we will do a better job preparing the AMIs, code, and data next time, when Ravi presents Hive!

Monday, September 24, 2012

SHMsoft office at Caroline Collective

We got furniture and  whiteboards, and created our first "Hello World" application.

SHMsoft Week in Review - September 23, 2012

Let's take a quick look at what happened at SHMsoft last week:

  • With OCR completed, we are working on the instant search feature for SHMcloud.
  • An IT company based in London is working to become a reseller of SHMcloud technology for private clouds.
  • We are in the midst of continued demos and conversations with a major eDiscovery software publisher.
  • SHMsoft is set up as a strategic partner for Hadoop / HBase pilot projects with a major hardware manufacturer.
  • SHMsoft is being set up to provide Hadoop administrative services for a major US bank.
  • We delivered a final report for a large Houston retailer.  Next week we will be doing a “proof-of-concept” for their marketing application, based on Big Data.

Tuesday, September 18, 2012

SHMsoft Week in Review - September 16, 2012

Let's take a quick look at what happened at SHMsoft last week


  • We have completed a white paper called "FreeEed - Open Source Big Data Solution for eDiscovery," you can download it here.
  • A more complete version of the SHMcloud (TM) manual can be found here.
  • We are finally listed on Amazon Marketplace here, and it comes up as the top result in searches for “e-Discovery”.
  • Working closely with our partners, we are developing leads for clients that understand the benefits of open-source, corporate e-discovery based on Hadoop clusters.
  • We have initiated the formation of the “Edd Foundation,” to house our open-source FreeEed application as well as to foster other open-source e-Discovery projects.
  • We have finalized our presentation and we are ready to go for the first prize on the Goradia startup competition at the IC&S event of the Houston Technology Center.
  • A large computer hardware manufacturer selected SHMsoft as an implementation partner for Hadoop/Big Data pilot projects for its clients.




Monday, September 10, 2012

SHMsoft Week in Review - September 09, 2012

Let's take a quick look at what happened at SHMsoft last week

  • SHMcloud is at version 4.1.6, with OCR, which exhibits performance of 7 seconds per image on an Amazon (AWS) medium-power machine, and up to 96% accuracy.
  • Our advisory board is complete, including specialists in eDiscovery, forensics, and Big Data marketing http://shmsoft.com/index.php/advisory-board.
  • We had a most successful Hadoop Meetup in our new office, located at 4820 Caroline Suite 103 in Houston. Twenty people came and enjoyed pizza while watching a presentation by Dianhui Zhu on "Genomic data analysis with Hadoop."
  • Work with an energy company using SHMcloud continues. We outlined a mutual roadmap for them and we will be working on defining the scope of their SHMcloud use.
  • An energetic lawyer joined the FreeEed project to market, advocate, and evangelize it. He will also provide technical guidance.
  • SHMsoft was selected as a finalist in the Goradia start-up competition. Our final presentation will happen on October 3.

Thursday, September 6, 2012

Houston Hadoop Meetup, September 2012, guest post from the presenter, Dianhui Zhu

This is what Dianhui (a.k.a Dennis) reported.

Dianhui (Dennis) Zhu presented his work on “Genomic Data Analysis with Hadoop”. Some of the Hadoopers who didn’t attend his first presentation were very interested in his work and requested this 2nd presentation.

The presentation was very successful. It started with a short description of the challenges that the researchers are facing in the field of genomic data analysis. It was then followed by a use case study on the use of Hadoop framework to detect genomic sequence patterns.

Dianhui actually walked Houston Hadoopers through the code and showed how to test it.. In the last 20 minutes, Dianhui demonstrated how to setup a 4-node Hadoop cluster from scratch and then did a live demo on the cluster. A performance comparison was conducted based on a real dataset. The attendants saw that the speed on the 4-node cluster was roughly three times faster than on a single node.

The meetup was held for the first time at the SHMsoft's new office location. Hadoopers were energized by pizza (thanks to the hospitality of SHMsoft). The presentation was accompanied by a lively interactive session with excellent questions and answers. It was finished with a perfect live demo.

See y'all there next time!

Art: Vassily Kandinsky - Untitled First Abstract Watercolor

Monday, September 3, 2012

SHMsoft Week in Review - September 02, 2012

Let's take a quick look at what happened at SHMsoft last week

  • SHMcloud V4.1.6 has been released, and now it is using a better way of distributing our software, called Elastic Block Storage, or EBS, which is a prerequisite for being listed on AWS Marketplace.
  • OCR integrated and tested, locally and on the cloud.
  • Formulated a product development roadmap, based on market research, expert input and projected funding.
  • Through our partnership with Hortonworks, we have established a relationship with a  NY law firm interested in Hadoop/Big Data eDiscovery.
  • We are working with a big Legal Process Outsourcing (LPO) firm, which is looking forward to testing and suggesting improvements in the SHMcloud software for their needs.
  • Got a new office at Caroline Collective at 4820 Caroline, Houston, Suite 103, TX 77004.
  • Got a new adviser, Big Data architect Amit Rathore.
  • Started working on search engine optimization (SEO), to drive traffic to our website.
  • Started working on the web site itself, to better explain our offerings.

Thursday, August 30, 2012

FreeEed gets a home



FreeEed got a home, and SHMsoft got an office - to better support our users. The place is called Caroline Collective, and the feeling is just like in the Silicon Valley. Plus, it's in the Museum district, so in that sense it is like Cision in Chicago, where you can walk to the museum (with the famous lions).

Monday, August 27, 2012

SHMsoft / FreeEed / SHMcloud weekly news, 8/26/12


  • SHMcloud V4.1.4 has been released; GUI bugs fixed.
  • OCR tested, adding it to SHMcloud install on Amazon.
  • Started on specifications for search in Early Case Assessment (ECA).
  • Started on phase 1 of Big Data social media marketing project for a large local retailer.
  • Staff: an experienced data scientist leads this Big Data social media project.
  • Having been accepted as an APN Amazon partner, SHMsoft is in the final stages of getting listed on AWS Marketplace with our SHMcloud(TM) product.
  • A large LPO firm in India started working on SHMcloud, to make sure that the product satisfies their needs for scalable processing.
  • A new financial consultant started working on SHMsoft’s business and financial plan.

Sunday, August 19, 2012

SHMsoft / FreeEed / SHMcloud weekly news, 8/19/12

  • SHMcloud V4.1.2 has been released; it features all processing parameters arranged in tabbed dialog for the convenience of the operator, and more parameters have been added
  • OCR capability is in testing, expected to be released soon
  • Staff additions: a software developer, a Hadoop administrator, an eDiscovery project manager, and an eDiscovery business advisor all started this past week
  • We are in meetings with two energy companies, who are planning to use SHMcloud as part of their internal eDiscovery handling
  • A New York law firm working with Hortonworks saw our Hadoop-based eDiscovery and asked for a presentation next week
  • SHMsoft has become an official member of the Amazon Partner Network (APN), of Rice Alliance for Technology and Entrepreneurship, and was selected to present to the judges of the Goradia startup competition