Tuesday, July 7, 2015

The power of text analytics at DARPA/Memex

One of the things we are doing in the DARPA Memex program is text analytics. One of the outcomes of it is an open source project called MemexGATE.

By itself, GATE stands for Generic Architecture for Text Engineering, and it is a mature and widely-used tool. It is up to you to create something useful with GATE, and MemexGATE is our first step. This is an application configured to understand court documents. It will detect people mentioned in the documents, dates, places, and many more characteristics that take you beyond plain key word searches.

To achieve this, GATE combines processing pipelines (such as sentence splitter, language-specific word tokenizer, part of speech tagger, etc) with gazetteers. Now, what is a gazetteer? -- It is a list of people, places etc. that can occur in your documents. MemexGATE includes scripts that collect all US judges, for example, so that they can be detected, when found in a document.

But MemexGATE does more: it is scalable. Building on the Behemot framework, it can parallelise processing for the Hadoop cluster, thus putting no limit on the size of the corpus. MemexGATE was designed and implemented by Jet Propulsion Lab team, and the project committer is Lewis McGibbney.

The picture shown above gives an example of a processed document (from NY court of appeals), with specific finds color-coded. In this way, we process more than 100,000 documents. Why is this useful for us at Memex? - Because we are trying to find and parse court documents related to labor trafficking, so that we can analyze them and better understand the indicators of labor trafficking.

It is very exciting to work on the Memex program. Our team is called "Hyperion Gray" and has been featured in Forbes lately.

What's next? One of the plans is to add understanding of documents to FreeEed, the open source eDiscovery. Instead of just doing keyword searches through the document, the lawyers will be able, by the addition of text analytics, make more sense of the documents: detect people, dates, organizations, etc. This will in turn help create the picture of the case in an automated way.

Disclaimer: we are not official speakers for Memex.
.

No comments: