Case study
We have collected all appeal documents from the NY Court of Appeals. For that, we crawled the court website and collected approximately 100,000 documents.
We have then configured the GATE (General Architecture for Text Engineering) tool to extract the information of interest from every document.
Here is the screenshot of GATE screen configured to extract information. It takes a few minutes to extract this information from 100,000 appeal cases, and the output is a CSV file which can be opened as a spreadsheet.
The verify the quality of the information extraction, we watch the statistics. Below is an example of the statistics from one of the latest runs. It shows the percentages of the information being reflected in the case document and successfully extracted by the software.
Files in dir: 111018
Docs processed : 100.0%
Case number: 100.0%
Metadata extracted: 100.0%
Civil: 71.0%
Criminal: 29.0%
Court: 94.7%
Gap days: 92.7%
First date: 92.8%
Appeal date: 100.0%
Judge: 85.8%
Other judges present: 98.4%
District attorney: 61.3%
Assistant district attorney: 100.0%
Crimes: 37.7%
County: 91.7%
Mode of conviction: 53.9%
Keywords: 93.3%
Interest of justice: 4.9%
References to cases: 19.9%
Number of output files: 12
Runtime: 2086 seconds
Our verification assured us that the rate of successful extraction (when the information is actually present) is high.
Below is an example screenshot of the information obtained. The output for all documents (25 MB) can be downloaded from here.
Adding this information to eDiscovery
There are two ways how you can add this information to FreeEed.
- The metadata fields can be added to the documents, and FreeEed configured to add them to the review; or
- The GATE workflow can be compiled and run directly within FreeEed.
Conclusions
The configuration of the GATE tool is an acquired skill, but even out-of-the-box extractors provide useful information. This work was done as part of Memex DARPA project, and the researchers found the extracted information extremely useful.
By the way, we provide training in all these technologies.
6 comments:
Just curious, how did you collect the documents [from the Appellate Division] used for this analysis?
Here is the code for the 'polite' crawler who will get all the documents from the court website, https://github.com/shmsoft/CourtDocs
And this is the link to the parsed document, https://s3.amazonaws.com/elephantscale-public/courtdocs/ny_court_docs.tar.gz, so that you don't have to run the crawler.
I hope you will get my answer - since you posted as anonymous, you might not get the notification
Message received, thank you.
Excellent Blog! I would like to thank for the efforts you have made in writing this post.
Text Mining Techniques and Ideas
Text Analytics using Python
Text Analytics Guide
Text Mining Techniques
Data Mining
Hi, Amazing you know this article helping for me and everyone and thanks for sharing information Core
corporate event organisers in chennai
top corporate event management companies in chennai
best corporate event organisers in chennai
corporate event management companies in chennai
A very nice guide. I will definitely follow these tips. Thank you for sharing such detailed article. I am learning a lot from you.
event management companies in chennai
top event management companies in chennai
Post a Comment