Thursday, December 29, 2016

Using FreeEed in the Memex program for investigations

A common problem in investigations is that the authors of the research software, which is being produced in the course of the Memex problem, are themselves not authorized to see the data that the investigation agencies deal with.

To address this problem, we added hash search to FreeEed. First, we have added the metadata screen display (which was not previously available), and users can see the metadata.

This screenshot presents the view of the metadata table. Metadata, of course, is "data about data." It shows all the fields collected from the documents being searched, together with their "a.k.a" or synonyms. For example, in this screenshot, you can see that field 22 can be called "From, but it can also be called "Author" or "Message-From." You can see now that there is a new field, called "Hash."

Next, the file hash is added to the metadata fields settings. Users have requested this feature prior, and now it is available. For emails, the hash is defined using the popular email fields. In FreeEed, this is configurable through the database.

This hash is shown in the screenshot on the left, which represents the 'load file' output by FreeEed. There it is seen with other popular metadata fields, which were recently added by request, such as Message-ID.

The investigating agency can simply compute the hashes of the objects, such as texts, phones, images, or anything else that they are looking for, and search for these, without revealing what they are searching for, to the authors of the software or the processors. Entities other than investigating agencies may find this feature useful as well.


Now, this shows in the processing results but is it searchable? For that, Hash has been added to the schema in the FreeEedUI search engine (which is SOLR). Now Hash shows up as one of the fields for each document, as the screenshot shows.

The last question, can one search having just the hash value? The answer is yes, you can search on the hash alone. To verify this, pick up one of the hashes that you saw in the documents and try to search for this value. You will find this one document - as is to be expected, since all hashes, MD5 and SHA-1, are designed to be unique per document.  The last screenshot illustrates this.

Additionally, FreeEed can provide the results sorted by user-defined "document significance," using the user-provided functions. Such functions are supplied by the Memex groups.


Sunday, December 25, 2016

Word clouds in FreeEed

Word clouds have been added to FreeEed as an early release. To try, download the jar from here, https://s3.amazonaws.com/shmsoft/releases/freeeed-processing-1.0-SNAPSHOT-jar-with-dependencies.jar and replace the jar by the same name in your install. Then run freeeed_player.sh (.bat) as usual.

Here is an example of a word cloud and a screenshot of the Analytics menu, which features word clouds.

The word cloud is from project included with FreeEed, which is just a collection of unconnected documents, so the cloud is not very meaningful. You should get something related to your use cases and more useful.

Your feedback will be very much appreciated.

Monday, December 12, 2016

Hadoop going to China

Actually, Hadoop is already in China. Here is the largest Hadoop distribution company in China, called Transwarp. Three hundred customers and counting, one hundred engineers, and growing, and five training centers across China.

Nevertheless, there is still "way to go" in this direction, as our cartoon aptly shows.

Thursday, December 1, 2016

Kent Graziano presents Snowflake at Houston Hadoop & Spark Meetup

Another great presentation at the Meetup, by Kent Graziano. Read all about the presenter, the subject, and the feedback here: https://www.meetup.com/Houston-Hadoop-Meetup-Group/events/235608911/

And here are the slides: http://www.slideshare.net/elephantscale/changing-the-game-with-cloud-dw

See you next time!