Friday, November 4, 2016

Using FreeEed for social media discovery

One of the areas that the Memex/DARPA teams excel in is crawling. FreeEed and the people behind it are part of the Memex, so it was quite natural to integrate discovery of crawl results into FreeEed processing and review.

Here is a recent Forbes article about the team.

Searching the websites and social media has been added to FreeEed starting from version 7. The common format to store crawl results is JSON. Each JSON description corresponds to a website page, user post, or a similar item.

Each JSON search entry is represented by a one-line in the archive file. The archive is given the extension *.jl, which stands for "JSON line".

FreeEed understands the *.jl extension, parses the JSON content of every line in the *.jl file, and finds indexes such fields as text, authors, etc., and makes them searchable in the FreeEed Review tool.

Below is a screenshot of FreeEeedUI review, illustrating searches in  a collection from an escort services website.

















How to create your crawler? You can use the crawler from Scraping Hub, also a member of the Memex team. Or you can use the trusted friend, Apache Nutch. Nutch has been around for more than ten years, and it is the beginning of Hadoop.

By the way, we provide training in all these technologies.

3 comments:

Robert said...

I admire this article for the well-researched content and excellent wording. I got so involved in this material that I couldn’t stop reading. I am impressed with your work and skill. Thank you so much. smm panel

Blog Comment Backlinks said...

This is actually the kind of information I have been trying to find. Thank you for writing this information. Mobile Game

nency said...

Im no expert, but I believe you just made an excellent point. You certainly fully understand what youre speaking about, and I can truly get behind that. Game Mobile Online