Wednesday, June 10, 2015

Joe Witt of Onyara presented Apache NiFi

Joe Witt and the team of Onyara came to present Apache Nifi at Houston Hadoop Meetup. The NiFi project is the result of eight years of development at NSA, which has been open sourced in November of 2014.

The project is for automating enterprise dataflows, and its salient use cases are
  • Remote sensor delivery
  • Inter-site/global distribution
  • Intra-site distribution
  • "Big Data" ingest
  • Data Processing (enrichment, filtering, sanitization)
For the rest, in the words of Shakespeare

"Let Lion, Moonshine, Wall, and lovers twain

At large discourse, while here they do remain."

Meaning, in our case, here are the slides, courteously provided by Joe.

Oh, and there WAS a live demo, so those who missed it - missed it.

As always, pizza was provided by Elephant Scale LLC, Big Data training and consulting.

Monday, June 8, 2015

Big Data Cartoon - Summer Fun


Summer is the time to have fun and to get some rest! While their moms and dads are presumably coding away some new Big Data app, their kids can go to the summer camp. So did our Big Data cartoonist, who is now working as a summer camp artistic director. (These "cartoons" are really the large size decorations there.)

But you can see the same themes, albeit hidden: the tiger is no doubt the new elephant, and the magicians are the software engineers.

Thursday, June 4, 2015

Review of "Apache Flume" by Steve Hoffman (Packt)

This is a second edition of the Apache Flume book, and it covers the latest Flume version 5.2. The author works at Orbitz, so he can draw on a lot of practical Big Data experience.

The intro chapter takes you through the history, versions, requirements, and the install and sample run of Flume. The author gives you the information on useful undocumented options and takes you to the cutting edge with submitting new requests to the Flume team (using has request as an example).

That should be enough, but the justification for the existence of the book and all the additional architectural options with Flume are this: real life will give you data collection troubles you never before though of. There will be memory and storage limitation on any node where you would install Flume, and that is why your real-world architectures will be multi-tiered, with part of the system being down for significant lengths of time. This is where more knowledge will be required.

Channel and sinks get their own individual chapters. You will learn about file rotation and data compressions and serialization mechanisms (such as Avro) to be used in Flume. Load balancing and failover descriptions will help you create robust data collection.

Flume can collect data from a variety of sources, and chapter five describes them, with a lot of in-the-know information and best practices and potential gotchas.

Interceptors (and in particular the Morphline interceptor) are a less known, but very powerful libraries to improve your data flows in Flume. They are a part of KiteSDK.

Chapter seven, “Putting it all together” leads you through a practical example of collecting the data and storing it in ElasticSearch, under specific Service Level Agreements (SLA), and the setting up Kibana for viewing the results.

The chapter on monitoring is useful because monitoring, while important, is as yet not complete in Flume, and the more up-do-date information on it you can get, the better – to avoid flying the dark. Imagine someone telling you that you've been loosing data for the month, and that parts of your system were not working, unbeknownst st to you. To avoid this, use monitoring!

The last chapter gives advice on deploying Flume in multiple data centers and on the “evils” of time zones.

All in all, a must for anyone needing data collection skills in Big Data and Flume.



Friday, May 8, 2015

Big Data Cartoon: Spark workshop

So what is Apache Spark workshop? You would imagine that sparks are flying off the participants and, at least in the eyes of our illustrator, this is absolutely true. By the way, Elephant Scale will be teaching a one-day workshop soon, in Dallas, TX, so stay tuned.

(What is the inspiration behind this drawing? It Rembrandt's - genius painting but gruesome subject - you have been warned - here.)

Wednesday, April 29, 2015

I am a reviewer on "Real-time Analytics with Storm and Cassandra"


  • Create your own data processing topology and implement it in various real-time scenarios using Storm and Cassandra
  • Build highly available and linearly scalable applications using Storm and Cassandra that will process voluminous data at lightning speed
  • A pragmatic and example-oriented guide to implement various applications built with Storm and Cassandra

Tuesday, April 28, 2015

Big Data Cartoon - In the Big Small world

Working with Big Data, you are also working across the globe. As Sujee Maniyam puts it in his presentation on Launching your career in Big Data: "If you think of a nine-to-five job - forget it!" You are working with multiple teams at all hours. The reward, in the words of Shakespeare

Why, then the world's mine oyster.

Many people quote this, but in the play it's better. Falstaff fires Pistol, his servant, and refuses to lend him money. Pistol decides to fend for himself. Fastaff is being witty, at his own expense: Pistol will betray him.


Enter FALSTAFF and PISTOL
FALSTAFF
I will not lend thee a penny.
PISTOL
Why, then the world's mine oyster.
Which I with sword will open.
FALSTAFF
Not a penny. I have been content, sir, you should
lay my countenance to pawn...