This is a second
edition of the Apache Flume book, and it covers the latest Flume
version 5.2. The author works at Orbitz, so he can draw on a lot of
practical Big Data experience.
The intro chapter
takes you through the history, versions, requirements, and the
install and sample run of Flume. The author gives you the information
on useful undocumented options and takes you to the cutting edge with
submitting new requests to the Flume team (using has request as an
example).
That should be
enough, but the justification for the existence of the book and all
the additional architectural options with Flume are this: real life
will give you data collection troubles you never before though of.
There will be memory and storage limitation on any node where you
would install Flume, and that is why your real-world architectures
will be multi-tiered, with part of the system being down for
significant lengths of time. This is where more knowledge will be
required.
Channel and sinks
get their own individual chapters. You will learn about file rotation
and data compressions and serialization mechanisms (such as Avro) to
be used in Flume. Load balancing and failover descriptions will help
you create robust data collection.
Flume can collect
data from a variety of sources, and chapter five describes them, with
a lot of in-the-know information and best practices and potential
gotchas.
Interceptors (and in
particular the Morphline interceptor) are a less known, but very
powerful libraries to improve your data flows in Flume. They are a
part of KiteSDK.
Chapter seven,
“Putting it all together” leads you through a practical example
of collecting the data and storing it in ElasticSearch, under
specific Service Level Agreements (SLA), and the setting up Kibana
for viewing the results.
The chapter on
monitoring is useful because monitoring, while important, is as yet
not complete in Flume, and the more up-do-date information on it you
can get, the better – to avoid flying the dark. Imagine someone
telling you that you've been loosing data for the month, and that
parts of your system were not working, unbeknownst st to you. To
avoid this, use monitoring!
The last chapter
gives advice on deploying Flume in multiple data centers and on the
“evils” of time zones.
All in all, a must
for anyone needing data collection skills in Big Data and Flume.