Apache Flume: Distributed Log Collection for Hadoop (What by Steve Hoffman

By Steve Hoffman

If your function contains relocating datasets into Hadoop, this e-book might help you do it extra successfully utilizing Apache Flume. From install to customization, it is a entire step by step advisor on making the provider paintings for you.

Overview

  • Integrate Flume together with your info sources
  • Transcode your information en-route in Flume
  • Route and separate your facts utilizing typical expression matching
  • Configure failover paths and load-balancing to take away unmarried issues of failure
  • Utilize Gzip Compression for documents written to HDFS

In Detail

Apache Flume is a dispensed, trustworthy, and on hand carrier for successfully gathering, aggregating, and relocating quite a lot of log info. Its major objective is to carry info from functions to Apache Hadoop's HDFS. It has an easy and versatile structure according to streaming information flows. it's powerful and fault tolerant with many failover and restoration mechanisms.

Apache Flume: allotted Log assortment for Hadoop covers issues of HDFS and streaming data/logs, and the way Flume can get to the bottom of those difficulties. This e-book explains the generalized structure of Flume, together with relocating facts to/from databases, NO-SQL-ish facts shops, in addition to optimizing functionality. This publication contains real-world situations on Flume implementation.

Apache Flume: disbursed Log assortment for Hadoop starts off with an architectural review of Flume after which discusses every one part intimately. It courses you thru the entire set up technique and compilation of Flume.

It offers you a heads-up on how one can use channels and channel selectors. for every architectural part (Sources, Channels, Sinks, Channel Processors, Sink teams, and so forth) some of the implementations can be coated intimately in addition to configuration ideas. you should use it to customise Flume on your particular wishes. There are tips given on writing customized implementations to boot that will assist you research and enforce them.

  • By the top, try to be in a position to build a sequence of Flume brokers to move your streaming facts and logs out of your platforms into Hadoop in close to genuine time.
  • What you'll examine from this book

    • Understand the Flume architecture
    • Download and set up open resource Flume from Apache
    • Discover whilst to exploit a reminiscence or file-backed channel
    • Understand and configure the Hadoop dossier procedure (HDFS) sink
    • Learn how you can use sink teams to create redundant information flows
    • Configure and use numerous resources for eating data
    • Inspect facts files and path to diversified or a number of locations in accordance with payload content
    • Transform information en-route to Hadoop
    • Monitor your information flows

    Approach

    A starter consultant that covers Apache Flume in detail.

    Who this ebook is written for

    Apache Flume: disbursed Log assortment for Hadoop is meant for those that are liable for relocating datasets into Hadoop in a well timed and trustworthy demeanour like software program engineers, database directors, and information warehouse administrators.

    Show description

    Read Online or Download Apache Flume: Distributed Log Collection for Hadoop (What You Need to Know) PDF

    Best software development books

    Error Control Coding: Fundamentals and Applications (Prentice-Hall Computer Applications in Electrical Engineerin)

    Utilizing at the least arithmetic, this quantity covers the basics of coding and the purposes of codes to the layout of genuine blunders keep watch over structures.

    Agile Software Construction

    Introduces the middle ideas, evaluates how winning they are often, in addition to what difficulties could be encountered Dispels various myths surrounding agile improvement

    Fathom 2: Eine Einführung (German Edition)

    Fathom 2 ist eine einzigartige dynamische Stochastik- und Datenanalysesoftware, die den besonderen Bedürfnissen der schulischen und universitären Lehre gerecht wird und die hier erstmals in deutscher Adaption vorgelegt wird. Die Einführung in Fathom 2 bietet einen schnellen und erfolgreichen Einstieg in diese Werkzeugsoftware anhand zahlreicher Beispiele zur statistischen Datenanalyse, zur stochastischen Simulation und zu mathematischen Aspekten der Stochastik.

    Building Web Apps for Google TV

    By way of integrating the internet with conventional television, Google television bargains builders a tremendous new channel for content material. yet developing apps for Google television calls for studying a few new skills—in truth, what you could already find out about cellular or machine net apps is not fullyyt acceptable. construction net Apps for Google television might help you're making the transition to Google television as you examine the instruments and strategies essential to construct refined internet apps for this platform.

    Extra resources for Apache Flume: Distributed Log Collection for Hadoop (What You Need to Know)

    Sample text

    IdleTimeout property. The amount of work to close the files is pretty small so increasing this value from the default of one worker is unlikely. rollTimerPoolSize property as it is not used. Sink groups In order to remove single points of failure in your data processing pipeline, Flume has the ability to send events to different sinks using either load balancing or failover. In order to do this we need to introduce a new concept called a sink group. A sink group is used to create a logical grouping of sinks.

    If your Events have a large variance in size, you may be tempted to use these settings to adjust capacity, but be warned that calculations are estimated from the event's body only. If you have any headers, which you will, your actual memory usage will be higher than the configured values. Finally, the keep-alive parameter is the time the thread writing data into the channel will wait when the channel is full before giving up. Since data is being drained from the channel at the same time, if space opens up before the timeout expires, the data will be written to the channel rather than throwing an exception back to the source.

    This is considerably slower than using the non-durable memory channel, but provides recoverability in the event of system or Flume agent restarts. Conversely, the memory channel is much faster, but failure results in data loss and has much lower storage capacity when compared with the multi-terabyte disks backing the file channel. Which channel you choose depends on your specific use cases, failure scenarios, and risk tolerance. That said, regardless of what channel you choose, if your rate of ingest from the sources into the channel is greater than the rate the sink can write data, you will exceed the capacity of the channel and you will throw a ChannelException.

    Download PDF sample

    Rated 4.37 of 5 – based on 49 votes