Using Kafka with Flume

Wesley13
• 阅读 677

这个文档是 Cloudera Distribution of Apache Kafka 1.3.x. 其他版本的文档在Cloudera Documentation.

Using Kafka with Flume

在CDH 5.2.0 及更高的版本中, Flume 包含一个Kafka source and sink。使用它们可以让数据从Kafka流入Hadoop或者从任何Flume source 流入Kafka。

重要提示:不能配置一个Kafka source发送数据到 a Kafka sink.如果这么做, the Kafka source sets the topic in the event header, overriding the sink configuration and creating an infinite loop, sending messages back and forth between the source and sink. If you need to use both a source and a sink, use an interceptor to modify the event header and set a different topic.

Kafka Source

使用Kafka source 让数据从Kafka topics 流入 Hadoop. The Kafka source 可以与任何Flume sink合并, 这样很容易把数据从 Kafka 写到 HDFS, HBase, 以及Solr.

下面的 Flume 配置示例,是使用 Kafka source 发送数据到 HDFS sink:

tier1.sources  = source1
 tier1.channels = channel1
 tier1.sinks = sink1
 
 tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
 tier1.sources.source1.zookeeperConnect = zk01.example.com:2181
 tier1.sources.source1.topic = weblogs
 tier1.sources.source1.groupId = flume
 tier1.sources.source1.channels = channel1
 tier1.sources.source1.interceptors = i1
 tier1.sources.source1.interceptors.i1.type = timestamp
 tier1.sources.source1.kafka.consumer.timeout.ms = 100
 
 tier1.channels.channel1.type = memory
 tier1.channels.channel1.capacity = 10000
 tier1.channels.channel1.transactionCapacity = 1000
 
 tier1.sinks.sink1.type = hdfs
 tier1.sinks.sink1.hdfs.path = /tmp/kafka/%{topic}/%y-%m-%d
 tier1.sinks.sink1.hdfs.rollInterval = 5
 tier1.sinks.sink1.hdfs.rollSize = 0
 tier1.sinks.sink1.hdfs.rollCount = 0
 tier1.sinks.sink1.hdfs.fileType = DataStream
 tier1.sinks.sink1.channel = channel1

为了更高的吞吐量, 可以配置多个Kafka sources读取一个 topic.如果所有sources配置一个相同的groupID, 并且topic 有多个分区, 设置每一个source 从不同的分区读取数据,就可以改善效率.

下面的列表描述Kafka source 支持的参数; 必须的参数使用粗体列出 .

Table 1. Kafka Source Properties

Property Name

Default Value

Description

type

 

必须设置为org.apache.flume.source.kafka.KafkaSource.

zookeeperConnect

 

The URI of the ZooKeeper server or quorum used by Kafka. This can be a single node (for example, zk01.example.com:2181) or a comma-separated list of nodes in a ZooKeeper quorum (for example, zk01.example.com:2181,zk02.example.com:2181, zk03.example.com:2181).

topic

 

source 读取消息的Kafka topic。 Flume 每个source只支持一个 topic.。

groupID

flume

The unique identifier of the Kafka consumer group. Set the same groupID in all sources to indicate that they belong to the same consumer group.

batchSize

1000

向channel写入消息的最多条数

batchDurationMillis

1000

向channel书写的最大时间 (毫秒)  。

其他Kafka consumer  支持的属性

 

通过Kafka source配置Kafka consumer。可以使用任何consumer 支持的属性。 Prepend the consumer property name with the prefix kafka. (for example, kafka.fetch.min.bytes). See the Kafka documentation for the full list of Kafka consumer properties.

调优

Kafka source 重写了两个Kafka consumer 的属性:

  1. auto.commit.enable 设置为 false by the source, and every batch is committed. 为了改善性能, 设置为 true 改为使用 kafka.auto.commit.enable。 这个可能会丢失数据 if the source goes down before committing.
  2. consumer.timeout.ms设置为 10, so when Flume polls Kafka for new data, it waits no more than 10 ms for the data to be available. Setting this to a higher value can reduce CPU utilization due to less frequent polling, but introduces latency in writing batches to the channel.

Kafka Sink

使用Kafka sink 从一个 Flume source发送数据到 Kafka . You can use the Kafka sink in addition to Flume sinks such as HBase or HDFS.

The following Flume configuration example uses a Kafka sink with an exec source:

tier1.sources  = source1
 tier1.channels = channel1
 tier1.sinks = sink1
 
 tier1.sources.source1.type = exec
 tier1.sources.source1.command = /usr/bin/vmstat 1
 tier1.sources.source1.channels = channel1
 
 tier1.channels.channel1.type = memory
 tier1.channels.channel1.capacity = 10000
 tier1.channels.channel1.transactionCapacity = 1000
 
 tier1.sinks.sink1.type = org.apache.flume.sink.kafka.KafkaSink
 tier1.sinks.sink1.topic = sink1
 tier1.sinks.sink1.brokerList = kafka01.example.com:9092,kafka02.example.com:9092
 tier1.sinks.sink1.channel = channel1
 tier1.sinks.sink1.batchSize = 20

The following table describes parameters the Kafka sink supports; required properties are listed in bold.

Table 2. Kafka Sink Properties

Property Name

Default Value

Description

type

 

必须设置为: org.apache.flume.sink.kafka.KafkaSink.

brokerList

 

The brokers the Kafka sink uses to discover topic partitions, formatted as a comma-separated list of hostname:port entries. You do not need to specify the entire list of brokers, but Cloudera recommends that you specify at least two for high availability.

topic

default-flume-topic

The Kafka topic to which messages are published by default. If the event header contains a topic field, the event is published to the designated topic, overriding the configured topic.

batchSize

100

The number of messages to process in a single batch. Specifying a larger batchSize can improve throughput and increase latency.

requiredAcks

1

The number of replicas that must acknowledge a message before it is written successfully. Possible values are 0 (do not wait for an acknowledgement), 1 (wait for the leader to acknowledge only), and -1 (wait for all replicas to acknowledge). To avoid potential loss of data in case of a leader failure, set this to -1.

其他Kafka producer所支持的属性

 

Used to configure the Kafka producer used by the Kafka sink. You can use any producer properties supported by Kafka. Prepend the producer property name with the prefix kafka. (for example, kafka.compression.codec). See the Kafka documentation for the full list of Kafka producer properties.

Kafka sink 使用 topic 以及 key properties from the FlumeEvent headers to determine where to send events in Kafka. If the header contains the topic property, that event is sent to the designated topic, overriding the configured topic. If the header contains the key property, that key is used to partition events within the topic. Events with the same key are sent to the same partition. If the key parameter is not specified, events are distributed randomly to partitions. Use these properties to control the topics and partitions to which events are sent through the Flume source or interceptor.

Kafka Channel

CDH 5.3 以及更高的版本包含一个Kafka channel to Flume in addition to the existing memory and file channels. 可以使用Kafka channel:

  • To write to Hadoop directly from Kafka without using a source.不使用source,从Kafka直接向hadoop中写数据。
  • To write to Kafka directly from Flume sources without additional buffering.不使用额外的缓冲区直接从Flume source向Kafka写数据。
  • As a reliable and highly available channel for any source/sink combination.可以与任何source/sink结合。

如下的 Flume 配置使用了一个Kafka channel 以及一个exec source 和 hdfs sink:

tier1.sources = source1
tier1.channels = channel1
tier1.sinks = sink1

tier1.sources.source1.type = exec
tier1.sources.source1.command = /usr/bin/vmstat 1
tier1.sources.source1.channels = channel1

tier1.channels.channel1.type = org.apache.flume.channel.kafka.KafkaChannel
tier1.channels.channel1.capacity = 10000
tier1.channels.channel1.transactionCapacity = 1000
tier1.channels.channel1.brokerList = kafka02.example.com:9092,kafka03.example.com:9092
tier1.channels.channel1.topic = channel2
tier1.channels.channel1.zookeeperConnect = zk01.example.com:2181
tier1.channels.channel1.parseAsFlumeEvent = true

tier1.sinks.sink1.type = hdfs
tier1.sinks.sink1.hdfs.path = /tmp/kafka/channel
tier1.sinks.sink1.hdfs.rollInterval = 5
tier1.sinks.sink1.hdfs.rollSize = 0
tier1.sinks.sink1.hdfs.rollCount = 0
tier1.sinks.sink1.hdfs.fileType = DataStream
tier1.sinks.sink1.channel = channel1

下面的列表描述了Kafka channel 所支持的参数; 粗体为必要参数.

Table 3. Kafka Channel Properties

Property Name

Default Value

Description

type

 

必须设置为:org.apache.flume.channel.kafka.KafkaChannel.

brokerList

 

The brokers the Kafka channel uses to discover topic partitions, formatted as a comma-separated list of hostname:port entries. You do not need to specify the entire list of brokers, but Cloudera recommends that you specify at least two for high availability.

zookeeperConnect

 

The URI of the ZooKeeper server or quorum used by Kafka. This can be a single node (for example, zk01.example.com:2181) or a comma-separated list of nodes in a ZooKeeper quorum (for example, zk01.example.com:2181,zk02.example.com:2181, zk03.example.com:2181).

topic

flume-channel

The Kafka topic the channel will use.

groupID

flume

The unique identifier of the Kafka consumer group the channel uses to register with Kafka.

parseAsFlumeEvent

true

Set to true if a Flume source is writing to the channel and expects AvroDataums with the FlumeEvent schema (org.apache.flume.source.avro.AvroFlumeEvent) in the channel. Set to false if other producers are writing to the topic that the channel is using.

readSmallestOffset

false

If true, reads all data in the topic. If false, reads only data written after the channel has started. Only used when parseAsFlumeEvent is false.

kafka.consumer.timeout.ms

100

当向sink写数据时轮询的间隔时间.

其他Kafka producer所支持的属性

 

Used to configure the Kafka producer. You can use any producer properties supported by Kafka. Prepend the producer property name with the prefix kafka. (for example, kafka.compression.codec). See the Kafka documentation for the full list of Kafka producer properties.

原文地址:http://www.cloudera.com/content/cloudera/en/documentation/cloudera-kafka/latest/topics/kafka\_flume.html

点赞
收藏
评论区
推荐文章
blmius blmius
3年前
MySQL:[Err] 1292 - Incorrect datetime value: ‘0000-00-00 00:00:00‘ for column ‘CREATE_TIME‘ at row 1
文章目录问题用navicat导入数据时,报错:原因这是因为当前的MySQL不支持datetime为0的情况。解决修改sql\mode:sql\mode:SQLMode定义了MySQL应支持的SQL语法、数据校验等,这样可以更容易地在不同的环境中使用MySQL。全局s
Easter79 Easter79
3年前
swap空间的增减方法
(1)增大swap空间去激活swap交换区:swapoff v /dev/vg00/lvswap扩展交换lv:lvextend L 10G /dev/vg00/lvswap重新生成swap交换区:mkswap /dev/vg00/lvswap激活新生成的交换区:swapon v /dev/vg00/lvswap
皕杰报表之UUID
​在我们用皕杰报表工具设计填报报表时,如何在新增行里自动增加id呢?能新增整数排序id吗?目前可以在新增行里自动增加id,但只能用uuid函数增加UUID编码,不能新增整数排序id。uuid函数说明:获取一个UUID,可以在填报表中用来创建数据ID语法:uuid()或uuid(sep)参数说明:sep布尔值,生成的uuid中是否包含分隔符'',缺省为
Jacquelyn38 Jacquelyn38
3年前
2020年前端实用代码段,为你的工作保驾护航
有空的时候,自己总结了几个代码段,在开发中也经常使用,谢谢。1、使用解构获取json数据let jsonData  id: 1,status: "OK",data: 'a', 'b';let  id, status, data: number   jsonData;console.log(id, status, number )
Wesley13 Wesley13
3年前
Java获得今日零时零分零秒的时间(Date型)
publicDatezeroTime()throwsParseException{    DatetimenewDate();    SimpleDateFormatsimpnewSimpleDateFormat("yyyyMMdd00:00:00");    SimpleDateFormatsimp2newS
Wesley13 Wesley13
3年前
mysql设置时区
mysql设置时区mysql\_query("SETtime\_zone'8:00'")ordie('时区设置失败,请联系管理员!');中国在东8区所以加8方法二:selectcount(user\_id)asdevice,CONVERT\_TZ(FROM\_UNIXTIME(reg\_time),'08:00','0
Wesley13 Wesley13
3年前
00:Java简单了解
浅谈Java之概述Java是SUN(StanfordUniversityNetwork),斯坦福大学网络公司)1995年推出的一门高级编程语言。Java是一种面向Internet的编程语言。随着Java技术在web方面的不断成熟,已经成为Web应用程序的首选开发语言。Java是简单易学,完全面向对象,安全可靠,与平台无关的编程语言。
Stella981 Stella981
3年前
Django中Admin中的一些参数配置
设置在列表中显示的字段,id为django模型默认的主键list_display('id','name','sex','profession','email','qq','phone','status','create_time')设置在列表可编辑字段list_editable
Wesley13 Wesley13
3年前
MySQL部分从库上面因为大量的临时表tmp_table造成慢查询
背景描述Time:20190124T00:08:14.70572408:00User@Host:@Id:Schema:sentrymetaLast_errno:0Killed:0Query_time:0.315758Lock_
Python进阶者 Python进阶者
8个月前
Excel中这日期老是出来00:00:00,怎么用Pandas把这个去除
大家好,我是皮皮。一、前言前几天在Python白银交流群【上海新年人】问了一个Pandas数据筛选的问题。问题如下:这日期老是出来00:00:00,怎么把这个去除。二、实现过程后来【论草莓如何成为冻干莓】给了一个思路和代码如下:pd.toexcel之前把这