# fluent-plugin-webhdfs [Fluentd](http://fluentd.org/) output plugin to write data into Hadoop HDFS over WebHDFS/HttpFs. WebHDFSOutput slices data by time (specified unit), and store these data as hdfs file of plain text. You can specify to: * format whole data as serialized JSON, single attribute or separated multi attributes * or LTSV, labeled-TSV (see http://ltsv.org/ ) * include time as line header, or not * include tag as line header, or not * change field separator (default: TAB) * add new line as termination, or not And you can specify output file path as 'path /path/to/dir/access.%Y%m%d.log', then got '/path/to/dir/access.20120316.log' on HDFS. ## Configuration ### WebHDFSOutput To store data by time,tag,json (same with '@type file') over WebHDFS: @type webhdfs host namenode.your.cluster.local port 50070 path /path/on/hdfs/access.log.%Y%m%d_%H.log If you want JSON object only (without time or tag or both on header of lines), specify it by `output_include_time` or `output_include_tag` (default true): @type webhdfs host namenode.your.cluster.local port 50070 path /path/on/hdfs/access.log.%Y%m%d_%H.log output_include_time false output_include_tag false To specify namenode, `namenode` is also available: @type webhdfs namenode master.your.cluster.local:50070 path /path/on/hdfs/access.log.%Y%m%d_%H.log To store data as LTSV without time and tag over WebHDFS: @type webhdfs host namenode.your.cluster.local port 50070 path /path/on/hdfs/access.log.%Y%m%d_%H.log output_data_type ltsv With username of pseudo authentication: @type webhdfs host namenode.your.cluster.local port 50070 path /path/on/hdfs/access.log.%Y%m%d_%H.log username hdfsuser Store data over HttpFs (instead of WebHDFS): @type webhdfs host httpfs.node.your.cluster.local port 14000 path /path/on/hdfs/access.log.%Y%m%d_%H.log httpfs true Store data as TSV (TAB separated values) of specified keys, without time, with tag (removed prefix 'access'): @type webhdfs host namenode.your.cluster.local port 50070 path /path/on/hdfs/access.log.%Y%m%d_%H.log field_separator TAB # or 'SPACE', 'COMMA' or 'SOH'(Start Of Heading: \001) output_include_time false output_include_tag true remove_prefix access output_data_type attr:path,status,referer,agent,bytes If message doesn't have specified attribute, fluent-plugin-webhdfs outputs 'NULL' instead of values. With ssl: @type webhdfs host namenode.your.cluster.local port 50070 path /path/on/hdfs/access.log.%Y%m%d_%H.log ssl true ssl_ca_file /path/to/ca_file.pem # if needed ssl_verify_mode peer # if needed (peer or none) Here `ssl_verify_mode peer` means to verify the server's certificate. You can turn off it by setting `ssl_verify_mode none`. The default is `peer`. See [net/http](http://www.ruby-doc.org/stdlib-2.1.3/libdoc/net/http/rdoc/Net/HTTP.html) and [openssl](http://www.ruby-doc.org/stdlib-2.1.3/libdoc/openssl/rdoc/OpenSSL.html) documentation for further details. With kerberos authentication: @type webhdfs host namenode.your.cluster.local port 50070 path /path/on/hdfs/access.log.%Y%m%d_%H.log kerberos true If you want to compress data before storing it: @type webhdfs host namenode.your.cluster.local port 50070 path /path/on/hdfs/access.log.%Y%m%d_%H compress gzip # or 'bzip2', 'snappy', 'lzo_command' Note that if you set `compress gzip`, then the suffix `.gz` will be added to path (or `.bz2`, `sz`, `.lzo`). Note that you have to install snappy gem if you want to set `compress snappy`. ### Namenode HA / Auto retry for WebHDFS known errors `fluent-plugin-webhdfs` (v0.2.0 or later) accepts 2 namenodes for Namenode HA (active/standby). Use `standby_namenode` like this: @type webhdfs namenode master1.your.cluster.local:50070 standby_namenode master2.your.cluster.local:50070 path /path/on/hdfs/access.log.%Y%m%d_%H.log And you can also specify to retry known hdfs errors (such like `LeaseExpiredException`) automatically. With this configuration, fluentd doesn't write logs for this errors if retry successed. @type webhdfs namenode master1.your.cluster.local:50070 path /path/on/hdfs/access.log.%Y%m%d_%H.log retry_known_errors yes retry_times 1 # default 1 retry_interval 1 # [sec] default 1 ### Performance notifications Writing data on HDFS single file from 2 or more fluentd nodes, makes many bad blocks of HDFS. If you want to run 2 or more fluentd nodes with fluent-plugin-webhdfs, you should configure 'path' for each node. You can use '${hostname}' or '${uuid:random}' placeholders in configuration for this purpose. For hostname: @type webhdfs host namenode.your.cluster.local port 50070 path /log/access/%Y%m%d/${hostname}.log Or with random filename (to avoid duplicated file name only): @type webhdfs host namenode.your.cluster.local port 50070 path /log/access/%Y%m%d/${uuid:random}.log With configurations above, you can handle all of files of '/log/access/20120820/*' as specified timeslice access logs. For high load cluster nodes, you can specify timeouts for HTTP requests. @type webhdfs namenode master.your.cluster.local:50070 path /log/access/%Y%m%d/${hostname}.log open_timeout 180 # [sec] default: 30 read_timeout 180 # [sec] default: 60 ### For unstable Namenodes With default configuration, fluent-plugin-webhdfs checks HDFS filesystem status and raise error for inacive NameNodes. If you were usging unstable NameNodes and have wanted to ignore NameNode errors on startup of fluentd, enable `ignore_start_check_error` option like below: @type webhdfs host namenode.your.cluster.local port 50070 path /log/access/%Y%m%d/${hostname}.log ignore_start_check_error true ### For unstable Datanodes With unstable datanodes that frequently downs, appending over WebHDFS may produce broken files. In such cases, specify `append no` and `${chunk_id}` parameter. @type webhdfs host namenode.your.cluster.local port 50070 append no path /log/access/%Y%m%d/${hostname}.${chunk_id}.log `out_webhdfs` creates new files on hdfs per flush of fluentd, with chunk id. You shouldn't care broken files from append operations. ## TODO * configuration example for Hadoop Namenode HA * here, or docs.fluentd.org ? * patches welcome! ## Copyright * Copyright (c) 2012- TAGOMORI Satoshi (tagomoris) * License * Apache License, Version 2.0