Sha256: 00adfbf5657907277d20e1962567441be4bae42a2c7dc0c51539e26fdddf6ccf

Contents?: true

Size: 1.26 KB

Versions: 3

Compression:

Stored size: 1.26 KB

Contents

# Crawler filter plugin for Embulk

Write short description here and build.gradle file.

## Overview

* **Plugin type**: filter

## Configuration

- **target_key**: base_url column key name (string, require)
- **max_depth_of_crawling**: max depth of crawling (integer, default: unlimited)
- **number_of_crawlers**: parallelism (integer, default: 1)
- **max_pages_to_fetch**: max_pages_to_fetch (integer, default: unlimited)
- **crawl_storage_folder**: crawl_storage_folder (string, require)
- **politeness_delay**: politeness_delay (integer, default: null)
- **user_agent_string**: user_agent_string (string, default: null)
- **output_prefix**: output_prefix (string, default: "")
- **connection_timeout**: connection timeout millisecond (integer, default: 30000)
- **socket_timeout**: socket timeout millisecond (integer, default: 20000)

## Example

```yaml
in:
  type: mysql
  host: dbs04
  user: application
  password: XXXXXXXX
  database: iap
  query: |
    select url from companies limit 100
filters:
  - type: crawler
    target_key: url
    number_of_crawlers: 10
    max_depth_of_crawling: 4
    politeness_delay: 100
    crawl_storage_folder: "/tmp/crawl/%s"
out:
  type: stdout
```


## Build

```
$ ./gradlew gem  # -t to watch change of files and rebuild continuously
```

Version data entries

3 entries across 3 versions & 1 rubygems

Version Path
embulk-filter-crawler-0.1.3 README.md
embulk-filter-crawler-0.1.2 README.md
embulk-filter-crawler-0.1.1 README.md