# Yasuri ## What is Yasuri `Yasuri` (鑢) is a library for declarative web scraping and a command line tool for scraping with it. It performs scraping by simply describing the expected result in a simple declarative notation. Yasuri makes it easy to write common scraping operations. For example, the following processes can be easily implemented. + Scrape multiple texts in a page and name them into a Hash + Open multiple links in a page and get the result of scraping each page as a Hash + Scrape each table that appears repeatedly in the page and get the result as an array + Scrape only the first three pages of each page provided by pagination ## Quick Start #### Install ```sh # for Ruby 2.3.2 $ gem 'yasuri', '~> 2.0', '>= 2.0.13' ``` または ```sh # for Ruby 3.0.0 or upper $ gem install yasuri ``` #### Use as library ```ruby require 'yasuri' require 'machinize' # Node tree constructing by DSL root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do text_title '//*[@id="contents"]/h2' text_content '//*[@id="contents"]/p[1]' end result = root.scrape("http://some.scraping.page.tac42.net/") # => [ # {"title" => "PageTitle 01", "content" => "Page Contents 01" }, # {"title" => "PageTitle 02", "content" => "Page Contents 02" }, # ... # {"title" => "PageTitle N", "content" => "Page Contents N" } # ] ``` This example, from the pages of each link that is expressed by the xpath of LinkNode(`links_root`), to scraping the two text that is expressed by the xpath of TextNode(`text_title`,`text_content`). (in other words, open each links `//*[@id="menu"]/ul/li/a` and, scrape `//*[@id="contents"]/h2` and `//*[@id="contents"]/p[1]`.) #### Use as CLI tool The same thing as above can be executed as a CLI command. ```sh $ yasuri scrape "http://some.scraping.page.tac42.net/" -j ' { "links_root": { "path": "//*[@id=\"menu\"]/ul/li/a", "text_title": "//*[@id=\"contents\"]/h2", "text_content": "//*[@id=\"contents\"]/p[1]" } }' [ {"title":"PageTitle 01","content":"Page Contents 01"}, {"title":"PageTitle 02","content":"Page Contents 02"}, ..., {"title":"PageTitle N","content":"Page Contents N"} ] ``` The result can be obtained as a string in json format. ---------------------------- ## Parse Tree A parse tree is a tree structure data for declaratively defining the elements to be scraped and the output structure. A parse tree consists of nested `Node`s, each of which has `Type`, `Name`, `Path`, `Childlen`, and `Options` attributes, and scrapes according to its `Type`. (Note that only `MapNode` does not have `Path`). The parse tree is defined in the following format: ```ruby # A simple tree consisting of one node Yasuri.<Type>_<Name> <Path> [,<Options>] # Nested tree Yasuri.<Type>_<Name> <Path> [,<Options>] do <Type>_<Name> <Path> [,<Options>] do <Type>_<Name> <Path> [,<Options>] ... end end ``` **Example** ```ruby # A simple tree consisting of one node Yasuri.text_title '/html/head/title', truncate:/^[^,]+/ # Nested tree Yasuri.links_root '//*[@id="menu"]/ul/li/a' do struct_table './tr' do text_title './td[1]' text_pub_date './td[2]' end end ``` Parsing trees can be defined in Ruby DSL, JSON, or YAML. The following is an example of the same parse tree as above, defined in each notation. **Case of defining as Ruby DSL** ```ruby Yasuri.links_title '/html/body/a' do text_name '/html/body/p' end ``` **Case of defining as JSON** ```json { links_title": { "path": "/html/body/a", "text_name": "/html/body/p" } } ``` **Case of defining as YAML** ```yaml links_title: path: "/html/body/a" text_name: "/html/body/p" ``` **Special case of purse tree** If there is only one element directly under the root, it will return that element directly instead of Hash(Object). ```json { "text_title": "/html/head/title", "text_body": "/html/body", } # => {"title": "Welcome to yasuri!", "body": "Yasuri is ..."} { "text_title": "/html/head/title"} } # => Welcome to yasuri! ``` In json or yaml format, a attribute can directly specify `path` as a value if it doesn't have any child Node. The following two json will have the same parse tree. ```json { "text_name": "/html/body/p" } { "text_name": { "path": "/html/body/p" } } ``` ### Run ParseTree Call the `Node#scrape(uri, opt={})` method on the root node of the parse tree. **Example** ```ruby root = Yasuri.links_root '//*[@id="menu"]/ul/li/a' do text_title '//*[@id="contents"]/h2' text_content '//*[@id="contents"]/p[1]' end result = root.scrape("http://some.scraping.page.tac42.net/", interval_ms: 1000) ``` + `uri` is the URI of the page to be scraped. + `opt` is options as Hash. The following options are available. Yasuri uses `Mechanize` internally as an agent to do scraping. If you want to specify this instance, call `Node#scrape_with_agent(uri, agent, opt={})`. ```ruby require 'logger' agent = Mechanize.new agent.log = Logger.new $stderr agent.request_headers = { # ... } result = root.scrape_with_agent( "http://some.scraping.page.tac42.net/", agent, interval_ms: 1000) ``` ### `opt` #### `interval_ms` Interval [milliseconds] for requesting multiple pages. If omitted, requests will be made continuously without an interval, but if requests to many pages are expected, it is strongly recommended to specify an interval time to avoid high load on the target host. #### `retry_count` Number of retries when page acquisition fails. If omitted, it will retry 5 times. #### `symbolize_names` If true, returns the keys of the result set as symbols. -------------------------- ## Node Node is a node or leaf of the parse tree, which has `Type`, `Name`, `Path`, `Childlen`, and `Options`, and scrapes according to its `Type`. (Note that only `MapNode` does not have `Path`). #### Type Type meen behavior of Node. - *Text* - *Struct* - *Links* - *Paginate* - *Map* See the description of each node for details. #### Name Name is used keys in returned hash. #### Path Path determine target node by xpath or css selector. It given by Machinize `search`. #### Childlen Child nodes. TextNode has always empty set, because TextNode is leaf. #### Options Parse options. It different in each types. You can get options and values by `opt` method. ```ruby # TextNode Exaample node = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/ node.opt #=> {:truncate => /^[^,]+/, :proc => nil} ``` ## Text Node TextNode return scraped text. This node have to be leaf. ### Example ```html <!-- http://yasuri.example.tac42.net --> <html> <head></head> <body> <p>Hello,World</p> <p>Hello,Yasuri</p> </body> </html> ``` ```ruby p1 = Yasuri.text_title '/html/body/p[1]' p1t = Yasuri.text_title '/html/body/p[1]', truncate:/^[^,]+/ p2u = Yasuri.text_title '/html/body/p[1]', proc: :upcase p1.scrape("http://yasuri.example.tac42.net") #=> "Hello,World" p1t.scrape("http://yasuri.example.tac42.net") #=> "Hello" p2u.scrape("http://yasuri.example.tac42.net") #=> "HELLO,WORLD" ``` Note that if you want to scrape multiple elements in the same page at once, use `MapNode`. See the `MapNode` example for details. ### Options ##### `truncate` Match to regexp, and truncate text. When you use group, it will return first matched group only. ```ruby node = Yasuri.text_example '/html/body/p[1]', truncate:/H(.+)i/ node.scrape(uri) #=> { "example" => "ello,Yasur" } ``` ##### `proc` Apply method to text. Method is given as Symbol. If it is given `truncate` option, apply method after truncated. ```ruby node = Yasuri.text_example '/html/body/p[1]', proc: :upcase, truncate:/H(.+)i/ node.scrape(uri) #=> { "example" => "ELLO,YASUR" } ``` ## Struct Node Struct Node return structured text. At first, Struct Node narrow down sub-tags by `Path`. Child nodes parse narrowed tags, and struct node returns hash contains parsed result. If Struct Node `Path` matches multi sub-tags, child nodes parse each sub-tags and struct node returns array. ### Example ```html <!-- http://yasuri.example.tac42.net --> <html> <head> <title>Books</title> </head> <body> <h1>1996</h1> <table> <thead> <tr><th>Title</th> <th>Publication Date</th></tr> </thead> <tr><td>The Perfect Insider</td> <td>1996/4/5</td></tr> <tr><td>Doctors in Isolated Room</td> <td>1996/7/5</td></tr> <tr><td>Mathematical Goodbye</td> <td>1996/9/5</td></tr> </table> <h1>1997</h1> <table> <thead> <tr><th>Title</th> <th>Publication Date</th></tr> </thead> <tr><td>Jack the Poetical Private</td> <td>1997/1/5</td></tr> <tr><td>Who Inside</td> <td>1997/4/5</td></tr> <tr><td>Illusion Acts Like Magic</td> <td>1997/10/5</td></tr> </table> <h1>1998</h1> <table> <thead> <tr><th>Title</th> <th>Publication Date</th></tr> </thead> <tr><td>Replaceable Summer</td> <td>1998/1/7</td></tr> <tr><td>Switch Back</td> <td>1998/4/5</td></tr> <tr><td>Numerical Models</td> <td>1998/7/5</td></tr> <tr><td>The Perfect Outsider</td> <td>1998/10/5</td></tr> </table> </body> </html> ``` ```ruby node = Yasuri.struct_table '/html/body/table[1]/tr' do text_title './td[1]' text_pub_date './td[2]' end node.scrape("http://yasuri.example.tac42.net") #=> [ { "title" => "The Perfect Insider", # "pub_date" => "1996/4/5" }, # { "title" => "Doctors in Isolated Room", # "pub_date" => "1996/7/5" }, # { "title" => "Mathematical Goodbye", # "pub_date" => "1996/9/5" }, ] ``` StructNode narrow down `<tr>` tags in first `<table>` by `'/html/body/table[1]/tr'`. Then, `<tr>` tags parsed Struct node has two child node. In this case, first `<table>` contains three `<tr>` tags (Not four.`<thead><tr>` is not match to `Path` ), so struct node returns three hashes. Each hash contains parsed text by Text Node. Struct node can contain not only Text node. ### Example ```ruby node = Yasuri.strucre_tables '/html/body/table' do struct_table './tr' do text_title './td[1]' text_pub_date './td[2]' end end node.scrape("http://yasuri.example.tac42.net") #=> [ { "table" => [ { "title" => "The Perfect Insider", # "pub_date" => "1996/4/5" }, # { "title" => "Doctors in Isolated Room", # "pub_date" => "1996/7/5" }, # { "title" => "Mathematical Goodbye", # "pub_date" => "1996/9/5" }]}, # { "table" => [ { "title" => "Jack the Poetical Private", # "pub_date" => "1997/1/5" }, # { "title" => "Who Inside", # "pub_date" => "1997/4/5" }, # { "title" => "Illusion Acts Like Magic", # "pub_date" => "1997/10/5" }]}, # { "table" => [ { "title" => "Replaceable Summer", # "pub_date" => "1998/1/7" }, # { "title" => "Switch Back", # "pub_date" => "1998/4/5" }, # { "title" => "Numerical Models", # "pub_date" => "1998/7/5" }, # { "title" => "The Perfect Outsider", # "pub_date" => "1998/10/5" }]} # ] ``` ### Options None. ## Links Node Links Node returns parsed text in each linked pages. ### Example ```html <!-- http://yasuri.example.tac42.net --> <html> <head><title>Yasuri Test</title></head> <body> <p>Hello,Yasuri</p> <a href="./child01.html">child01</a> <a href="./child02.html">child02</a> <a href="./child03.html">child03</a> </body> <title> ``` ```html <!-- http://yasuri.example.tac42.net/child01.html --> <html> <head><title>Child 01 Test</title></head> <body> <p>Child 01 page.</p> <ul> <li><a href="./child01_sub.html">Child01_Sub</a></li> <li><a href="./child02_sub.html">Child02_Sub</a></li> </ul> </body> <title> ``` ```html <!-- http://yasuri.example.tac42.net/child02.html --> <html> <head><title>Child 02 Test</title></head> <body> <p>Child 02 page.</p> </body> <title> ``` ```html <!-- http://yasuri.example.tac42.net/child03.html --> <html> <head><title>Child 03 Test</title></head> <body> <p>Child 03 page.</p> <ul> <li><a href="./child03_sub.html">Child03_Sub</a></li> </ul> </body> <title> ``` ```ruby node = Yasuri.links_title '/html/body/a' do text_content '/html/body/p' end node.scrape("http://yasuri.example.tac42.net") #=> [ {"content" => "Child 01 page."}, {"content" => "Child 02 page."}, {"content" => "Child 03 page."}] ``` At first, Links Node find all links in the page by path. In this case, LinksNode find `/html/body/a` tags in `http://yasuri.example.tac42.net`. Then, open href attributes (`./child01.html`, `./child02.html` and `./child03.html`). Then, Links Node and apply child nodes. Links Node will return applied result of each page as array. ### Options None. ## Paginate Node Paginate Node parses and returns each pages that provid by paginate. ### Example Target page `page01.html` is like this. `page02.html` to `page04.html` are similarly. ```html <!-- http://yasuri.example.tac42.net/page01.html --> <html> <head><title>Page01</title></head> <body> <p>Pagination01</p> <nav class='pagination'> <span class='prev'> PreviousPage </span> <span class='page'> 1 </span> <span class='page'> <a href="./page02.html">2</a> </span> <span class='page'> <a href="./page03.html">3</a> </span> <span class='page'> <a href="./page04.html">4</a> </span> <span class='next'> <a href="./page02.html" class="next" rel="next"> NextPage </a> </span> </nav> </body> <title> ``` ```ruby node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:3 do text_content '/html/body/p' end node.scrape("http://yasuri.example.tac42.net/page01.html") #=> [ {"content" => "Patination01"}, # {"content" => "Patination02"}, # {"content" => "Patination03"}] ``` Paginate Node require link for next page. In this case, it is `NextPage` `/html/body/nav/span/a[@class='next']`. ### Options ##### `limit` Upper limit of open pages in pagination. ```ruby node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , limit:2 do text_content '/html/body/p' end node.scrape(uri) #=> [ {"content" => "Pagination01"}, {"content" => "Pagination02"}] ``` Paginate Node open upto 2 given by `limit`. In this situation, pagination has 4 pages, but result Array has 2 texts because given `limit:2`. ##### `flatten` `flatten` option expands each page results. ```ruby node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do text_title '/html/head/title' text_content '/html/body/p' end node.scrape("http://yasuri.example.tac42.net/page01.html") #=> [ {"title" => "Page01", # "content" => "Patination01"}, # {"title" => "Page01", # "content" => "Patination02"}, # {"title" => "Page01", # "content" => "Patination03"}] node = Yasuri.pages_root "/html/body/nav/span/a[@class='next']" , flatten:true do text_title '/html/head/title' text_content '/html/body/p' end node.scrape("http://yasuri.example.tac42.net/page01.html") #=> [ "Page01", # "Patination01", # "Page02", # "Patination02", # "Page03", # "Patination03"] ``` ## Map Node *MapNode* is a node that summarizes the results of scraping. This node is always a branch node in the parse tree. ### Example ```html <!-- http://yasuri.example.tac42.net --> <html> <head><title>Yasuri Example</title></head> <body> <p>Hello,World</p> <p>Hello,Yasuri</p> </body> </html> ``` ```ruby tree = Yasuri.map_root do text_title '/html/head/title' text_body_p '/html/body/p[1]' end tree.scrape("http://yasuri.example.tac42.net") #=> { "title" => "Yasuri Example", "body_p" => "Hello,World" } tree = Yasuri.map_root do map_group1 { text_child01 '/html/body/a[1]' } map_group2 do text_child01 '/html/body/a[1]' text_child03 '/html/body/a[3]' end end tree.scrape("http://yasuri.example.tac42.net") #=> { # "group1" => { # "child01" => "child01" # }, # "group2" => { # "child01" => "child01", # "child03" => "child03" # } # } ``` ### Options None. ------------------------- ## Usage ### Use as library When used as a library, the tree can be defined in DSL, json, or yaml format. ```ruby require 'yasuri' # 1. Create a parse tree. # Define by Ruby's DSL tree = Yasuri.links_title '/html/body/a' do text_name '/html/body/p' end # Define by JSON src = <<-EOJSON { links_title": { "path": "/html/body/a", "text_name": "/html/body/p" } } EOJSON tree = Yasuri.json2tree(src) # Define by YAML src = <<-EOYAML links_title: path: "/html/body/a" text_name: "/html/body/p" EOYAML tree = Yasuri.yaml2tree(src) # 2. Give the URL to start parsing tree.inject(uri) ``` ### Use as CLI tool **Help** ```sh $ yasuri help scrape Usage: yasuri scrape <URI> [[--file <TREE_FILE>] or [--json <JSON>]] Options: f, [--file=FILE] # path to file that written yasuri tree as json or yaml j, [--json=JSON] # yasuri tree format json string i, [--interval=N] # interval each request [ms] Getting from <URI> and scrape it. with <JSON> or json/yml from <TREE_FILE>. They should be Yasuri's format json or yaml string. ``` In the CLI tool, you can specify the parse tree in either of the following ways. + `--file`, `-f` : option to read the parse tree in json or yaml format output to a file. + `--json`, `-j` : option to specify the parse tree directly as a string. **Example of specifying a parse tree as a file** ```sh % cat sample.yml text_title: "/html/head/title" text_desc: "//*[@id=\"intro\"]/p" % yasuri scrape "https://www.ruby-lang.org/en/" --file sample.yml {"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "} % cat sample.json { "text_title": "/html/head/title", "text_desc": "//*[@id=\"intro\"]/p" } % yasuri scrape "https://www.ruby-lang.org/en/" --file sample.json {"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "} ``` Whether the file is written in json or yaml will be determined automatically. **Example of specifying a parse tree directly in json** ```sh $ yasuri scrape "https://www.ruby-lang.org/en/" -j ' { "text_title": "/html/head/title", "text_desc": "//*[@id=\"intro\"]/p" }' {"title":"Ruby Programming Language","desc":"\n A dynamic, open source programming language with a focus on\n simplicity and productivity. It has an elegant syntax that is\n natural to read and easy to write.\n "} ``` #### Other options + `--interval`, `-i` : The interval [milliseconds] for requesting multiple pages. **Example: Request at 1 second intervals** ```sh $ yasuri scrape "https://www.ruby-lang.org/en/" --file sample.yml --interval 1000 ```