# XmlDataExtractor
This gem provides a DSL for extracting formatted data from any XML structure.
## Installation
Add this line to your application's Gemfile:
```ruby
gem 'xml_data_extractor'
```
And then execute:
$ bundle install
Or install it yourself as:
$ gem install xml_data_extractor
## Usage
The general ideia is to declare a ruby Hash that represents the fields structure, containing instructions of how every piece of data should be retrieved from the XML document.
```ruby
structure = { schemas: { character: { path: "xml/FirstName" } } }
xml = "Gandalf"
result = XmlDataExtractor.new(structure).parse(xml)
# result -> { character: "Gandalf" }
```
For convenience, you can write the structure in yaml, which can be easily converted to a ruby hash using `YAML.load(yml).deep_symbolize_keys`.
Considering the following yaml and xml:
```yml
schemas:
description:
path: xml/desc
modifier: downcase
amount:
path: xml/info/price
modifier: to_f
```
```xml
HELLO WORLD
123
```
The output is:
```ruby
{
description: "hello world",
amount: 123.0
}
```
### Defining the structure
The structure should be defined as a hash inside the `schemas` key. See the [complete example](https://github.com/monde-sistemas/xml_data_extractor/blob/master/spec/complete_example_spec.rb#L5).
When defining the structure you can combine any available command in order to extract and format the data as needed.
The available commands are separated in two general pusposes:
- [Navigation & Extraction](#navigation--extraction)
- [Formatting](#formatting)
### Navigation & Extraction:
The data extraction process is based on `Xpath` using Nokogiri.
* [Xpath introduction](https://blog.scrapinghub.com/2016/10/27/an-introduction-to-xpath-with-examples)
* [Xpath cheatsheet](https://devhints.io/xpath)
#### path
Defines the `xpath` of the element.
The `path` is the default command of a field definition, so this:
```yml
schemas:
description:
path: xml/desc
```
Is equivalent to this:
```yml
schemas:
description: xml/desc
```
It can be defined as a string:
```yml
schemas:
description:
path: xml/some_field
```
```xml
ABC
```
```ruby
{ description: "ABC" }
```
Or as a string array:
```yml
schemas:
address:
path: [street, info/city]
```
```xml
Diagon Alley
London
```
```ruby
{ address: ["Diagon Alley", "London"] }
```
And even as a hash array, for complex operations:
```yml
schemas:
address:
path:
- path: street
modifier: downcase
- path: info/city
modifier: upcase
```
```ruby
{ address: ["diagon alley", "LONDON"] }
```
#### attr
Defines a tag attribute which the value should be extracted from, instead of the tag value itself:
```yml
schemas:
description:
path: xml/info
attr: desc
```
```xml
some stuff
```
```ruby
{ description: "ABC" }
```
Like the path, it can also be defined as a string array.
#### within
To define a root path for the fields:
```yml
schemas:
movie:
within: info/movie_data
title: original_title
actor: main_actor
```
```xml
The Irishman
Robert De Niro
```
```ruby
{ movie: { title: "The Irishman", actor: "Robert De Niro" } }
```
#### unescape
This option is pretty usefull when you have embbed XML or HTML inside some tag, like CDATA elements, and you need to unescape them first in order to parse their content:
```yml
schemas:
movie:
unescape: response
title: response/original_title
actor: response/main_actor
```
```xml
<original_title>1<original_title><main_actor>1<main_actor>
```
This XML will be turned into this one during the parsing:
```xml
The Irishman
Robert De Niro
```
```ruby
{ movie: { title: "The Irishman", actor: "Robert De Niro" } }
```
#### array_of
Defines the path to a XML collection, which will be looped generating an array of hashes:
```yml
schemas:
people:
array_of: characters/character
name: firstname
age: age
```
```xml
Geralt
97
Yennefer
102
```
```ruby
{
people: [
{ name: "Geralt", age: "97" },
{ name: "Yennefer", age: "102" }
]
}
```
If you need to loop trough nested collections, you can define an array of paths:
```yml
schemas:
show:
within: show_data
title: description
people:
array_of: [characters/character, info]
name: name
```
```xml
Peaky Blinders
Tommy Shelby
Arthur Shelby
Alfie Solomons
```
```ruby
{
show: {
title: "Peaky Blinders",
people: [
{ name: "Tommy Shelby" },
{ name: "Arthur Shelby" },
{ name: "Alfie Solomons" }
]
}
}
```
### Formatting:
#### fixed
Defines a fixed value for the field:
```yml
currency:
fixed: BRL
```
```ruby
{ currency: "BRL" }
```
#### mapper
Uses a hash of predefined values to replace the extracted value with its respective option.
If the extracted value is not found in any of the mapper options, it will be replaced by the `default` value, but if the default value is not defined, the returned value is not replaced.
```yml
mappers:
currencies:
default: unknown
options:
BRL: R$
USD: [US$, $]
schemas:
money:
array_of: curr_types/type
path: symbol
mapper: currencies
```
```xml
US$
R$
RB
$
```
```ruby
{
money: ["USD", "BRL", "unknown", "USD"]
}
```
#### modifier
Defines a method to be called on the returned value.
```yml
schemas:
name:
path: some_field
modifier: upcase
```
```xml
Lewandovski
```
```ruby
{ name: "LEWANDOVSKI" }
```
You can also pass parameters to the method. In this case you will have to declare the modifier as an array of hashes, with the `name` and `params` keys:
```yml
schemas:
name:
path: [firstname, lastname]
modifier:
- name: join
params: [" "]
- downcase
```
```xml
Robert
Martin
```
```ruby
{ name: "robert martin" }
```
If you need to use custom methods, you can pass an object containing the methods in the initialization. The custom method will receive the value as parameter:
```yml
schemas:
name:
path: final_price
modifier: format_as_float
```
```xml
R$ 12.99
```
```ruby
class MyMethods
def format_as_float(value)
value.gsub(/[^\d.]/, "").to_f
end
end
XmlDataExtractor.new(yml, MyMethods.new).parse(xml)
```
```ruby
{ price: 12.99 }
```