# XmlDataExtractor
This gem provides a DSL for extracting formatted data from any XML structure.
## Installation
Add this line to your application's Gemfile:
```ruby
gem 'xml_data_extractor'
```
And then execute:
$ bundle install
Or install it yourself as:
$ gem install xml_data_extractor
## Usage
The general ideia is to declare a ruby Hash that represents the fields structure, containing instructions of how every piece of data should be retrieved from the XML document.
```ruby
structure = { schemas: { character: { path: "xml/FirstName" } } }
xml = "Gandalf"
result = XmlDataExtractor.new(structure).parse(xml)
# result -> { character: "Gandalf" }
```
For convenience, you can write the structure in yaml, which can be easily converted to a ruby hash using `YAML.load(yml).deep_symbolize_keys`.
Considering the following yaml and xml:
```yml
schemas:
description:
path: xml/desc
modifier: downcase
amount:
path: xml/info/price
modifier: to_f
```
```xml
HELLO WORLD123
```
The output is:
```ruby
{
description: "hello world",
amount: 123.0
}
```
### Defining the structure
The structure should be defined as a hash inside the `schemas` key. See the [complete example](https://github.com/monde-sistemas/xml_data_extractor/blob/master/spec/complete_example_spec.rb#L5).
When defining the structure you can combine any available command in order to extract and format the data as needed.
The available commands are separated in two general pusposes:
- [Navigation & Extraction](#navigation--extraction)
- [Formatting](#formatting)
### Navigation & Extraction:
The data extraction process is based on `Xpath` using Nokogiri.
* [Xpath introduction](https://blog.scrapinghub.com/2016/10/27/an-introduction-to-xpath-with-examples)
* [Xpath cheatsheet](https://devhints.io/xpath)
#### path
Defines the `xpath` of the element.
The `path` is the default command of a field definition, so this:
```yml
schemas:
description:
path: xml/desc
```
Is equivalent to this:
```yml
schemas:
description: xml/desc
```
It can be defined as a string:
```yml
schemas:
description:
path: xml/some_field
```
```xml
ABC
```
```ruby
{ description: "ABC" }
```
Or as a string array:
```yml
schemas:
address:
path: [street, info/city]
```
```xml
Diagon AlleyLondon
```
```ruby
{ address: ["Diagon Alley", "London"] }
```
And even as a hash array, for complex operations:
```yml
schemas:
address:
path:
- path: street
modifier: downcase
- path: info/city
modifier: upcase
```
```ruby
{ address: ["diagon alley", "LONDON"] }
```
#### attr
Defines a tag attribute which the value should be extracted from, instead of the tag value itself:
```yml
schemas:
description:
path: xml/info
attr: desc
```
```xml
some stuff
```
```ruby
{ description: "ABC" }
```
Like the path, it can also be defined as a string array.
#### within
To define a root path for the fields:
```yml
schemas:
movie:
within: info/movie_data
title: original_title
actor: main_actor
```
```xml
The IrishmanRobert De Niro
```
```ruby
{ movie: { title: "The Irishman", actor: "Robert De Niro" } }
```
#### unescape
This option is pretty usefull when you have embbed XML or HTML inside some tag, like CDATA elements, and you need to unescape them first in order to parse their content:
```yml
schemas:
movie:
unescape: response
title: response/original_title
actor: response/main_actor
```
```xml
<original_title>1<original_title><main_actor>1<main_actor>
```
This XML will be turned into this one during the parsing:
```xml
The IrishmanRobert De Niro
```
```ruby
{ movie: { title: "The Irishman", actor: "Robert De Niro" } }
```
#### array_of
Defines the path to a XML collection, which will be looped generating an array of hashes:
```yml
schemas:
people:
array_of: characters/character
name: firstname
age: age
```
```xml
Geralt97Yennefer102
```
```ruby
{
people: [
{ name: "Geralt", age: "97" },
{ name: "Yennefer", age: "102" }
]
}
```
If you need to loop trough nested collections, you can define an array of paths:
```yml
schemas:
show:
within: show_data
title: description
people:
array_of: [characters/character, info]
name: name
```
```xml
Peaky BlindersTommy ShelbyArthur ShelbyAlfie Solomons
```
```ruby
{
show: {
title: "Peaky Blinders",
people: [
{ name: "Tommy Shelby" },
{ name: "Arthur Shelby" },
{ name: "Alfie Solomons" }
]
}
}
```
### link
This command is useful when the XML contains references to other nodes, it works as a SQL JOIN. The path must be and expression containing the `` identifier, which will be replaced by the value fetched from the `link:` command.
Example:
```yml
schemas:
bookings:
array_of: booking
date: booking_date
document: id
products:
array_of:
accomodation:
path: ../hotel[booking_id=]/accomodation
link: id
```
```xml
12020-01-0122020-01-021Standard2Premium
```
```ruby
{
bookings: [
{
date: "2020-01-01",
document: "1"
products: [
{ accomodation: "Standard" }
]
},
{
date: "2020-01-02",
document: "2"
products: [
{ accomodation: "Premium" }
]
}
]
}
```
In this example if I didn't use the `link` to get only the hotel of each booking, it would have returned two accomodations for each booking and instead of extract a string with the accomodation it would extract an array with all the accomodations for each booking.
You can combine the `link` with `array_of` if you want search for a list of elements filtering by some field, just provide the `path` and the `link`:
```yml
schemas:
bookings:
array_of: booking
date: date
document: id
products:
array_of:
path: ../products[booking_id=]
link: id
....
```
### uniq_by
Can only be used with **array_of**.
This functionality is useful when some XML nodes are duplicated and you want to extract data from the first occurrence only. It has a behavior similar to Ruby **uniq** method on arrays.
For each path generated from `array_of`, the value fetched using `uniq_by` will be checked against the generated collection and the path will be discarded if the value already exists.
```yml
schemas:
bookings:
array_of:
path: booking
uniq_by: id
date: bdate
document: id
```
```xml
12020-01-0112020-01-01
```
```ruby
{
bookings: [
{
date: "2020-01-01",
document: "1"
}
]
}
```
In this example if we don't use the tag `uniq_by` there would be extracted two elements with the same data, like:
```ruby
{
bookings: [
{
date: "2020-01-01",
document: "1"
},
{
date: "2020-01-01",
document: "1"
}
]
}
```
### array_presence: first_only
The field that contains this property will be only added to the first item of the array.
Can only be used in fields that belong to a node of `array_of`.
```yml
passengers:
array_of: bookings/booking/passengers/passenger
id:
path: document
modifier: to_s
name:
attr: [FirstName, LastName]
modifier:
- name: join
params: [" "]
rav_tax:
array_presence: first_only
path: ../rav
modifier: to_f
```
```xml
150109.111.019-79MarceloLauxen110.155.019-78CoronaVirus
```
```ruby
{
bookings: [
{
passengers: [
{
id: "109.111.019-79",
name: "Marcelo Lauxen",
tax_rav: 150.00
},
{
id: "110.155.019-78",
name: "Corona Virus"
}
]
}
]
}
```
In this example the field `tax_rav` was only included on the first passenger because this field has the `array_presence: first_only` property.
### in_parent
This option allows you to navigate to a parent node of the current node.
```yml
passengers:
array_of: bookings/booking/passengers/passenger
id:
path: document
modifier: to_s
bookings_id:
in_parent: bookings
path: id
```
```xml
8888109.111.019-79110.155.019-78
```
```ruby
{
bookings: [
{
passengers: [
{
id: "109.111.019-79",
bookings_id: 8888
},
{
id: "110.155.019-78",
bookings_id: 8888
}
]
}
]
}
```
In this example the value of `bookings_id` will be extracted starting at the node provided in `in_parent` instead of the current node. It's possible to navigate to a parent node with `../` too (xpath provides this functionality), but using `in_parent` you just need to provide the name of the parent node, it will navigate up until the parent node is found, no matter how many levels.
### keep_if
This option allows you to keep the part of the block of the hash in the final result only if the condition matches.
```yml
schemas:
dummy:
within: data
description: additional_desc
exchange: currency_info/value
price: price
payment:
type: payment_info/method
value: payment_info/price
keep_if: "'type' == 'invoice'"
```
```xml
Keep walking4.1555.09card55.482333
```
```ruby
{
dummy: {
description: "Keep walking",
exchange: "4.15",
price: "55.09"
}
}
```
In this example the condition didn't match since the payment method was `card` instead of `invoice` and then the extracted payment hash was removed from the final result.
### Formatting:
#### fixed
Defines a fixed value for the field:
```yml
currency:
fixed: BRL
```
```ruby
{ currency: "BRL" }
```
#### mapper
Uses a hash of predefined values to replace the extracted value with its respective option.
If the extracted value is not found in any of the mapper options, it will be replaced by the `default` value, but if the default value is not defined, the returned value is not replaced.
```yml
mappers:
currencies:
default: unknown
options:
BRL: R$
USD: [US$, $]
schemas:
money:
array_of: curr_types/type
path: symbol
mapper: currencies
```
```xml
US$R$RB$
```
```ruby
{
money: ["USD", "BRL", "unknown", "USD"]
}
```
#### modifier
Defines a method to be called on the returned value.
```yml
schemas:
name:
path: some_field
modifier: upcase
```
```xml
Lewandovski
```
```ruby
{ name: "LEWANDOVSKI" }
```
You can also pass parameters to the method. In this case you will have to declare the modifier as an array of hashes, with the `name` and `params` keys:
```yml
schemas:
name:
path: [firstname, lastname]
modifier:
- name: join
params: [" "]
- downcase
```
```xml
RobertMartin
```
```ruby
{ name: "robert martin" }
```
If you need to use custom methods, you can pass an object containing the methods in the initialization. The custom method will receive the value as parameter:
```yml
schemas:
name:
path: final_price
modifier: format_as_float
```
```xml
R$ 12.99
```
```ruby
class MyMethods
def format_as_float(value)
value.gsub(/[^\d.]/, "").to_f
end
end
XmlDataExtractor.new(yml, MyMethods.new).parse(xml)
```
```ruby
{ price: 12.99 }
```