# XmlDataExtractor This gem provides a DSL for extracting formatted data from any XML structure. ## Installation Add this line to your application's Gemfile: ```ruby gem 'xml_data_extractor' ``` And then execute: $ bundle install Or install it yourself as: $ gem install xml_data_extractor ## Usage The general ideia is to declare a ruby Hash that represents the fields structure, containing instructions of how every piece of data should be retrieved from the XML document. ```ruby structure = { schemas: { character: { path: "xml/FirstName" } } } xml = "Gandalf" result = XmlDataExtractor.new(structure).parse(xml) # result -> { character: "Gandalf" } ``` For convenience, you can write the structure in yaml, which can be easily converted to a ruby hash using `YAML.load(yml).deep_symbolize_keys`. Considering the following yaml and xml: ```yml schemas: description: path: xml/desc modifier: downcase amount: path: xml/info/price modifier: to_f ``` ```xml HELLO WORLD 123 ``` The output is: ```ruby { description: "hello world", amount: 123.0 } ``` ### Defining the structure The structure should be defined as a hash inside the `schemas` key. See the [complete example](https://github.com/monde-sistemas/xml_data_extractor/blob/master/spec/complete_example_spec.rb#L5). When defining the structure you can combine any available command in order to extract and format the data as needed. The available commands are separated in two general pusposes: - [Navigation & Extraction](#navigation--extraction) - [Formatting](#formatting) ### Navigation & Extraction: The data extraction process is based on `Xpath` using Nokogiri. * [Xpath introduction](https://blog.scrapinghub.com/2016/10/27/an-introduction-to-xpath-with-examples) * [Xpath cheatsheet](https://devhints.io/xpath) #### path Defines the `xpath` of the element. The `path` is the default command of a field definition, so this: ```yml schemas: description: path: xml/desc ``` Is equivalent to this: ```yml schemas: description: xml/desc ``` It can be defined as a string: ```yml schemas: description: path: xml/some_field ``` ```xml ABC ``` ```ruby { description: "ABC" } ``` Or as a string array: ```yml schemas: address: path: [street, info/city] ``` ```xml Diagon Alley London ``` ```ruby { address: ["Diagon Alley", "London"] } ``` And even as a hash array, for complex operations: ```yml schemas: address: path: - path: street modifier: downcase - path: info/city modifier: upcase ``` ```ruby { address: ["diagon alley", "LONDON"] } ``` #### attr Defines a tag attribute which the value should be extracted from, instead of the tag value itself: ```yml schemas: description: path: xml/info attr: desc ``` ```xml some stuff ``` ```ruby { description: "ABC" } ``` Like the path, it can also be defined as a string array. #### within To define a root path for the fields: ```yml schemas: movie: within: info/movie_data title: original_title actor: main_actor ``` ```xml The Irishman Robert De Niro ``` ```ruby { movie: { title: "The Irishman", actor: "Robert De Niro" } } ``` #### unescape This option is pretty usefull when you have embbed XML or HTML inside some tag, like CDATA elements, and you need to unescape them first in order to parse their content: ```yml schemas: movie: unescape: response title: response/original_title actor: response/main_actor ``` ```xml <original_title>1<original_title><main_actor>1<main_actor> ``` This XML will be turned into this one during the parsing: ```xml The Irishman Robert De Niro ``` ```ruby { movie: { title: "The Irishman", actor: "Robert De Niro" } } ``` #### array_of Defines the path to a XML collection, which will be looped generating an array of hashes: ```yml schemas: people: array_of: characters/character name: firstname age: age ``` ```xml Geralt 97 Yennefer 102 ``` ```ruby { people: [ { name: "Geralt", age: "97" }, { name: "Yennefer", age: "102" } ] } ``` If you need to loop trough nested collections, you can define an array of paths: ```yml schemas: show: within: show_data title: description people: array_of: [characters/character, info] name: name ``` ```xml Peaky Blinders Tommy Shelby Arthur Shelby Alfie Solomons ``` ```ruby { show: { title: "Peaky Blinders", people: [ { name: "Tommy Shelby" }, { name: "Arthur Shelby" }, { name: "Alfie Solomons" } ] } } ``` ### link This command is useful when the XML contains references to other nodes, it works as a SQL JOIN. The path must be and expression containing the `` identifier, which will be replaced by the value fetched from the `link:` command. Example: ```yml schemas: bookings: array_of: booking date: booking_date document: id products: array_of: accomodation: path: ../hotel[booking_id=]/accomodation link: id ``` ```xml 1 2020-01-01 2 2020-01-02 1 Standard 2 Premium ``` ```ruby { bookings: [ { date: "2020-01-01", document: "1" products: [ { accomodation: "Standard" } ] }, { date: "2020-01-02", document: "2" products: [ { accomodation: "Premium" } ] } ] } ``` In this example if I didn't use the `link` to get only the hotel of each booking, it would have returned two accomodations for each booking and instead of extract a string with the accomodation it would extract an array with all the accomodations for each booking. You can combine the `link` with `array_of` if you want search for a list of elements filtering by some field, just provide the `path` and the `link`: ```yml schemas: bookings: array_of: booking date: date document: id products: array_of: path: ../products[booking_id=] link: id .... ``` ### uniq_by Can only be used with **array_of**. This functionality is useful when some XML nodes are duplicated and you want to extract data from the first occurrence only. It has a behavior similar to Ruby **uniq** method on arrays. For each path generated from `array_of`, the value fetched using `uniq_by` will be checked against the generated collection and the path will be discarded if the value already exists. ```yml schemas: bookings: array_of: path: booking uniq_by: id date: bdate document: id ``` ```xml 1 2020-01-01 1 2020-01-01 ``` ```ruby { bookings: [ { date: "2020-01-01", document: "1" } ] } ``` In this example if we don't use the tag `uniq_by` there would be extracted two elements with the same data, like: ```ruby { bookings: [ { date: "2020-01-01", document: "1" }, { date: "2020-01-01", document: "1" } ] } ``` ### array_presence: first_only The field that contains this property will be only added to the first item of the array. Can only be used in fields that belong to a node of `array_of`. ```yml passengers: array_of: bookings/booking/passengers/passenger id: path: document modifier: to_s name: attr: [FirstName, LastName] modifier: - name: join params: [" "] rav_tax: array_presence: first_only path: ../rav modifier: to_f ``` ```xml 150 109.111.019-79 Marcelo Lauxen 110.155.019-78 Corona Virus ``` ```ruby { bookings: [ { passengers: [ { id: "109.111.019-79", name: "Marcelo Lauxen", tax_rav: 150.00 }, { id: "110.155.019-78", name: "Corona Virus" } ] } ] } ``` In this example the field `tax_rav` was only included on the first passenger because this field has the `array_presence: first_only` property. ### in_parent This option allows you to navigate to a parent node of the current node. ```yml passengers: array_of: bookings/booking/passengers/passenger id: path: document modifier: to_s bookings_id: in_parent: bookings path: id ``` ```xml 8888 109.111.019-79 110.155.019-78 ``` ```ruby { bookings: [ { passengers: [ { id: "109.111.019-79", bookings_id: 8888 }, { id: "110.155.019-78", bookings_id: 8888 } ] } ] } ``` In this example the value of `bookings_id` will be extracted starting at the node provided in `in_parent` instead of the current node. It's possible to navigate to a parent node with `../` too (xpath provides this functionality), but using `in_parent` you just need to provide the name of the parent node, it will navigate up until the parent node is found, no matter how many levels. ### keep_if This option allows you to keep the part of the block of the hash in the final result only if the condition matches. ```yml schemas: dummy: within: data description: additional_desc exchange: currency_info/value price: price payment: type: payment_info/method value: payment_info/price keep_if: "'type' == 'invoice'" ``` ```xml Keep walking 4.15 55.09 card 55.48 2 333 ``` ```ruby { dummy: { description: "Keep walking", exchange: "4.15", price: "55.09" } } ``` In this example the condition didn't match since the payment method was `card` instead of `invoice` and then the extracted payment hash was removed from the final result. ### Formatting: #### fixed Defines a fixed value for the field: ```yml currency: fixed: BRL ``` ```ruby { currency: "BRL" } ``` #### mapper Uses a hash of predefined values to replace the extracted value with its respective option. If the extracted value is not found in any of the mapper options, it will be replaced by the `default` value, but if the default value is not defined, the returned value is not replaced. ```yml mappers: currencies: default: unknown options: BRL: R$ USD: [US$, $] schemas: money: array_of: curr_types/type path: symbol mapper: currencies ``` ```xml US$ R$ RB $ ``` ```ruby { money: ["USD", "BRL", "unknown", "USD"] } ``` #### modifier Defines a method to be called on the returned value. ```yml schemas: name: path: some_field modifier: upcase ``` ```xml Lewandovski ``` ```ruby { name: "LEWANDOVSKI" } ``` You can also pass parameters to the method. In this case you will have to declare the modifier as an array of hashes, with the `name` and `params` keys: ```yml schemas: name: path: [firstname, lastname] modifier: - name: join params: [" "] - downcase ``` ```xml Robert Martin ``` ```ruby { name: "robert martin" } ``` If you need to use custom methods, you can pass an object containing the methods in the initialization. The custom method will receive the value as parameter: ```yml schemas: name: path: final_price modifier: format_as_float ``` ```xml R$ 12.99 ``` ```ruby class MyMethods def format_as_float(value) value.gsub(/[^\d.]/, "").to_f end end XmlDataExtractor.new(yml, MyMethods.new).parse(xml) ``` ```ruby { price: 12.99 } ```