Working with XML
Extensible Markup Language (XML) lets users stored data in a semi structured format for information exchange. Users can validate or compare XML data with any other data source. Users can access the XML Files using SQL through the configured XML Connector.
Understanding XML Structure
An XML file is a text-based document that lets you save the data with .xml as the file extension.
<CRMList>
<Organization>Acme Inc</Organization>
<OrganizationID>100</OrganizationID>
<isActive>true</isActive>
<Location>US</Location>
<Location>IND</Location>
<Contacts>
<ID>11012</ID>
<Name>Lauren Walker</Name>
<BirthDate>1972-01-18</BirthDate>
<Address Type="Home">
<Zip>10999</Zip>
<Country>USA</Country>
</Address>
<Address Type="Office">
<Zip>95819</Zip>
<Country>USA</Country>
</Address>
</Contacts>
</CRMList>
- XML Document: The content between the start and end of a tag, e.g.
<xml></xml>
, is called an XML document. - XML Elements: All the tags within an XML document are called XML Elements that contains text attributes and other elements.
- Root Element: Each XML document begins with a primary tag, which is called the root element. In above example
<CRMList></CRMList>
is the root element. - XML Attribute: Each element can have a descriptor which is called an attribute. In following element
<Address Type="Office"/>
,Type
is the attribute. - XML Content: The data/ values in the XML document is called XML content. In above example
Acme Inc
,US
,10999
is the content.
Understanding XPath
XPath is a query language for identifying elements, attributes and extracting data from an XML document. XPath expressions are used to traverse the XML document.
Typically, the XPaths are auto-identified by the connector by parsing first 100 records, users may need to manually specify XPaths in certain cases where the connector's identification falls short.
Following are the different XPath expressions and their output for above XML example.
XPath: /CRMList
Output: Returns the complete XML document
XPath: /CRMList/Contacts
Output: Returns all the Contacts node/ document.
<Contacts>
<ID>11012</ID>
<Name>Lauren Walker</Name>
<BirthDate>1972-01-18</BirthDate>
<Address Type="Home">
<Zip>10999</Zip>
<Country>USA</Country>
</Address>
<Address Type="Office">
<Zip>95819</Zip>
<Country>USA</Country>
</Address>
</Contacts>
XPath: /CRMList/Contacts/Address[1]
Output: Returns address node at the 1st index.
<Address Type="Home">
<Zip>10999</Zip>
<Country>USA</Country>
</Address>
Parsing XML Data
For this example follow test data customer-order.xml has been used.
By default, the connector partially flattens only the root element, retaining nested elements as XML strings. However, it can also be configured to comprehensively flattens XML documents, transforming all elements and nodes into a relational structure.
Majority of use cases will require flattening of the entire document. In the flattening approach the connector creates distinct tables for object arrays, implicitly joined to their parent table. Notably, nested sibling XPaths are handled as SQL CROSS JOINs, ensuring accurate data representation within the relational structure.
When DataModel=FLATTENEDDOCUMENTS
and we run the following SQL against the customer-order.xml
file, the connector will:
- Flatten and implicitly join the
customers
object array and its nestedaddress
object array. - Flatten and implicitly join the
orders
object array and its nestedproducts
object array. - Cross join the results of steps 1 and 2 to generate the final table.
Example
SELECT id, name, zip, coordinates, customer_id, order_id, product_name, [metrics.price]
FROM customer_order
WHERE zip IN (12345, 94105) OR order_id in (1, 3)
Output
id | name | zip | coordinates | customer_id | order_id | product_name | metrics.price |
---|---|---|---|---|---|---|---|
1 | Ian Walker | 12345 | <coordinates>-98.808</coordinates><coordinates>48.11</coordinates> | null | null | null | null |
2 | Jenny Blob | 94105 | null | null | null | null | null |
null | null | null | null | 1 | 1 | Shirt | 50 |
null | null | null | null | 1 | 1 | Shoes | 50 |
null | null | null | null | 2 | 3 | Dress | 150 |
null | null | null | null | 2 | 3 | Bag | 50 |
Push Attributes
This property controls whether attributes within XML documents are transformed into individual columns.
Let's consider following document.
<address flag="active">
<state>CA</state>
<zip>90001</zip>
<coordinates>-90</coordinates>
<coordinates>60</coordinates>
<start_date>2023-07-01</start_date>
<end_date/>
</address>
When this property is set to true
, the flag
attribute will convert into a call and the output table will have the following structure:
Column | Type |
---|---|
address:_id | VARCHAR |
coordinates | VARCHAR |
end_date | DATE |
flag | VARCHAR |
start_date | DATE |
state | VARCHAR |
zip | INT |
and the data will look like this.
address:_id | coordinates | end_date | flag | start_date | state | zip |
---|---|---|---|---|---|---|
1 | <coordinates>-98.808</coordinates><coordinates>48.11</coordinates> | 2023-06-30 | inactive | 2023-01-01 | CA | 12345 |
Flatten Arrays
It is another key property that can be used to flatten array values into columns of their own. However, It is best to leave other unbounded arrays as they are and piece out the data for them as needed using XML Functions.
Let's consider the same document
<address flag="active">
<state>CA</state>
<zip>90001</zip>
<coordinates>-90</coordinates>
<coordinates>60</coordinates>
<start_date>2023-07-01</start_date>
<end_date/>
</address>
When this property value is set to 2, the coordinates array will convert into two columns and the output table will have the following structure:
Column | Type |
---|---|
address:_id | VARCHAR |
coordinates | VARCHAR |
coordinates.0 | VARCHAR |
coordinates.1 | VARCHAR |
end_date | DATE |
start_date | DATE |
state | VARCHAR |
zip | INT |
and the data will look like this.
address:_id | coordinates.0 | coordinates.1 | start_date | state | zip |
---|---|---|---|---|---|
1 | -98.808 | 48.11 | 2023-01-01 | CA | 12345 |
RowScanDepth
It determines how many object arrays the connector scans to find rows and columns. It follows nested objects, counting each object array as a row. The default value is 100.
Other Properties
- FlattenRowLimit: The maximum number of rows that can result from a single flattened element. Default is 250000.
Understanding Settings (Source/ Target)
XML file connector has specific configurations. Please read the following properties carefully to understand the behaviour of the connector.
Single File
This property specifies whether the driver should read data from a single XML file or a folder of XML files. If this property is enabled, the driver will read data from a single XML file with the .xml
extension. If this property is disabled, the driver will read data from all the .xml
files in the folder and union them together. The structure of all the XML files in the folder must be exactly the same.
Parameterised Path This property allows users to parameterize the file path and name using the parameter key. For this to work, the rule must have a default parameter file associated with it.
XPath Users can explicitly provide an XPath for the connector to identify the correct structure (elements and attributes) in the XML document. By default, the connector will identify the XPath by reading the first 200 records.
Additional Properties
The connector has some default properties that users can modify or add to, which changes how the connector processes XML data.
Property | Value | Note |
---|---|---|
Table Name | It is auto generated which the user can modify. | |
DataModel | FlattenedDocuments, Documents (Default) | |
PushAttributes | False (Default), True | View description |
FlattenArrays | Any numeric value. 0 is default. | View description |
FlattenObject | True (Default), False | |
QualifyColumns | None (Default), Parent, Full | Qualify column names based on nested level. |
FlattenRowLimit | Any numeric value. 250000 is default. | Limits the numbers of output rows. |
RowScanDepth | Any numeric value. 100 is default. | Rows (objects) to scan when dynamically determining the schema. |
Schema
Schema defines the structure of an XML file by identifying its nodes, attributes, and data types. Schemas are stored as separate objects in a repository, each with a unique schema ID.
The Get Schema
and Save Schema
actions suggest a user interface or application managing schemas and XML files.
Depending on the context, "Saving" might involve persisting the schema to a dedicated repository or embedding it within the XML file.
Schema and XML file together allow SQL to generate accurate output.
How To: Read XML File
This video shows you how to read data from customer-order.xml file.
XML Functions
The driver can return XML as column values. The driver enables you to use SQL functions to work with these column values. The following sections provide examples; for a reference, see STRING Functions. The examples in this section use the following array (see Parsing Hierarchical Data for more information on parsing XML objects and arrays):
<root>
<customer>
<id>1</id>
<name>Ian Walker</name>
</customer>
<customer>
<id>2</id>
<name>Abby Walker</name>
</customer>
<customer>
<id>3</id>
<name>Jenny Walker</name>
</customer>
</root>
TBD