Skip to main content

Working with XML


Extensible Markup Language (XML) lets users stored data in a semi structured format for information exchange. Users can validate or compare XML data with any other data source. Users can access the XML Files using SQL through the configured XML Connector.

Understanding XML Structure

An XML file is a text-based document that lets you save the data with .xml as the file extension.

<CRMList>
<Organization>Acme Inc</Organization>
<OrganizationID>100</OrganizationID>
<isActive>true</isActive>
<Location>US</Location>
<Location>IND</Location>
<Contacts>
<ID>11012</ID>
<Name>Lauren Walker</Name>
<BirthDate>1972-01-18</BirthDate>
<Address Type="Home">
<Zip>10999</Zip>
<Country>USA</Country>
</Address>
<Address Type="Office">
<Zip>95819</Zip>
<Country>USA</Country>
</Address>
</Contacts>
</CRMList>
  • XML Document: The content between the start and end of a tag, e.g. <xml></xml>, is called an XML document.
  • XML Elements: All the tags within an XML document are called XML Elements that contains text attributes and other elements.
  • Root Element: Each XML document begins with a primary tag, which is called the root element. In above example <CRMList></CRMList> is the root element.
  • XML Attribute: Each element can have a descriptor which is called an attribute. In following element <Address Type="Office"/>, Type is the attribute.
  • XML Content: The data/ values in the XML document is called XML content. In above example Acme Inc, US, 10999 is the content.

Understanding XPath

XPath is a query language for identifying elements, attributes and extracting data from an XML document. XPath expressions are used to traverse the XML document.
Typically, the XPaths are auto-identified by the connector by parsing first 100 records, users may need to manually specify XPaths in certain cases where the connector's identification falls short.

Following are the different XPath expressions and their output for above XML example.

XPath: /CRMList Output: Returns the complete XML document

XPath: /CRMList/Contacts Output: Returns all the Contacts node/ document.

<Contacts>
<ID>11012</ID>
<Name>Lauren Walker</Name>
<BirthDate>1972-01-18</BirthDate>
<Address Type="Home">
<Zip>10999</Zip>
<Country>USA</Country>
</Address>
<Address Type="Office">
<Zip>95819</Zip>
<Country>USA</Country>
</Address>
</Contacts>

XPath: /CRMList/Contacts/Address[1] Output: Returns address node at the 1st index.

<Address Type="Home">
<Zip>10999</Zip>
<Country>USA</Country>
</Address>

Parsing XML Data

For this example follow test data customer-order.xml has been used.

By default, the connector partially flattens only the root element, retaining nested elements as XML strings. However, it can also be configured to comprehensively flattens XML documents, transforming all elements and nodes into a relational structure.

Majority of use cases will require flattening of the entire document. In the flattening approach the connector creates distinct tables for object arrays, implicitly joined to their parent table. Notably, nested sibling XPaths are handled as SQL CROSS JOINs, ensuring accurate data representation within the relational structure.

When DataModel=FLATTENEDDOCUMENTS and we run the following SQL against the customer-order.xml file, the connector will:

  1. Flatten and implicitly join the customers object array and its nested address object array.
  2. Flatten and implicitly join the orders object array and its nested products object array.
  3. Cross join the results of steps 1 and 2 to generate the final table.

Example

SELECT id, name, zip, coordinates, customer_id, order_id, product_name, [metrics.price] 
FROM customer_order
WHERE zip IN (12345, 94105) OR order_id in (1, 3)

Output

idnamezipcoordinatescustomer_idorder_idproduct_namemetrics.price
1Ian Walker12345<coordinates>-98.808</coordinates><coordinates>48.11</coordinates>nullnullnullnull
2Jenny Blob94105nullnullnullnullnull
nullnullnullnull11Shirt50
nullnullnullnull11Shoes50
nullnullnullnull23Dress150
nullnullnullnull23Bag50

Push Attributes

This property controls whether attributes within XML documents are transformed into individual columns.

Let's consider following document.

<address flag="active">
<state>CA</state>
<zip>90001</zip>
<coordinates>-90</coordinates>
<coordinates>60</coordinates>
<start_date>2023-07-01</start_date>
<end_date/>
</address>

When this property is set to true, the flag attribute will convert into a call and the output table will have the following structure:

ColumnType
address:_idVARCHAR
coordinatesVARCHAR
end_dateDATE
flagVARCHAR
start_dateDATE
stateVARCHAR
zipINT

and the data will look like this.

address:_idcoordinatesend_dateflagstart_datestatezip
1<coordinates>-98.808</coordinates><coordinates>48.11</coordinates>2023-06-30inactive2023-01-01CA12345

Flatten Arrays

It is another key property that can be used to flatten array values into columns of their own. However, It is best to leave other unbounded arrays as they are and piece out the data for them as needed using XML Functions.

Let's consider the same document

<address flag="active">
<state>CA</state>
<zip>90001</zip>
<coordinates>-90</coordinates>
<coordinates>60</coordinates>
<start_date>2023-07-01</start_date>
<end_date/>
</address>

When this property value is set to 2, the coordinates array will convert into two columns and the output table will have the following structure:

ColumnType
address:_idVARCHAR
coordinatesVARCHAR
coordinates.0VARCHAR
coordinates.1VARCHAR
end_dateDATE
start_dateDATE
stateVARCHAR
zipINT

and the data will look like this.

address:_idcoordinates.0coordinates.1start_datestatezip
1-98.80848.112023-01-01CA12345

RowScanDepth

It determines how many object arrays the connector scans to find rows and columns. It follows nested objects, counting each object array as a row. The default value is 100.

Other Properties

  • FlattenRowLimit: The maximum number of rows that can result from a single flattened element. Default is 250000.

Understanding Settings (Source/ Target)

XML file connector has specific configurations. Please read the following properties carefully to understand the behaviour of the connector.

Single File This property specifies whether the driver should read data from a single XML file or a folder of XML files. If this property is enabled, the driver will read data from a single XML file with the .xml extension. If this property is disabled, the driver will read data from all the .xml files in the folder and union them together. The structure of all the XML files in the folder must be exactly the same.

Parameterised Path This property allows users to parameterize the file path and name using the parameter key. For this to work, the rule must have a default parameter file associated with it.

XPath Users can explicitly provide an XPath for the connector to identify the correct structure (elements and attributes) in the XML document. By default, the connector will identify the XPath by reading the first 200 records.


Additional Properties

The connector has some default properties that users can modify or add to, which changes how the connector processes XML data.

PropertyValueNote
Table NameIt is auto generated which the user can modify.
DataModelFlattenedDocuments, Documents (Default)
PushAttributesFalse (Default), TrueView description
FlattenArraysAny numeric value. 0 is default.View description
FlattenObjectTrue (Default), False
QualifyColumnsNone (Default), Parent, FullQualify column names based on nested level.
FlattenRowLimitAny numeric value. 250000 is default.Limits the numbers of output rows.
RowScanDepthAny numeric value. 100 is default.Rows (objects) to scan when dynamically determining the schema.

Schema

Schema defines the structure of an XML file by identifying its nodes, attributes, and data types. Schemas are stored as separate objects in a repository, each with a unique schema ID.

The Get Schema and Save Schema actions suggest a user interface or application managing schemas and XML files. Depending on the context, "Saving" might involve persisting the schema to a dedicated repository or embedding it within the XML file.

Schema and XML file together allow SQL to generate accurate output.


How To: Read XML File

This video shows you how to read data from customer-order.xml file.


XML Functions

The driver can return XML as column values. The driver enables you to use SQL functions to work with these column values. The following sections provide examples; for a reference, see STRING Functions. The examples in this section use the following array (see Parsing Hierarchical Data for more information on parsing XML objects and arrays):

<root>
<customer>
<id>1</id>
<name>Ian Walker</name>
</customer>
<customer>
<id>2</id>
<name>Abby Walker</name>
</customer>
<customer>
<id>3</id>
<name>Jenny Walker</name>
</customer>
</root>

TBD