Working with XML

Extensible Markup Language (XML) lets users stored data in a semi structured format for information exchange. Users can validate or compare XML data with any other data source. Users can access the XML Files using SQL through the configured XML Connector.

Understanding XML Structure

An XML file is a text-based document that lets you save the data with .xml as the file extension.

<CRMList>
	<Organization>Acme Inc</Organization>
	<OrganizationID>100</OrganizationID>
	<isActive>true</isActive>
	<Location>US</Location>
	<Location>IND</Location>
	<Contacts>
		<ID>11012</ID>
		<Name>Lauren Walker</Name>
		<BirthDate>1972-01-18</BirthDate>
		<Address Type="Home">
			<Zip>10999</Zip>
			<Country>USA</Country>
		</Address>
		<Address Type="Office">
			<Zip>95819</Zip>
			<Country>USA</Country>
		</Address>
	</Contacts>
</CRMList>

XML Document: The content between the start and end of a tag, e.g. <xml></xml>, is called an XML document.
XML Elements: All the tags within an XML document are called XML Elements that contains text attributes and other elements.
Root Element: Each XML document begins with a primary tag, which is called the root element. In above example <CRMList></CRMList> is the root element.
XML Attribute: Each element can have a descriptor which is called an attribute. In following element <Address Type="Office"/>, Type is the attribute.
XML Content: The data/ values in the XML document is called XML content. In above example Acme Inc, US, 10999 is the content.

Understanding XPath

XPath is a query language for identifying elements, attributes and extracting data from an XML document. XPath expressions are used to traverse the XML document.
Typically, the XPaths are auto-identified by the connector by parsing first 100 records, users may need to manually specify XPaths in certain cases where the connector's identification falls short.

Following are the different XPath expressions and their output for above XML example.

XPath: /CRMList Output: Returns the complete XML document

XPath: /CRMList/Contacts Output: Returns all the Contacts node/ document.

<Contacts>
		<ID>11012</ID>
		<Name>Lauren Walker</Name>
		<BirthDate>1972-01-18</BirthDate>
		<Address Type="Home">
			   <Zip>10999</Zip>
			   <Country>USA</Country>
		</Address>
		<Address Type="Office">
			   <Zip>95819</Zip>
			   <Country>USA</Country>
		</Address>
	</Contacts>

XPath: /CRMList/Contacts/Address[1] Output: Returns address node at the 1st index.

<Address Type="Home">
			<Zip>10999</Zip>
			<Country>USA</Country>
		</Address>

Parsing XML Data

For this example follow test data customer-order.xml has been used.

By default, the connector partially flattens only the root element, retaining nested elements as XML strings. However, it can also be configured to comprehensively flattens XML documents, transforming all elements and nodes into a relational structure.

Majority of use cases will require flattening of the entire document. In the flattening approach the connector creates distinct tables for object arrays, implicitly joined to their parent table. Notably, nested sibling XPaths are handled as SQL CROSS JOINs, ensuring accurate data representation within the relational structure.

When DataModel=FLATTENEDDOCUMENTS and we run the following SQL against the customer-order.xml file, the connector will:

Flatten and implicitly join the customers object array and its nested address object array.
Flatten and implicitly join the orders object array and its nested products object array.
Cross join the results of steps 1 and 2 to generate the final table.

Example

SELECT id, name, zip, coordinates, customer_id, order_id, product_name, [metrics.price] 
    FROM customer_order 
        WHERE zip IN (12345, 94105) OR order_id in (1, 3)

Output

id	name	zip	coordinates	customer_id	order_id	product_name	metrics.price
1	Ian Walker	12345	`<coordinates>-98.808</coordinates><coordinates>48.11</coordinates>`	null	null	null	null
2	Jenny Blob	94105	null	null	null	null	null
null	null	null	null	1	1	Shirt	50
null	null	null	null	1	1	Shoes	50
null	null	null	null	2	3	Dress	150
null	null	null	null	2	3	Bag	50

Push Attributes

This property controls whether attributes within XML documents are transformed into individual columns.

Let's consider following document.

<address flag="active">
	<state>CA</state>
	<zip>90001</zip>
	<coordinates>-90</coordinates>
	<coordinates>60</coordinates>
	<start_date>2023-07-01</start_date>
	<end_date/>
</address>

When this property is set to true, the flag attribute will convert into a call and the output table will have the following structure:

Column	Type
address:_id	VARCHAR
coordinates	VARCHAR
end_date	DATE
flag	VARCHAR
start_date	DATE
state	VARCHAR
zip	INT

and the data will look like this.

address:_id	coordinates	end_date	flag	start_date	state	zip
1	`<coordinates>-98.808</coordinates><coordinates>48.11</coordinates>`	2023-06-30	inactive	2023-01-01	CA	12345

Flatten Arrays

It is another key property that can be used to flatten array values into columns of their own. However, It is best to leave other unbounded arrays as they are and piece out the data for them as needed using XML Functions.

Let's consider the same document

<address flag="active">
	<state>CA</state>
	<zip>90001</zip>
	<coordinates>-90</coordinates>
	<coordinates>60</coordinates>
	<start_date>2023-07-01</start_date>
	<end_date/>
</address>

When this property value is set to 2, the coordinates array will convert into two columns and the output table will have the following structure:

Column	Type
address:_id	VARCHAR
coordinates	VARCHAR
coordinates.0	VARCHAR
coordinates.1	VARCHAR
end_date	DATE
start_date	DATE
state	VARCHAR
zip	INT

and the data will look like this.

address:_id	coordinates.0	coordinates.1	start_date	state	zip
1	-98.808	48.11	2023-01-01	CA	12345

RowScanDepth

It determines how many object arrays the connector scans to find rows and columns. It follows nested objects, counting each object array as a row. The default value is 100.

Other Properties

FlattenRowLimit: The maximum number of rows that can result from a single flattened element. Default is 250000.

Understanding Settings (Source/ Target)

XML file connector has specific configurations. Please read the following properties carefully to understand the behaviour of the connector.

Single File This property specifies whether the driver should read data from a single XML file or a folder of XML files. If this property is enabled, the driver will read data from a single XML file with the .xml extension. If this property is disabled, the driver will read data from all the .xml files in the folder and union them together. The structure of all the XML files in the folder must be exactly the same.

Parameterised Path This property allows users to parameterize the file path and name using the parameter key. For this to work, the rule must have a default parameter file associated with it.

XPath Users can explicitly provide an XPath for the connector to identify the correct structure (elements and attributes) in the XML document. By default, the connector will identify the XPath by reading the first 200 records.

Additional Properties

The connector has some default properties that users can modify or add to, which changes how the connector processes XML data.

Property	Value	Note
Table Name		It is auto generated which the user can modify.
DataModel	FlattenedDocuments, Documents (Default)
PushAttributes	False (Default), True	View description
FlattenArrays	Any numeric value. 0 is default.	View description
FlattenObject	True (Default), False
QualifyColumns	None (Default), Parent, Full	Qualify column names based on nested level.
FlattenRowLimit	Any numeric value. 250000 is default.	Limits the numbers of output rows.
RowScanDepth	Any numeric value. 100 is default.	Rows (objects) to scan when dynamically determining the schema.

Schema

Schema defines the structure of an XML file by identifying its nodes, attributes, and data types. Schemas are stored as separate objects in a repository, each with a unique schema ID.

The Get Schema and Save Schema actions suggest a user interface or application managing schemas and XML files. Depending on the context, "Saving" might involve persisting the schema to a dedicated repository or embedding it within the XML file.

Schema and XML file together allow SQL to generate accurate output.

How To: Read XML File

This video shows you how to read data from customer-order.xml file.

XML Functions

The driver can return XML as column values. The driver enables you to use SQL functions to work with these column values. The following sections provide examples; for a reference, see STRING Functions. The examples in this section use the following array (see Parsing Hierarchical Data for more information on parsing XML objects and arrays):

<root>
	<customer>
		<id>1</id>
		<name>Ian Walker</name>
	</customer>
	<customer>
		<id>2</id>
		<name>Abby Walker</name>
	</customer>
	<customer>
		<id>3</id>
		<name>Jenny Walker</name>
	</customer>
</root>

TBD

Understanding XML Structure​

Understanding XPath​

Parsing XML Data​

Example​

Output​

Push Attributes​

Flatten Arrays​

RowScanDepth​

Other Properties​

Understanding Settings (Source/ Target)​

Additional Properties​

Schema​

How To: Read XML File​

XML Functions​