Parquet
Parquet is a columnar storage file format designed for big data workloads. The Parquet connector enables efficient reading and writing of Parquet files, improving performance and storage efficiency in analytical environments.
Prerequisites
- Verify that the storage location or folder is accessible from the cluster.
- Gather valid user credentials and ensure the user has appropriate permissions to read the files.
- Ensure at least one file is present in the folder.
Connecting to Amazon S3
AWS EC2 Role Authentication
Use the following properties to create a valid connection. Properties marked with an asterisk (*) are required.
Name | Description | Example Values |
---|---|---|
Connection Name * | Name that uniquely identifies the connection. | Prod_file_conn |
Driver * | Driver that is used to establish the connection. By default, one driver is available. | Parquet Custom JDBC |
Container * | The name of the S3 bucket that contains necessary Parquet files. | acme-prod-bucket |
Folder Path * | The path which exists within the bucket, that leads to the folder containing Parquet files. | /data/parquet-files/ |
Region * | AWS region, where the S3 bucket is located. | us-east-1 |
Type * | The connection type for EC2 role authentication can only be a System Connection. Refer Connections for more details. | System |
AWS IAM Authentication
Use the following properties to create a valid connection. Properties marked with an asterisk (*) are required.
Name | Description | Example Values |
---|---|---|
Connection Name * | A unique name that identifies the connection. | Prod_file_conn |
Driver * | The driver used to establish the connection. By default, one driver is available. | Parquet Custom JDBC |
Container * | The name of the S3 bucket that contains the Parquet files. | prod-bucket |
Folder Path * | The directory path within the bucket that points to the folder containing Parquet files. | /data/parquet-files/ |
Region * | The AWS region where the S3 bucket is located. | us-east-1 |
Type * | The connection type – either System Connection or User Connection. Refer Connections for more details. | System |
Access Key * | The key used to authenticate the connection to the storage account. | zsdrhg456dfhsz8jhm,khujm56y3 |
Secret Key * | The secret key used to authenticate the connection to the storage account. | FXdzEJr////d81nSTEXAMPLETOKEN== |
Connecting to Azure Blob Storage
Azure Access Key Authentication
Use the following properties to create a valid connection. Properties marked with an asterisk (*) are required.
Name | Description | Example Values |
---|---|---|
Connection Name * | A unique name that identifies the connection. | Prod_file_conn |
Driver * | The driver used to establish the connection. By default, one driver is available. | Parquet Custom JDBC |
Container * | The name of the storage container where the Parquet files are stored. | sales-data-container |
Folder Path * | The directory path within the container that points to the folder containing Parquet files. | /data/parquet-files/ |
Type * | The connection type – either System Connection or User Connection. Refer Connections for more details. | System |
Account Name * | The storage account name used for authentication. | salesstoreadmin |
Access Key * | The key used to authenticate the connection to the storage account. | werghaqehglisfgeshzdfhuotyrg |
Connecting to Azure Data Lake
Azure Access Key Authentication
Use the following properties to create a valid connection. Properties marked with an asterisk (*) are required.
Name | Description | Example Values |
---|---|---|
Connection Name * | A unique name that identifies the connection. | Prod_file_conn |
Driver * | The driver used to establish the connection. By default, one driver is available. | Parquet Custom JDBC |
Container * | The name of the storage container where the Parquet files are stored. | sales-data-container |
Folder Path * | The directory path within the container that points to the folder containing Parquet files. | /data/parquet-files/ |
Type * | The connection type – either System Connection or User Connection. Refer Connections for more details. | System |
Account Name * | The storage account name used for authentication. | salesstorageadmin |
Access Key * | The key used to authenticate the connection to the storage account. | asgtd482tyujkmkiuyg456798xfbdzst |
Connecting to FTP
Anonymous Authentication
Use the following properties to create a valid connection. Properties marked with an asterisk (*) are required.
Name | Description | Example Values |
---|---|---|
Connection Name * | A unique name that identifies the connection. | Prod_file_conn |
Driver * | The driver used to establish the connection. By default, one driver is available. | Parquet Custom JDBC |
Host * | The IP address or hostname of the FTP server. | 192.168.44.31 |
Port * | The port on which the FTP server listens. | 21 |
Folder Path * | The directory path on the server that points to the folder containing Parquet files. | /data/parquet-files/ |
Type * | For anonymous authentication, the connection type must be configured as a System Connection. Refer Connections for more details. | System |
Username and Password Authentication
Use the following properties to create a valid connection. Properties marked with an asterisk (*) are required.
Name | Description | Example Values |
---|---|---|
Connection Name * | A unique name that identifies the connection. | Prod_file_conn |
Driver * | The driver used to establish the connection. By default, one driver is available. | Parquet Custom JDBC |
Host * | The IP address or hostname of the FTP server. | 192.168.44.31 |
Port * | The port on which the FTP server listens. | 21 |
Folder Path * | The directory path on the server that points to the folder containing Parquet files. | /data/parquet-files/ |
Type * | The connection type – either System Connection or User Connection. Refer Connections for more details. | System |
Username * | The FTP username with necessary privileges. | app_service_account |
Password * | The password associated with the specified username. | App$erv1ceP@ss2025 |
Connecting to Google Cloud Storage
Service Account Authentication
Use the following properties to create a valid connection. Properties marked with an asterisk (*) are required.
Name | Description | Example Values |
---|---|---|
Connection Name * | A unique name that identifies the connection. | Prod_file_conn |
Driver * | The driver used to establish the connection. By default, one driver is available. | Parquet Custom JDBC |
Project ID * | The unique identifier of the Google Cloud project where the storage bucket resides. | my-gcp-project-123456 |
Container * | The name of the storage container where the Parquet files are stored. | parquet-data-container |
Folder Path * | The directory path within the bucket that points to the folder containing Parquet files. | data/parquet-files/ |
Type * | For service account authentication, the connection type must be configured as a System Connection. Refer Connections for more details. | System |
Connecting to Local Storage
Users can read files from local storage, which may refer to a server directory for uploaded files or a network-mounted directory accessible to the cluster. Use one or more properties from the table below to create a valid connection. Properties marked with an asterisk (*) are required.
Name | Description | Example Values |
---|---|---|
Connection Name * | A unique name that identifies the connection. | Prod_file_conn |
Driver * | The driver used to establish the connection. By default, one driver is available. | Parquet Custom JDBC |
Folder Path * | The directory path that points to the folder containing Parquet files. | /data/parquet-files/ |
The connector does not support authentication for accessing files from local storage.
Connecting to SFTP
Username and Password Authentication
Use the following properties to create a valid connection. Properties marked with an asterisk (*) are required.
Name | Description | Example Values |
---|---|---|
Connection Name * | A unique name that identifies the connection. | Prod_file_conn |
Driver * | The driver used to establish the connection. By default, one driver is available. | Parquet Custom JDBC |
Host * | The IP address or hostname of the SFTP server. | 186.269.54.87 |
Port * | The port on which the SFTP server listens. | 22 |
Folder Path * | The directory path on the server that points to the folder containing Parquet files. | /data/parquet-files/ |
Type * | The connection type – either System Connection or User Connection. Refer Connections for more details. | System |
Username * | The SFTP username with necessary privileges. | dev_ops_user |
Password * | The password associated with the specified username. | D0v0ps#Secure2025! |
Custom Properties
The following optional connection properties can be configured as needed:
Property | Default Value | Possible Values | Description |
---|---|---|---|
BatchSize | 0 | Numeric value | Specifies the maximum number of rows included in each batch operation. Set to 0 to submit the entire batch as a single request. |
MaxRow | -1 | Numeric value | Limits the number of rows returned when no aggregation or GROUP BY is used in the query. |
RowScanDepth | 100 | Numeric value | The number of rows to scan when dynamically determining columns for the table. |
Pagesize | 1000 | Numeric value | The number of rows to return per page from the file. |
Timeout | 60 | Numeric value | The time in seconds until a timeout error is thrown, canceling the operation. Set 0 for unlimited time. |
Supported Datatypes
The following data types are supported:
- NUMBER
- INT32
- INT64
- DECIMAL
- FLOAT
- DOUBLE
- VARCHAR (BINARY UTF8)
- BOOLEAN
- DATE
- TIME
- TIMESTAMP
- MAP
- STRUCT
- GROUP (LIST)
Unsupported Datatypes
The following data types are not supported:
- TIMESTAMP_NANOS
- TIME_NANOS
- TIMESTAMP_TZ
- DATETIME WITH TIMEZONES
- LISTS (outside group-annotated arrays)
- DICTS
- SETS
- BINARY or IMAGE data