Supported Connections
A list of supported data source connection types.
We've moved! To improve customer experience, the Collibra Data Quality User Guide has moved to the Collibra Documentation Center as part of the Collibra Data Quality 2022.11 release. To ensure a seamless transition, dq-docs.collibra.com will remain accessible, but the DQ User Guide is now maintained exclusively in the Documentation Center.
The following is a list of drivers certified for production use.
Connection | Certified | Tested | Packaged | Optionally packaged | Pushdown | Estimate job | Filtergram | Analyze data | Schedule | Spark agent | Yarn agent | Parallel JDBC | Session state | Kerberos password | Kerberos password manager | Kerberos keytab | Kerberos TGT | Standalone (non-Livy) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Athena | ||||||||||||||||||
BigQuery | ||||||||||||||||||
Databricks JDBC | ||||||||||||||||||
DB2 | ||||||||||||||||||
Dremio | ||||||||||||||||||
Hive | ||||||||||||||||||
Impala | ||||||||||||||||||
MSSQL | ||||||||||||||||||
MYSQL | ||||||||||||||||||
Oracle | ||||||||||||||||||
Postgres | ||||||||||||||||||
Presto | ||||||||||||||||||
Redshift | ||||||||||||||||||
Snowflake | ||||||||||||||||||
Sybase | ||||||||||||||||||
Teradata |
Connection | Certified | Tested | Packaged | Optionally packaged | Pushdown | Estimate job | Filtergram | Analyze data | Spark agent | Yarn agent |
---|---|---|---|---|---|---|---|---|---|---|
Azure Data Lake (Gen2) | ||||||||||
Google Cloud Storage | ||||||||||
HDFS | ||||||||||
S3 |
The following is a list of drivers which are under evaluation (not certified yet for production usage). These connections are currently ineligible for escalated support services.
Connection | Certified | Tested | Packaged | Optional packaging | Pushdown | Estimate job | Filtergram | Analyze data | Schedule | Spark agent | Yarn agent | Parallel JDBC | Session state | Kerberos state | Kerberos password manager | Kerberos keytab | Kerberos TGT | Standalone (non-Livy) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cassandra | ||||||||||||||||||
MongoDB | ||||||||||||||||||
SAP Hana | ||||||||||||||||||
Solr |
Connection | Certified | Tested | Packaged | Optional packaging | Pushdown | Estimate job | Filtergram | Analyze data | Schedule | Spark agent | Yarn agent | Parallel JDBC | Session state | Kerberos password | Kerberos password manager | Kerberos TGT | CRDB metastore | Standalone (non-Livy) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Kafka |
File type | Supported |
---|---|
CSV (and all delimiters) | |
Parquet | |
AVRO | |
JSON | |
DELTA |
Authentication
- DQ Jobs that require Kerberos TGT are not yet supported on Spark Standalone or Local deployments
- Recommended to submit jobs via Yarn or K8s
File Sizes
- Files with more than 250 columns supported in File Explorer, unless you have Livy enabled.
- Files larger than 5gb are not supported in File Explorer, unless you have Livy enabled.
- Smaller file sizes will allow for skip scanning and more efficient processing
- Advanced features like replay, scheduling, and historical lookbacks require a date signature in the folder of file path
S3
- Please ensure no spaces in S3 connection name
- Please remember to select 'Save Credentials' checkbox upon establishing connection
- Please point to root bucket, not sub folders
Local Files
- Local files can only be run using NO_AGENT default
- This is for quick testing, smaller files, and demonstration purposes.
- Local file scanning is not intended for large scale production use.
Livy
- MapR is EOL and MapR spark engine not supported to run CDQ jobs.
The only supported Databricks spark submit option is to use a notebook to initiate the job (Scala and Pyspark options). This is intended for pipeline developers and users knowledgeable with Databricks and notebooks. This form factor is ideal for incorporating data quality within existing Spark ETL data flows. The results are still available for business users to consume. The configuration is not intended for business users to implement. There are three ways that Databricks users can run CDQ jobs using Databricks cluster or JDBC connection.
1. Notebook
Users can directly open a notebook, upload CDQ jars and run a CDQ job on Databricks cluster. The full steps are explained in below page. CDQ supports this flow in production.
2. Spark-Submit
There are two ways to run a spark submit job on Databricks's cluster. The first approach is to run a CDQ spark submit job using Databricks UI and the second approach is by invoking Databricks rest API.
We have tested both approaches against different cluster versions of DataBricks (See below table). Below is the full documentation to demonstrate these paths.
https://dq-docs.collibra.com/apis-1/notebook/cdq-+-databricks/dq-databricks-submit\
Please note that these are only examples to demonstrate how to achieve DQ spark submit to Databricks's cluster. These paths are not supported in production and DQ team does not support any bug coverages or professional services or customer questions for these flows. \
3. JDBC
CDQ users can create JDBC connections in CDQ UI and connect to their Databricks database. This is scheduled for 2022.05 release.
Delta Lake and JDBC connectivity has been validated against Spark 3.01 CDQ package, Databricks 7.3 LTS and SparkJDBC41.jar. This is available as Preview. No other combinations have been certified at this time.
Spark submit using the Databricks spark master url is not supported.

CDQ Production support for Databricks.
Last modified 6mo ago