Collibra DQ User Guide
2022.10
Search
⌃K

Supported Connections

A list of supported data source connection types.
We've moved! To improve customer experience, the Collibra Data Quality User Guide has moved to the Collibra Documentation Center as part of the Collibra Data Quality 2022.11 release. To ensure a seamless transition, dq-docs.collibra.com will remain accessible, but the DQ User Guide is now maintained exclusively in the Documentation Center.
To access our old Supported Connections page, please refer to 2022.03.

Production

The following is a list of drivers certified for production use.

Connections - Currently Supported

Connection
Certified
Tested
Packaged
Optionally packaged
Pushdown
Estimate job
Filtergram
Analyze data
Schedule
Spark agent
Yarn agent
Parallel JDBC
Session state
Kerberos password
Kerberos password manager
Kerberos keytab
Kerberos TGT
Standalone (non-Livy)
Athena
BigQuery
Databricks JDBC
DB2
Dremio
Hive
Impala
MSSQL
MYSQL
Oracle
Postgres
Presto
Redshift
Snowflake
Sybase
Teradata

Remote Connections - Currently Supported

Connection
Certified
Tested
Packaged
Optionally packaged
Pushdown
Estimate job
Filtergram
Analyze data
Spark agent
Yarn agent
Azure Data Lake (Gen2)
Google Cloud Storage
HDFS
S3

Under Evaluation

The following is a list of drivers which are under evaluation (not certified yet for production usage). These connections are currently ineligible for escalated support services.

Connections - Tech Preview

Connection
Certified
Tested
Packaged
Optional packaging
Pushdown
Estimate job
Filtergram
Analyze data
Schedule
Spark agent
Yarn agent
Parallel JDBC
Session state
Kerberos state
Kerberos password manager
Kerberos keytab
Kerberos TGT
Standalone (non-Livy)
Cassandra
MongoDB
SAP Hana
Solr

Streaming - Tech Preview

Connection
Certified
Tested
Packaged
Optional packaging
Pushdown
Estimate job
Filtergram
Analyze data
Schedule
Spark agent
Yarn agent
Parallel JDBC
Session state
Kerberos password
Kerberos password manager
Kerberos TGT
CRDB metastore
Standalone (non-Livy)
Kafka

Files

File type
Supported
CSV (and all delimiters)
Parquet
AVRO
JSON
DELTA

Limitations

Authentication
  • DQ Jobs that require Kerberos TGT are not yet supported on Spark Standalone or Local deployments
    • Recommended to submit jobs via Yarn or K8s

File Limitations

File Sizes
  • Files with more than 250 columns supported in File Explorer, unless you have Livy enabled.
  • Files larger than 5gb are not supported in File Explorer, unless you have Livy enabled.
  • Smaller file sizes will allow for skip scanning and more efficient processing
  • Advanced features like replay, scheduling, and historical lookbacks require a date signature in the folder of file path
S3
  • Please ensure no spaces in S3 connection name
  • Please remember to select 'Save Credentials' checkbox upon establishing connection
  • Please point to root bucket, not sub folders
Local Files
  • Local files can only be run using NO_AGENT default
  • This is for quick testing, smaller files, and demonstration purposes.
  • Local file scanning is not intended for large scale production use.
Livy
  • Livy is only supported for K8s environments

Spark Engine Support

  • MapR is EOL and MapR spark engine not supported to run CDQ jobs.

Databricks

Please refer to this page for more details on Databricks support
The only supported Databricks spark submit option is to use a notebook to initiate the job (Scala and Pyspark options). This is intended for pipeline developers and users knowledgeable with Databricks and notebooks. This form factor is ideal for incorporating data quality within existing Spark ETL data flows. The results are still available for business users to consume. The configuration is not intended for business users to implement. There are three ways that Databricks users can run CDQ jobs using Databricks cluster or JDBC connection. 1. Notebook Users can directly open a notebook, upload CDQ jars and run a CDQ job on Databricks cluster. The full steps are explained in below page. CDQ supports this flow in production.
2. Spark-Submit
There are two ways to run a spark submit job on Databricks's cluster. The first approach is to run a CDQ spark submit job using Databricks UI and the second approach is by invoking Databricks rest API. We have tested both approaches against different cluster versions of DataBricks (See below table). Below is the full documentation to demonstrate these paths. https://dq-docs.collibra.com/apis-1/notebook/cdq-+-databricks/dq-databricks-submit\
Please note that these are only examples to demonstrate how to achieve DQ spark submit to Databricks's cluster. These paths are not supported in production and DQ team does not support any bug coverages or professional services or customer questions for these flows. \
3. JDBC
CDQ users can create JDBC connections in CDQ UI and connect to their Databricks database. This is scheduled for 2022.05 release.
Delta Lake and JDBC connectivity has been validated against Spark 3.01 CDQ package, Databricks 7.3 LTS and SparkJDBC41.jar. This is available as Preview. No other combinations have been certified at this time.
Spark submit using the Databricks spark master url is not supported.
CDQ Production support for Databricks.