Collibra DQ User Guide
2022.03
Search
⌃K

Supported Connections

A list of supported data source connection types.

Production

The following is a list of drivers certified for production use.
Note: Illustrative grades provided below represent level of support around data types, authentication methods, and features. Example scenarios as follows:
  • Data Types
    • A Grade: Majority supported with fewer exceptions
    • B- Grade: Modest support with more exceptions e.g. Boolean, Geography, Advanced Date/Time
  • Authentication
    • A Grade: Multiple methods including Kerberos keytabs
    • B Grade: User/pass only
  • Features
    • A Grade: Pushdown Support, Views, Special characters, Validate Source, versatile Driver support
    • C Grade: No pushdown support, no special characters, specific drivers required (e.g. multiple JARs)
    • D Grade: Not packaged in k8 containers and not certified with each release. These are not available in the cloud version of the software.
Connection Type
Driver
Certification
Grade
Auth Type
Comments
Teradata
Native
Production
B+
User / Pass
Some sql nuances, fairly easy to use and performs well
Oracle
Native
Production
A-
User / Pass
Performance based on how the DB is configured and fetch size in driver.
MS SQL Server
Native
Production
A-
User / Pass
allows whitespace in column headers
Snowflake
Native
Production
A+
User / Pass
easy to use and performs well
S3
S3 SDK in Web / S3a connector in Core
Production
A
Secret / Key, Instance Profile, Assume Role
Supports bucket keys, IAM roles and caching
Hive
Simba JDBC
Production
B+
User / Pass, Kerberos
Too many variations of Tez, mapReduce and other nuances that make it difficult to get started.
Impala
Simba JDBC
Production
A-
User / Pass
Tends to be a quicker start than Hive but still has many config nuances
Postgres
Native
Production
A+
User / Pass
easy to use performs well
MySql
Native
Production
A+
User / Pass
easy to use performs well
DB2
Native
Production
A-
User / Pass, Kerberos
easy to use performs well, fetch syntax vs limit and other nuances
GreenPlum
Postgres
Production
A-
User / Pass
easy to use performs well
Redshift
Simba JDBC
Production
B+
Athena
Simba JDBC
Production
B+
Presto
Simba JDBC
Production
B+
HDFS
HDFS connector
Production
B
Kerberos
works well but usually a few hadoop spark nuances to get right

Preview

The following is a list of drivers which are under evaluation (not certified yet for production usage).
Connection Type
Driver
Certification
Grade
Auth Type
Comments
MongoDB
unityJDBC
Preview
C+
User / Pass
Depends which driver you use to turn a document store into a relational view.
BigQuery
Simba JDBC (Web)
Spark Connector (Core)
Preview
B+
JSON service account
Views are now supported with viewEnabled=true flag, Joins work but are not supported.
Requires 20 jars as compared to 1 jar and accepts pushdown but has nuances.
GCS (Google Cloud Storage)
Google Cloud SDK for Web /
GCS Spark Connector in Core
Preview
B
JSON service account
more config than usual
Azure Cloud Storage (ADLSv2)
Azure SDK for Web /
Azure Data Explorer connector for Spark in Core
Preview
B-
Key / Service Principal
more config than avg
Solr
Solr JDBC
Preview
D
User / Pass
not a relational store so requires some understanding of Solr
MS SQL Data Warehouse
Native
Preview
B+
User / Pass
Delta Lake
Native
Preview
D
User / Pass
This is not packaged with Collibra DQ containers (K8s build). This is to be used at your own risk with standalone installs. Validated against environment using DQ Spark 3.01 and Databricks 7.3 LTS
SAP HANA
Native
Preview
D
User / Pass
This is not packaged with Collibra DQ containers (K8s build). This is to be used at your own risk with standalone installs. This works with most common data types using the Spark execution.
MariaDB
MySQL Driver
Preview
B+
User / Pass
Uses mysql driver, some nuances
Dremio
Dremio JDBC
Preview
B
User / Pass
Kafka
Native
Preview
B-
Most cases the group doesn't know enough about kafka administration, schema registry and other nuances.
Sybase
Native
Preview
B+

Files

File Type Support
Grade
Comments
Parquet
B+
schema in file, less common but works well
CSV (and all delimiters)
A+
very common easy to use
JSON
A-
works great depends how many levels nested
XML
B
recommend json
AVRO
B

Limitations

Authentication
  • DQ Jobs that require Kerberos TGT are not yet supported on Spark Standalone or Local deployments
    • Recommended to submit jobs via Yarn or K8s

File Limitations

File Sizes
  • Files with more than 250 columns supported in File Explorer, unless you have Livy enabled.
  • Files larger than 5gb are not supported in File Explorer, unless you have Livy enabled.
  • Smaller file sizes will allow for skip scanning and more efficient processing
  • Advanced features like replay, scheduling, and historical lookbacks require a date signature in the folder of file path
S3
  • Please ensure no spaces in S3 connection name
  • Please remember to select 'Save Credentials' checkbox upon establishing connection
  • Please point to root bucket, not sub folders
Local Files
  • Local files can only be run using NO_AGENT default
  • This is for quick testing, smaller files, and demonstration purposes.
  • Local file scanning is not intended for large scale production use.
Livy
  • Livy is only supported for K8s environments

Spark Engine Support

  • MapR is EOL and MapR spark engine not supported to run CDQ jobs.

Databricks

The only supported Databricks spark submit option is to use a notebook to initiate the job (Scala and Pyspark options). This is intended for pipeline developers and users knowledgeable with Databricks and notebooks. This form factor is ideal for incorporating data quality within existing Spark ETL data flows. The results are still available for business users to consume. The configuration is not intended for business users to implement.
Delta Lake and JDBC connectivity has been validated against Spark 3.01 CDQ package and our stand alone deployment packages. This is available as Preview for standalone only. Delta Lake JDBC is not available as part of the container packages.
Spark submit using the Databricks spark master url is not supported.