Supported Connections
A list of supported data source connection types.
The following is a list of drivers certified for production use.
Note: Illustrative grades provided below represent level of support around data types, authentication methods, and features. Example scenarios as follows:
- Data Types
- A Grade: Majority supported with fewer exceptions
- B- Grade: Modest support with more exceptions e.g. Boolean, Geography, Advanced Date/Time
- Authentication
- A Grade: Multiple methods including Kerberos keytabs
- B Grade: User/pass only
- Features
- A Grade: Pushdown Support, Views, Special characters, Validate Source, versatile Driver support
- C Grade: No pushdown support, no special characters, specific drivers required (e.g. multiple JARs)
- D Grade: Not packaged in k8 containers and not certified with each release. These are not available in the cloud version of the software.
Connection Type | Driver | Certification | Grade | Auth Type | Comments |
Teradata | Native | Production | B+ | User / Pass | Some sql nuances, fairly easy to use and performs well |
Oracle | Native | Production | A- | User / Pass | Performance based on how the DB is configured and fetch size in driver. |
MS SQL Server | Native | Production | A- | User / Pass | allows whitespace in column headers |
Snowflake | Native | Production | A+ | User / Pass | easy to use and performs well |
S3 | S3 SDK in Web / S3a connector in Core | Production | A | Secret / Key, Instance Profile, Assume Role | Supports bucket keys, IAM roles and caching |
Hive | Simba JDBC | Production | B+ | User / Pass, Kerberos | Too many variations of Tez, mapReduce and other nuances that make it difficult to get started. |
Impala | Simba JDBC | Production | A- | User / Pass | Tends to be a quicker start than Hive but still has many config nuances |
Postgres | Native | Production | A+ | User / Pass | easy to use performs well |
MySql | Native | Production | A+ | User / Pass | easy to use performs well |
DB2 | Native | Production | A- | User / Pass, Kerberos | easy to use performs well, fetch syntax vs limit and other nuances |
GreenPlum | Postgres | Production | A- | User / Pass | easy to use performs well |
Redshift | Simba JDBC | Production | B+ | | |
Athena | Simba JDBC | Production | B+ | | |
Presto | Simba JDBC | Production | B+ | | |
HDFS | HDFS connector | Production | B | Kerberos | works well but usually a few hadoop spark nuances to get right |
The following is a list of drivers which are under evaluation (not certified yet for production usage).
Connection Type | Driver | Certification | Grade | Auth Type | Comments |
MongoDB | unityJDBC | Preview | C+ | User / Pass | Depends which driver you use to turn a document store into a relational view. |
BigQuery | Simba JDBC (Web) Spark Connector (Core) | Preview | B+ | JSON service account | Views are now supported with viewEnabled=true flag, Joins work but are not supported. Requires 20 jars as compared to 1 jar and accepts pushdown but has nuances. |
GCS (Google Cloud Storage) | Google Cloud SDK for Web / GCS Spark Connector in Core | Preview | B | JSON service account | more config than usual |
Azure Cloud Storage (ADLSv2) | Azure SDK for Web / Azure Data Explorer connector for Spark in Core | Preview | B- | Key / Service Principal | more config than avg |
Solr | Solr JDBC | Preview | D | User / Pass | not a relational store so requires some understanding of Solr |
MS SQL Data Warehouse | Native | Preview | B+ | User / Pass | |
Delta Lake | Native | Preview | D | User / Pass | This is not packaged with Collibra DQ containers (K8s build). This is to be used at your own risk with standalone installs. Validated against environment using DQ Spark 3.01 and Databricks 7.3 LTS |
SAP HANA | Native | Preview | D | User / Pass | This is not packaged with Collibra DQ containers (K8s build). This is to be used at your own risk with standalone installs. This works with most common data types using the Spark execution. |
MariaDB | MySQL Driver | Preview | B+ | User / Pass | Uses mysql driver, some nuances |
Dremio | Dremio JDBC | Preview | B | User / Pass | |
Kafka | Native | Preview | B- | | Most cases the group doesn't know enough about kafka administration, schema registry and other nuances. |
Sybase | Native | Preview | B+ | | |
File Type Support | Grade | Comments |
Parquet | B+ | schema in file, less common but works well |
CSV (and all delimiters) | A+ | very common easy to use |
JSON | A- | works great depends how many levels nested |
XML | B | recommend json |
AVRO | B | |
Authentication
- DQ Jobs that require Kerberos TGT are not yet supported on Spark Standalone or Local deployments
- Recommended to submit jobs via Yarn or K8s
File Sizes
- Files with more than 250 columns supported in File Explorer, unless you have Livy enabled.
- Files larger than 5gb are not supported in File Explorer, unless you have Livy enabled.
- Smaller file sizes will allow for skip scanning and more efficient processing
- Advanced features like replay, scheduling, and historical lookbacks require a date signature in the folder of file path
S3
- Please ensure no spaces in S3 connection name
- Please remember to select 'Save Credentials' checkbox upon establishing connection
- Please point to root bucket, not sub folders
Local Files
- Local files can only be run using NO_AGENT default
- This is for quick testing, smaller files, and demonstration purposes.
- Local file scanning is not intended for large scale production use.
Livy
- Livy is only supported for K8s environments
- MapR is EOL and MapR spark engine not supported to run CDQ jobs.
The only supported Databricks spark submit option is to use a notebook to initiate the job (Scala and Pyspark options). This is intended for pipeline developers and users knowledgeable with Databricks and notebooks. This form factor is ideal for incorporating data quality within existing Spark ETL data flows. The results are still available for business users to consume. The configuration is not intended for business users to implement.
Delta Lake and JDBC connectivity has been validated against Spark 3.01 CDQ package and our stand alone deployment packages. This is available as Preview for standalone only. Delta Lake JDBC is not available as part of the container packages.
Spark submit using the Databricks spark master url is not supported.
Last modified 11mo ago