Collibra DQ User Guide
2022.10
Search…
⌃K

Pushdown

Pushdown feature description
We've moved! To improve customer experience, the Collibra Data Quality User Guide has moved to the Collibra Documentation Center as part of the Collibra Data Quality 2022.11 release. To ensure a seamless transition, dq-docs.collibra.com will remain accessible, but the DQ User Guide is now maintained exclusively in the Documentation Center.
Some of the DQ features support pushdown to avoid transferring large data sets from a data source (Database, Cloud storage, file systems, and so on) into Spark. When pushdown is enabled and supported, the DQ Job generates SQL queries to offload the compute to the data source, reducing the amount of data transfer and Spark computation of the DQ Job. Not all features support pushdown nor pushdown completely eliminate data transfer.

Why use pushdown

To analyze a data set for DQ findings as part of a DQ Job, CDQ uses Spark as the compute engine to power the analysis. This requires the data set to be loaded into Spark as a Spark DataFrame, and the DQ Job performance (job completion speed) is limited by Spark resources available and the complexity of the job. This data transfer from data source to Spark is dependent on two of the following factors:
  1. 1.
    Bandwidth limitation at the data source If the DQ Job requires pulling 100 million rows from a SQL Database, then the Input/Output limit of transferring data out of SQL Database into the Spark cluster will greatly increase the overall speed of the DQ Job.
  2. 2.
    Compute limitation at the data source for complex queries If the DQ Job requires complex queries to create the data set (via -q), then this computation is done at the SQL Database level. Some examples of "complex" queries are: -q "SELECT * FROM public.very_long_table_of_transactions WHERE date = ${rd} and department = 'finance'" This query can take long time at the database level due to WHERE clause filtering a very long table on columns date and department that may not be indexed properly, leading to a full table scan without a LIMIT clause specified. -q "SELECT * FROM public.very_long_table_of_transactions transactions INNER JOIN public.departments departments ON transactions.department_id = departiments.id WHERE transactions.date = ${rd} and departments.name = 'finance'" This query can take long time at the database level due to a join between tables.
  3. 3.
    Resource limitation at the data source The data source may not have enough hardware resource available to efficiently fulfill the query and/or handle multiple DQ Jobs & other non-CDQ applications requesting data.
The bottleneck due to data transfer can be viewed in Jobs page > Job Logs > LOAD stage. This data loading step is the first step for all DQ Jobs. However, pushdown feature can be used to reduce (NOT eliminate) the data transfer & compute to Spark from the data source, if the DQ Job specified does not require loading all the data into Spark. In summary, speed of loading the data from data source to Spark is a -q query compute time at data source + network transfer of -q result between data source and Spark. Pushdown can reduce both elements, but the efficiency gained is dependent on the complexity of -q query and how big the data set from -qresult would have been without pushdown.

How pushdown is efficient

Some of the Spark compute performed by the DQ Job can be translated into SQL queries that most relational databases support natively. In such a case, the DQ Job does not need to load all the rows of the data set. Instead, the DQ Job can query the data source for the results of those SQL queries and reduce the amount of data transferring out of the data source. The results of these SQL queries are almost always lead to smaller amount of data compared to the full data set defined by -q. Only some of the DQ Job features require the full data set to be loaded into Spark. Therefore, pushdown can be a useful tool to speed up the overall DQ Job speed -- provided that the speed of executing these SQL queries are faster than the speed of transferring the data out of the data source into Spark. In most use cases, pushdown leads to faster DQ Job execution for large data sets. If the -q query is sufficiently complex, then the speed reduced by transferring less data into Spark can be cancelled out by the multiple frequent SQL queries made to the data source by the Pushdown process (because each query may have have redundant compute due to the complexity of -q)

How pushdown works in CDQ

Using pushdown only reduces the amount of data transferred out of the data source. It does NOT skip the LOAD stage in DQ Job. Every DQ Job requires a small sample of rows (10-20) of the data set defined by -q in order to generate Data Preview and analyze schema information for the data set run. This means the -q query may be fully computed at data source before the sampling can occur (depending on the complexity of the -q). In such a case, sampling 10-20 rows of data is not a quick and immediate LOAD stage and only efficiency gain comes from lack of transferring data between data source and Spark. Therefore, pushdown feature would be most efficient if -q is a simple select query with simple where filtering. The benefit comes from the fact that if your data set defined by -q results in 100 million rows, only 10-20 rows of the data set defined by -q will be loaded into Spark.
Profile with pushdown will then generate a series of SQL queries and query the data source again for aggregate metric data. Depending on the data set, these multiple SQL queries can be more efficient than loading all the data into Spark and computing these aggregate metric in Spark. The results between Profile with pushdown and Profile without pushdown are (practically) identical.

Profile: pushdown vs no pushdown

Here is the summary of Profile activity with details regarding pushdown support
Feature
Supports pushdown
Description
Row Count
Yes
Computes row count of the data set.
Distinct Count
Yes
The number of distinct values in a column
Mean
Yes
The average of all the values in the column. Supports numeric columns only.
Min / Max
Yes
The minimum and maximum values of the column. Supports numeric and boolean columns only
NULL Count
Yes
The number of null values in the column.
EMPTY Count
Yes
The number of empty values in the column. Supports string columns only.
TYPE Count
Yes
The number of different types inferred in a column (if any)
TopN / BottomN
No
Computes the top 5 most frequent (TopN) and least frequent (BottomN) values. Supports all types. This result is displayed as a frequency bar chart in Profile page. If pushdown is enabled, then TopN and BottomN values are not displayed. Related features like Stat Rules (Distribution) are also disabled.
Data Shape Detection
No
Detects shapes of the values based on Shape parameters provided (automatic or manual).
Histogram
No
Creates histogram of the values in the column.
Correlation Matrix
No
Creates correlation matrix. Supports only numeric columns