Collibra DQ User Guide


We've moved! To improve customer experience, the Collibra Data Quality User Guide has moved to the Collibra Documentation Center as part of the Collibra Data Quality 2022.11 release. To ensure a seamless transition, will remain accessible, but the DQ User Guide is now maintained exclusively in the Documentation Center.
A DQ Check is a bash script that is essentially the launch point for any DQ Job to scan a data set. A data set can be a flat file, such as textfile, json file, parquet file, etc, or a table from any number of databases, such as Oracle, Postgres, Mysql, Greenplum, DB2, SQLServer, Teradata, etc.
Example Run a Data Quality check on any file by setting the file path.
./owlcheck -ds stock_trades -rd 2019-02-23 -f /path/to/file.csv -d ,
Example output below. A hoot is a valid JSON response
"dataset": "stock_trades",
"runId": "2019-02-03",
"score": 100,
"behaviorScore": 0,
"rows": 477261,
"passFail": 1,
"peak": 1,
"dayOfWeek": "Sun",
"avgRows": 0,
"cols": 5,
"activeRules": 0,
"activeAlerts": 0,
"runTime": "00:00:23",
"dqItems": {},
"datashapes": [],
"validateSrc": [],
"alerts": [],
"prettyPrint": true

Monthly Data

Sometimes you may want to run monthly profiles with aggregated data. In this case, the scheduling tool can supply the ${rd} as a variable such as $runDate and the end date as $endDate. 1 line examples for bash or shell below.
echo "Hello World Owl"
runDate=$(date +"%Y-%m-%d")
endDate=$(date -d "$runDate +1 month" +%Y-%m-%d)
echo $runDate
echo $endDate
./owlcheck \
-q "select * from table where date >= '$runDate' and date < '$endDate' " \
-ds example \
-rd $runDate \
-tbin MONTH

Monthly BackRun (Using Collibra Data Quality's built-in Monthly)

Collibra Data Quality has 2 convenient features here:
  1. 1.
    The use of built-in ${rd} and ${rdEnd} removes the need for any shell scripting.
  2. 2.
    Using -br, DQ will replay 20 months of data using this template automatically.
./owlcheck \
-q "select * from table where date >= '${rd}' and date < '${rdEnd}' " \
-ds example
-rd 2019-01-01
-rdEnd 2019-02-01
-tbin MONTH
-br 20

Daily Data

One of the most common examples is data loading or running once a day. A job control framework can pass in this value or you can pull it from shell.
echo "Hello World Owl"
runDate=$(date +"%Y-%m-%d")
echo $runDate
./owlcheck \
-q "select * from table where date = '$runDate' " \
-ds example \
-rd $runDate \
-tbin DAY

Daily Data (Using Collibra Data Quality's built-in Daily)

./owlcheck \
-q "select * from table where date = '${rd}' " \
-ds example \
-rd 2019-03-14

Daily Data with Timestamp instead of Date

./owlcheck \
-q "select * from table where TS >= '${rd} 00:00:00' and TS <= '${rd} 23:59:59' " \
-ds example \
-rd 2019-03-14

OR Timestamp using ${rdEnd}

./owlcheck \
-q "select * from table where TS >= '${rd} 00:00:00' and TS < '${rdEnd} 00:00:00' " \
-ds example \
-rd 2019-03-14 \
-rdEnd 2019-03-15 \
-tbin DAY

Hourly Data

./owlcheck \
-q "select * from table where TS >= '${rd}' and TS < '${rdEnd}' " \
-ds example \
-rd "2019-03-14 09:00:00" \
-rdEnd "2019-03-14 10:00:00" \
-tbin HOUR

DQ Check Template with Service Hook

The best practice is to make a generic job that would be repeatable for every DQ Check. Below is an example that first hits Collibra Data Quality using a REST call and then runs the response.
curl -X GET "http://$host/v2/getowlchecktemplate?dataset=lake.loan_customer" \
-H "accept: application/json"
The above REST call returns the below DQ Check. It is left up to the Job Control to replace the ${rd} with the date from the Job Control system. You can use Collibra DQ's built-in scheduler to save these steps.
./owlcheck \
-lib "/home/danielrice/owl/drivers/mysql/" \
-cxn mysql \
-q "select * from lake.loan_customer where load_dt = '${rd}' " \
-key post_cd_num -ds lake.loan_customer \
-rd ${rd} \
-dc load_dt -dl -dlkey usr_name,post_cd_num -dllb 5 \
-tbin DAY -by DAY -dupe -dupeinc ip_address_home,usr_name -dupecutoff 85 \
-fpgon -fpgkey usr_name,post_cd_num -fpgdc load_dt -fpglb 5 -fpgtbin DAY \
-loglevel INFO \
-h $host:5432/owltrunk \
-owluser {user}

REST API End Point

The easiest option is to use the runtemplate end point API call to make requests to from cmdLine or JobControl System. This endpoint gets the DQ Check saved in Collibra instead of the client needing to know the DQ Check details.

Curl example for the above Rest Call

TOKEN=$(curl -s -X POST http://$host/auth/signin -H "Content-Type:application/json" -d "{\"username\":\"$username\", \"password\":\"$password\"}" | jq -r '.token')
curl -i -H 'Accept: application/json' \
-H "Authorization: Bearer ${TOKEN}" \

Bash Script

A generic and repeatable owlcheck script for job schedulers, that hooks into Collibra to get the template.
#1 authenticate
curl -sb -X POST -d username={user} -d password={password} http://$OWL_HOST/login -c cookies.txt
#2 get template
owlcheck_args=$(curl -b cookies.txt -H "accept: application/json" -X GET http://$OWL_HOST/v2/getowlcheckcmdlinebydataset\?dataset=insurance | sed 's/.*\[\(.*\)\]/\1/' | sed -e "s/^\"//" -e "s/\"$//" | sed 's/\\\"\(.*\)\\\"/\x27\1\x27/')
#3 replace ${rd} with job_run_date
job_run_date="2019-03-14 10:00:00"
#4 run owlcheck
eval owlcheck $owlcheck_args
For more Information on Collibra Data Quality's Scheduler, visit the DQ Check Cron page.