2021.10
Collibra DIC Integration
Powered By GitBook
Advanced...
An OwlCheck is bash script that is essentially the launch point for any owl job to scan a dataset. A dataset can be a flat file (such as textfile, json file, parquet file, etc), or a table from any number of Databases (such as Oracle, Postgres, Mysql, Greenplum, DB2, SQLServer, Teradata, etc).
Example Run a Data Quality check on any file by setting the file path.
1
./owlcheck -ds stock_trades -rd 2019-02-23 -f /path/to/file.csv -d ,
Copied!
Example output below. A hoot is a valid JSON response
1
{
2
"dataset": "stock_trades",
3
"runId": "2019-02-03",
4
"score": 100,
5
"behaviorScore": 0,
6
"rows": 477261,
7
"passFail": 1,
8
"peak": 1,
9
"dayOfWeek": "Sun",
10
"avgRows": 0,
11
"cols": 5,
12
"activeRules": 0,
13
"activeAlerts": 0,
14
"runTime": "00:00:23",
15
"dqItems": {},
16
"datashapes": [],
17
"validateSrc": [],
18
"alerts": [],
19
"prettyPrint": true
20
}
Copied!

Monthly Data

Sometimes you may want to run monthly profiles with aggregated data. In this case the scheduling tool can supply the ${rd} as variable such as $runDate and the end date as $endDate. 1 line examples for bash or shell below.
1
echo "Hello World Owl"
2
3
runDate=$(date +"%Y-%m-%d")
4
endDate=$(date -d "$runDate +1 month" +%Y-%m-%d)
5
6
echo $runDate
7
echo $endDate
8
9
./owlcheck \
10
-q "select * from table where date >= '$runDate' and date < '$endDate' " \
11
-ds example \
12
-rd $runDate \
13
-tbin MONTH
Copied!

Monthly BackRun (Using Owl's built in Monthly)

Owl has 2 convenient features here: 1) the use of built in ${rd} and ${rdEnd} removes the need for any shell scripting. 2) using -br, Owl will replay 20 months of data using this template automatically.
1
./owlcheck \
2
-q "select * from table where date >= '${rd}' and date < '${rdEnd}' " \
3
-ds example
4
-rd 2019-01-01
5
-rdEnd 2019-02-01
6
-tbin MONTH
7
-br 20
Copied!

Daily Data

One of the most common examples is data loading or running once a day. A job control framework can pass in this value or you can pull it from shell.
1
echo "Hello World Owl"
2
3
runDate=$(date +"%Y-%m-%d")
4
echo $runDate
5
6
./owlcheck \
7
-q "select * from table where date = '$runDate' " \
8
-ds example \
9
-rd $runDate \
10
-tbin DAY
Copied!

Daily Data (Using Owl's built in Daily)

1
./owlcheck \
2
-q "select * from table where date = '${rd}' " \
3
-ds example \
4
-rd 2019-03-14
Copied!

Daily Data with Timestamp instead of Date

1
./owlcheck \
2
-q "select * from table where TS >= '${rd} 00:00:00' and TS <= '${rd} 23:59:59' " \
3
-ds example \
4
-rd 2019-03-14
Copied!

OR Timestamp using ${rdEnd}

1
./owlcheck \
2
-q "select * from table where TS >= '${rd} 00:00:00' and TS < '${rdEnd} 00:00:00' " \
3
-ds example \
4
-rd 2019-03-14 \
5
-rdEnd 2019-03-15 \
6
-tbin DAY
Copied!

Hourly Data

1
./owlcheck \
2
-q "select * from table where TS >= '${rd}' and TS < '${rdEnd}' " \
3
-ds example \
4
-rd "2019-03-14 09:00:00" \
5
-rdEnd "2019-03-14 10:00:00" \
6
-tbin HOUR
Copied!

OwlCheck Template with Service Hook

The best practice is to make a generic job that would be repeatable for every OwlCheck. Below is an example that first hits Owl using a REST call and then runs the response.
1
curl -X GET "http://$host/v2/getowlchecktemplate?dataset=lake.loan_customer" \
2
-H "accept: application/json"
Copied!
The above REST call returns the below OwlCheck. It is left up to the Job Control to replace the ${rd} with the date from the Job Control system. You can use Owls built in scheduler to save these steps.
1
./owlcheck \
2
-lib "/home/danielrice/owl/drivers/mysql/" \
3
-cxn mysql \
4
-q "select * from lake.loan_customer where load_dt = '${rd}' " \
5
-key post_cd_num -ds lake.loan_customer \
6
-rd ${rd} \
7
-dc load_dt -dl -dlkey usr_name,post_cd_num -dllb 5 \
8
-tbin DAY -by DAY -dupe -dupeinc ip_address_home,usr_name -dupecutoff 85 \
9
-fpgon -fpgkey usr_name,post_cd_num -fpgdc load_dt -fpglb 5 -fpgtbin DAY \
10
-loglevel INFO \
11
-h $host:5432/owltrunk \
12
-owluser {user}
Copied!

REST API End Point

The easiest option is to use the runtemplate end point API call to make requests to from cmdLine or JobControl System. This endpoint gets the OwlCheck saved in Owl instead of the client needing to know the OwlCheck details.
post
http://$host
/v2/runtemplate?dataset=lake.spotify
RunTemplate

Curl example for the above Rest Call

1
TOKEN=$(curl -s -X POST http://$host/auth/signin -H "Content-Type:application/json" -d "{\"username\":\"$username\", \"password\":\"$password\"}" | jq -r '.token')
2
3
curl -i -H 'Accept: application/json' \
4
-H "Authorization: Bearer ${TOKEN}" \
5
http://$host/v2/runtemplate?dataset=lake.spotify
Copied!

Bash Script

A generic and repeatable owlcheck script for job schedulers, that hooks into Owl to get the template.
1
#1 authenticate
2
curl -sb -X POST -d username={user} -d password={password} http://$OWL_HOST/login -c cookies.txt
3
4
#2 get template
5
owlcheck_args=$(curl -b cookies.txt -H "accept: application/json" -X GET http://$OWL_HOST/v2/getowlcheckcmdlinebydataset\?dataset=insurance | sed 's/.*\[\(.*\)\]/\1/' | sed -e "s/^\"//" -e "s/\"$//" | sed 's/\\\"\(.*\)\\\"/\x27\1\x27/')
6
7
#3 replace ${rd} with job_run_date
8
job_run_date="2019-03-14 10:00:00"
9
owlcheck_args=${owlcheck_args//'${rd}'/$job_run_date}
10
11
#4 run owlcheck
12
eval owlcheck $owlcheck_args
Copied!
For more Information on Owl's Scheduler check out the doc on OwlCheck Cron Page.
Last modified 14d ago