Outliers
We've moved! To improve customer experience, the Collibra Data Quality User Guide has moved to the Collibra Documentation Center as part of the Collibra Data Quality 2022.11 release. To ensure a seamless transition, dq-docs.collibra.com will remain accessible, but the DQ User Guide is now maintained exclusively in the Documentation Center.
This real life use-case is when you have a large file or data frame with many days of data but you want the run profile to be the current day so that it trends properly overtime. Another nuance to this use-case is that the customer_id is a unique field to the user and it should not show up in the analytics i.e. an outlier. But the customer_id should be available when the user wants to query the rest api end points. The customer_id is then used to link back the users original dataset. A bloomberg_Id (BB_ID) is a common example.
fname,app_date,age,customer_id
Kirk,2018-02-24,18,31
Kirk,2018-02-23,11,4
Kirk,2018-02-22,10,3
Kirk,2018-02-21,12,2
Kirk,2018-02-20,10,1
val filePath = getClass.getResource("/notebooktest.csv").getPath
val spark = SparkSession.builder
.master("local")
.appName("test")
.getOrCreate()
val opt = new OwlOptions()
opt.dataset = "dataset_outlier"
opt.runId = "2018-02-24"
opt.outlier.on = true
opt.outlier.key = Array("fname")
opt.outlier.dateColumn = "app_date"
opt.outlier.timeBin = OutlierOpt.TimeBin.DAY
opt.outlier.lookback = 5
opt.outlier.excludes = Array("customer_id")
val dfHist = OwlUtils.load(filePath = filePath, delim = ",", sparkSession = spark)
val dfCurrent = dfHist.where(s"app_date = '${opt.runId}' ")
val owl = OwlUtils.OwlContextWithHistory(dfCurrent=dfCurrent, dfHist=dfHist, opt=opt)
owl.register(opt)
owl.owlCheck()
Score drops from 100 to 99 based on the single outlier in the file. Row count is 1 because there is only 1 row in the current data frame. The historical data frame was provided for context and you can see those rows in the outlier drill-in. The customer_id is available in the data preview and can be used as an API hook to link back to the original dataset.

After you run an owlcheck using owl.owlcheck you might want to check individual scores to see what type of issues were in the data. Owl can send back the records with issues in the format of a DataFrame using the notebook cmds or JSON from the REST api.
val hoot = owl.hoot
println(s"SHAPE ${hoot.shapeScore} ")
println(s"DUPE ${hoot.dupeScore} ")
println(s"OUTLIER ${hoot.outlierScore} ")
println(s"PATTERN ${hoot.patternScore} ")
println(s"RECORD ${hoot.recordScore} ")
println(s"SCHEMA ${hoot.schemaScore} ")
println(s"BEHAVIOR${hoot.behaviorScore} ")
println(s"SOURCE ${hoot.sourceScore} ")
println(s"RULES ${hoot.ruleScore} ")
if (hoot.shapeScore > 0) {
owl.getShapeRecords.show
}
if (hoot.dupeScore > 0) {
owl.getDupeRecords.show
}
+-------+---------+--------------------+--------+-----------+-------+------+
|row_cnt|obs_score| row_key|obs_type|customer_id| fname|owl_id|
+-------+---------+--------------------+--------+-----------+-------+------+
| 21| 46|afa89984ce472a409...| DUPE| 32| Kirk| 1|
| 22| 46|afa89984ce472a409...| DUPE| 31|Kirk's.| 2|
| 23| 60|41ea2d828b1a5fbf2...| DUPE| 30| Dan| 3|
| 24| 60|41ea2d828b1a5fbf2...| DUPE| 27| Dan| 6|
+---------------+--------------------+--------+----------+--------------+--------+-------+-------+---+--------------------+-----------+-------+------+--------+
| dataset| run_id|col_name|col_format|col_format_cnt|owl_rank|row_cnt|row_key|age| app_date|customer_id| fname|owl_id|time_bin|
+---------------+--------------------+--------+----------+--------------+--------+-------+-------+---+--------------------+-----------+-------+------+--------+
|dataset_outlier|2018-02-24 00:00:...| fname| xxxx'x.| 1| 1| 2|xxxx'x.| 18|2018-02-24 00:00:...| 31|Kirk's.| 2| null|
+---------------+--------------------+--------+----------+--------------+--------+-------+-------+---+--------------------+-----------+-------+------+--------+
get
http://$host
/v2/getoutlier?dataset=dataset_outlier&runId=2018-02-24
GetOutlier
get
http://$host
/v2/getdatashapes?dataset=dataset_outlier&runId=2018-02-24
GetShape
Last modified 6mo ago