Outliers
This real life use-case is when you have a large file or data frame with many days of data but you want the run profile to be the current day so that it trends properly overtime. Another nuance to this use-case is that the customer_id is a unique field to the user and it should not show up in the analytics i.e. an outlier. But the customer_id should be available when the user wants to query the rest api end points. The customer_id is then used to link back the users original dataset. A bloomberg_Id (BB_ID) is a common example.

CSV File

1
fname,app_date,age,customer_id
2
Kirk,2018-02-24,18,31
3
Kirk,2018-02-23,11,4
4
Kirk,2018-02-22,10,3
5
Kirk,2018-02-21,12,2
6
Kirk,2018-02-20,10,1
Copied!

Notebook Code (Spark Scala)

1
val filePath = getClass.getResource("/notebooktest.csv").getPath
2
3
val spark = SparkSession.builder
4
.master("local")
5
.appName("test")
6
.getOrCreate()
7
8
val opt = new OwlOptions()
9
opt.dataset = "dataset_outlier"
10
opt.runId = "2018-02-24"
11
opt.outlier.on = true
12
opt.outlier.key = Array("fname")
13
opt.outlier.dateColumn = "app_date"
14
opt.outlier.timeBin = OutlierOpt.TimeBin.DAY
15
opt.outlier.lookback = 5
16
opt.outlier.excludes = Array("customer_id")
17
18
val dfHist = OwlUtils.load(filePath = filePath, delim = ",", sparkSession = spark)
19
val dfCurrent = dfHist.where(s"app_date = '${opt.runId}' ")
20
21
val owl = OwlUtils.OwlContextWithHistory(dfCurrent=dfCurrent, dfHist=dfHist, opt=opt)
22
owl.register(opt)
23
owl.owlCheck()
Copied!

Owl Web UI

Score drops from 100 to 99 based on the single outlier in the file. Row count is 1 because there is only 1 row in the current data frame. The historical data frame was provided for context and you can see those rows in the outlier drill-in. The customer_id is available in the data preview and can be used as an API hook to link back to the original dataset.
After you run an owlcheck using owl.owlcheck you might want to check individual scores to see what type of issues were in the data. Owl can send back the records with issues in the format of a DataFrame using the notebook cmds or JSON from the REST api.
1
val hoot = owl.hoot
2
3
println(s"SHAPE ${hoot.shapeScore} ")
4
println(s"DUPE ${hoot.dupeScore} ")
5
println(s"OUTLIER ${hoot.outlierScore} ")
6
println(s"PATTERN ${hoot.patternScore} ")
7
println(s"RECORD ${hoot.recordScore} ")
8
println(s"SCHEMA ${hoot.schemaScore} ")
9
println(s"BEHAVIOR${hoot.behaviorScore} ")
10
println(s"SOURCE ${hoot.sourceScore} ")
11
println(s"RULES ${hoot.ruleScore} ")
12
13
if (hoot.shapeScore > 0) {
14
owl.getShapeRecords.show
15
}
16
if (hoot.dupeScore > 0) {
17
owl.getDupeRecords.show
18
}
Copied!
1
+-------+---------+--------------------+--------+-----------+-------+------+
2
|row_cnt|obs_score| row_key|obs_type|customer_id| fname|owl_id|
3
+-------+---------+--------------------+--------+-----------+-------+------+
4
| 21| 46|afa89984ce472a409...| DUPE| 32| Kirk| 1|
5
| 22| 46|afa89984ce472a409...| DUPE| 31|Kirk's.| 2|
6
| 23| 60|41ea2d828b1a5fbf2...| DUPE| 30| Dan| 3|
7
| 24| 60|41ea2d828b1a5fbf2...| DUPE| 27| Dan| 6|
Copied!
1
+---------------+--------------------+--------+----------+--------------+--------+-------+-------+---+--------------------+-----------+-------+------+--------+
2
| dataset| run_id|col_name|col_format|col_format_cnt|owl_rank|row_cnt|row_key|age| app_date|customer_id| fname|owl_id|time_bin|
3
+---------------+--------------------+--------+----------+--------------+--------+-------+-------+---+--------------------+-----------+-------+------+--------+
4
|dataset_outlier|2018-02-24 00:00:...| fname| xxxx'x.| 1| 1| 2|xxxx'x.| 18|2018-02-24 00:00:...| 31|Kirk's.| 2| null|
5
+---------------+--------------------+--------+----------+--------------+--------+-------+-------+---+--------------------+-----------+-------+------+--------+
Copied!
get
http://$host
/v2/getoutlier?dataset=dataset_outlier&runId=2018-02-24
GetOutlier
get
http://$host
/v2/getdatashapes?dataset=dataset_outlier&runId=2018-02-24
GetShape
Last modified 20d ago