Simple

Load Table use SparkJDBC

1
//--- Configure Table From Database ---//
2
val connProps = Map (
3
"driver" -> "org.postgresql.Driver",
4
"user" -> s"${user}",
5
"password" -> s"${pass}",
6
"url" -> s"jdbc:postgresql://${host}:${port}/${database}",
7
"dbtable" -> "owl_test.nyse"
8
)
9
10
//--- Load Spark DataFrame ---//
11
val jdbcDF = spark.read.format("jdbc").options(connProps).load
12
jdbcDF.show
Copied!

Configure Owl Options

Connect to Owl's Metadata Database and control DQ scan options. Wrap sparkDF with Owl context.
1
import com.owl.common.options._
2
import com.owl.core.util.OwlUtils
3
4
val opt = new OwlOptions
5
//--- Owl Metastore ---//
6
opt.host = s"$owlHost"
7
opt.port = s"5432/postgres?currentSchema=public"
8
opt.pgUser = s"$owlUser"
9
opt.pgPassword = s"$owlPass"
10
//--- Run Options ---//
11
opt.dataset = "owl_test.nyse"
12
opt.runId = "2018-01-10"
13
opt.datasetSafeOff = true
14
15
val owl = OwlUtils.OwlContext(jdbcDF, opt)
Copied!

Register with Catalog and Run Profile

1
//--- Register with Owl Catalog ---//
2
owl.register(opt)
3
4
//--- Profile Dataset ---//
5
val profile = owl.profileDF
6
profile.show
Copied!
Notice that Owl returns results as Dataframes. This is a fantastic abstraction that allows you to ignore all domain objects and custom types and interact with a scaleable generic result set using common protocols like "where" or "filter" or "save" or "write" all with parallel operations.
1
+--------------+-----+-------+-----------+---+---+--------+-----------+------+----+------+-------+-------+------+----+---------+
2
| column|nulls|empties|cardinality|min|max|is_mixed|mixed_ratio| Int|Long|String|Decimal|Boolean|Double|Date|Timestamp|
3
+--------------+-----+-------+-----------+---+---+--------+-----------+------+----+------+-------+-------+------+----+---------+
4
| tenant_id| 0| 0| 60| 0| 9| false| 0.0|100000| 0| 0| 0| 0| 0| 0| 0|
5
| a11| 0| 0| 1|a11|a11| false| 0.0| 0| 0|100000| 0| 0| 0| 0| 0|
6
| a10| 0| 0| 1|a10|a10| false| 0.0| 0| 0|100000| 0| 0| 0| 0| 0|
7
| account_type| 0| 0| 3| 02| 06| false| 0.0|100000| 0| 0| 0| 0| 0| 0| 0|
8
| a13| 0| 0| 1|a13|a13| false| 0.0| 0| 0|100000| 0| 0| 0| 0| 0|
9
|security_alias| 0| 0| 3| 0| 2| false| 0.0|100000| 0| 0| 0| 0| 0| 0| 0|
10
| a12| 0| 0| 1|a12|a12| false| 0.0| 0| 0|100000| 0| 0| 0| 0| 0|
11
+--------------+-----+-------+-----------+---+---+--------+-----------+------+----+------+-------+-------+------+----+---------+
Copied!

Profile UI

While the spark DF.show() is a nice and convenient output format, you may prefer a rich UI visual that tracks the data tests over time. The UI provides trend analysis, data drift, data relationships and more.

Duplicates

Take duplicate detection for example. A common use-case where a business wants to make sure they do not have repeated or duplicate records in a table. Set the lowerBound to the percent fuzzy match you are willing to accept, commonly 87% or higher is an interesting match. You might also want to target a single day or week or month that you shouldn't have dupes within. Notice the .where function and then pass in a custom dataframe to the Owl context.
1
opt.dupe.on = true
2
opt.dupe.lowerBound = 99
3
opt.dupe.include = Array("SYMBOL", "EXCH")
4
5
val df1Day = jdbcDF.where("TRADE_DATE = '2018-01-10' ")
6
val owl = OwlUtils.OwlContext(df1Day, opt)
7
8
val dupes = owl.dupesDF
9
dupes.show
10
11
// rdd collect
12
dupes.rdd.collect.foreach(println)
13
14
// records linked together for remediation
15
owl.getDupeRecords.show
Copied!

Outliers

Gaining and understanding of your outliers is a commonly desired DQ function. Owl has several configurations to help find the most meaningful outliers in your dataset and over time. Below compares the current day to a baseline of days in the historical dataframe.
1
opt.outlier.on = true
2
opt.outlier.lookback = 6
3
opt.outlier.dateColumn = "TRADE_DATE"
4
opt.outlier.timeBin = OutlierOpt.TimeBin.DAY
5
opt.outlier.key = Array("SYMBOL")
6
7
val df1Day = jdbcDF2.where("TRADE_DATE = '2018-01-10' ")
8
val owl = OwlUtils.OwlContextWithHistory(dfCurrent = df1Day, dfHist = jdbcDF2, opt = opt)
9
val outliers = owl.outliersDF
10
outliers.show
Copied!
Last modified 20d ago