AWS DataBricks

Getting started

First use vanilla spark code to setup connection properties and access a database table via spark jdbc. Entire code example available at the end for copy paste.

Schema output, Row Count and Runtime

1
root
2
|-- EXCH: string (nullable = true)
3
|-- SYMBOL: string (nullable = true)
4
|-- TRADE_DATE: date (nullable = true)
5
|-- OPEN: decimal(9,3) (nullable = true)
6
|-- HIGH: decimal(9,3) (nullable = true)
7
|-- LOW: decimal(9,3) (nullable = true)
8
|-- CLOSE: decimal(9,3) (nullable = true)
9
|-- VOLUME: integer (nullable = true)
10
|-- PART_DATE_STR: date (nullable = true)
11
12
Row Count: 102,817
13
Runtime: 00:00:03
Copied!

Next Configure Owl Options and Point to Owl Metastore

This requires that you have imported the Owl libraries into your notebook or databricks env.

Next Run a Profile

1
+-------------+-----+-------+-----------+--------+-----------+------+----+------+-------+-------+------+------+---------+
2
| column|nulls|empties|cardinality|is_mixed|mixed_ratio| Int|Long|String|Decimal|Boolean|Double| Date|Timestamp|
3
+-------------+-----+-------+-----------+--------+-----------+------+----+------+-------+-------+------+------+---------+
4
| HIGH| 0| 0| 19159| false| 0.0| 0| 0| 0| 102817| 0| 0| 0| 0|
5
| SYMBOL| 0| 0| 3137| false| 0.0| 0| 0|102817| 0| 0| 0| 0| 0|
6
| LOW| 0| 0| 18845| false| 0.0| 0| 0| 0| 102817| 0| 0| 0| 0|
7
| VOLUME| 0| 0| 25856| false| 0.0|102817| 0| 0| 0| 0| 0| 0| 0|
8
| TRADE_DATE| 0| 0| 33| false| 0.0| 0| 0| 0| 0| 0| 0|102817| 0|
9
| EXCH| 0| 0| 2| false| 0.0| 0| 0|102817| 0| 0| 0| 0| 0|
10
| CLOSE| 0| 0| 15781| false| 0.0| 0| 0| 0| 102817| 0| 0| 0| 0|
11
|PART_DATE_STR| 0| 0| 33| false| 0.0| 0| 0| 0| 0| 0| 0|102817| 0|
12
| OPEN| 0| 0| 16013| false| 0.0| 0| 0| 0| 102817| 0| 0| 0| 0|
13
+-------------+-----+-------+-----------+--------+-----------+------+----+------+-------+-------+------+------+---------+
Copied!

Next Check for Duplicates

Notice there is a duplicate discovered. NYSE AAN record exists twice in the 10/1/2018. This should not happen as end of day stock data should only have 1 record per stock symbol. Great DQ finding.

Next Scan for Outliers

Notice that KOD.w the camera company Kodak commonly trades at less than 2 pennies and jumps to $2.35. Absolutely an outlier. This was a news event named Kodak coin, google it.
1
+-----+------+---------+----------+----------+
2
| key|column| value|prediction|confidence|
3
+-----+------+---------+----------+----------+
4
|TPG.E|VOLUME| 23400.0| 0.0| 0|
5
|MTB-C|VOLUME| 0.0| 100.0| 0|
6
|KOD.W| OPEN| 2.35| 0.015| 1|
Copied!

Entire Code Snippet

1
//--- GCP Postgres Connection ---//
2
val url = "jdbc:postgresql://${host}:5432/postgres?currentSchema=owl_test"
3
var connectionProps = new java.util.Properties()
4
connectionProps.setProperty("driver", "org.postgresql.Driver")
5
connectionProps.setProperty("user", "${user}")
6
connectionProps.setProperty("password", "${pass}")
7
connectionProps.setProperty("connectionUrl", url)
8
9
//--- Load DataFrame From GCP Postgres ---//
10
val jdbcDF2 = spark.read.jdbc(url, "owl_test.nyse", connectionProps)
11
jdbcDF2.printSchema
12
jdbcDF2.cache
13
jdbcDF2.count
14
15
//--- Owl Library Imports ---//
16
import com.owl.common.options._
17
import com.owl.core.Owl
18
import com.owl.core.util.OwlUtils
19
20
val opt = new OwlOptions()
21
//--- Owl Metastore ---//
22
opt.host = s"${host}"
23
opt.port = s"5432/postgres?currentSchema=public"
24
opt.pgUser = s"$user"
25
opt.pgPassword = s"$pass"
26
27
//--- Run Options ---//
28
opt.dataset = "nyse_notebook_pipeline"
29
opt.runId = "2018-01-10"
30
opt.datasetSafeOff = true
31
32
opt.dupe.on = true
33
opt.dupe.lowerBound = 99
34
opt.dupe.include = Array("SYMBOL", "EXCH")
35
36
opt.outlier.on = true
37
opt.outlier.lookback = 6
38
opt.outlier.dateColumn = "TRADE_DATE"
39
opt.outlier.timeBin = OutlierOpt.TimeBin.DAY
40
opt.outlier.key = Array("SYMBOL")
41
opt.outlier.measurementUnit = "VOLUME=100000000,HIGH=0.1,LOW=0.1,OPEN=0.1,CLOSE=0.1"
42
43
//--- Initialize Owl ---//
44
val currentDay = jdbcDF2.where(s"TRADE_DATE = '${opt.runId}' ")
45
val owl = OwlUtils.OwlContextWithHistory(dfCurrent = currentDay,
46
dfHist = jdbcDF2, opt = opt)
47
48
//--- Pipeline Cmds ---//
49
owl.register(opt)
50
val profile = owl.profileDF
51
val outliers = owl.outliersDF
52
val dupes = owl.dupesDF
Copied!

Required configuration

  • DataBricks Runtime: 5.4 (includes Apache Spark 2.4.3, Scala 2.11)
  • Python Version: 3

Required libraries

  • OWL jar file Example: owl_core_trunk_jar_with_dependencies.jar
  • JDBC driver: org.springframework:spring-jdbc:4.3.16.RELEASE
  • Database specific JDBC connector drivers Example#1: mysql:mysql-connector-java:8.0.17 Example#2: org.postgresql:postgresql:jar:42.2.8
Last modified 20d ago