2021.10
Collibra DIC Integration
Powered By GitBook
Intraday Positions
It is common for financial organizations to receive a steady stream of files that have hourly or minutely data. The files might trail the market in a near real-time fashion. Below is an example
1
--positions/
2
|--2019/
3
|--01/
4
|--22/
5
position_2019_01_22_09.csv
6
position_2019_01_22_10.csv
7
position_2019_01_22_11.csv
8
position_2019_01_22_12.csv
Copied!

File Contents @ 9am

TIME
COMPANY
TICK
SIDE
QTY
2019-01-22 09:00
T&G
xyz
LONG
300
2019-01-22 09:00
Fisher
abc
SHORT
20
2019-01-22 09:00
TradeServ
def
LONG
120

File Contents @ 10am

TIME
COMPANY
TICK
SIDE
QTY
2019-01-22 10:00
T&G
xyz
LONG
280
2019-01-22 10:00
BlackTR
ghi
SHORT
45
Notice that during the day you may or may not have a position for every company recorded. We need a way to link the "company" to its position throughout the day but not alert in cases where they simply did not trade or adjust their position. Owl offers real-time outlier detection for this scenario (see code snippet below). We also need to make sure that each companies position is only represented once per file (per hour in this case) because positions are already the aggregate view of the trades, so they should be unique. Owl offers duplicate detection (see code snippet below).

Owl DQ Pipeline

1
// Part of your pipeline includes the ingestion of files that have the date
2
// and hour encoded in the file name. How do you process those files using Owl?
3
//
4
// Format: <name>_<year>_<month>_<day>.csv
5
6
val filePath = // <set this> positions/2019/01/22/positions_2019-01-22_09.csv
7
8
// Configure Owl.
9
val opt = new OwlOptions
10
opt.dataset = "positions"
11
opt.load.delimiter = ","
12
opt.load.fileQuery = "select * from dataset"
13
opt.load.filePath = file.getPath
14
15
opt.outlier.on = true
16
opt.outlier.key = Array("COMPANY")
17
opt.outlier.timeBin = TimeBin.HOUR
18
19
opt.dupe.on = true
20
opt.dupe.include = Array("COMPANY", "TICK")
21
opt.dupe.exactMatch = true
22
23
// Parse the filename to construct the run date (-rd) that will be passed
24
// to Owl.
25
val name = file.getName.split('.').head
26
val parts = name.split("_")
27
val date = parts.slice(2, 5).mkString("-")
28
val hour = parts.takeRight(1).head
29
30
// Must be in format 'yyyy-MM-dd' or 'yyyy-MM-dd HH:mm'.
31
val rd = s"${date} ${hour}"
32
33
// Tell Owl to process data
34
opt.runId = rd
35
36
// Create a DataFrame from the file.
37
val df = OwlUtils.load(opt.load.filePath, opt.load.delimiter, spark)
38
39
// Instantiate an OwlContext with the dataframe and our custom configuration.
40
val owl = OwlUtils.OwlContext(df, spark, opt)
41
42
// Make sure Owl has catalogued the dataset.
43
owl.register(opt)
44
45
// Let Owl do the rest!
46
owl.owlCheck
Copied!

Owl Web

DQ Coverage for Position data

    Schema Evolution
    Profiling
    Correlation Analysis
    Segmentation
    Outlier Detection
    Duplicate Detection
    Pattern Mining
Last modified 1yr ago