2021.10
Collibra DIC Integration
Powered By GitBook
Cluster Health Report
How much is your redundant data costing you?

Reclaim Gigabytes of Redundant Data

As data engineers, first we copy files into a landing zone, next we load the files into a staging area. After that we transform (ETL) the data into the final table. Soon that same data is copied to a lake for other groups to run analytics on. Eventually a group of analysts will need the data in another format and a data engineer will copy the data in a newly joined or transposed fashion. Sounds familiar?
The result is the same data or similar columns of the same data being copied many times. The answer: Buy more hardware... could be OR run an Owl health report and gain an understanding of how much data could be removed, reclaiming disk space and instantly seeing a return on investment after clicking the button.

Tabular breakdown of % fingerprint match

Sometimes its not as simple as comparing 2 tables from the same database. Owl allows a technical user to setup multiple DB connections before executing an owl health check.
1
import com.owl.common.Props
2
import com.owl.core.Owl
3
4
val c1 = new Connection()
5
c1.dataset = "silo.account"
6
c1.user = "user"
7
c1.password = "pass"
8
c1.query = "select id, networth, acc_name, acc_branch from silo.account limit 200000"
9
c1.url = "jdbc:mysql://owldatalake.chzid9w0hpyi.us-east-1.rds.amazonaws.com:3306"
10
11
val c2 = new Connection()
12
c2.dataset = "silo.user_account"
13
c2.user = "user"
14
c2.password = "pass"
15
c2.query = "SELECT acc_name, acc_branch, networth FROM silo.account limit 200000"
16
c2.url = "jdbc:mysql://owldatalake.chzid9w0hpyi.us-east-1.rds.amazonaws.com:3306"
17
18
val props = new Props()
19
props.dataset = "colMatchTest1"
20
props.runId = "2017-02-04"
21
props.connectionList = List(c1,c2).asJava
22
props.colMatchBatchSize = 2
23
props.colMatchDurationMins = 3
24
25
val matchDF = new Owl(props).colMatchDF
26
matchDF.show
27
28
matchDF.createOrReplaceTempView("matches")
Copied!

High level view of data overlap

Last modified 2yr ago