Collibra DQ User Guide
2022.10
Search
⌃K

Profile (automatic)

Create profiles based on a table, view, or file
We've moved! To improve customer experience, the Collibra Data Quality User Guide has moved to the Collibra Documentation Center as part of the Collibra Data Quality 2022.11 release. To ensure a seamless transition, dq-docs.collibra.com will remain accessible, but the DQ User Guide is now maintained exclusively in the Documentation Center.
Users have the option to scan the entire dataset or users can apply custom filtering to select the depth (row filtering) and width (columns).

Select the Scope

You can find detailed instructions about selecting the scope in the Explorer section. You can run limits, by time, or full table scans if you have enough resources.

Select Options (or leave defaults)

Save / Run

Profile is on by default and is part of onboarding a dataset

View the Results

Automatically Profile

Collibra Data Quality automatically profiles datasets over time to enable drill-in for detailed insights an automated data quality. A profile is just the first step towards an amazing amount of auto discovery. Visualize segments of the dataset and how how the dataset is changing over time.
Collibra DQ offers click or code options to run profiling.

Data Set Profile

Collibra Data Quality creates a detailed profile of each dataset under management. This profile will later be used to both provide insight and automatically identify data quality issues.

Pushdown Profiling

Collibra DQ can compute the Profile of a dataset either via Spark (default) or the Data Warehouse (Profile Pushdown) where the data lives as the engine. When the Profile is computed using the datasource DBMS the user can choose two levels of pushdown:
  • Full Profile - Perform full profile calculation except for TopN
  • Count - Only perform row and column counts
The following DBMS systems are supported for "Profile Pushdown"
  • Impala
  • Hive
  • Snowflake
  • Presto
  • Teradata
  • SQL Server
  • Postgres
  • Redshift
  • Mysql
  • Oracle
  • DB2
Pushdown and parallel JDBC cannot be used together. If you are using pushdown, do not select the parallel JDBC option.

Profile Insights

By gathering a variety of different statistics, Collibra DQ's profile can provide a great deal of insight about a data set.
To see the difference between baseline (historical) and current values, Collibra DQ provides a Delta % change column. In the Delta % change column, data is represented in a pie chart for quick visualization of the changes.
To elaborate on the quality metrics:
The profile can discover attributes then helps delineate the relative metrics around numeric v. non-numeric discovered.
  • Filled - [1] Integer - The percentage of data that is numeric (or non-numeric) in a numeric (or non-numeric) discovered column.
  • Mixed - [String] Integer - The percentage of data that is non-numeric (or numeric) in a numeric (or non-numeric) discovered column.
  • Null - [] - The percentage of data that has no value at all.
  • Empty - [""] - The percentage of data that has a string instance of zero length.
Profile includes a range of statistics
  • Actual Datatype
  • Discovered Datatypes
  • Percent Null
  • Percent Empty
  • Percent Mixed Types
  • Cardinality
  • Minimum
  • Maximum
  • Mean
  • TopN / BottomN
  • Value Quartiles
  • Minimum (String) Length
  • Maximum (String) Length

Sensitive Data Detection (Semantic)

Collibra Data Quality can automatically identify any type of common PII columns.
Collibra Data Quality is able to detect the following types of PII
  • EMAIL
  • PHONE
  • ZIP CODE
  • STATE CD
  • CREDIT CARD
  • GENDER
  • SSN
  • IP ADDRESS
  • EIN
Once detected, Collibra Data Quality will tag the column in the Profile as the discovered type as well as automatically apply a rule. If the user can choose to decline any discovered tag by simply clicking on it and confirming the delete action. This action can also remove the rule associated with the tag.

Correlation Matrix (Relationship)

Discover hidden relationships and measure the strength of those relationships.

Histograms

Often the first step in a data science project is to segment the data. Collibra Data Quality automatically does this using histograms.

Data Preview

After profiling the data, for those users with appropriate rights, Collibra Data Quality provides a glimpse of the dataset. The Data preview tab also provides a some basic insights such as highlights of Data Shape issues and Outliers (if enabled), and Column Filtergram visualization.