When large scale and high concurrency checks are not required, DQ can be installed and operated entirely on a single host. In this mode, DQ will leverage a Spark Standalone pseudo cluster where the master and workers run and use resources from the same server. DQ also requires a Postgres database for storage and Java 8 for running the DQ web application. It is possible to install each of the Spark, Postgres, and Java 8 components separately and install DQ on top of existing components. However, we offer a full installation package that installs these components in off-line mode and install DQ in one server.
We assume that a server running Centos 7 or RHEL 7 is setup and ready to install DQ in the home directory (base path:
OWL_BASE) under subdirectory
$OWL_BASE/owl). There is no requirement for DQ to be installed in the home directory, but the DQ Full Installation script may lead to permission-denied issue during local Postgres server installation if paths other than home directory is used. If so, please adjust your directory permission to allow the installation script a write access to the Postgres data folder.
This tutorial assumes that you are installing DQ on a brand new compute instance on Google Cloud Platform. Google Cloud SDK setup with proper user permission is assumed. This is optional, as you are free to create Full Standalone Installation setup on any cloud service provider or on-premise.
Please refer to the GOAL paragraph for the intended outcome of each step and modify accordingly
# Create new GCP Compute Instance named "install"gcloud compute instances create install \--image=centos-7-v20210701 \--image-project=centos-cloud \--machine-type=e2-standard-4# SSH into the instance as user "centos"gcloud compute ssh --zone "us-central1-a" --project "gcp-example-project" "[email protected]"
Download full package tarball using the signed link to the full package tarball provided by the DQ Team. Replace
<signed-link-to-full-package> with the link provided.
### Go to the OWL_BASE (home directory of the user is most common)### This example we will use /home/owldq installing as the user owldqcd /home/owldq### Download & untarcurl -o dq-full-package.tar.gz "<signed-link-to-full-package>"tar -xvf dq-full-package.tar.gz### Clean-up unnecessary tarball (optional)rm dq-full-package.tar.gz
First set some variables for
OWL_BASE (where to install DQ. In this tutorial, you are already in the directory that you want to install),
OWL_METASTORE_USER (the Postgres username used by DQ Web Application to access Postgres storage), and
OWL_METASTORE_PASS (the Postgres password used by DQ Web Application to access Postgres storage).
### base path that you want owl installed. No trailingexport OWL_BASE=$(pwd)export OWL_METASTORE_USER=postgresexport OWL_METASTORE_PASS=password
dq-package-full.tar.gz that you untarred contains installation packages for Java 8, Postgres 11, and Spark. There is no need to download these components. These off-line installation components are located in
One of the files extracted from the tarball is
setup.sh. This script installs DQ and the required components. If a component already exist (e.g. Java 8 is already installed and
$JAVA_HOME is set), then that component is not installed (i.e. Java 8 installation is skipped).
To control which components are installed, use
-options=...parameter. The argument provided should be comma-delimited list of components to install (valid inputs:
-options=postgres,spark,owlweb,owlagent means "install Postgres, Spark pseudo cluster, Owl Web Application, and Owl Agent". Note that Java is not part of the options. Java 8 installation is automatically checked and installed/skipped depending on availability.
You must at minimum specify
-options=spark,owlweb,owlagent if you independently installed Postgres or using an external Postgres connection (as you can see in Step #3 if you choose that installation route)
### The following installs PostgresDB locally as part of OwlDQ install./setup.sh \-owlbase=$OWL_BASE \-user=$OWL_METASTORE_USER \-pgpassword=$OWL_METASTORE_PASS \-options=postgres,spark,owlweb,owlagent
If no exceptions occurred and installation was successful, then the process will complete with the following output.
installing owlwebstarting owlwebstarting owl-webinstalling agentnot starting agentinstall completeplease use owl owlmanage utility to configure license key and start owl-agent after owl-web successfully starts up
If you have already installed DQ from the previous step, then skip this step. This is only for those who want to use external Postgres (e.g. use GCP Cloud SQL service as the Postgres metadata storage). If you have an existing Postgres installation, then everything in the previous step applies except the Postgres data path prompt and the
Refer to the Step #2 for details on what
OWL_METASTORE_USER , and
# base path that you want owl installed. No trailingexport OWL_BASE=$(pwd)export OWL_METASTORE_USER=postgresexport OWL_METASTORE_PASS=password
Run the following installation script. Note the missing "postgres" in
-options and new parameter
-pgserver could point to any URL that the standalone instance has access to.
# The following does not install PostgresDB and# uses existing PostgresDB server located in localhost:5432 with "postgres" database./setup.sh \-owlbase=$OWL_BASE \-user=$OWL_METASTORE_USER \-pgpassword=$OWL_METASTORE_PASS \-options=spark,owlweb,owlagent \-pgserver="localhost:5432/postgres"
The database named
postgres is used by default as DQ metadata storage. Changing this database name is out-of-scope for Full Standalone Installation. Contact DQ Team for assistance.
The installation process will start the DQ Web Application. This process will handle initializing the Postgres metadata storage schema in Postgres (under the database named
postgres). This process must complete successfully before the DQ Agent can be started. Wait approximately 1 minute for the Postgres metadata storage schema to be populated. If you can access DQ Web using
<url-to-dq-web>:9000 using a Web browser, then this means you have successfully installed DQ.
Next, verify that the Spark Cluster has started and is available to run DQ checks using
<url-to-dq-web>:8080 Take note of the Spark Master url (starting with
spark://...). This will be required during DQ Agent configuration.
In order for DQ to run checks on data, the DQ Agent must be configured with a license key. Replace
<license-key> with a valid license key provided by the DQ Team.
cd $OWL_BASE/owl/bin./owlmanage.sh setlic=<license-key># expected output:# > License Accepted new date: <expiration-date>
Next, start the DQ Agent process to enable processing of DQ checks.
cd $OWL_BASE/owl/bin./owlmanage.sh start=owlagent# Verify "agent.properties" file is createdcd $OWL_BASE/owl/config
When the script successfully runs,
$OWL_BASE/owl/config folder will contain a file called
agent.properties. This file contains agent id # of agents installed in this machine. Since this is the first non-default agent installed, the expected agent id is 2. Verify
agent.properties file is created. Your
agent.properties is expected to have different timestamp, but you should see
cd $OWL_BASE/owl/configcat agent.properties# expected output:> #Tue Jul 13 22:26:19 UTC 2021> agentid=2
Once the DQ Agent starts, it needs to be configured in DQ Web in order to successfully submit jobs to the local Spark (pseudo) cluster.
The new agent has been setup with the template base path
/opt and install path
owlmanage.sh start=owlagent script does not respect
OWL_BASE environment. We need to edit the Agent Configuration to follow our
Follow the steps on How To Configure Agent via UI page to configure the newly created DQ Agent and edit the following parameters in DQ Agent #2.
Replace all occurrence of
/opt/owl with your
$OWL_BASE/owl/in Base Path, Collibra DQ Core JAR, Collibra DQ Core Logs, Collibra DQ Script, and Collibra DQ Web Logs.
Note that Base Path here does not refer to
Replace Default Master value with the Spark URL from Fig 3
Replace Default Client Mode to "Cluster"
Replace Number of Executors(s), Executor Memory (GB), Driver Memory (GB) to a reasonable default (depending on how large your instance is)
Refer to Agent Configuration Parameters for parameters descriptions.
Follow the steps on How to Add DB Connection via UI page to add
metastore database connection. For demo purposes, we will run a DQ Job against local DQ Metadata Storage.
Follow the steps on How To Link DB Connection to Agent via UI page to configure newly created DQ Agent.
Click the compass icon in the navigation pane to navigate to the Explorer Page. Click on the "metastore" connection, select the "public" schema, and then select the first table in the resulting list of tables. Once the preview and scope tab comes up, click "Build Model". When the Profile page comes up, click the "Run" button.
On the Run page, click the "Estimate Job" button, acknowledge the resource recommendations, and then click the "Run" button.
Click the clock icon in the navigation pane to navigate to the Jobs Page. Wait 10 seconds and then click refresh several times with a few seconds in between clicks. The test DQ check should show and progress through a sequence of activities before settling in "Finished "status.
### Setting permissions on your pem file for ssh accesschmod 400 ~/Downloads/ssh_pem_key
### Postgres data directly initialization failed### Postgres permission denied errors### sed: can't read /home/owldq/owl/postgres/data/postgresql.conf: Permission deniedsudo rm -rf /home/owldq/owl/postgreschmod -R 755 /home/owldq### Reinstall just postgres./setup.sh -owlbase=$OWL_BASE -user=$OWL_METASTORE_USER -pgpassword=$OWL_METASTORE_PASS -options=postgres
### Spark standalone permission denied after using ./start-all.shssh-keygen -t rsa -N "" -f ~/.ssh/id_rsacat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
### Changin permissiongs on individual log filessudo chmod 777 /home/owldq/owl/pids/owl-agent.pidsudo chmod 777 /home/owldq/owl/pids/owl-web.pid
### Getting the hostname of the instancehostname -f
### Checking worker nodes disk spacesudo du -ah | sort -hr | head -5sudo find /home/owldq/owl/spark/work/* -mtime +1 -type f -delete
### Adding ENV variables to bash profilevi ~/.bash_profileexport SPARK_HOME=/home/owldq/owl/sparkexport PATH=$SPARK_HOME/bin:$PATH### Add to owl-env.sh for standalone installvi /home/owldq/owl/config/owl-env.shexport SPARK_HOME=/home/owldq/owl/sparkexport PATH=$SPARK_HOME/bin:$PATH
### Checking PIDS for different componentsps -aef|grep postgresps -aef|grep owl-webps -aef|grep owl-agentps -aef|grep spark
### Restart different componentscd /home/owldq/owl/bin/./owlmanage.sh start=postgres./owlmanage.sh start=owlagent./owlmanage.sh start=owlwebcd /home/owldq/owl/spark/sbin/./stop-all.sh./start-all.sh
To launch a Spark standalone cluster with the launch scripts, you should create a file called conf/workers in your Spark directory, which must contain the hostnames of all the machines where you intend to start Spark workers, one per line. If conf/workers does not exist, the launch scripts defaults to a single machine (localhost), which is useful for testing. Note, the master machine accesses each of the worker machines via ssh. By default, ssh is run in parallel and requires password-less (using a private key) access to be setup. If you do not have a password-less setup, you can set the environment variable SPARK_SSH_FOREGROUND and serially provide a password for each worker.
Once you’ve set up this file, you can launch or stop your cluster with the following shell scripts, based on Hadoop’s deploy scripts, and available in
sbin/start-master.sh - Starts a master instance on the machine the script is executed on.
sbin/start-workers.sh - Starts a worker instance on each machine specified in the
sbin/start-worker.sh - Starts a worker instance on the machine the script is executed on.
sbin/start-all.sh - Starts both a master and a number of workers as described above.
sbin/stop-master.sh - Stops the master that was started via the
sbin/stop-worker.sh - Stops all worker instances on the machine the script is executed on.
sbin/stop-workers.sh - Stops all worker instances on the machines specified in the
sbin/stop-all.sh - Stops both the master and the workers as described above.
Note that these scripts must be executed on the machine you want to run the Spark master on, not your local machine.
### Starting Spark Standalonecd /home/owldq/owl/spark/sbin./start-all.sh### Stopping Sparkcd /home/owldq/owl/spark/sbin./stop-all.sh
### Starting Spark with Separate Workers/home/owldq/owl/spark/sbin/start-master.shSPARK_WORKER_OPTS=" -Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=1799 -Dspark.worker.cleanup.appDataTtl=3600"SPARK_WORKER_INSTANCES=3;/home/owldq/owl/spark/sbin/start-slave.sh spark://$(hostname):7077 -c 5 -m 20g