2021.10
Collibra DIC Integration
Powered By GitBook
Hadoop Install
In some cases, the required Hadoop client configuration requires the DQ Agent to run on an Hadoop Edge node within the cluster. This can happen because native dependency packages are required, network isolation from subnet that is hosting DQ server, complex security configuration, ect. In these circumstances, simply deploy the DQ Agent on a cluster Edge Node that contains the required configurations and packages. In this setup, the DQ Agent will use the existing Hadoop configuration and packages to run DQ checks on the Hadoop cluster.

Hadoop Config Setup

Hadoop configuration can be incredibly complex. There can be hundreds of "knobs" across dozens of different components. However, DQ's goal is to simply leverage Hadoop to allocate compute resources in order to execute DQ checks (Spark jobs). This means that the only client side configurations required are:
    Security protocol definition
    Yarn Resource Manager endpoints
    Storage service (HDFS or Cloud storage).
Once the Hadoop client configuration is defined, it is only a matter of pointing the DQ Agent at the folder that contains the client configuration files. The DQ Agent is then able to use the Hadoop client configuration to submit jobs to the specified Hadoop cluster.
DQ jobs running on Hadoop are Spark jobs. DQ will use the storage platform defined in the "fs.defaultFS" setting to distribute all of the required Spark libraries and specified dependency packages like drivers files. This allows DQ to use a version of Spark that is different than the one provided by the cluster. If it is a requirement to use the Spark version provided by the target Hadoop cluster, obtain and use a copy of the yarn-site.xml and core-site.xml from the cluster.

Create Config Folder

1
cd $OWL_HOME
2
mkdir -p config/hadoop
3
echo "export HADOOP_CONF_DIR=$OWL_HOME/config/hadoop" >> config/owl-env.sh
4
bin/owlmanage.sh restart=owlagent
Copied!

Minimum Config (Kerberos Disabled, TLS Disabled)

This configuration would typical only be applicable in Cloud Hadoop scenarios (EMR/Dataproc/HDI). Cloud Hadoop clusters are ephemeral and do not store any data as the data is stored in and is secured by Cloud Storage.
1
export RESOURCE_MANAGER=<yarn-resoruce-manager-host>
2
export NAME_NODE=<namenode>
3
4
echo "
5
<configuration>
6
<property>
7
<name>hadoop.security.authentication</name>
8
<value>simple</value>
9
</property>
10
<property>
11
<name>hadoop.rpc.protection</name>
12
<value>authentication</value>
13
</property>
14
<property>
15
<name>fs.defaultFS</name>
16
<value>hdfs://$NAME_NODE:8020</value>
17
</property>
18
</configuration>
19
" >> $OWL_HOME/config/hadoop/core-site.xml
20
21
echo "
22
<configuration>
23
<property>
24
<name>yarn.resourcemanager.scheduler.address</name>
25
<value>$RESOURCE_MANAGER:8030</value>
26
</property>
27
<property>
28
<name>yarn.resourcemanager.address</name>
29
<value>$RESOURCE_MANAGER:8032</value>
30
</property>
31
<property>
32
<name>yarn.resourcemanager.webapp.address</name>
33
<value>$RESOURCE_MANAGER:8088</value>
34
</property>
35
</configuration>
36
" >> $OWL_HOME/config/hadoop/yarn-site.xml
Copied!
When deploying a Cloud Service Hadoop cluster from any of the major Cloud platforms, it is possible to use Cloud Storage rather than HDFS for dependency package staging and distribution. To achieve this, create a new storage bucket and ensure that both the Hadoop cluster and the instance running DQ Agent have access to it. This is usually accomplished using a Role that is attached to the infrastructure. For example, AWS Instance Role with bucket access policies. Then, set "fs.defaultFS" in core-site.xml to the bucket path instead of HDFS.
Once the Hadoop client configuration has been created, navigate to Agent Management console from the Admin Console and configure the agent to use Yarn (Hadoop resource scheduler) as the Default Master and set the Default Deployment Mode to "Cluster".

Kerberos Secured with Resource Manager TLS enabled

Typically, Hadoop cluster that are deployed on-premises are multi-tenant and not ephemeral. This means they must be secured using Kerberos. In addition, all endpoints with HTTP endpoints will have TLS enabled. In addition HDFS may be configured for a more secure communication using additional RPC encryption.
1
export RESOURCE_MANAGER=<yarn-resoruce-manager-host>
2
export NAME_NODE=<namenode>
3
export KERBEROS_DOMAIN=<kerberos-domain-on-cluster>
4
export HDFS_RPC_PROTECTION=<authentication || privacy || integrity>
5
6
echo "
7
<configuration>
8
<property>
9
<name>hadoop.security.authentication</name>
10
<value>kerberos</value>
11
</property>
12
<property>
13
<name>hadoop.rpc.protection</name>
14
<value>$HDFS_RPC_PROTECTION</value>
15
</property>
16
<property>
17
<name>fs.defaultFS</name>
18
<value>hdfs://$NAME_NODE:8020</value>
19
</property>
20
</configuration>
21
" >> $OWL_HOME/config/hadoop/core-site.xml
22
23
echo "
24
<configuration>
25
<property>
26
<name>hadoop.security.authentication</name>
27
<value>HDFS/[email protected]$KERBEROS_DOMAIN</value>
28
</property>
29
</configuration>
30
" >> $OWL_HOME/config/hadoop/hdfs-site.xml
31
32
echo "
33
<configuration>
34
<property>
35
<name>yarn.resourcemanager.scheduler.address</name>
36
<value>$RESOURCE_MANAGER:8030</value>
37
</property>
38
<property>
39
<name>yarn.resourcemanager.address</name>
40
<value>$RESOURCE_MANAGER:8032</value>
41
</property>
42
<property>
43
<name>yarn.resourcemanager.webapp.https.address</name>
44
<value>$RESOURCE_MANAGER:8090</value>
45
</property>
46
</configuration>
47
" >> $OWL_HOME/config/hadoop/yarn-site.xml
Copied!
When the target Hadoop cluster is secured by Kerberos, DQ checks require a Kerberos credential. This typically means that the DQ Agent will need to be configured to include a Kerberos keytab with each DQ check. Access the DQ Agent configuration page from the Admin Console and configure the "Freeform Append" setting with the -sparkprinc <spark-submit-principal> -sparkkeytab <path-to-keytab>.
Last modified 14d ago