2021.10
Collibra DIC Integration
Powered By GitBook
DQ Job Hive
Run a Data Quality check on a Hive table. Use the -hive flag for a native connection via the HCat, this does not require a JDBC connection and is optimized for distributed speed and scale.

Hive Native, no JDBC Connection

Open source platforms like HDP, EMR and CDH use well known standards and because of this Owl can take advantage of things like HCat and it removes the need for JDBC connection details as well as offers optimum data read speeds. Owl recommends and supports this with the -hive flag.
1
./owlcheck -ds hive_table -rd 2019-03-13 \
2
-q "select * from hive_table" -hive
Copied!
Example output. A hoot is a valid JSON response
1
{
2
"dataset": "hive_table",
3
"runId": "2019-02-03",
4
"score": 100,
5
"behaviorScore": 0,
6
"rows": 477261,
7
"prettyPrint": true
8
}
Copied!

Hive JDBC

    1.
    You need to use the hive JDBC driver, commonly org.apache.hive.HiveDriver
    2.
    You need to locate your driver JDBC Jar with the version that came with your EMR, HDP or CDH
      1.
      This jar is commonly found on an edge node under /opt/hdp/libs/hive/hive-jdbc.jar etc...
1
./owlcheck -rd 2019-06-07 -ds hive_table \
2
-u <user> -p <pass> -q "select * from table" \
3
-c "jdbc:hive2://<HOST>:10000/default" \
4
-driver org.apache.hive.HiveDriver \
5
-lib /opt/owl/drivers/hive/ \
6
-master yarn -deploymode client
Copied!

HDP Driver - org.apache.hive.HiveDriver

CDH Driver - com.cloudera.hive.jdbc41.Datasource

For CDH all the drivers are packaged under, HiveJDBC41_cdhversion.zip

Troubleshooting

A common JDBC connection is hive.resultset.use.unique.column.names=false
This can be added directly to the JDBC connection url string or to the driver properties section
Test your hive connection via beeline to make sure it is correct before going further.
1
beeline -u 'jdbc:hive2://<HOST>:10000/default;principal=hive/[email protected];useSSL=true' -d org.apache.hive.jdbc.HiveDriver
Copied!

Kerberos Example

1
jdbc:hive2://<HOST>:10000/default;principal=hive/[email protected];useSSL=true
Copied!

Connecting Owl WebApp to Hive JDBC

Notice the driver properties for kerberos and principals.
In very rare cases where you can't get the jar files to connect properly one workaround is to add this to the owl-web startup script
1
$JAVA_HOME/bin/java -Dloader.path=lib,/home/danielrice/owl/drivers/hive/ \
2
-DowlAppender=owlRollingFile \
3
-DowlLogFile=owl-web -Dlog4j.configurationFile=file://$INSTALL_PATH/config/log4j2.xml \
4
$HBASE_KERBEROS -jar $owlweb $ZKHOST_KER \
5
--logging.level.org.springframework=INFO $TIMEOUT \
6
--server.session.timeout=$TIMEOUT \
7
--server.port=9001 > $LOG_PATH/owl-web-app.out 2>&1 & echo $! >$INSTALL_PATH/pids/owl-web.pid
Copied!

Class Not Found apache or client or log4j etc...

Any class not found error means that you do not have the "standalone-jar" or you do not have all the jars needed for the driver.

Hive JDBC Jars

It is common for Hive to need a lot of .jar files to complete the driver setup.

Java jar cmds

Sometimes it is helpful to look inside the jar and make sure it has all the needed files.
1
jar -tvf hive-jdbc.jar
Copied!
Last modified 2mo ago