DQ Job Hive
We've moved! To improve customer experience, the Collibra Data Quality User Guide has moved to the Collibra Documentation Center as part of the Collibra Data Quality 2022.11 release. To ensure a seamless transition, dq-docs.collibra.com will remain accessible, but the DQ User Guide is now maintained exclusively in the Documentation Center.
Run a Data Quality check on a Hive table. Use the -hive flag for a native connection via the HCat, this does not require a JDBC connection and is optimized for distributed speed and scale.
Open source platforms like HDP, EMR and CDH use well known standards and because of this Owl can take advantage of things like HCat and it removes the need for JDBC connection details as well as offers optimum data read speeds. Owl recommends and supports this with the -hive flag.
./owlcheck -ds hive_table -rd 2019-03-13 \
-q "select * from hive_table" -hive
Example output. A hoot is a valid JSON response
- 1.You need to use the hive JDBC driver, commonly org.apache.hive.HiveDriver
- 2.You need to locate your driver JDBC Jar with the version that came with your EMR, HDP or CDH
- 1.This jar is commonly found on an edge node under /opt/hdp/libs/hive/hive-jdbc.jar etc...
./owlcheck -rd 2019-06-07 -ds hive_table \
-u <user> -p <pass> -q "select * from table" \
-c "jdbc:hive2://<HOST>:10000/default" \
-driver org.apache.hive.HiveDriver \
-lib /opt/owl/drivers/hive/ \
-master yarn -deploymode client
For CDH all the drivers are packaged under, HiveJDBC41_cdhversion.zip
A common JDBC connection is hive.resultset.use.unique.column.names=false
This can be added directly to the JDBC connection url string or to the driver properties section
Test your hive connection via beeline to make sure it is correct before going further.
beeline -u 'jdbc:hive2://<HOST>:10000/default;principal=hive/[email protected];useSSL=true' -d org.apache.hive.jdbc.HiveDriver
Notice the driver properties for kerberos and principals.
In very rare cases where you can't get the jar files to connect properly one workaround is to add this to the owl-web startup script
$JAVA_HOME/bin/java -Dloader.path=lib,/home/danielrice/owl/drivers/hive/ \
-DowlLogFile=owl-web -Dlog4j.configurationFile=file://$INSTALL_PATH/config/log4j2.xml \
$HBASE_KERBEROS -jar $owlweb $ZKHOST_KER \
--logging.level.org.springframework=INFO $TIMEOUT \
--server.port=9001 > $LOG_PATH/owl-web-app.out 2>&1 & echo $! >$INSTALL_PATH/pids/owl-web.pid
Any class not found error means that you do not have the "standalone-jar" or you do not have all the jars needed for the driver.
It is common for Hive to need a lot of .jar files to complete the driver setup.
Sometimes it is helpful to look inside the jar and make sure it has all the needed files.
jar -tvf hive-jdbc.jar