Collibra DQ User Guide
2022.10
Search
⌃K

DQ Job S3

We've moved! To improve customer experience, the Collibra Data Quality User Guide has moved to the Collibra Documentation Center as part of the Collibra Data Quality 2022.11 release. To ensure a seamless transition, dq-docs.collibra.com will remain accessible, but the DQ User Guide is now maintained exclusively in the Documentation Center.
S3 permissions need to be setup appropriately.
S3 connections should be defined using the root bucket. Nested S3 connections are not supported.

Example Minimum Permissions

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"s3:ListBucketMultipartUploads",
"s3:ListBucket",
"s3:ListMultipartUploadParts",
"s3:PutObject",
"s3:GetObject",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:athena:*:<AWSAccountID>:workgroup/primary",
"arn:aws:s3:::<S3 bucket name>/*",
"arn:aws:s3:::<S3 bucket name>",
"arn:aws:glue:*:<AWSAccountID>:catalog",
"arn:aws:glue:*:<AWSAccountID>:database/<database name>",
"arn:aws:glue:*:<AWSAccountID>:table/<database name>/*"
]
}
]
}
(Needs appropriate driver) http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/ Hadoop AWS Driver hadoop-aws-2.7.3.2.6.5.0-292.jar
-f "s3a://s3-location/testfile.csv" \
-d "," \
-rd "2018-01-08" \
-ds "salary_data_s3" \
-deploymode client \
-lib /home/ec2-user/owl/drivers/aws/

Databricks Utils Or Spark Conf

val AccessKey = "xxx"
val SecretKey = "xxxyyyzzz"
//val EncodedSecretKey = SecretKey.replace("/", "%2F")
val AwsBucketName = "s3-location"
val MountName = "kirk"
dbutils.fs.unmount(s"/mnt/$MountName")
dbutils.fs.mount(s"s3a://${AccessKey}:${SecretKey}@${AwsBucketName}", s"/mnt/$MountName")
//display(dbutils.fs.ls(s"/mnt/$MountName"))
//sse-s3 example
dbutils.fs.mount(s"s3a://$AccessKey:$SecretKey@$AwsBucketName", s"/mnt/$MountName", "sse-s3")

Databricks Notebooks using S3 buckets

val AccessKey = "ABCDED"
val SecretKey = "aaasdfwerwerasdfB"
val EncodedSecretKey = SecretKey.replace("/", "%2F")
val AwsBucketName = "s3-location"
val MountName = "abc"
// bug if you don't unmount first
dbutils.fs.unmount(s"/mnt/$MountName")
// mount the s3 bucket
dbutils.fs.mount(s"s3a://${AccessKey}:${EncodedSecretKey}@${AwsBucketName}", s"/mnt/$MountName")
display(dbutils.fs.ls(s"/mnt/$MountName"))
// read the dataframe
val df = spark.read.text(s"/mnt/$MountName/atm_customer/atm_customer_2019_01_28.csv")