1

I need to mask certain attributes of a column with array((struct)) data type in Hive. For example, a field, biodata = [{'name':'Rahul','age':20,'gender':'male'},{'name':'Kavita','age':25,'gender':'female'}]

Here, I need to mask/encrypt 'name' attribute and return array((struct)) as below: biodata = [{'name':'xvdff','age':20,'gender':'male'},{'name':'ddkfld','age':25,'gender':'female'}]

How can i achieve this with by writing a Hive UDF.

1 Answer 1

0

If you want to do it without exploding, then you need to write custom UDF.

sha256 hash is a good method (in Hive it is sha2(input, 256) function) for data obfuscation because it is collision-tolerate and deterministic one-way function. One-way means it is not possible to reverse (cryptographically strong), collision-tolerance means it is very low probability to get the same hash for different input values, and deterministic means that it is always the same hash for the the same input, this property allows you to perform joins on hashed attribute and calculate distinct hashed values, perform other analytics and aggregation in the same way as if they were not hashed.

Using native Hive functions, you can explode, apply sha256, then collect array again.

For example like this:

select t.id, 
       collect_list(named_struct('name', sha2(e.name, 256), 'age', e.age, 'gender', e.gender)) as result_array
  from mytable t
       lateral view outer inline(t.biodata) e as name, age, gender
  group by t.id

sha256 being applied across all the data in your data warehouse will still give you the possibility to analyze and join by hashed values, though it is not possible to reverse sha256 without having original value->hash mapping.

Additionally you may want to set empty values or other "special values" to NULL or empty instead of hashing them like this: case when name = '' or name = 'NA' then '' else sha2(name, 256) end, it will be more convenient to analyze and filter such values.

The length of sha256 is 64 HEX digits, does not depend on input length. Example for 'test' input string: 9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08

For older Hive versions without native sha2, you can call DigestUtils method using reflect or java_method: reflect('org.apache.commons.codec.digest.DigestUtils', 'sha256Hex', input)

Less secure and less collision-tolerate hashing method is MD5: md5(input).

Also Hive has mask_hash function for masking data, which is based on MD5 in Hive 2.x and changed to use sha256 in Hive 3.0, see code changes, you can use that code and also read this blog on how to sort array(struct) by specified struct field in GenericUDF, this will give you a good start if you want custom UDF

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the valuable help. but the focus/expectation is more on how to pass an array<<struct>> as a parameter to a hive UDF and return an array<<struct>> with specific fields masked within it.. Please let me know how to achieve the same.
@AbhishekGupta In the last link UDF receiving Array<struct> and returning ArrayStruct is described in details, all you need is to implement your own evaluate method and remove all unnecessary stuff: comparators, sorting.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.