Base85 Encoding using built-in Vertica SQL functions
Maybe you need to squeeze 64 hex characters (32 bytes) into something smaller while maintaining human-readability. This might be useful if you've de-identified a column using the SHA256 function, but you have a field width limitation of 40 characters, for example.
The SQL expression below condenses 64 hex characters (or less) into 40 human-readable characters, using Base85 encoding. Base85 encoding is nice because it fits binary data of N bytes into a string of N*1.25 human-readable characters.
This expression isn't terribly fast, but on our 8-node cluster it encodes 10 million distinct values in well under a minute. That time includes running SHA256() on the input values as well.
This SQL expression was adapted from the definition of Base85 encoding found on Wikipedia: Ascii85
Normally one would request dbadmins to create a user-defined function (UDF) for this, but we're discouraged from that due to the maintenance overhead of maintaining many clusters.
WITH HexData AS ( SELECT TO_HEX('Man '::VARBINARY) AS "x" UNION ALL SELECT TO_HEX('Man is distinguished, not only b'::VARBINARY) AS "x" UNION ALL SELECT SHA256('Man is distinguished, not only by his reason, but by this singular passion from other animals, which is a lust of the mind, that by a perseverance of delight in the continued and indefatigable generation of knowledge, exceeds the short vehemence of any carnal pleasure.') ) SELECT -- Input value "x" should be 64 hex characters (or less). -- Output value "Base85Encoded" is 40 human-readable characters in the 90-character range ASCII(33) to ASCII(122). -- Note that numbers 52200625, 614125, and 7225 are equal to the number 85 raised to the power of 4, 3, and 2, respectively. "x", -- ##### 4-tuple 1 ##### Base85 characters 1 - 5 CHR(33 + TRUNC(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 1, 8)) / 52200625, 0)::INT) || CHR(33 + TRUNC(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 1, 8)), 52200625) / 614125, 0)::INT) || CHR(33 + TRUNC(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 1, 8)), 52200625), 614125) / 7225, 0)::INT) || CHR(33 + TRUNC(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 1, 8)), 52200625), 614125), 7225) / 85, 0)::INT) || CHR(33 + MOD(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 1, 8)), 52200625), 614125), 7225), 85)::INT) || -- ##### 4-tuple 2 ##### Base85 characters 6 - 10 CHR(33 + TRUNC(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 9, 8)) / 52200625, 0)::INT) || CHR(33 + TRUNC(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 9, 8)), 52200625) / 614125, 0)::INT) || CHR(33 + TRUNC(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 9, 8)), 52200625), 614125) / 7225, 0)::INT) || CHR(33 + TRUNC(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 9, 8)), 52200625), 614125), 7225) / 85, 0)::INT) || CHR(33 + MOD(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 9, 8)), 52200625), 614125), 7225), 85)::INT) || -- ##### 4-tuple 3 ##### Base85 characters 11 - 15 CHR(33 + TRUNC(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 17, 8)) / 52200625, 0)::INT) || CHR(33 + TRUNC(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 17, 8)), 52200625) / 614125, 0)::INT) || CHR(33 + TRUNC(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 17, 8)), 52200625), 614125) / 7225, 0)::INT) || CHR(33 + TRUNC(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 17, 8)), 52200625), 614125), 7225) / 85, 0)::INT) || CHR(33 + MOD(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 17, 8)), 52200625), 614125), 7225), 85)::INT) || -- ##### 4-tuple 4 ##### Base85 characters 16 - 20 CHR(33 + TRUNC(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 25, 8)) / 52200625, 0)::INT) || CHR(33 + TRUNC(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 25, 8)), 52200625) / 614125, 0)::INT) || CHR(33 + TRUNC(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 25, 8)), 52200625), 614125) / 7225, 0)::INT) || CHR(33 + TRUNC(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 25, 8)), 52200625), 614125), 7225) / 85, 0)::INT) || CHR(33 + MOD(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 25, 8)), 52200625), 614125), 7225), 85)::INT) || -- ##### 4-tuple 5 ##### Base85 characters 21 - 25 CHR(33 + TRUNC(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 33, 8)) / 52200625, 0)::INT) || CHR(33 + TRUNC(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 33, 8)), 52200625) / 614125, 0)::INT) || CHR(33 + TRUNC(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 33, 8)), 52200625), 614125) / 7225, 0)::INT) || CHR(33 + TRUNC(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 33, 8)), 52200625), 614125), 7225) / 85, 0)::INT) || CHR(33 + MOD(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 33, 8)), 52200625), 614125), 7225), 85)::INT) || -- ##### 4-tuple 6 ##### Base85 characters 26 - 30 CHR(33 + TRUNC(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 41, 8)) / 52200625, 0)::INT) || CHR(33 + TRUNC(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 41, 8)), 52200625) / 614125, 0)::INT) || CHR(33 + TRUNC(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 41, 8)), 52200625), 614125) / 7225, 0)::INT) || CHR(33 + TRUNC(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 41, 8)), 52200625), 614125), 7225) / 85, 0)::INT) || CHR(33 + MOD(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 41, 8)), 52200625), 614125), 7225), 85)::INT) || -- ##### 4-tuple 7 ##### Base85 characters 31 - 35 CHR(33 + TRUNC(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 49, 8)) / 52200625, 0)::INT) || CHR(33 + TRUNC(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 49, 8)), 52200625) / 614125, 0)::INT) || CHR(33 + TRUNC(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 49, 8)), 52200625), 614125) / 7225, 0)::INT) || CHR(33 + TRUNC(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 49, 8)), 52200625), 614125), 7225) / 85, 0)::INT) || CHR(33 + MOD(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 49, 8)), 52200625), 614125), 7225), 85)::INT) || -- ##### 4-tuple 8 ##### Base85 characters 36 - 40 CHR(33 + TRUNC(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 57, 8)) / 52200625, 0)::INT) || CHR(33 + TRUNC(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 57, 8)), 52200625) / 614125, 0)::INT) || CHR(33 + TRUNC(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 57, 8)), 52200625), 614125) / 7225, 0)::INT) || CHR(33 + TRUNC(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 57, 8)), 52200625), 614125), 7225) / 85, 0)::INT) || CHR(33 + MOD(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 57, 8)), 52200625), 614125), 7225), 85)::INT) AS "Base85Encoded" FROM HexData ;
Results:
x | Base85Encoded |
---|---|
4d616e20 | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!9jqo^ |
4d616e2069732064697374696e677569736865642c206e6f74206f6e6c792062 | 9jqo^BlbD-BleB1DJ+*+F(f,q/0JhKFCj@.4 |
78fe75026c4390ceccc4e9e6a9428ba8ae5968b458e60b5eebecd682cf24bbf2 | GlDgeCdX<0bf&c.WBuNAY$#GF=QTutlg32ScQp-n |
Comments
Thanks for sharing your encoding.
I enhanced a Python UDX I had shared here previously : https://forum.vertica.com/discussion/241198/base64-decode-encode-functions-in-vertica-via-python-udx-sdk
to also provide base85 and ascii85 encoding/decoding based on the Base64 python module. Enhanced UDX attached.
SQL to create the new base85/ascii85 functions:
However I do get different results than in your test (although the forum might not like some of the generated encodings, base64 is far cleaner to pass around).
Anyways, feel fee to reuse that UDX.
Hi @SergeB , I really appreciate your sharing, and I'm sure others will benefit from it as well. Unfortunately, our engineers don't have permission to create UDFs in our systems, but your code will certainly benefit those who can.
I would like to check our results to find out why there is a discrepancy in the outputs. One thing I noticed is your input text is missing a period at the end. Still, I cannot reproduce your output either way. I used a third-party tool to check the encoding and it seems to agree with my results. The specific case I'm checking is when you first process the text with SHA256 and then encode the binary output of that as Base85.
Here are my two results of encoding the long phrase after running it through SHA256, both with and without the final period:
GlDgeCdX<0bf&c.WBuNAY$#GF=QTutlg32ScQp-n
;Fq2\J(apostrophe)d14"=&Cfd<u'PZCaS)IRQ]k4'QB(apostrophe)8[+t:
This is different from your output:
2)7!K0KE]R3&+u'0JmBM0JkO@2eFm,A2-#R0JYF?@PTrU1c8c#2e=g'1c8iR1L(apostrophe)4N2E3]S1c]#!@5BiO
...
Ah, now I understand the discrepancy. When you call va85encode(sha256(x)), the encoder is processing the hexadeximal text output of sha256, while my huge SQL expression above encodes the binary output of SHA256, so it encodes only 32 bytes, not 64.
Also, I totally agree that base64 and base32 are more portable than base85, because of the character sets. base85 doesn't play well with XML, for example. I'm even having trouble posting the base85 output in this forum because of the apostrophes. My goal in using base85 was to produce the shortest human-readable representation.
There is a really cool site https://cryptii.com/ that will show intermediate output of multiple processing steps. I also used this site to reproduce your output by feeding the textual output of SHA256 to the Ascii85 encoder.