Base85 Encoding using built-in Vertica SQL functions

verticauser810verticauser810 Vertica Customer
edited November 2021 in General Discussion

Maybe you need to squeeze 64 hex characters (32 bytes) into something smaller while maintaining human-readability. This might be useful if you've de-identified a column using the SHA256 function, but you have a field width limitation of 40 characters, for example.

The SQL expression below condenses 64 hex characters (or less) into 40 human-readable characters, using Base85 encoding. Base85 encoding is nice because it fits binary data of N bytes into a string of N*1.25 human-readable characters.

This expression isn't terribly fast, but on our 8-node cluster it encodes 10 million distinct values in well under a minute. That time includes running SHA256() on the input values as well.

This SQL expression was adapted from the definition of Base85 encoding found on Wikipedia: Ascii85

Normally one would request dbadmins to create a user-defined function (UDF) for this, but we're discouraged from that due to the maintenance overhead of maintaining many clusters.

WITH HexData AS (
    SELECT TO_HEX('Man '::VARBINARY) AS "x"
    UNION ALL SELECT TO_HEX('Man is distinguished, not only b'::VARBINARY) AS "x"
    UNION ALL SELECT SHA256('Man is distinguished, not only by his reason, but by this singular passion from other animals, which is a lust of the mind, that by a perseverance of delight in the continued and indefatigable generation of knowledge, exceeds the short vehemence of any carnal pleasure.')
)
SELECT 
    -- Input value "x" should be 64 hex characters (or less).
    -- Output value "Base85Encoded" is 40 human-readable characters in the 90-character range ASCII(33) to ASCII(122).
    -- Note that numbers 52200625, 614125, and 7225 are equal to the number 85 raised to the power of 4, 3, and 2, respectively.
    "x",
    -- ##### 4-tuple 1 ##### Base85 characters 1 - 5
    CHR(33 + TRUNC(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 1, 8)) / 52200625, 0)::INT) ||
    CHR(33 + TRUNC(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 1, 8)), 52200625) / 614125, 0)::INT) ||
    CHR(33 + TRUNC(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 1, 8)), 52200625), 614125) / 7225, 0)::INT) ||
    CHR(33 + TRUNC(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 1, 8)), 52200625), 614125), 7225) / 85, 0)::INT) ||
    CHR(33 + MOD(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 1, 8)), 52200625), 614125), 7225), 85)::INT) ||
    -- ##### 4-tuple 2 ##### Base85 characters 6 - 10
    CHR(33 + TRUNC(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 9, 8)) / 52200625, 0)::INT) ||
    CHR(33 + TRUNC(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 9, 8)), 52200625) / 614125, 0)::INT) ||
    CHR(33 + TRUNC(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 9, 8)), 52200625), 614125) / 7225, 0)::INT) ||
    CHR(33 + TRUNC(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 9, 8)), 52200625), 614125), 7225) / 85, 0)::INT) ||
    CHR(33 + MOD(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 9, 8)), 52200625), 614125), 7225), 85)::INT) ||
    -- ##### 4-tuple 3 ##### Base85 characters 11 - 15
    CHR(33 + TRUNC(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 17, 8)) / 52200625, 0)::INT) ||
    CHR(33 + TRUNC(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 17, 8)), 52200625) / 614125, 0)::INT) ||
    CHR(33 + TRUNC(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 17, 8)), 52200625), 614125) / 7225, 0)::INT) ||
    CHR(33 + TRUNC(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 17, 8)), 52200625), 614125), 7225) / 85, 0)::INT) ||
    CHR(33 + MOD(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 17, 8)), 52200625), 614125), 7225), 85)::INT) ||
    -- ##### 4-tuple 4 ##### Base85 characters 16 - 20
    CHR(33 + TRUNC(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 25, 8)) / 52200625, 0)::INT) ||
    CHR(33 + TRUNC(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 25, 8)), 52200625) / 614125, 0)::INT) ||
    CHR(33 + TRUNC(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 25, 8)), 52200625), 614125) / 7225, 0)::INT) ||
    CHR(33 + TRUNC(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 25, 8)), 52200625), 614125), 7225) / 85, 0)::INT) ||
    CHR(33 + MOD(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 25, 8)), 52200625), 614125), 7225), 85)::INT) ||
    -- ##### 4-tuple 5 ##### Base85 characters 21 - 25
    CHR(33 + TRUNC(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 33, 8)) / 52200625, 0)::INT) ||
    CHR(33 + TRUNC(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 33, 8)), 52200625) / 614125, 0)::INT) ||
    CHR(33 + TRUNC(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 33, 8)), 52200625), 614125) / 7225, 0)::INT) ||
    CHR(33 + TRUNC(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 33, 8)), 52200625), 614125), 7225) / 85, 0)::INT) ||
    CHR(33 + MOD(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 33, 8)), 52200625), 614125), 7225), 85)::INT) ||
    -- ##### 4-tuple 6 ##### Base85 characters 26 - 30
    CHR(33 + TRUNC(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 41, 8)) / 52200625, 0)::INT) ||
    CHR(33 + TRUNC(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 41, 8)), 52200625) / 614125, 0)::INT) ||
    CHR(33 + TRUNC(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 41, 8)), 52200625), 614125) / 7225, 0)::INT) ||
    CHR(33 + TRUNC(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 41, 8)), 52200625), 614125), 7225) / 85, 0)::INT) ||
    CHR(33 + MOD(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 41, 8)), 52200625), 614125), 7225), 85)::INT) ||
    -- ##### 4-tuple 7 ##### Base85 characters 31 - 35
    CHR(33 + TRUNC(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 49, 8)) / 52200625, 0)::INT) ||
    CHR(33 + TRUNC(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 49, 8)), 52200625) / 614125, 0)::INT) ||
    CHR(33 + TRUNC(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 49, 8)), 52200625), 614125) / 7225, 0)::INT) ||
    CHR(33 + TRUNC(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 49, 8)), 52200625), 614125), 7225) / 85, 0)::INT) ||
    CHR(33 + MOD(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 49, 8)), 52200625), 614125), 7225), 85)::INT) ||
    -- ##### 4-tuple 8 ##### Base85 characters 36 - 40
    CHR(33 + TRUNC(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 57, 8)) / 52200625, 0)::INT) ||
    CHR(33 + TRUNC(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 57, 8)), 52200625) / 614125, 0)::INT) ||
    CHR(33 + TRUNC(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 57, 8)), 52200625), 614125) / 7225, 0)::INT) ||
    CHR(33 + TRUNC(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 57, 8)), 52200625), 614125), 7225) / 85, 0)::INT) ||
    CHR(33 + MOD(MOD(MOD(MOD(HEX_TO_INTEGER(SUBSTRING(LPAD("x", 64, '0'), 57, 8)), 52200625), 614125), 7225), 85)::INT) AS "Base85Encoded"
FROM HexData
;

Results:

x Base85Encoded
4d616e20 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!9jqo^
4d616e2069732064697374696e677569736865642c206e6f74206f6e6c792062 9jqo^BlbD-BleB1DJ+*+F(f,q/0JhKFCj@.4
78fe75026c4390ceccc4e9e6a9428ba8ae5968b458e60b5eebecd682cf24bbf2 GlDgeCdX<0bf&c.WBuNAY$#GF=QTutlg32ScQp-n

Comments

  • SergeBSergeB Employee
    edited November 2021

    Thanks for sharing your encoding.

    I enhanced a Python UDX I had shared here previously : https://forum.vertica.com/discussion/241198/base64-decode-encode-functions-in-vertica-via-python-udx-sdk

    to also provide base85 and ascii85 encoding/decoding based on the Base64 python module. Enhanced UDX attached.

    SQL to create the new base85/ascii85 functions:

    CREATE FUNCTION vb85encode AS LANGUAGE 'Python' NAME 'vb85encode_factory' LIBRARY pylib fenced;
    CREATE FUNCTION vb85decode AS LANGUAGE 'Python' NAME 'vb85decode_factory' LIBRARY pylib fenced;
    CREATE FUNCTION va85decode AS LANGUAGE 'Python' NAME 'va85decode_factory' LIBRARY pylib fenced;
    CREATE FUNCTION va85encode AS LANGUAGE 'Python' NAME 'va85encode_factory' LIBRARY pylib fenced;

    However I do get different results than in your test (although the forum might not like some of the generated encodings, base64 is far cleaner to pass around).

    with ztext as
    (select 'Man is distinguished, not only by his reason, but by this singular passion from other animals, which is a lust of the mind, that by a perseverance of delight in the continued and indefatigable generation of knowledge, exceeds the short vehemence of any carnal pleasure' as x)
    select x, va85encode(sha256(x)) as va85sha256, va85encode(x), vb85encode(x), vb64encode(x) from ztext;
    -[ RECORD 1 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    x | Man is distinguished, not only by his reason, but by this singular passion from other animals, which is a lust of the mind, that by a perseverance of delight in the continued and indefatigable generation of knowledge, exceeds the short vehemence of any carnal pleasure
    va85sha256 | 2)7!K0KE]R3&+u'0JmBM0JkO@2eFm,A2-#R0JYF?@PTrU1c8c#2e=g'1c8iR1L4N2E3]S1c]#!@5BiO va85encode | 9jqo^BlbD-BleB1DJ+*+F(f,q/0JhKF<GL>Cj@.4Gp$d7F!,L7@<6@)/0JDEF<G%<+EV:2F!,O<DJ+*.@<*K0@<6L(Df-\0Ec5e;DffZ(EZee.Bl.9pF"AGXBPCsi+DGm>@3BB/F*&OCAfu2/AKYi(DIb:@FD,*)+C]U=@3BN#EcYf8ATD3s@q?d$AftVqCh[NqF<G:8+EV:.+Cf>-FD5W8ARlolDIal(DId<j@<?3r@:F%a+D58'ATD4$Bl@l3De:,-DJs8ARoFb/0JMK@qB4^F!,R<AKZ&-DfTqBG%G>uD.RTpAKYo'+CT/5+Cei#DII?(E,9)oF2M7
    vb85encode | O<^zX>%ZCX>)XGZfA9Ab7*BEFf-gbRchTY<VDJc_3(Mb0BhMVRLV8EFfZabRc4RAarPHb0BkRZfA9DVR9gFVRLh7Z
    CxFa&K)QZ**v7av))DX>DO_b1WctXlY|;AZc?TVIXXEb95kYW~HEWgu;7Ze%PVbZB98AYyqSVIXj2a&uNWpZI|VU(3W*}rY-wjbRcPNAarPDAY*TCbZKsNWn>^>Ze$>7Ze(R<VRUI{VPb4$AZKN6WpZJ3X>V>IZ)PBCZf|#NWn^b%EFfigVXJzb0BnRWgv5CZ*p`Xc4cT~ZDnp_Wgu^6AYpEKAY);2ZeeU7aBO8^b9HiM
    vb64encode | TWFuIGlzIGRpc3Rpbmd1aXNoZWQsIG5vdCBvbmx5IGJ5IGhpcyByZWFzb24sIGJ1dCBieSB0aGlzIHNpbmd1bGFyIHBhc3Npb24gZnJvbSBvdGhlciBhbmltYWxzLCB3aGljaCBpcyBhIGx1c3Qgb2YgdGhlIG1pbmQsIHRoYXQgYnkgYSBwZXJzZXZlcmFuY2Ugb2YgZGVsaWdodCBpbiB0aGUgY29udGludWVkIGFuZCBpbmRlZmF0aWdhYmxlIGdlbmVyYXRpb24gb2Yga25vd2xlZGdlLCBleGNlZWRzIHRoZSBzaG9ydCB2ZWhlbWVuY2Ugb2YgYW55IGNhcm5hbCBwbGVhc3VyZQ==

    Anyways, feel fee to reuse that UDX.

  • verticauser810verticauser810 Vertica Customer
    edited November 2021

    Hi @SergeB , I really appreciate your sharing, and I'm sure others will benefit from it as well. Unfortunately, our engineers don't have permission to create UDFs in our systems, but your code will certainly benefit those who can.

    I would like to check our results to find out why there is a discrepancy in the outputs. One thing I noticed is your input text is missing a period at the end. Still, I cannot reproduce your output either way. I used a third-party tool to check the encoding and it seems to agree with my results. The specific case I'm checking is when you first process the text with SHA256 and then encode the binary output of that as Base85.

    Here are my two results of encoding the long phrase after running it through SHA256, both with and without the final period:
    GlDgeCdX<0bf&c.WBuNAY$#GF=QTutlg32ScQp-n
    ;Fq2\J(apostrophe)d14"=&Cfd<u'PZCaS)IRQ]k4'QB(apostrophe)8[+t:

    This is different from your output:
    2)7!K0KE]R3&+u'0JmBM0JkO@2eFm,A2-#R0JYF?@PTrU1c8c#2e=g'1c8iR1L(apostrophe)4N2E3]S1c]#!@5BiO

    ...
    Ah, now I understand the discrepancy. When you call va85encode(sha256(x)), the encoder is processing the hexadeximal text output of sha256, while my huge SQL expression above encodes the binary output of SHA256, so it encodes only 32 bytes, not 64.

    Also, I totally agree that base64 and base32 are more portable than base85, because of the character sets. base85 doesn't play well with XML, for example. I'm even having trouble posting the base85 output in this forum because of the apostrophes. My goal in using base85 was to produce the shortest human-readable representation.

  • verticauser810verticauser810 Vertica Customer
    edited November 2021

    There is a really cool site https://cryptii.com/ that will show intermediate output of multiple processing steps. I also used this site to reproduce your output by feeding the textual output of SHA256 to the Ascii85 encoder.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file