Specifying length of VARCHAR
In Vertica is there a benefit to frugally specifying length limits in VARCHAR columns? Does this allocate space, such that E.g. VARCHAR(5)
for ZIP codes is better than VARCHAR
? Conversely, is there a downside to specifying large values, e.g. VARCHAR(65535)
when the string length of possible values might be larger than 80 but the upper limit not known?
By contrast, I understand that in Postgres there's no advantage to specifying the length https://dba.stackexchange.com/questions/20974/should-i-add-an-arbitrary-length-limit-to-varchar-columns
Tagged:
0
Answers
Prior to version 9.3, yes there is a performance penalty in Vertica for specifying "large" VARCHAR length. Primarily because, though the disk storage is used only the length of the data, Memory allocation happens for full length of the column, even if the data is much smaller.
There are two way the performance penalty manifests.
If the VARCHAR(500) (or any large value) is used in JOIN condition using HASH join. I have seen this happening in production systems.
The HASH table will be unnecessarily large resulting OOM or DISK_SPILLING. Vertica 9.3 introduced JFMT Hint and JoinDefaultTupleFormat (session, database level) configuration parameter that mitigates this problem effectively. Though the default value of the parameter is still "fixed" (not variable).
https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/AnalyzingData/Optimizations/VariableLengthDataJoin.htm?zoom_highlight=JoinDefaultTupleFormat
The other way performance penalty manifest is apart from allocating larger memory foot print if Vertica has to transmit the result to the client, it will use the full length buffers. I have seen long varchar (2.5 Mil) where as the text data is is in the rage of 10-100K.
This can be mitigated by casting/substring when selecting data from such columns.
IMHO, also anytime if the column is used in segmentation, order by, group by, distinct etc, it will be efficient to specify reasonable/expected length for Varchar column.
A quick example shows what @Sudhakar_B is refereing to
When I first started working with Vertica as a consultant back in 2011, a client had defined ALL of their character fields as VARCHAR(65000)! Needless to say, they were complaining about query performance. We "right-sized" their VARCHARs, and voila, performance went through the roof!
Thank you for this awesomely clear answer. Providing those example performance checks is also extremely helpful showing how to test before posting to the forum.
@Pieter_Sheth-Vo - That was a great question!
Note that the documentation does hint that it's a bad idea to over size your VARCHARs by saying:
See: https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/SQLReferenceManual/DataTypes/CharacterDataTypes.htm
@Jim_Knicely ,
Is there a downside to setting JoinDefaultTupleFormat to "variable" at DB level? Even in 10.x the default is "fixed".
The reason I am asking is the parameter suggests should be changed only under Vertica support guidance.
Furthermore JFMT hint works only with SYNTACTIC_JOIN hint at individual query level.
So is it possible that setting this at DB level disable some other optimizer options?
I have done some extensive testing and this seems to work well.
TIA
Well, @Sudhakar_B ,
As a matter of fact, yes, there is a downside in doing so.
setting JoinDefaultTupleFormat to "variable" will not resolve all issues with oversized column types: Loading , sorting will remain penalised at database level; and, what hurts much more: all SQL clients - even vsql - will over-allocate fetch buffers for the maximum possible lengths. And those maximum possible lengths will be multiplied by the rowset - the number of rows you fetch with each fetch call. You'll run out of memory quickly on the middle tier or client side - or both. This applies to BI tools, ETL tools, and your favourite SQL client, to name a few.
And, if you use your judgment and size your columns wisely, with JoinDefaultTupleFormat set to "variable" , you're then slower than with fixed formats, where simple pointer arithmetics can let you navigate between tuples, whereas you always have to calculate at atomic level for navigating the memory with variable buffer lengths.
My suggestion is to run a
SELECT MAX(OCTET_LENGTH(RTRIM(s)))
for all suspicious strings, and use the result for re-dimensioning your VARCHARs.Fix the problems as high up in the food chain as you can.
@marcothesane ,
That you for detailed response. I agree with your overall assessment and guidance.
While I understand the impact of using "wide" columns, in the industry and Vertica customers the following points come into play.
1. Vertica customers who build data-lake in Vertica tend to lean towards ELT than ETL. At some customer's sites data from several different independent data sources (and databases) are continuously ingested into Vertica with minimal transformation. And then all the transformations (technical and business) happen within Vertica. In such situation customers want to maintain "high-fidelity" of data types and model to the respective source systems, over which we have no-control. Furthermore, the expectation is that Vertica data-lake/EDW to be "more forgiving for data types" than the respective source operational systems.
2. In such environments data-ingest and DML are normally at individual table level with minimal to no joins. Though they are penalized for wide columns usage the penalty is acceptable and can be managed. As the data ingest team has control over these processes.
3. Data consumers from Vertica, BI tools, R/Python platforms, VSQL or other clients typically run large, complex "analytic & exploratory" queries in Vertica. Such queries typically perform several inefficient joins (on wide varchar columns), many computation and aggregation with-in Vertica on very large data-set and return much smaller aggregated results to the client. This is why I was focusing more on the JOIN (specially HASH) in Vertica and the
JoinDefaultTupleFormat
.4. Though I have seen some clients extracting "large result set" from Vertica for analytic processing in other platform, I dissuade them from following that practice as Vertica is designed for IN-DB Analytic workload.
5. I have done some extensive testing with this parameter at one customer site (version 9.3.1) and found very favorable results. However since this is new parameter in 9.3, I wanted to reach-out to see if these are any known specific downside for setting the value to "variable".
As I can't share the customer test cases here, I found a repeatable use-case in Vertica VMware version 10.0.0 to demonstrate the effectiveness of the parameter
JoinDefaultTupleFormat
.Thank you for this awesomely clear answer. Providing those example performance checks is also extremely helpful showing how to test before posting to the forum.