Bug: UNION ALL of CTEs produces invalid UTF-8 when one CTE references the other (25.1.0-5)
Vertica Version: 25.1.0-5
Summary:
I'm encountering invalid UTF-8 characters in string columns after performing a UNION ALL of two CTEs, where one CTE references the other via a LEFT JOIN on a grouped subquery. Each CTE individually contains only valid UTF-8 data. The corruption only appears after the UNION ALL.
Error:
Severity: ERROR Message: Regexp encountered an invalid UTF-8 character Sqlstate: 22021 Error Code: 4551 Detail: String = E'\177\377\377\377\377\377\377'
How to reproduce:
The pattern that triggers this is:
cte1— produces string columns from joined tablescte2— referencescte1via a LEFT JOIN on an aggregated subquery (GROUP BY+MAX)- Final query performs
UNION ALLofcte1andcte2 - A downstream
REGEXP_SUBSTRon the unioned result fails with invalid UTF-8
Key observations:
SELECT * FROM cte1 WHERE NOT isutf8(col1)returns 0 rowsSELECT * FROM cte2 WHERE NOT isutf8(col1)returns 0 rows- The
UNION ALLof both returns rows with invalid UTF-8 - Removing the
LEFT JOINfromcte2makes the problem disappear - The issue seems related to how the optimizer inlines the CTE when it is referenced multiple times (once in
cte2's left join, once in the finalUNION ALL)
Workaround:
Using the query hint /*+ENABLE_WITH_CLAUSE_MATERIALIZATION*/ or materializing cte1 into a temp table appears to resolve the issue.
Questions:
1. Is this a known issue?
2. Is there a fix planned or available in a later patch?
3. Is ENABLE_WITH_CLAUSE_MATERIALIZATION a safe setting to enable globally as a workaround?
Thank you for any guidance.
0