Bug: UNION ALL of CTEs produces invalid UTF-8 when one CTE references the other (25.1.0-5)

shane_breezeshane_breeze Vertica Customer
edited 10:19AM in General Discussion

Vertica Version: 25.1.0-5

Summary:
I'm encountering invalid UTF-8 characters in string columns after performing a UNION ALL of two CTEs, where one CTE references the other via a LEFT JOIN on a grouped subquery. Each CTE individually contains only valid UTF-8 data. The corruption only appears after the UNION ALL.

Error:

Severity: ERROR
Message: Regexp encountered an invalid UTF-8 character
Sqlstate: 22021
Error Code: 4551
Detail: String = E'\177\377\377\377\377\377\377'

How to reproduce:

The pattern that triggers this is:

  1. cte1 — produces string columns from joined tables
  2. cte2 — references cte1 via a LEFT JOIN on an aggregated subquery (GROUP BY + MAX)
  3. Final query performs UNION ALL of cte1 and cte2
  4. A downstream REGEXP_SUBSTR on the unioned result fails with invalid UTF-8

Key observations:

  • SELECT * FROM cte1 WHERE NOT isutf8(col1) returns 0 rows
  • SELECT * FROM cte2 WHERE NOT isutf8(col1) returns 0 rows
  • The UNION ALL of both returns rows with invalid UTF-8
  • Removing the LEFT JOIN from cte2 makes the problem disappear
  • The issue seems related to how the optimizer inlines the CTE when it is referenced multiple times (once in cte2's left join, once in the final UNION ALL)

Workaround:
Using the query hint /*+ENABLE_WITH_CLAUSE_MATERIALIZATION*/ or materializing cte1 into a temp table appears to resolve the issue.

Questions:
1. Is this a known issue?
2. Is there a fix planned or available in a later patch?
3. Is ENABLE_WITH_CLAUSE_MATERIALIZATION a safe setting to enable globally as a workaround?

Thank you for any guidance.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file