MultiPhase Trasnform UDx are pointless, stacking separate Transform UDx provide more functionality
I took a look on using MultiPhase Transform UDx.
I am not sure why anybody would want to use them. They restrict functionality comparing to use of stacked several regular TransformUDx.
Here is what I found:
Idea behind MultiPhase is to provide canned functionality, same as you have when calling stacked Transforms:
select transform1(v1.) over() from (
select transform2(v2.) over() from (
....
) v2
) v1
Output from transform 2 goes to input of transform 1, creating stacked Transforms
Same call to MultiPhase transform would be
select MPtransform(*) over() from (xxxx);
For regular transform in stacked calls, I can control where each stage is being executed.
I can use over() - single instance of transform executed on init node, over(partition nodes) - single instance of transform per each node, over(partition best) - several instances of transfomr per node. More important, I can specify where I want each transform to be executed individually. For example, I can request that each transform is executed on every node in cluster. And it works perfectly fine.
What I found is that MultiPhase transform is not that flexible. First stage of MultiPhase transform can be configured similar to regular transform, but all other stages are forced to be executed in single thread on initiator node.
So... I do not see a reason to use UDx MultiPhase Transform. Stacked transforms works better, more flexible and have more functionality. Performance is same for stacked transforms and MultiPhase transform.
And, looking at bigger picture, I do not see a point why Vertica released MultiPhase transform. Regular UDx transforms in stacked configuration already exits, works just fine, flexible configurable etc.
What is a point in MultiPhase UDx Transforms?
Looking forward to see comment from Vertica UDx developer.
Answers
Hi Sergey,
Thanks for your very good question. As you have correctly explained, a stack of UDTFs (User Defined Transform Functions) can do the job of a MPTF (Multi-Phase Transform Function), and it would be even more flexible. As you suspect, using MPTF may not be an advantage if the developer and the user of the functions is the same person. However, MPTF can be very handy when the developer (e.g., a data engineer) of a functionality is not the same as its consumer (e.g., a business analyst or a data scientist). A MPTF wraps several UDTFs in a single function; therefore, the end user doesn't need to know the details about each UDTF and how to cascade them together.
Some of Vertica functions are also implemented using MPTF. For example, the CORR_MATRIX function, which automatically get installed by Vertica's Machine-Learning library, is a MPTF. This function takes an input relation with numeric columns, and calculates the Pearson Correlation Coefficient between each pair of its input columns. Can you imagine how awkward it could be if we had provided two UDTFs instead, and had instructed the end-users to cascade them together to achieve the right results?
I'd like to add that MPTF is supported by Vertica SDK in three languages: C++, Java, and Python.
Yes MRTF are encapsulated and easier to use than stacked transforms.
Do not assume users a dumber than Vertica developer. In my case, they are on par or better. May be they do not spent as much time with Vertica, but they can easily follow instructions on stacking transforms.
Point here, some key functionality is lost from MPTF, namely equivalent of "partition by node|best|auto".
MTPF already can specify segmentation for each step. It is half way there, need to add support for partition by.