Is this a Vertica C++ SDK bug ?

cfreely · May 2015

Hi,

I have some problems when develop a vertica custom analytics function with C++ SDK.

I simplify this problem, and here is what I do to reproduction it:

1. edit the example NthValue analytic function (/opt/vertica/sdk/examples/AnalyticFunctions/NthValue.cpp):

int i = 0, j = 0;
do {
    if (currentRow == nParam) {
        nthRowExists = true;
        // break;
    } else {
        currentRow++;
    }
    i++;
} while (inputReader.next());
// Output the n-th value.
if (nthRowExists) {
    do {
        outputWriter.copyFromInput(0 /*dest*/, inputReader, argCols.at(0) /*src*/);
        j++;
    } while (outputWriter.next());
} else {
    // The partition has less than n rows. Output NULL values.
    do {
        outputWriter.setNull(0);
        j++;
    } while (outputWriter.next());
}
srvInterface.log("SSSSSSSSSS:%d %d", i, j);

2. compile the examples

make AnalyticFunctions

3.run example test

vsql -f AnalyticFunctions.sql

4. view the UDx Log

16:47:46.398 [C++-instance-nwxsxyjq-25959:0x66e-32496] 0x7f1afaf31700 [UserMessage] nth_value - SSSSSSSSSS:3 131072
16:47:46.400 [C++-instance-nwxsxyjq-25959:0x66e-32496] 0x7f1afb732700 [UserMessage] nth_value - SSSSSSSSSS:3 131072
16:47:46.446 [C++-instance-nwxsxyjq-25959:0x66e-32496] 0x7f1afb732700 [UserMessage] nth_value - SSSSSSSSSS:3 131072
16:47:46.575 [C++-instance-nwxsxyjq-25959:0x66e-32496] 0x7f1afaf31700 Exiting UDx side process

Here is the problem: the number '3' and '131072' in log message is obviously incorrect, Why is this happened?

Thank You!

cfreely · May 2015

My environment information is：

Operation System: Centos 6.5
Vertica: 7.1.1

GCC: 4.4.7 20120313 (Red Hat 4.4.7-11) (GCC)

Thank you.

[Deleted User] · May 2015

Hi cfreely,

I believe I can explain at least half of what you are seeing. I suspect you're testing on a multi-node Vertica cluster?

If so, each node is going to get a partition of data to process whenever you call the UDx. In the Nth row example, the table is being partitioned on the first column in the table:

/***** Example 4: Nth-Value *****/
DROP TABLE T;
CREATE TABLE T(prod_id INTEGER, sale NUMERIC(5));
COPY T FROM STDIN DELIMITER ',';
1,10
2,20
3,50
1,40
1,60
2,50
2,70
3,40
3,10
\.

-- Invoke UDF
SELECT prod_id, sale, nth_value(sale, 2)
OVER(PARTITION BY prod_id ORDER BY sale) as "2nd_value" FROM T ORDER BY 1;

So, the input table is partitioned by the first column, which has three distinct values. Each partition will contain three rows, since each distict value appears three times in the table.

Each partition is likely to get sent off to a different node in your cluster to be processed. One thing to keep in mind is that the srvInterface.log() function only writes to the log file on the node on which the UDx instance is running. So, to get the full picture of what your UDx is doing, you need to check the log files on all of your nodes.

I added a bunch more data to the table and re-ran you altered example, and got the following results on node 2 of my three-node cluster:

 10:39:40.615 [C++-docd01.verticacorp.-10812:0x27b77-3632]  0x7fb5870d9700 [UserMessage] nth_value - SSSSSSSSSS:3 131072
 10:39:40.661 [C++-docd01.verticacorp.-10812:0x27b77-3632]  0x7fb5870d9700 [UserMessage] nth_value - SSSSSSSSSS:3 131072
 10:39:40.705 [C++-docd01.verticacorp.-10812:0x27b77-3632]  0x7fb5870d9700 [UserMessage] nth_value - SSSSSSSSSS:3 131072
 10:39:40.749 [C++-docd01.verticacorp.-10812:0x27b77-3632]  0x7fb5870d9700 [UserMessage] nth_value - SSSSSSSSSS:2 131072
 10:39:40.792 [C++-docd01.verticacorp.-10812:0x27b77-3632]  0x7fb5870d9700 [UserMessage] nth_value - SSSSSSSSSS:3 131072

So, node 2 handled 4 partitions of data (my node 1, which was the initiator node, only handled 1).

That explains why the "i" value is usually 3. Each partiton being processed in the example only has three rows. Note that one of the partitions from the list above only had two rows. Play with the input table a bit and you'll see how the work is being partitioned and sent off to the different nodes in your cluster.

As for why the output row count is so much higher... I'm not sure. I'll look into that, but I suspect it migth have something to do with a minimum output partition size.

cfreely · May 2015

Hi Gary_G,

Yes, you reproduced my problem, thank you very much!

So why the output row count is so much higher? It seems to be a bug in Vertica Community Edition. Or maybe there is something I didn't know?

I will waiting for your further results, thanks again for your help.

[Deleted User] · June 2015

OK, so I can confirm that this is a bug. What's happening is that the AnalyticPartitionWriter.next() function does not return false until the output buffer is filled. This doesn't result in an error or incorrect results because HP Vertica only retrieves the number of rows from the function that it sent to it. The rest of the junk values in the output block are ignored.

However, this does result in a lot of wasted iterations.

There is a workaround: instead of relying on .next() to return false to indicate we have output enough rows, we can loop until we produce the number of output rows matching the number of input rows returned by AnalyticPartitionReader.getNumRows().

So, based on the example in this message thread, we can change the code to exit once we have enough rows. Here's an updated version of the NthValue class:

class NthValue : public AnalyticFunction
{
    virtual void processPartition(ServerInterface &srvInterface, 
                                  AnalyticPartitionReader &inputReader, 
                                  AnalyticPartitionWriter &outputWriter)
    {
        try {

            //get the value of n from the parameters
            ParamReader paramReader = srvInterface.getParamReader();
            const VString label = paramReader.getStringRef("label");

            // Get # of rows we need to output
            const int numInRows = inputReader.getNumRows();


            const SizedColumnTypes &inTypes = inputReader.getTypeMetaData();
            vector<size_t> argCols; // Argument column indexes.
            inTypes.getArgumentColumns(argCols);
            vint nParam = 0;

            // Two function arguments expected.
            if (argCols.size() != 2 ||
                !inTypes.getColumnType(argCols.at(0)).isNumeric() || // a argument
                !inTypes.getColumnType(argCols.at(1)).isInt() || // n argument
                (nParam = inputReader.getIntRef(argCols.at(1))) < 1) // n positive?
                vt_report_error(1, "Two arguments (numeric, int) are expected with the second argument being a positive integer.");

            vint currentRow = 1;
            bool nthRowExists = false;

            // Find the value of the n-th row.
            int i = 0;
            int j = 0;
            do {
                if (currentRow == nParam) {
                    nthRowExists = true;
                    // break;
                } else {
                    currentRow++;
                }
                i++;
            } while (inputReader.next());


            // Output the n-th value.
            if (nthRowExists) {
                do {
                    outputWriter.copyFromInput(0 /*dest*/, inputReader, argCols.at(0) /*src*/);
                    j++;
                    outputWriter.next();
                } while (j <= numInRows); // Go until we have produced enough rows
            } else {
                // The partition has less than n rows. Output NULL values.
                do {
                    outputWriter.setNull(0);
                    outputWriter.next();
                } while (j <= numInRows);
            }
            srvInterface.log("SSSSSSSSSS:%d %d", i, j);
        } catch(exception& e) {
            // Standard exception. Quit.
            vt_report_error(0, "Exception while processing partition: [%s]", e.what());
        }
    }
};

Running this code results in messages like the following in the UDx log:

 14:44:23.411 [C++-...]  0x7f2283563700 [UserMessage] nth_value - SSSSSSSSSS:2001 2002
 14:44:23.413 [C++-...]  0x7f2283563700 [UserMessage] nth_value - SSSSSSSSSS:1982 1983
 14:44:23.415 [C++-...]  0x7f2283563700 [UserMessage] nth_value - SSSSSSSSSS:2034 2035
 14:44:23.416 [C++-...]  0x7f2283563700 [UserMessage] nth_value - SSSSSSSSSS:1924 1925
 14:44:23.418 [C++-...]  0x7f2283563700 [UserMessage] nth_value - SSSSSSSSSS:1946 1947
 14:44:23.420 [C++-...]  0x7f2283563700 [UserMessage] nth_value - SSSSSSSSSS:1967 1968

The off-by-one is caused by incrementing j one past the number of rows needed.

Note: We're using AnalyticPartitionReader.getNumRows() because there is a separate bug in AnalyticPartitionWriter.getNumRows(). The two values should always be the same. Unfortunately, AnalyticPartitionWriter.getNumRows() is producing the wroung count.

We're Moving!

Create My New Community Account Now

Is this a Vertica C++ SDK bug ?

Comments

Leave a Comment