Is this a Vertica C++ SDK bug ?

Hi,

 

I have some problems when develop a vertica custom analytics function with C++ SDK. 

I simplify this problem, and here is what I do to reproduction it:

 

1. edit the example NthValue analytic function (/opt/vertica/sdk/examples/AnalyticFunctions/NthValue.cpp):

int i = 0, j = 0;
do {
    if (currentRow == nParam) {
        nthRowExists = true;
        // break;
    } else {
        currentRow++;
    }
    i++;
} while (inputReader.next());
// Output the n-th value.
if (nthRowExists) {
    do {
        outputWriter.copyFromInput(0 /*dest*/, inputReader, argCols.at(0) /*src*/);
        j++;
    } while (outputWriter.next());
} else {
    // The partition has less than n rows. Output NULL values.
    do {
        outputWriter.setNull(0);
        j++;
    } while (outputWriter.next());
}
srvInterface.log("SSSSSSSSSS:%d %d", i, j);

 

2. compile the examples

 

make AnalyticFunctions

 

3.run example test

 

vsql -f AnalyticFunctions.sql

 

4. view the UDx Log

 

16:47:46.398 [C++-instance-nwxsxyjq-25959:0x66e-32496] 0x7f1afaf31700 [UserMessage] nth_value - SSSSSSSSSS:3 131072
16:47:46.400 [C++-instance-nwxsxyjq-25959:0x66e-32496] 0x7f1afb732700 [UserMessage] nth_value - SSSSSSSSSS:3 131072
16:47:46.446 [C++-instance-nwxsxyjq-25959:0x66e-32496] 0x7f1afb732700 [UserMessage] nth_value - SSSSSSSSSS:3 131072
16:47:46.575 [C++-instance-nwxsxyjq-25959:0x66e-32496] 0x7f1afaf31700 Exiting UDx side process

 

Here is the problem: the number '3' and '131072' in log message is obviously incorrect, Why is this happened?

 

Thank You!

Comments

  • My environment information is:

     

    Operation System: Centos 6.5
    Vertica: 7.1.1

    GCC: 4.4.7 20120313 (Red Hat 4.4.7-11) (GCC) 

     

    Thank you.

  • Hi cfreely,

     

    I believe I can explain at least half of what you are seeing. I suspect you're testing on a multi-node Vertica cluster?

     

    If so, each node is going to get a partition of data to process whenever you call the UDx. In the Nth row example, the table is being partitioned on the first column in the table:

     

    /***** Example 4: Nth-Value *****/
    DROP TABLE T;
    CREATE TABLE T(prod_id INTEGER, sale NUMERIC(5));
    COPY T FROM STDIN DELIMITER ',';
    1,10
    2,20
    3,50
    1,40
    1,60
    2,50
    2,70
    3,40
    3,10
    \.

    -- Invoke UDF
    SELECT prod_id, sale, nth_value(sale, 2)
    OVER(PARTITION BY prod_id ORDER BY sale) as "2nd_value" FROM T ORDER BY 1;

    So, the input table is partitioned by the first column, which has three distinct values. Each partition will contain three rows, since each distict value appears three times in the table.

     

    Each partition is likely to get sent off to a different node in your cluster to be processed. One thing to keep in mind is that the srvInterface.log() function only writes to the log file on the node on which the UDx instance is running. So, to get the full picture of what your UDx is doing, you need to check the log files on all of your nodes.

     

    I added a bunch more data to the table and re-ran you altered example, and got the following results on node 2 of my three-node cluster:

     

     10:39:40.615 [C++-docd01.verticacorp.-10812:0x27b77-3632]  0x7fb5870d9700 [UserMessage] nth_value - SSSSSSSSSS:3 131072
    10:39:40.661 [C++-docd01.verticacorp.-10812:0x27b77-3632] 0x7fb5870d9700 [UserMessage] nth_value - SSSSSSSSSS:3 131072
    10:39:40.705 [C++-docd01.verticacorp.-10812:0x27b77-3632] 0x7fb5870d9700 [UserMessage] nth_value - SSSSSSSSSS:3 131072
    10:39:40.749 [C++-docd01.verticacorp.-10812:0x27b77-3632] 0x7fb5870d9700 [UserMessage] nth_value - SSSSSSSSSS:2 131072
    10:39:40.792 [C++-docd01.verticacorp.-10812:0x27b77-3632] 0x7fb5870d9700 [UserMessage] nth_value - SSSSSSSSSS:3 131072

    So, node 2 handled 4 partitions of data (my node 1, which was the initiator node, only handled 1).

     

    That explains why the "i" value is usually 3. Each partiton being processed in the example only has three rows. Note that one of the partitions from the list above only had two rows. Play with the input table a bit and you'll see how the work is being partitioned and sent off to the different nodes in your cluster.

     

    As for why the output row count is so much higher... I'm not sure. I'll look into that, but I suspect it migth have something to do with a minimum output partition size.

     

  • Hi Gary_G,

     

    Yes, you reproduced my problem, thank you very much! 

     

    So why the output row count is so much higher? It seems to be a bug in Vertica Community Edition. Or maybe there is something I didn't know?

     

    I will waiting for your further results, thanks again for your help.

  • OK, so I can confirm that this is a bug. What's happening is that the AnalyticPartitionWriter.next() function does not return false until the output buffer is filled. This doesn't result in an error or incorrect results because HP Vertica only retrieves the number of rows from the function  that it sent to it. The rest of the junk values in the output block are ignored.

     

    However, this does result in a lot of wasted iterations.

     

    There is a workaround: instead of relying on .next() to return false to indicate we have output enough rows, we can loop until we produce the number of output rows matching the number of input rows returned by AnalyticPartitionReader.getNumRows().

     

    So, based on the example in this message thread, we can change the code to exit once we have enough rows. Here's an updated version of the NthValue class:

     

    class NthValue : public AnalyticFunction
    {
    virtual void processPartition(ServerInterface &srvInterface,
    AnalyticPartitionReader &inputReader,
    AnalyticPartitionWriter &outputWriter)
    {
    try {

    //get the value of n from the parameters
    ParamReader paramReader = srvInterface.getParamReader();
    const VString label = paramReader.getStringRef("label");

    // Get # of rows we need to output
    const int numInRows = inputReader.getNumRows();


    const SizedColumnTypes &inTypes = inputReader.getTypeMetaData();
    vector<size_t> argCols; // Argument column indexes.
    inTypes.getArgumentColumns(argCols);
    vint nParam = 0;

    // Two function arguments expected.
    if (argCols.size() != 2 ||
    !inTypes.getColumnType(argCols.at(0)).isNumeric() || // a argument
    !inTypes.getColumnType(argCols.at(1)).isInt() || // n argument
    (nParam = inputReader.getIntRef(argCols.at(1))) < 1) // n positive?
    vt_report_error(1, "Two arguments (numeric, int) are expected with the second argument being a positive integer.");

    vint currentRow = 1;
    bool nthRowExists = false;

    // Find the value of the n-th row.
    int i = 0;
    int j = 0;
    do {
    if (currentRow == nParam) {
    nthRowExists = true;
    // break;
    } else {
    currentRow++;
    }
    i++;
    } while (inputReader.next());


    // Output the n-th value.
    if (nthRowExists) {
    do {
    outputWriter.copyFromInput(0 /*dest*/, inputReader, argCols.at(0) /*src*/);
    j++;
    outputWriter.next();
    } while (j <= numInRows); // Go until we have produced enough rows
    } else {
    // The partition has less than n rows. Output NULL values.
    do {
    outputWriter.setNull(0);
    outputWriter.next();
    } while (j <= numInRows);
    }
    srvInterface.log("SSSSSSSSSS:%d %d", i, j);
    } catch(exception& e) {
    // Standard exception. Quit.
    vt_report_error(0, "Exception while processing partition: [%s]", e.what());
    }
    }
    };

    Running this code results in messages like the following in the UDx log:

     

     14:44:23.411 [C++-...]  0x7f2283563700 [UserMessage] nth_value - SSSSSSSSSS:2001 2002
    14:44:23.413 [C++-...] 0x7f2283563700 [UserMessage] nth_value - SSSSSSSSSS:1982 1983
    14:44:23.415 [C++-...] 0x7f2283563700 [UserMessage] nth_value - SSSSSSSSSS:2034 2035
    14:44:23.416 [C++-...] 0x7f2283563700 [UserMessage] nth_value - SSSSSSSSSS:1924 1925
    14:44:23.418 [C++-...] 0x7f2283563700 [UserMessage] nth_value - SSSSSSSSSS:1946 1947
    14:44:23.420 [C++-...] 0x7f2283563700 [UserMessage] nth_value - SSSSSSSSSS:1967 1968

    The off-by-one is caused by incrementing j one past the number of rows needed.

     

    Note: We're using AnalyticPartitionReader.getNumRows() because there is a separate bug in  AnalyticPartitionWriter.getNumRows(). The two values should always be the same. Unfortunately, AnalyticPartitionWriter.getNumRows() is producing the wroung count.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file