Error when review kmeans result

lijianglijiang Registered User

select kmeans('timeout_kmeans_model_1', 'timeout_anomaly_detect_view', '*', 4
using parameters max_iterations=20,
output_view='myKmeansTimeoutView1',
exclude_columns = 'sql_interval, my_node_address, partner_carrier_no',
key_columns= 'sql_interval, my_node_address, partner_carrier_no');
Finished in 6 iterations
select * from myKmeansTimeoutView1;
Got the below error:

ERROR 5861: Error calling setup() in User Function apply_kmeans at [src/Common/GPredict.cpp:42], error code: 0, message: Error in setup: [Error in setup: Invalid column '']

Comments

  • Jim_KnicelyJim_Knicely Administrator, Moderator, Employee, Registered User, VerticaExpert

    Which version of Vertica are you using?

  • lijianglijiang Registered User

    Vertica Analytic Database v8.1.1-10

  • lijianglijiang Registered User

    The view, myKmeansTimeoutView1, is created. I can see it from the catalog.

  • Jim_KnicelyJim_Knicely Administrator, Moderator, Employee, Registered User, VerticaExpert

    Hi,

    Can you run the following and post the results?

    select export_objects('','myKmeansTimeoutView1');

  • lijianglijiang Registered User

    Yes.
    Here is the result.
    export_objects|

    CREATE VIEW ML.myKmeansTimeoutView1 AS
    SELECT timeout_anomaly_detect_view.sql_interval,
    timeout_anomaly_detect_view.my_node_address,
    timeout_anomaly_detect_view.partner_carrier_no,
    public.apply_kmeans(timeout_anomaly_detect_view.hist_comparator, timeout_anomaly_detect_view.peer_comparator USING PARAMETERS model_name='ML.timeout_kmeans_model_1'::varchar(386)) AS cluster_id
    FROM ML.timeout_anomaly_detect_view;

    SELECT MARK_DESIGN_KSAFE(1);
    |

  • lijianglijiang Registered User

    Thank you for your help.

  • sKwasKwa Registered User
    edited February 7

    @lijiang
    Your problem in USING PARAMETERS definition:

    • exclude_columns = 'sql_interval, my_node_address, partner_carrier_no'
    • key_columns= 'sql_interval, my_node_address, partner_carrier_no'

    You excluded all columns, thats why you got error: Invalid column ''. Do you understand what do kmeans and how to use it?

    @Jim_Knicely
    Will Vertica open API for "how to create db objects like view with UDF"? kmeans is an UDF Transform function and it creates VIEW, but how(its not documented)?
    (Answer: No, its for internal usage - also accepted).

  • Jim_KnicelyJim_Knicely Administrator, Moderator, Employee, Registered User, VerticaExpert

    Hi,

    I'm using Vertica 9. It works okay:

    dbadmin=> select kmeans('timeout_kmeans_model_1', 'timeout_anomaly_detect_view', '*', 4
    dbadmin(> using parameters max_iterations=20,
    dbadmin(> output_view='myKmeansTimeoutView1',
    dbadmin(> exclude_columns = 'sql_interval, my_node_address, partner_carrier_no',
    dbadmin(> key_columns= 'sql_interval, my_node_address, partner_carrier_no');
    WARNING 7858:  Only found 2 non-empty clusters. You may want to try again or use a better set of initial centers; or there may be fewer than k distinct datapoints in the table
              kmeans
    ---------------------------
     Finished in 1 iterations
    
    (1 row)
    
    dbadmin=> select * from myKmeansTimeoutView1;
     sql_interval | my_node_address | partner_carrier_no | cluster_id
    --------------+-----------------+--------------------+------------
     TEST         | TEST            |                  1 |          1
     TEST2        | TEST2           |                  2 |          0
     TEST2        | TEST2           |                  2 |          0
     TEST2        | TEST2           |                  2 |          0
                  |                 |                    |
     TEST2        | TEST2           |                  2 |          0
    (6 rows)
    

    I will check if there is a bug in 8.1.1.

    @lijiang Can you run the following and post the results?

    select export_objects('','timeout_anomaly_detect_view');

  • Jim_KnicelyJim_Knicely Administrator, Moderator, Employee, Registered User, VerticaExpert
    edited February 7

    @sKwa - Will Vertica open API for "how to create db objects like view with UDF"?

    No, it's for internal usage ;)

  • lijianglijiang Registered User

    I did not exclude all columns from the training data.
    There are 5 columns, sql_interval, my_node_address, partner_carrier_no, hist_comparator, and peer_comparator, in the data view. The three columns specified are the key columns. I'm asking KMEAN to label the data in 4 clusters based on hist_comparator and peer_comparator for each key value, which is the combination of the three columns specified in the select statement.

  • lijianglijiang Registered User

    select export_objects('','timeout_anomaly_detect_view');
    returns the results below:
    export_objects|
    CREATE VIEW ML.timeout_anomaly_detect_view AS
    SELECT a.sql_interval,
    a.my_node_address,
    a.partner_carrier_no,
    round(CASE WHEN (a.timeout_msg_stddev = 0::numeric(18,0)) THEN 0::numeric(18,0) ELSE (abs((a.timeout_tdr_count - a.timeout_msg_avg)) / a.timeout_msg_stddev) END, 0) AS hist_comparator,
    (a.timeout_rate / a.timeout_rate_all) AS peer_comparator
    FROM ( SELECT kmean_training_data_hr.sql_interval,
    kmean_training_data_hr.my_node_address,
    kmean_training_data_hr.partner_carrier_no,
    kmean_training_data_hr.timeout_tdr_count,
    CASE WHEN (kmean_training_data_hr.total_tdr_count = 0) THEN 0::numeric(18,0) ELSE (kmean_training_data_hr.timeout_tdr_count / kmean_training_data_hr.total_tdr_count) END AS timeout_rate,
    kmean_training_data_hr.timeout_msg_avg,
    kmean_training_data_hr.timeout_msg_stddev,
    round((kmean_training_data_hr.total_msg_avg / sum(kmean_training_data_hr.total_msg_avg) OVER (PARTITION BY kmean_training_data_hr.sql_interval)), 2) AS traffic_percentage,
    CASE WHEN (sum(kmean_training_data_hr.total_tdr_count) OVER (PARTITION BY kmean_training_data_hr.sql_interval) = 0) THEN 0::numeric(18,0) ELSE (sum(kmean_training_data_hr.timeout_tdr_count) OVER (PARTITION BY kmean_training_data_hr.sql_interval) / sum(kmean_training_data_hr.total_tdr_count) OVER (PARTITION BY kmean_training_data_hr.sql_interval)) END AS timeout_rate_all
    FROM ML.kmean_training_data_hr) a
    WHERE ((a.traffic_percentage > 0.001) AND (a.timeout_rate_all > 0::numeric(18,0)));

    SELECT MARK_DESIGN_KSAFE(1);
    |

  • sKwasKwa Registered User

    @Jim_Knicely

    So can you give a link to a source code of kmeans? o:)

  • Jim_KnicelyJim_Knicely Administrator, Moderator, Employee, Registered User, VerticaExpert

    @lijiang - Sorry, one more:

    select export_objects('','ML.kmean_training_data_hr');

    I want to see the data types.

  • sKwasKwa Registered User
    edited February 7

    @lijiang

    I did not exclude all columns from the training data.
    There are 5 columns [...]

    Don't get me wrong, I just asked do you understand what you do. Question do not includes all info(probably @Jim_Knicely has access to more info).

    here is formatted output:

    CREATE VIEW ml.timeout_anomaly_detect_view
    AS
      SELECT a.sql_interval,
             a.my_node_address,
             a.partner_carrier_no,
             Round(CASE
                     WHEN ( a.timeout_msg_stddev = 0 :: Numeric(18, 0) ) THEN 0 ::
                     Numeric(18, 0)
                     ELSE ( Abs(( a.timeout_tdr_count - a.timeout_msg_avg )) /
                            a.timeout_msg_stddev )
                   END, 0)                           AS hist_comparator,
             ( a.timeout_rate / a.timeout_rate_all ) AS peer_comparator
      FROM   (SELECT kmean_training_data_hr.sql_interval,
                     kmean_training_data_hr.my_node_address,
                     kmean_training_data_hr.partner_carrier_no,
                     kmean_training_data_hr.timeout_tdr_count,
                     CASE
                       WHEN ( kmean_training_data_hr.total_tdr_count = 0 ) THEN
                       0 :: Numeric(18, 0)
                       ELSE ( kmean_training_data_hr.timeout_tdr_count /
                                     kmean_training_data_hr.total_tdr_count )
                     END
                            AS timeout_rate,
                     kmean_training_data_hr.timeout_msg_avg,
                     kmean_training_data_hr.timeout_msg_stddev,
                     Round(( kmean_training_data_hr.total_msg_avg / Sum(
                             kmean_training_data_hr.total_msg_avg)
                                     OVER (
                                       PARTITION BY
                                     kmean_training_data_hr.sql_interval) ), 2
                            ) AS
                     traffic_percentage,
                     CASE
                       WHEN ( Sum(kmean_training_data_hr.total_tdr_count)
                                OVER (
                                  PARTITION BY kmean_training_data_hr.sql_interval)
                              =
                              0 )
                            THEN 0 ::
                       Numeric(18, 0)
                       ELSE ( Sum(kmean_training_data_hr.timeout_tdr_count)
                                OVER (
                                  PARTITION BY kmean_training_data_hr.sql_interval)
                              /
                              Sum(
                                     kmean_training_data_hr.total_tdr_count)
                                       OVER (
                                         PARTITION BY
                                       kmean_training_data_hr.sql_interval) )
                     END
                            AS timeout_rate_all
              FROM   ml.kmean_training_data_hr) a
      WHERE  ( ( a.traffic_percentage > 0.001 )
    
  • lijianglijiang Registered User

    select export_objects('','ML.kmean_training_data_hr');
    Returns:
    export_objects|

    CREATE TABLE ML.kmean_training_data_hr
    (
    sql_interval timestamp NOT NULL,
    my_carrier_no int NOT NULL,
    message_type varchar(20) NOT NULL,
    my_node_address varchar(32) NOT NULL,
    partner_carrier_no int NOT NULL,
    failure_tdr_count int NOT NULL,
    timeout_tdr_count int NOT NULL,
    failure_imsi_count int NOT NULL,
    timeout_imsi_count int NOT NULL,
    success_imsi_count int NOT NULL,
    total_tdr_count int NOT NULL,
    total_msg_avg int NOT NULL,
    total_msg_stddev numeric(16,6) NOT NULL,
    success_msg_avg int NOT NULL,
    success_msg_stddev numeric(16,6) NOT NULL,
    failure_msg_avg int NOT NULL,
    failure_msg_stddev numeric(16,6) NOT NULL,
    timeout_msg_avg int NOT NULL,
    timeout_msg_stddev numeric(16,6) NOT NULL,
    total_roamer_avg int NOT NULL,
    total_roamer_stddev numeric(16,6) NOT NULL,
    success_roamer_avg int NOT NULL,
    success_roamer_stddev numeric(16,6) NOT NULL,
    failure_roamer_avg int NOT NULL,
    failure_roamer_stddev numeric(16,6) NOT NULL,
    timeout_roamer_avg int NOT NULL,
    timeout_roamer_stddev numeric(16,6) NOT NULL
    )
    PARTITION BY ((kmean_training_data_hr.sql_interval)::date);

    CREATE PROJECTION ML.kmean_training_data_hr_super
    (
    sql_interval ENCODING RLE,
    my_carrier_no ENCODING RLE,
    message_type ENCODING RLE,
    my_node_address,
    partner_carrier_no,
    failure_tdr_count,
    timeout_tdr_count,
    failure_imsi_count,
    timeout_imsi_count,
    success_imsi_count,
    total_tdr_count,
    total_msg_avg,
    total_msg_stddev,
    success_msg_avg,
    success_msg_stddev,
    failure_msg_avg,
    failure_msg_stddev,
    timeout_msg_avg,
    timeout_msg_stddev,
    total_roamer_avg,
    total_roamer_stddev,
    success_roamer_avg,
    success_roamer_stddev,
    failure_roamer_avg,
    failure_roamer_stddev,
    timeout_roamer_avg,
    timeout_roamer_stddev
    )
    AS
    SELECT kmean_training_data_hr.sql_interval,
    kmean_training_data_hr.my_carrier_no,
    kmean_training_data_hr.message_type,
    kmean_training_data_hr.my_node_address,
    kmean_training_data_hr.partner_carrier_no,
    kmean_training_data_hr.failure_tdr_count,
    kmean_training_data_hr.timeout_tdr_count,
    kmean_training_data_hr.failure_imsi_count,
    kmean_training_data_hr.timeout_imsi_count,
    kmean_training_data_hr.success_imsi_count,
    kmean_training_data_hr.total_tdr_count,
    kmean_training_data_hr.total_msg_avg,
    kmean_training_data_hr.total_msg_stddev,
    kmean_training_data_hr.success_msg_avg,
    kmean_training_data_hr.success_msg_stddev,
    kmean_training_data_hr.failure_msg_avg,
    kmean_training_data_hr.failure_msg_stddev,
    kmean_training_data_hr.timeout_msg_avg,
    kmean_training_data_hr.timeout_msg_stddev,
    kmean_training_data_hr.total_roamer_avg,
    kmean_training_data_hr.total_roamer_stddev,
    kmean_training_data_hr.success_roamer_avg,
    kmean_training_data_hr.success_roamer_stddev,
    kmean_training_data_hr.failure_roamer_avg,
    kmean_training_data_hr.failure_roamer_stddev,
    kmean_training_data_hr.timeout_roamer_avg,
    kmean_training_data_hr.timeout_roamer_stddev
    FROM ML.kmean_training_data_hr
    ORDER BY kmean_training_data_hr.my_carrier_no,
    kmean_training_data_hr.message_type,
    kmean_training_data_hr.sql_interval,
    kmean_training_data_hr.my_node_address,
    kmean_training_data_hr.partner_carrier_no
    SEGMENTED BY hash(kmean_training_data_hr.sql_interval, kmean_training_data_hr.total_msg_avg) ALL NODES KSAFE 1;

    SELECT MARK_DESIGN_KSAFE(1);
    |

    Thank you for your help.

  • lijianglijiang Registered User

    @Jim_Knicely

    FYI, the training data view, timeout_anomaly_detect_view, returns 402 rows. Some of the values of hist_comparator and peer_comparator are very small. For example, 0E-54. Could this be the problem?

    I'll replace the small values by zero and give it a try.

  • lijianglijiang Registered User

    No luck. Still get the same error. Training data are attached.

  • Jim_KnicelyJim_Knicely Administrator, Moderator, Employee, Registered User, VerticaExpert

    Can you try creating the model specifying the schema name ML for the objects?

    select kmeans('ML.timeout_kmeans_model_1', 'ML.timeout_anomaly_detect_view', '*', 4 
    using parameters max_iterations=20, 
    output_view='ML.myKmeansTimeoutView1',
    exclude_columns = 'sql_interval, my_node_address, partner_carrier_no', 
    key_columns= 'sql_interval, my_node_address, partner_carrier_no');
    
    select * from ML.myKmeansTimeoutView1;
    
  • lijianglijiang Registered User

    run the following and got the same error.
    select kmeans('ML.timeout_kmeans_model_3', 'ML.timeout_anomaly_detect_view', '*', 4
    using parameters max_iterations=20,
    output_view='ML.myKmeansTimeoutView4',
    exclude_columns = 'sql_interval, my_node_address, partner_carrier_no',
    key_columns= 'sql_interval, my_node_address, partner_carrier_no');

    select * from ML.myKmeansTimeoutView4;

  • Jim_KnicelyJim_Knicely Administrator, Moderator, Employee, Registered User, VerticaExpert

    Ok, thanks! Let me look into this a little deeper for you.

  • lijianglijiang Registered User

    Much appreciated.

  • Jim_KnicelyJim_Knicely Administrator, Moderator, Employee, Registered User, VerticaExpert

    @lijiang - I believe that you are hitting a bug in 8.1.1 where column aliases were not being passed to UDXs (i.e. apply_kmeans). This is fixed in Vertica 9.0.1.

    If you can't upgrade, see if this works:

    dbadmin=> SELECT set_vertica_options('OPT', 'DISABLE_FLATTEN_SUBQUERY');
                             set_vertica_options
    ----------------------------------------------------------------------
    
    Opt Vertica Options
    --------------------
    DISABLE_FLATTEN_SUBQUERY
    
    
    (1 row)
    
    dbadmin=> SELECT * FROM ML.myKmeansTimeoutView1;
            sql_interval        | my_node_address | partner_carrier_no | cluster_id
    ----------------------------+-----------------+--------------------+------------
     2018-02-07 11:39:58.281504 | T               |                  1 |          0
     2018-02-07 11:40:02.270718 | T               |                  1 |          0
     2018-02-08 11:40:51.948821 | T               |                  2 |          0
     2018-02-08 11:40:54.343808 | T               |                  2 |          0
     2018-02-10 11:41:07.615797 | T               |                  2 |          0
     2018-02-10 11:41:08.799745 | T               |                  2 |          0
    (6 rows)
    
    dbadmin=> SELECT clr_vertica_options('OPT', 'DISABLE_FLATTEN_SUBQUERY');
     clr_vertica_options
    ---------------------
    
    
    (1 row)
    
  • Jim_KnicelyJim_Knicely Administrator, Moderator, Employee, Registered User, VerticaExpert
    edited February 8

    @lijiang - Another work around might be to create a top level view that does not have column aliases for calculated fields...

    Example:

    dbadmin=> CREATE OR REPLACE VIEW ML.timeout_anomaly_detect_view_top AS SELECT * FROM ML.timeout_anomaly_detect_view;
    CREATE VIEW
    
    dbadmin=> SELECT sql_interval, my_node_address, partner_carrier_no,
    dbadmin->        public.apply_kmeans(hist_comparator, peer_comparator USING PARAMETERS model_name='ML.timeout_kmeans_model_1'::varchar(386)) AS cluster_id
    dbadmin->   FROM ML.timeout_anomaly_detect_view_top;
            sql_interval        | my_node_address | partner_carrier_no | cluster_id
    ----------------------------+-----------------+--------------------+------------
     2018-02-07 11:39:58.281504 | T               |                  1 |          0
     2018-02-07 11:40:02.270718 | T               |                  1 |          0
     2018-02-08 11:40:51.948821 | T               |                  2 |          0
     2018-02-08 11:40:54.343808 | T               |                  2 |          0
     2018-02-10 11:41:07.615797 | T               |                  2 |          0
     2018-02-10 11:41:08.799745 | T               |                  2 |          0
    (6 rows)
    

    If the view does not help, creating a table will work for sure...

    dbadmin=> CREATE TABLE ML.timeout_anomaly_detect_view_table AS SELECT * FROM ML.timeout_anomaly_detect_view;
    CREATE TABLE
    
    dbadmin=> SELECT sql_interval, my_node_address, partner_carrier_no,
    dbadmin->        public.apply_kmeans(hist_comparator, peer_comparator USING PARAMETERS model_name='ML.timeout_kmeans_model_1'::varchar(386)) AS cluster_id
    dbadmin->   FROM ML.timeout_anomaly_detect_view_table;
            sql_interval        | my_node_address | partner_carrier_no | cluster_id
    ----------------------------+-----------------+--------------------+------------
     2018-02-07 11:39:58.281504 | T               |                  1 |          0
     2018-02-07 11:40:02.270718 | T               |                  1 |          0
     2018-02-08 11:40:51.948821 | T               |                  2 |          0
     2018-02-08 11:40:54.343808 | T               |                  2 |          0
     2018-02-10 11:41:07.615797 | T               |                  2 |          0
     2018-02-10 11:41:08.799745 | T               |                  2 |          0
    (6 rows)
    

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file