Error when review kmeans result
select kmeans('timeout_kmeans_model_1', 'timeout_anomaly_detect_view', '*', 4
using parameters max_iterations=20,
output_view='myKmeansTimeoutView1',
exclude_columns = 'sql_interval, my_node_address, partner_carrier_no',
key_columns= 'sql_interval, my_node_address, partner_carrier_no');
Finished in 6 iterations
select * from myKmeansTimeoutView1;
Got the below error:
ERROR 5861: Error calling setup() in User Function apply_kmeans at [src/Common/GPredict.cpp:42], error code: 0, message: Error in setup: [Error in setup: Invalid column '']
0
Comments
Which version of Vertica are you using?
Vertica Analytic Database v8.1.1-10
The view, myKmeansTimeoutView1, is created. I can see it from the catalog.
Hi,
Can you run the following and post the results?
select export_objects('','myKmeansTimeoutView1');
Yes.
Here is the result.
export_objects|
CREATE VIEW ML.myKmeansTimeoutView1 AS
SELECT timeout_anomaly_detect_view.sql_interval,
timeout_anomaly_detect_view.my_node_address,
timeout_anomaly_detect_view.partner_carrier_no,
public.apply_kmeans(timeout_anomaly_detect_view.hist_comparator, timeout_anomaly_detect_view.peer_comparator USING PARAMETERS model_name='ML.timeout_kmeans_model_1'::varchar(386)) AS cluster_id
FROM ML.timeout_anomaly_detect_view;
SELECT MARK_DESIGN_KSAFE(1);
|
Thank you for your help.
@lijiang
Your problem in
USING PARAMETERSdefinition:exclude_columns = 'sql_interval, my_node_address, partner_carrier_no'key_columns= 'sql_interval, my_node_address, partner_carrier_no'You excluded all columns, thats why you got error:
Invalid column ''. Do you understand what dokmeansand how to use it?@Jim_Knicely
Will Vertica open API for "how to create db objects like view with UDF"?
kmeansis an UDF Transform function and it createsVIEW, but how(its not documented)?(Answer: No, its for internal usage - also accepted).
Hi,
I'm using Vertica 9. It works okay:
dbadmin=> select kmeans('timeout_kmeans_model_1', 'timeout_anomaly_detect_view', '*', 4 dbadmin(> using parameters max_iterations=20, dbadmin(> output_view='myKmeansTimeoutView1', dbadmin(> exclude_columns = 'sql_interval, my_node_address, partner_carrier_no', dbadmin(> key_columns= 'sql_interval, my_node_address, partner_carrier_no'); WARNING 7858: Only found 2 non-empty clusters. You may want to try again or use a better set of initial centers; or there may be fewer than k distinct datapoints in the table kmeans --------------------------- Finished in 1 iterations (1 row) dbadmin=> select * from myKmeansTimeoutView1; sql_interval | my_node_address | partner_carrier_no | cluster_id --------------+-----------------+--------------------+------------ TEST | TEST | 1 | 1 TEST2 | TEST2 | 2 | 0 TEST2 | TEST2 | 2 | 0 TEST2 | TEST2 | 2 | 0 | | | TEST2 | TEST2 | 2 | 0 (6 rows)I will check if there is a bug in 8.1.1.
@lijiang Can you run the following and post the results?
select export_objects('','timeout_anomaly_detect_view');@sKwa - Will Vertica open API for "how to create db objects like view with UDF"?
No, it's for internal usage
I did not exclude all columns from the training data.
There are 5 columns, sql_interval, my_node_address, partner_carrier_no, hist_comparator, and peer_comparator, in the data view. The three columns specified are the key columns. I'm asking KMEAN to label the data in 4 clusters based on hist_comparator and peer_comparator for each key value, which is the combination of the three columns specified in the select statement.
select export_objects('','timeout_anomaly_detect_view');
returns the results below:
export_objects|
CREATE VIEW ML.timeout_anomaly_detect_view AS
SELECT a.sql_interval,
a.my_node_address,
a.partner_carrier_no,
round(CASE WHEN (a.timeout_msg_stddev = 0::numeric(18,0)) THEN 0::numeric(18,0) ELSE (abs((a.timeout_tdr_count - a.timeout_msg_avg)) / a.timeout_msg_stddev) END, 0) AS hist_comparator,
(a.timeout_rate / a.timeout_rate_all) AS peer_comparator
FROM ( SELECT kmean_training_data_hr.sql_interval,
kmean_training_data_hr.my_node_address,
kmean_training_data_hr.partner_carrier_no,
kmean_training_data_hr.timeout_tdr_count,
CASE WHEN (kmean_training_data_hr.total_tdr_count = 0) THEN 0::numeric(18,0) ELSE (kmean_training_data_hr.timeout_tdr_count / kmean_training_data_hr.total_tdr_count) END AS timeout_rate,
kmean_training_data_hr.timeout_msg_avg,
kmean_training_data_hr.timeout_msg_stddev,
round((kmean_training_data_hr.total_msg_avg / sum(kmean_training_data_hr.total_msg_avg) OVER (PARTITION BY kmean_training_data_hr.sql_interval)), 2) AS traffic_percentage,
CASE WHEN (sum(kmean_training_data_hr.total_tdr_count) OVER (PARTITION BY kmean_training_data_hr.sql_interval) = 0) THEN 0::numeric(18,0) ELSE (sum(kmean_training_data_hr.timeout_tdr_count) OVER (PARTITION BY kmean_training_data_hr.sql_interval) / sum(kmean_training_data_hr.total_tdr_count) OVER (PARTITION BY kmean_training_data_hr.sql_interval)) END AS timeout_rate_all
FROM ML.kmean_training_data_hr) a
WHERE ((a.traffic_percentage > 0.001) AND (a.timeout_rate_all > 0::numeric(18,0)));
SELECT MARK_DESIGN_KSAFE(1);
|
@Jim_Knicely
So can you give a link to a source code of
kmeans?@lijiang - Sorry, one more:
select export_objects('','ML.kmean_training_data_hr');I want to see the data types.
@lijiang
Don't get me wrong, I just asked do you understand what you do. Question do not includes all info(probably @Jim_Knicely has access to more info).
here is formatted output:
CREATE VIEW ml.timeout_anomaly_detect_view AS SELECT a.sql_interval, a.my_node_address, a.partner_carrier_no, Round(CASE WHEN ( a.timeout_msg_stddev = 0 :: Numeric(18, 0) ) THEN 0 :: Numeric(18, 0) ELSE ( Abs(( a.timeout_tdr_count - a.timeout_msg_avg )) / a.timeout_msg_stddev ) END, 0) AS hist_comparator, ( a.timeout_rate / a.timeout_rate_all ) AS peer_comparator FROM (SELECT kmean_training_data_hr.sql_interval, kmean_training_data_hr.my_node_address, kmean_training_data_hr.partner_carrier_no, kmean_training_data_hr.timeout_tdr_count, CASE WHEN ( kmean_training_data_hr.total_tdr_count = 0 ) THEN 0 :: Numeric(18, 0) ELSE ( kmean_training_data_hr.timeout_tdr_count / kmean_training_data_hr.total_tdr_count ) END AS timeout_rate, kmean_training_data_hr.timeout_msg_avg, kmean_training_data_hr.timeout_msg_stddev, Round(( kmean_training_data_hr.total_msg_avg / Sum( kmean_training_data_hr.total_msg_avg) OVER ( PARTITION BY kmean_training_data_hr.sql_interval) ), 2 ) AS traffic_percentage, CASE WHEN ( Sum(kmean_training_data_hr.total_tdr_count) OVER ( PARTITION BY kmean_training_data_hr.sql_interval) = 0 ) THEN 0 :: Numeric(18, 0) ELSE ( Sum(kmean_training_data_hr.timeout_tdr_count) OVER ( PARTITION BY kmean_training_data_hr.sql_interval) / Sum( kmean_training_data_hr.total_tdr_count) OVER ( PARTITION BY kmean_training_data_hr.sql_interval) ) END AS timeout_rate_all FROM ml.kmean_training_data_hr) a WHERE ( ( a.traffic_percentage > 0.001 )select export_objects('','ML.kmean_training_data_hr');
Returns:
export_objects|
CREATE TABLE ML.kmean_training_data_hr
(
sql_interval timestamp NOT NULL,
my_carrier_no int NOT NULL,
message_type varchar(20) NOT NULL,
my_node_address varchar(32) NOT NULL,
partner_carrier_no int NOT NULL,
failure_tdr_count int NOT NULL,
timeout_tdr_count int NOT NULL,
failure_imsi_count int NOT NULL,
timeout_imsi_count int NOT NULL,
success_imsi_count int NOT NULL,
total_tdr_count int NOT NULL,
total_msg_avg int NOT NULL,
total_msg_stddev numeric(16,6) NOT NULL,
success_msg_avg int NOT NULL,
success_msg_stddev numeric(16,6) NOT NULL,
failure_msg_avg int NOT NULL,
failure_msg_stddev numeric(16,6) NOT NULL,
timeout_msg_avg int NOT NULL,
timeout_msg_stddev numeric(16,6) NOT NULL,
total_roamer_avg int NOT NULL,
total_roamer_stddev numeric(16,6) NOT NULL,
success_roamer_avg int NOT NULL,
success_roamer_stddev numeric(16,6) NOT NULL,
failure_roamer_avg int NOT NULL,
failure_roamer_stddev numeric(16,6) NOT NULL,
timeout_roamer_avg int NOT NULL,
timeout_roamer_stddev numeric(16,6) NOT NULL
)
PARTITION BY ((kmean_training_data_hr.sql_interval)::date);
CREATE PROJECTION ML.kmean_training_data_hr_super
(
sql_interval ENCODING RLE,
my_carrier_no ENCODING RLE,
message_type ENCODING RLE,
my_node_address,
partner_carrier_no,
failure_tdr_count,
timeout_tdr_count,
failure_imsi_count,
timeout_imsi_count,
success_imsi_count,
total_tdr_count,
total_msg_avg,
total_msg_stddev,
success_msg_avg,
success_msg_stddev,
failure_msg_avg,
failure_msg_stddev,
timeout_msg_avg,
timeout_msg_stddev,
total_roamer_avg,
total_roamer_stddev,
success_roamer_avg,
success_roamer_stddev,
failure_roamer_avg,
failure_roamer_stddev,
timeout_roamer_avg,
timeout_roamer_stddev
)
AS
SELECT kmean_training_data_hr.sql_interval,
kmean_training_data_hr.my_carrier_no,
kmean_training_data_hr.message_type,
kmean_training_data_hr.my_node_address,
kmean_training_data_hr.partner_carrier_no,
kmean_training_data_hr.failure_tdr_count,
kmean_training_data_hr.timeout_tdr_count,
kmean_training_data_hr.failure_imsi_count,
kmean_training_data_hr.timeout_imsi_count,
kmean_training_data_hr.success_imsi_count,
kmean_training_data_hr.total_tdr_count,
kmean_training_data_hr.total_msg_avg,
kmean_training_data_hr.total_msg_stddev,
kmean_training_data_hr.success_msg_avg,
kmean_training_data_hr.success_msg_stddev,
kmean_training_data_hr.failure_msg_avg,
kmean_training_data_hr.failure_msg_stddev,
kmean_training_data_hr.timeout_msg_avg,
kmean_training_data_hr.timeout_msg_stddev,
kmean_training_data_hr.total_roamer_avg,
kmean_training_data_hr.total_roamer_stddev,
kmean_training_data_hr.success_roamer_avg,
kmean_training_data_hr.success_roamer_stddev,
kmean_training_data_hr.failure_roamer_avg,
kmean_training_data_hr.failure_roamer_stddev,
kmean_training_data_hr.timeout_roamer_avg,
kmean_training_data_hr.timeout_roamer_stddev
FROM ML.kmean_training_data_hr
ORDER BY kmean_training_data_hr.my_carrier_no,
kmean_training_data_hr.message_type,
kmean_training_data_hr.sql_interval,
kmean_training_data_hr.my_node_address,
kmean_training_data_hr.partner_carrier_no
SEGMENTED BY hash(kmean_training_data_hr.sql_interval, kmean_training_data_hr.total_msg_avg) ALL NODES KSAFE 1;
SELECT MARK_DESIGN_KSAFE(1);
|
Thank you for your help.
@Jim_Knicely
FYI, the training data view, timeout_anomaly_detect_view, returns 402 rows. Some of the values of hist_comparator and peer_comparator are very small. For example, 0E-54. Could this be the problem?
I'll replace the small values by zero and give it a try.
No luck. Still get the same error. Training data are attached.
Can you try creating the model specifying the schema name ML for the objects?
select kmeans('ML.timeout_kmeans_model_1', 'ML.timeout_anomaly_detect_view', '*', 4 using parameters max_iterations=20, output_view='ML.myKmeansTimeoutView1', exclude_columns = 'sql_interval, my_node_address, partner_carrier_no', key_columns= 'sql_interval, my_node_address, partner_carrier_no'); select * from ML.myKmeansTimeoutView1;run the following and got the same error.
select kmeans('ML.timeout_kmeans_model_3', 'ML.timeout_anomaly_detect_view', '*', 4
using parameters max_iterations=20,
output_view='ML.myKmeansTimeoutView4',
exclude_columns = 'sql_interval, my_node_address, partner_carrier_no',
key_columns= 'sql_interval, my_node_address, partner_carrier_no');
select * from ML.myKmeansTimeoutView4;
Ok, thanks! Let me look into this a little deeper for you.
Much appreciated.
@lijiang - I believe that you are hitting a bug in 8.1.1 where column aliases were not being passed to UDXs (i.e. apply_kmeans). This is fixed in Vertica 9.0.1.
If you can't upgrade, see if this works:
dbadmin=> SELECT set_vertica_options('OPT', 'DISABLE_FLATTEN_SUBQUERY'); set_vertica_options ---------------------------------------------------------------------- Opt Vertica Options -------------------- DISABLE_FLATTEN_SUBQUERY (1 row) dbadmin=> SELECT * FROM ML.myKmeansTimeoutView1; sql_interval | my_node_address | partner_carrier_no | cluster_id ----------------------------+-----------------+--------------------+------------ 2018-02-07 11:39:58.281504 | T | 1 | 0 2018-02-07 11:40:02.270718 | T | 1 | 0 2018-02-08 11:40:51.948821 | T | 2 | 0 2018-02-08 11:40:54.343808 | T | 2 | 0 2018-02-10 11:41:07.615797 | T | 2 | 0 2018-02-10 11:41:08.799745 | T | 2 | 0 (6 rows) dbadmin=> SELECT clr_vertica_options('OPT', 'DISABLE_FLATTEN_SUBQUERY'); clr_vertica_options --------------------- (1 row)@lijiang - Another work around might be to create a top level view that does not have column aliases for calculated fields...
Example:
dbadmin=> CREATE OR REPLACE VIEW ML.timeout_anomaly_detect_view_top AS SELECT * FROM ML.timeout_anomaly_detect_view; CREATE VIEW dbadmin=> SELECT sql_interval, my_node_address, partner_carrier_no, dbadmin-> public.apply_kmeans(hist_comparator, peer_comparator USING PARAMETERS model_name='ML.timeout_kmeans_model_1'::varchar(386)) AS cluster_id dbadmin-> FROM ML.timeout_anomaly_detect_view_top; sql_interval | my_node_address | partner_carrier_no | cluster_id ----------------------------+-----------------+--------------------+------------ 2018-02-07 11:39:58.281504 | T | 1 | 0 2018-02-07 11:40:02.270718 | T | 1 | 0 2018-02-08 11:40:51.948821 | T | 2 | 0 2018-02-08 11:40:54.343808 | T | 2 | 0 2018-02-10 11:41:07.615797 | T | 2 | 0 2018-02-10 11:41:08.799745 | T | 2 | 0 (6 rows)If the view does not help, creating a table will work for sure...
dbadmin=> CREATE TABLE ML.timeout_anomaly_detect_view_table AS SELECT * FROM ML.timeout_anomaly_detect_view; CREATE TABLE dbadmin=> SELECT sql_interval, my_node_address, partner_carrier_no, dbadmin-> public.apply_kmeans(hist_comparator, peer_comparator USING PARAMETERS model_name='ML.timeout_kmeans_model_1'::varchar(386)) AS cluster_id dbadmin-> FROM ML.timeout_anomaly_detect_view_table; sql_interval | my_node_address | partner_carrier_no | cluster_id ----------------------------+-----------------+--------------------+------------ 2018-02-07 11:39:58.281504 | T | 1 | 0 2018-02-07 11:40:02.270718 | T | 1 | 0 2018-02-08 11:40:51.948821 | T | 2 | 0 2018-02-08 11:40:54.343808 | T | 2 | 0 2018-02-10 11:41:07.615797 | T | 2 | 0 2018-02-10 11:41:08.799745 | T | 2 | 0 (6 rows)