XGBoost - features_importance() and NULL handling not implemented?

chinga · July 2022

Hello - I have created and fit a model using XGBoost using Vertica 12 CE. I can call methods like .score and .roc_curve, however .features_importance returns an error:

FunctionError: Method 'features_importance' for 'XGBoostClassifier' doesn't exist.

It seems that it should (https://www.vertica.com/python/documentation_last/learn/XGBoostClassifier/index.php) and indeed does work with the Random Forest Classifier.

I believe that XGBoost (the algorithm) handles missing values without needing to impute, however Vertica's implementation appears to reject rows where a predictor (X) contains NULL. Is that correct?

Thanks.

SruthiA · July 2022

@chinga : as of today, feature_importance only works with RandomForest since we support it only for random forest in vertica server.. in verticapy. there is a feature request open to support it with xgboost as well

https://github.com/vertica/VerticaPy/issues

https://www.vertica.com/docs/11.1.x/HTML/Content/Authoring/SQLReferenceManual/Functions/MachineLearning/RF_PREDICTOR_IMPORTANCE.htm

XGboost will generally handle missing values without need for impute. Could you please share me an example where model creation with null values works and prediction fails?

chinga · July 2022

from verticapy.datasets import load_titanic
vdf = load_titanic()

from verticapy.learn.ensemble import XGBoostClassifier
model = XGBoostClassifier(name = 'xgb_titanic',
                  max_ntree = 10,
                  max_depth = 5,
                  nbins = 32,
                  split_proposal_method = "global",
                  tol = 0.001,
                  learning_rate = 0.1,
                  min_split_loss = 0.0,
                  weight_reg = 0.0,
                  sample = 1.0,
                  col_sample_by_tree =  1.0,
                  col_sample_by_node = 1.0,)

The age column contains NULLS - note the number of rejected rows when fitting the classifier

model.fit(vdf,
         X = ["age"],
         y = "survived")

===========
call_string
===========
xgb_classifier('public.xgb_titanic', '"public"."_verticapy_tmp_view_dbadmin_40888_925548101_"', '"survived"', '"age"' USING PARAMETERS exclude_columns='', max_ntree=10, max_depth=5, learning_rate=0.1, min_split_loss=0, weight_reg=0, nbins=32, objective=crossentropy, sampling_size=1, col_sample_by_tree=1, col_sample_by_node=1)

=======
details
=======
predictor|      type      
---------+----------------
   age   |float or numeric


==================
initial_prediction
==================
response_label| value  
--------------+--------
      0       | 0.00000
      1       | 0.00000


===============
Additional Info
===============
       Name       |Value
------------------+-----
    tree_count    | 10  
rejected_row_count| 237 
accepted_row_count| 997

Check the count of NULLs in the age columns. Fill them, then count again.

vdf["age"].count()

997.0

vdf.fillna()["age"].count()

1234.0

No rows are now rejected when fitting the classifier

model.fit(vdf,
         X = ["age"],
         y = "survived")

===========
call_string
===========
xgb_classifier('public.xgb_titanic', '"public"."_verticapy_tmp_view_dbadmin_40888_3753016991_"', '"survived"', '"age"' USING PARAMETERS exclude_columns='', max_ntree=10, max_depth=5, learning_rate=0.1, min_split_loss=0, weight_reg=0, nbins=32, objective=crossentropy, sampling_size=1, col_sample_by_tree=1, col_sample_by_node=1)

=======
details
=======
predictor|      type      
---------+----------------
   age   |float or numeric


==================
initial_prediction
==================
response_label| value  
--------------+--------
      0       | 0.00000
      1       | 0.00000


===============
Additional Info
===============
       Name       |Value
------------------+-----
    tree_count    | 10  
rejected_row_count|  0  
accepted_row_count|1234

chinga · July 2022

Note that I am not predicting at this stage - just fitting the model where the predictor can have NULLs.

In the example posted - the number of rejected rows goes from 237 to 0 when the predictor (age) is filled. I was expecting XGBoost to handle NULLs in the predictors rather than reject the entire row.

Thanks for clarifying that features_importance() is not available for XGBoost yet.

SruthiA · July 2022

@chinga: we don't support handling null values for XGBoost because there is no specific order in the data. Hence we reject the entire row prior to training. For our time series algorithms, we allow different ways of filling the missing data because the data is ordered

chinga · July 2022

Hi there – I didn’t understand the answer regarding NULL values and XGBoost. To clarify, XGBoost supports missing data as detailed in section 3.4 of this paper:

https://arxiv.org/pdf/1603.02754.pdf

The section is called “Sparsity-Aware Split Finding”. If I understand the feature correctly, I shouldn’t need to fill in the NULLs if NULLs are treated as “missing”. I hope this clarifies the question. More details about the feature I am talking about can be found here:

Frequently Asked Questions — xgboost 1.6.1 documentation

https://xgboost.readthedocs.io/en/latest/faq.html#how-to-deal-with-missing-values
_
XGBoost supports missing values by default. In tree algorithms, branch directions for missing values are learned during training._

I’d be grateful if you could advise if “Sparsity-Aware Split Finding” is available in Vertica’s implementation.
Thanks,

SruthiA · August 2022

@chinga : XGBoost supports missing data.. but we reject the rows with null values prior to training. if you want them to be considered, please open a support case with your use case.

We're Moving!

Create My New Community Account Now

XGBoost - features_importance() and NULL handling not implemented?

Answers

Leave a Comment