XGBoost - features_importance() and NULL handling not implemented?

chingachinga Community Edition User

Hello - I have created and fit a model using XGBoost using Vertica 12 CE. I can call methods like .score and .roc_curve, however .features_importance returns an error:

FunctionError: Method 'features_importance' for 'XGBoostClassifier' doesn't exist.

It seems that it should (https://www.vertica.com/python/documentation_last/learn/XGBoostClassifier/index.php) and indeed does work with the Random Forest Classifier.

I believe that XGBoost (the algorithm) handles missing values without needing to impute, however Vertica's implementation appears to reject rows where a predictor (X) contains NULL. Is that correct?

Thanks.

Answers

  • SruthiASruthiA Administrator
    edited July 2022

    @chinga : as of today, feature_importance only works with RandomForest since we support it only for random forest in vertica server.. in verticapy. there is a feature request open to support it with xgboost as well

    https://github.com/vertica/VerticaPy/issues

    https://www.vertica.com/docs/11.1.x/HTML/Content/Authoring/SQLReferenceManual/Functions/MachineLearning/RF_PREDICTOR_IMPORTANCE.htm

    XGboost will generally handle missing values without need for impute. Could you please share me an example where model creation with null values works and prediction fails?

  • chingachinga Community Edition User
    from verticapy.datasets import load_titanic
    vdf = load_titanic()
    
    from verticapy.learn.ensemble import XGBoostClassifier
    model = XGBoostClassifier(name = 'xgb_titanic',
                      max_ntree = 10,
                      max_depth = 5,
                      nbins = 32,
                      split_proposal_method = "global",
                      tol = 0.001,
                      learning_rate = 0.1,
                      min_split_loss = 0.0,
                      weight_reg = 0.0,
                      sample = 1.0,
                      col_sample_by_tree =  1.0,
                      col_sample_by_node = 1.0,)
    

    The age column contains NULLS - note the number of rejected rows when fitting the classifier

    model.fit(vdf,
             X = ["age"],
             y = "survived")
    
    ===========
    call_string
    ===========
    xgb_classifier('public.xgb_titanic', '"public"."_verticapy_tmp_view_dbadmin_40888_925548101_"', '"survived"', '"age"' USING PARAMETERS exclude_columns='', max_ntree=10, max_depth=5, learning_rate=0.1, min_split_loss=0, weight_reg=0, nbins=32, objective=crossentropy, sampling_size=1, col_sample_by_tree=1, col_sample_by_node=1)
    
    =======
    details
    =======
    predictor|      type      
    ---------+----------------
       age   |float or numeric
    
    
    ==================
    initial_prediction
    ==================
    response_label| value  
    --------------+--------
          0       | 0.00000
          1       | 0.00000
    
    
    ===============
    Additional Info
    ===============
           Name       |Value
    ------------------+-----
        tree_count    | 10  
    rejected_row_count| 237 
    accepted_row_count| 997 
    

    Check the count of NULLs in the age columns. Fill them, then count again.

    vdf["age"].count()
    
    997.0
    
    vdf.fillna()["age"].count()
    
    1234.0
    

    No rows are now rejected when fitting the classifier

    model.fit(vdf,
             X = ["age"],
             y = "survived")
    
    ===========
    call_string
    ===========
    xgb_classifier('public.xgb_titanic', '"public"."_verticapy_tmp_view_dbadmin_40888_3753016991_"', '"survived"', '"age"' USING PARAMETERS exclude_columns='', max_ntree=10, max_depth=5, learning_rate=0.1, min_split_loss=0, weight_reg=0, nbins=32, objective=crossentropy, sampling_size=1, col_sample_by_tree=1, col_sample_by_node=1)
    
    =======
    details
    =======
    predictor|      type      
    ---------+----------------
       age   |float or numeric
    
    
    ==================
    initial_prediction
    ==================
    response_label| value  
    --------------+--------
          0       | 0.00000
          1       | 0.00000
    
    
    ===============
    Additional Info
    ===============
           Name       |Value
    ------------------+-----
        tree_count    | 10  
    rejected_row_count|  0  
    accepted_row_count|1234 
    
  • chingachinga Community Edition User

    Note that I am not predicting at this stage - just fitting the model where the predictor can have NULLs.

    In the example posted - the number of rejected rows goes from 237 to 0 when the predictor (age) is filled. I was expecting XGBoost to handle NULLs in the predictors rather than reject the entire row.

    Thanks for clarifying that features_importance() is not available for XGBoost yet.

  • SruthiASruthiA Administrator

    @chinga: we don't support handling null values for XGBoost because there is no specific order in the data. Hence we reject the entire row prior to training. For our time series algorithms, we allow different ways of filling the missing data because the data is ordered

  • chingachinga Community Edition User

    Hi there – I didn’t understand the answer regarding NULL values and XGBoost. To clarify, XGBoost supports missing data as detailed in section 3.4 of this paper:

    https://arxiv.org/pdf/1603.02754.pdf

    The section is called “Sparsity-Aware Split Finding”. If I understand the feature correctly, I shouldn’t need to fill in the NULLs if NULLs are treated as “missing”. I hope this clarifies the question. More details about the feature I am talking about can be found here:

    Frequently Asked Questions — xgboost 1.6.1 documentation

    https://xgboost.readthedocs.io/en/latest/faq.html#how-to-deal-with-missing-values
    _
    XGBoost supports missing values by default. In tree algorithms, branch directions for missing values are learned during training._

    I’d be grateful if you could advise if “Sparsity-Aware Split Finding” is available in Vertica’s implementation.
    Thanks,

  • SruthiASruthiA Administrator

    @chinga : XGBoost supports missing data.. but we reject the rows with null values prior to training. if you want them to be considered, please open a support case with your use case.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file