XGBoost - features_importance() and NULL handling not implemented?
chinga
Community Edition User
Hello - I have created and fit a model using XGBoost using Vertica 12 CE. I can call methods like .score and .roc_curve, however .features_importance returns an error:
FunctionError: Method 'features_importance' for 'XGBoostClassifier' doesn't exist.
It seems that it should (https://www.vertica.com/python/documentation_last/learn/XGBoostClassifier/index.php) and indeed does work with the Random Forest Classifier.
I believe that XGBoost (the algorithm) handles missing values without needing to impute, however Vertica's implementation appears to reject rows where a predictor (X) contains NULL. Is that correct?
Thanks.
0
Answers
@chinga : as of today, feature_importance only works with RandomForest since we support it only for random forest in vertica server.. in verticapy. there is a feature request open to support it with xgboost as well
https://github.com/vertica/VerticaPy/issues
https://www.vertica.com/docs/11.1.x/HTML/Content/Authoring/SQLReferenceManual/Functions/MachineLearning/RF_PREDICTOR_IMPORTANCE.htm
XGboost will generally handle missing values without need for impute. Could you please share me an example where model creation with null values works and prediction fails?
The age column contains NULLS - note the number of rejected rows when fitting the classifier
Check the count of NULLs in the age columns. Fill them, then count again.
No rows are now rejected when fitting the classifier
Note that I am not predicting at this stage - just fitting the model where the predictor can have NULLs.
In the example posted - the number of rejected rows goes from 237 to 0 when the predictor (age) is filled. I was expecting XGBoost to handle NULLs in the predictors rather than reject the entire row.
Thanks for clarifying that features_importance() is not available for XGBoost yet.
@chinga: we don't support handling null values for XGBoost because there is no specific order in the data. Hence we reject the entire row prior to training. For our time series algorithms, we allow different ways of filling the missing data because the data is ordered
Hi there – I didn’t understand the answer regarding NULL values and XGBoost. To clarify, XGBoost supports missing data as detailed in section 3.4 of this paper:
https://arxiv.org/pdf/1603.02754.pdf
The section is called “Sparsity-Aware Split Finding”. If I understand the feature correctly, I shouldn’t need to fill in the NULLs if NULLs are treated as “missing”. I hope this clarifies the question. More details about the feature I am talking about can be found here:
Frequently Asked Questions — xgboost 1.6.1 documentation
https://xgboost.readthedocs.io/en/latest/faq.html#how-to-deal-with-missing-values
_
XGBoost supports missing values by default. In tree algorithms, branch directions for missing values are learned during training._
I’d be grateful if you could advise if “Sparsity-Aware Split Finding” is available in Vertica’s implementation.
Thanks,
@chinga : XGBoost supports missing data.. but we reject the rows with null values prior to training. if you want them to be considered, please open a support case with your use case.