Which comes first? Multiple Imputation, Splitting into train/test, or Standardization/Normalization

https://datascience.stackexchange.com/questions/53138

normalization
multiclass-classification
data-imputation

01-11-2019
|

Pregunta

I am working on a multi-class classification problem, with ~65 features and ~150K instances. 30% of features are categorical and the rest are numerical (continuous). I understand that standardization or normalization should be done after splitting the data into train and test subsets, but I am not still sure about the imputation process. For the classification task, I am planning to use Random Forest, Logistic Regression, and XGBOOST (which are not distance-based).

Could someone please explain which should come first? Split > imputation or imputation>split? In case that split>imputation is correct, should I follow imputation>standardization or standardization>imputation?

No hay solución correcta

Licenciado bajo: CC-BY-SA con atribución

No afiliado a datascience.stackexchange