Handling Missing Data Titanic case study
Published: July 21, 2024
Handling Missing Data: A Case Study with the Titanic Dataset
Dealing with missing data is a common challenge when working with tabular datasets. The key question often arises: should you keep a column with null values, remove the entire rows containing null values, or fill the null values with imputed data? The answer, unsurprisingly, is: it depends. The best approach varies case by case.
Let’s explore these options using the Titanic dataset as an example. The dataset includes a feature called “Age,” which has 177 null values out of 891 entries. Since we cannot train a model with missing values, we need to address them using one of the following strategies:
1. Remove the “Age” column entirely
2. Remove the 177 rows containing null values
3. Impute the 177 missing values using an imputation technique
To determine which method yields the best results in terms of accuracy, let’s examine the performance of several machine learning models using each strategy. We’ll compare Logistic Regression, K-Nearest Neighbors (KNN), Decision Tree, Gaussian Naive Bayes, and Support Vector Classifier (SVC).
Accuracy Results:
From these results, we can observe:
- Removing the “Age” Column: This method shows better consistency in accuracy across most models, though it slightly underperforms compared to removing null rows in Logistic Regression and GaussianNB.
- Removing Null Rows: This approach yields the highest accuracy for Logistic Regression and GaussianNB, but it significantly impacts the performance of KNN, Decision Tree, and SVC.
- Imputing Missing Values: This method tends to provide the lowest accuracy among the three strategies, especially affecting KNN and SVC.
In conclusion, the decision on how to handle missing values should be informed by the specific characteristics of the dataset and the models being used. In the case of the Titanic dataset, while removing the “Age” column provides a more consistent accuracy across various models, removing rows with null values might be the best option for certain models like Logistic Regression and GaussianNB.
Github: Code