Titanic (5)
Titanic (5)
Titanic (5)
Abstract—This paper provides a comprehensive exploration of II. DATA PARSING AND W RANGLING
the Titanic machine learning challenge, addressing key concepts
in data parsing, wrangling, and advanced machine learning. We
The dataset consists of training (‘train.csv‘) and testing
detail Random Forest classification with ensemble learning and (‘test.csv‘) files. Preprocessing was crucial for handling miss-
hyperparameter tuning, as well as the evaluation of predictions ing values and preparing data for machine learning algorithms.
using confusion matrices. Additionally, techniques such as gradi-
ent descent and feature normalization are discussed to optimize A. Handling Missing Values
performance. This approach achieves competitive accuracy on Missing values in ‘Age‘ and ‘Fare‘ were imputed using the
Kaggle’s leaderboard.
mean of the respective columns to avoid loss of data during
analysis.
I. I NTRODUCTION Listing 1: Filling Missing Values
mean_age = mean(titanic_train.Age, ’omitnan’);
For binary classification problems, where the goal is to mean_fare = mean(titanic_train.Fare, ’omitnan’);
predict whether a passenger survived or not based on a variety titanic_train.Age = fillmissing(titanic_train.Age, ’
of variables, the Titanic dataset offers a well-known bench- constant’, mean_age);
mark. These characteristics include travel-related information titanic_train.Fare = fillmissing(titanic_train.Fare,
’constant’, mean_fare);
like fare, embarkation port, and whether the traveler was
traveling alone or with family, as well as demographic data
like age, sex, and socioeconomic standing (e.g., passenger B. Feature Engineering and Transformation
class). This dataset offers a strong case for using machine Feature engineering enhances the dataset by adding new,
learning approaches, demonstrating how feature engineering, meaningful variables:
data preprocessing, and model selection interact to solve • FamilySize: The sum of siblings/spouses and par-
practical categorization issues. ents/children aboard, plus one (the passenger themselves).
In this research, we provide a complete machine learning • IsAlone: A binary variable indicating whether the pas-
pipeline that aims to optimize interpretability and accuracy. To senger traveled alone.
make sure the dataset is prepared for analysis, the procedure • AgeGroup: Binned age values into categories such as
starts with data preprocessing, which includes addressing ‘Child‘, ‘Teen‘, ‘Adult‘, etc., to reduce noise and enhance
missing values, encoding categorical variables, and normal- interpretability.
izing continuous features. Feature engineering is then used
to extract more information, including calculating the size of Listing 2: Feature Engineering
the family based on the number of parents and siblings on titanic_train.FamilySize = titanic_train.SibSp +
board or producing binary indications for titles (e.g., Mr., Mrs., titanic_train.Parch + 1;
titanic_train.IsAlone = double(titanic_train.
Miss) based on passenger names. By offering vital context, FamilySize == 1);
these manufactured traits can improve the model’s capacity to
identify patterns of survival. age_bins = [0, 12, 18, 35, 60, Inf];
age_labels = {’Child’, ’Teen’, ’Adult’, ’Middle_Aged
A variety of classification techniques, including as Support ’, ’Senior’};
Vector Machines (SVM), Random Forests, Decision Trees, and titanic_train.AgeGroup = discretize(titanic_train.
Logistic Regression, are used after preprocessing and feature Age, age_bins, ...
’Categorical’, age_labels);
engineering. To determine the ideal setup for optimizing pre-
dictive performance, each model is meticulously adjusted us-
ing hyperparameter optimization approaches like grid search or C. Data Normalization and Gradient Descent
random search. Lastly, metrics including accuracy, precision, Large variations in feature magnitudes can result in slow
recall, and the F1-score are used to assess the models. Cross- convergence and wasteful updates, making gradient descent
validation is used to make sure the models are resilient against algorithms extremely sensitive to the scale of input features.
overfitting. This pipeline prioritizes interpretability in addition Features like Age, Fare, and FamilySize were normalized,
to high accuracy, allowing for a more thorough comprehension scaling their values between 0 and 1, in order to address this.
of the variables affecting Titanic survival. This keeps features with wider ranges from controlling the
gradient updates and guarantees that each feature contributes Listing 5: Random Forest Implementation
equally throughout the optimization process. Normalization cv = cvpartition(size(titanic_train, 1), ’KFold’,
speeds up convergence and enhances the stability and per- 12);
formance of gradient descent-based models by bringing these for leaf = [11, 22, 27]
characteristics to a common scale, producing predictions that for splits = [11, 22, 33]
are more accurate and dependable. for cycles = [100, 250, 500]
temp_model = fitcensemble(titanic_train,
’Survived’, ...
Listing 3: Data Normalization ’Method’, ’Bag’, ...
num_features = {’Age’, ’Fare’, ’SibSp’, ’Parch’, ’ ’NumLearningCycles’, cycles, ...
FamilySize’}; ’Learners’, templateTree(’
for i = 1:length(num_features) MaxNumSplits’, splits, ...
titanic_train.(num_features{i}) = normalize( ’MinLeafSize’, leaf), ...
titanic_train.(num_features{i})); ’CrossVal’, ’on’, ’CVPartition’, cv)
end ;
end
end
end
III. H IGHER D IMENSIONAL C LASSIFICATION WITH
D UMMY VARIABLES
V. E VALUATION OF P REDICTIONS
To enable numerical interpretation in machine learning
A. Confusion Matrix
models, categorical data, including ‘Sex‘, ‘Embarked‘, and
‘AgeGroup‘, were transformed into dummy variables. The confusion matrix quantifies the model’s predictive per-
formance. Table I shows the confusion matrix generated using
Listing 4: Dummy Variable Conversion cross-validation predictions.
function data = categorical_data_to_dummy_variables(
data, vars) TABLE I: Confusion Matrix for Training Predictions
for i = 1:length(vars)
if ismember(vars{i}, data.Properties. Predicted: No Predicted: Yes
VariableNames) Actual: No 491 58
if ˜iscategorical(data.(vars{i})) Actual: Yes 97 245
data.(vars{i}) = categorical(data.(
vars{i}));
end B. Kaggle Submission
dummy_vars = dummyvar(data.(vars{i}));
dummy_table = array2table(dummy_vars, ’ A submission file was generated for Kaggle evaluation:
VariableNames’, ...
strcat(vars{i}, ’_’, string(
categories(data.(vars{i}))))); Listing 6: Kaggle Submission
data = [data, dummy_table]; test_predictions = predict(final_model, titanic_test
data.(vars{i}) = []; );
end submission = table(titanic_test.PassengerId, ...
end test_predictions, ’VariableNames’, {’PassengerId
end ’, ’Survived’});
writetable(submission, ’submission.csv’);
IV. R ANDOM F OREST C LASSIFICATION AND E NSEMBLE C. Kaggle Submission and Results
L EARNING
The final submission achieved a Kaggle score of 0.80708,
A. Random Forest Overview placing it in the top quartile of the leaderboard. The submis-
Several decision trees are combined in the Random Forest sion was generated as follows:
ensemble learning technique to increase classification accuracy
Listing 7: Kaggle Submission
and avoid overfitting. It works by using random subsets of the
test_predictions = predict(final_model, titanic_test
data and features to train multiple trees. );
submission = table(titanic_test.PassengerId, ...
B. Cross-Validation for Model Tuning test_predictions, ’VariableNames’, {’PassengerId
’, ’Survived’});
A 12-fold cross-validation strategy was used to evaluate writetable(submission, ’submission.csv’);
hyperparameter combinations:
• MinLeafSize: Minimum observations required at a leaf D. Visualization
node. Figure 2 illustrates the distribution of accuracies from
• MaxNumSplits: Maximum number of splits in each tree. hyperparameter tuning, and Figure 3 shows the histogram
• NumLearningCycles: Number of trees in the ensemble. representation of the confusion matrix.
datasets because it minimizes overfitting and enhances gener-
alization by combining predictions from several decision trees.
The model’s performance is further improved via hyperparam-
eter optimization, which includes maximizing the number of
trees, maximum depth, and minimum samples per split. By
balancing bias and variance, this procedure guarantees that the
model produces excellent accuracy and stability. Incorporating
gradient descent optimization into specific feature selection
processes or model elements guarantees effective convergence
and improves predictive power.
Cross-validation, which provides a more accurate assess-
Fig. 1: Kaggle score ment of the model’s performance on unseen data, is used
to verify the model’s robustness and avoid overfitting. This
method of iterative evaluation aids in finding any discrepancies
between various data divides. Future developments might
investigate the use of neural networks, which could be able to
identify more intricate patterns in the data. Furthermore, the
accuracy and interpretability of the model may be improved by
revealing hidden correlations between variables through deeper
feature engineering, such as interaction terms or non-linear
transformations.
VI. C ONCLUSION
Using the advantages of ensemble learning, this project
demonstrates how well Random Forest handles categorization
challenges. Random Forest is a reliable option for complicated