DAS601 Project

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

DAS 601 Project, Spring 2024

To Read:

In order to excel in this assignment, it is imperative to review the subsequent


paper. It addresses the most effective methods for conducting a machine learning
project, and it is imperative that you adhere to each recommendation it provides.

1. How to Avoid Machine Learning Pitfalls: A Guide for Academic Researchers


(https://arxiv.org/html/2108.02497v4)

Part I. Problem Statement & Dataset (20 Marks)

The initial phase of the project involves finding a problem that can be addressed
through the application of machine learning techniques. The issue might pertain to
various domains, including science, social science, health, business, finance, and
others. Furthermore, it is imperative to gather a suitable dataset in order to address
the problem effectively. You can utilize the following resources to gather data:

1. https://www.kaggle.com/datasets (Kaggle)
2. https://archive.ics.uci.edu/ (UCI Machine Learning Repository)
3. https://paperswithcode.com/datasets (Papers with Code)

Once the problem has been determined and the necessary data has been gathered,
it is imperative to compose an Introduction chapter. This chapter should consist of
the following paragraphs:

1. Background and Context: In the first paragraph, provide a brief overview of


the field of study and the specific area of machine learning and the problem
you are focusing on. Explain why this area is important. In the second
paragraph, explain what has been done previously. You should use IEEE style
references while citing previous works and contributions.
2. Motivation: In the third paragraph, clearly state the problem your research
addresses. Highlight the gaps or limitations in the existing research that your
work aims to fill, and how it will enrich knowledge or contribute to the
society, and so on.
3. Objective and Scope: In the fourth paragraph, define the specific goals of
your research. What are you trying to achieve, and what are the boundaries
of your work?
4. Contributions: Provide a concise overview of the primary advancements
made by your research in the fifth paragraph. What are the primary
discoveries or advancements that your paper introduces? At first, this
paragraph will solely focus on your expectations. The certainty of the
contributions can only be ascertained upon the project's completion.
Therefore, it will be necessary for you to revise this paragraph once you have
finished your job.
5. Structure of the Paper: Briefly outline the organization of the paper. This
helps the reader understand what to expect in the subsequent sections. Your
paper will contain Methodology, Results, Discussion, and Conclusion
sections.

Resource and Reference:

The introduction section of the subsequent papers may serve as a model. Attempt
to comprehend and comprehend the structure and progression of the introduction,
and endeavor to organize your own introduction in a similar manner.

1. Automatic COVID-19 prediction using explainable machine learning


techniques(https://www.sciencedirect.com/science/article/pii/S26663074230
00037).
2. Detection and classification of brain tumor using hybrid deep learning
models (https://www.nature.com/articles/s41598-023-50505-6).

Part II. Exploratory Data Analysis (20 Marks)

In this section, your task is to delve into the dataset and systematically identify
and document its various characteristics. You are required to compose different
sections of the Methodology and Results chapters. The detailed instructions for
each chapter are as follows:

1. Dataset Composition: Create a subsection titled "Dataset" under the


Methodology chapter. In this subsection, provide a comprehensive overview
of the dataset. Address the following aspects with appropriate references:

a. Dataset Description: Explain what the dataset encompasses, including


its context and subject matter.
b. Data Collection Source: Identify the source from which the dataset was
collected, including the original authors and the purpose behind its
creation.
c. Dataset Purpose: Discuss the objectives for which the dataset was
originally compiled and any specific applications or use cases it serves.

Additionally, include information on the following aspects:


d. Review Data Types: Identify and classify the data types within the
dataset, such as numeric, categorical, and date/time.
e. Check Data Dimensions: Report the number of rows and columns in the
dataset.
f. Column Names and Definitions: Provide a detailed understanding of
what each column represents and any relevant definitions.

2. Lastly, determine & document the quantity of data you wish to retain for
training and testing purposes and divide the dataset accordingly. Please
utilize the training dataset exclusively for all subsequent assignments from
this point forward.

3. Create a subsection titled "Data" under the Results chapter. In this


subsection, include the following visualizations along with proper captions.
Additionally, provide a written analysis of the insights derived from these
visualizations:

a. Visualization of Categorical Variables:


i. Bar Chart: Visualize individual categorical variables using bar
charts.
ii. Grouped Bar Chart: Display multiple categorical variables to
compare their distributions.
iii. Stacked Bar Chart: Illustrate the composition of different
categories within a single chart.
iv. Horizontal Bar Chart: Present categorical data in a horizontal
format for better clarity and comparison.
b. Visualization of Numeric Variables:
i. Histograms: Show the distribution of numeric variables.
ii. Density Plot: Visualize the probability density of a numeric
variable.
iii. Box Plot: Display the spread and outliers within numeric data.
c. Scatter Plot:
i. Two Numeric Variables: Create a scatter plot to explore the
relationship between two numeric variables.
d. Combined Visualization:
i. Scatter Plot / Bubble Plot: Visualize two numeric variables along
with a categorical variable for added dimension and insight.
ii. Visualize correlations between all numeric variables using Pair
Plots.
e. Heatmap:
i. Two Categorical Variables and One Numeric Variable: Use a
heatmap to depict the relationship between two categorical
variables and their impact on a numeric variable.

Each visualization should be accompanied by a detailed explanation of the


patterns, trends, and insights observed from the data.

Resource & Reference:

1. Exploratory Data Analysis: Translating Statistics into Decisions.


(https://drive.google.com/file/d/1FTf4mhgXheckEhjKCJ6XSza1QPcL0_dx/vi
ew?usp=sharing)

Part III. Feature Engineering (20 Marks)

Create a subsection titled Feature Engineering under the Methodology section. In


this section you will create different new features from the existing ones. Mainly,
you will perform the following feature engineering techniques and write about
them in this subsection:

1. Polynomial Features: Create new features by raising existing features to a


power, often used in polynomial regression.
2. Interaction Features: Combine two or more features multiplicatively to
capture interactions between variables.
3. Domain-Specific Features: Generate new features based on domain
knowledge, such as ratios, differences, or custom transformations relevant to
the problem at hand.
4. Principal Component Analysis (PCA): Reduces dimensionality by projecting
data into a lower-dimensional space while retaining most of the variance.
5. Normalization: Adjusts the scale of features to have a specific range, such as
[0,1].
6. Standardization: Transforms features to have zero mean and unit variance.
7. Log Transformation: Applies a logarithmic transformation to reduce
skewness and normalize distributions.
8. One-Hot Encoding: Converts categorical variables into a set of binary
indicators.
9. Label Encoding: Assigns a unique integer to each category.
10. Frequency Encoding: Replaces each category with the frequency of its
occurrence in the dataset.
11. Min-Max Scaling: Scales features to a given range, typically [0,1].
12. Equal Width Binning: Divides numeric data into bins of equal width.
13. Equal Frequency Binning: Divides data into bins with an equal number of
data points.
14. Custom Binning: Uses domain knowledge to create meaningful bins.
15. Mean/Median Imputation: Replaces missing values with the mean or median
of the column.
16. K-Nearest Neighbors Imputation: Uses the values of the nearest neighbors to
impute missing values.

After that, you will select the best features among the existing ones and the newly
created ones using the following methods:

1. Filter Methods: Use statistical techniques like correlation, Chi-square test, or


mutual information to select relevant features.
2. Wrapper Methods: Use model performance metrics to evaluate the subset of
features (e.g., recursive feature elimination).
3. Embedded Methods: Integrate feature selection within model training (e.g.,
LASSO, Ridge regression).

Part IV. Model Selection (20 Marks)

Models

This section pertains to the development of machine learning models. Initially, you
must identify a minimum of five (you can choose more) distinct models using your
expertise in machine learning and the results of your literature evaluation.
Subsequently, establish a subsection entitled "Models" and provide a detailed
account of the theories, advantages, and disadvantages of each model, supported
by diagrams and references, and your reasons for choosing them.

Candidate Model Selection

Subsequently, employ the default hyperparameters to train the models through


5-fold cross-validation. For regression, you should select an appropriate evaluation
metric, such as MSE, MAE, RMSE, and for classification, you should select accuracy,
precision, recall, or AUC-ROC. Next, you must determine the average performance
and standard deviations of each fold. Afterward, you should evaluate their
performances and choose the top three. The features that you selected from Part III
should be implemented. Document this candidate model selection procedure in the
“Model Selection” subsection of the Methodology chapter. Write the results of each
model in a table in the “Model Selection” section under the Results chapter.

Hyperparameter Tuning & Final Model Selection


Conduct thorough hyperparameter optimization on the selected models. Employ
methodologies such as grid search, random search, or Bayesian optimization to
fine-tune the parameters. Verify the enhancements in performance by using
cross-validation. Evaluate the effectiveness of the optimized models. Choose the
optimal model by conducting a thorough assessment that encompasses
performance indicators, computational efficiency, and model complexity.

Part V. Model Evaluation (20 Marks)

Under the Methodology section, establish a distinct section called "Model


Evaluation" and outline the specific approach you will employ to assess the
performance of your model. It is essential to thoroughly describe the metrics
you plan to utilize, including their definitions, equations, benefits,
drawbacks, limits, and the rationale for your selection.

Apply the same preprocessing steps (e.g., normalization, encoding) to the


test set as used for training data to avoid data leakage. Choose metrics that
align with the business objective and type of problem (e.g., classification or
regression). For classification, choose Accuracy, Precision, Recall, F1-Score,
AUC-ROC, and Confusion Matrix. For regression, choose Mean Squared Error
(MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), R²
Score.

Overall Structure

0 Abstract
1 Introduction
2 Methodology
2.1 Dataset
2.2 Feature Engineering
2.3 Model Selection
2.4 Model Evaluation
3 Results
3.1 Data
3.2 Model Selection
3.3 Model Evaluation
3.4 Error Analysis
4 Discussion
5 Conclusion

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy