0% found this document useful (0 votes)

6 views7 pages

S-11

The document outlines a comprehensive approach to building machine learning models, covering data preprocessing, feature engineering, model selection, cross-validation, and hyperparameter tuning. It emphasizes the importance of understanding data sources, handling missing values, and creating meaningful features to improve predictions, particularly in contexts like student performance and fraud detection. Additionally, it discusses specific techniques for time-series data validation and tuning SVM hyperparameters to enhance model accuracy.

Uploaded by

gunelaslanova106

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views7 pages

S-11

Uploaded by

gunelaslanova106

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Name Surname –

Semester: Fall 2024/2025

Lecturer: Shahnaz Shahbazova
Subject: Machine Learning
Group: E 27-24

(1) Data Preprocessing: You are tasked with building a machine learning model
using data from multiple sources with different formats. How would you
preprocess and standardize this data for analysis?

Data preprocessing is a crucial step in building machine learning models, especially when
dealing with data from multiple sources with different formats. The goal is to clean, transform,
and standardize the data so that it can be effectively used for analysis and modeling. Here’s
how I would approach preprocessing and standardizing data for analysis: First of all, to
understand the data sources and formats, I would do before preprocessing the data, it's
important to understand the structure, content, and quality of the data from different sources.
Data might come in various formats like: - CSV, JSON, XML, SQL databases, or APIs. Different
encodings or delimiters. Time series data, categorical, or textual data. Begin by examining the
data, looking for any inconsistencies or patterns that might indicate the necessary steps for
transformation. For data cleaning, data from multiple sources often have missing values,
outliers, or incorrect entries. The following steps we can write and it can help clean the data.
Beginning with missing Data Handling, we can first identify missing values using functions like
`isnull()` or `isna()` in pandas. Secondly, we can handle missing values by either imputing them
or dropping rows or columns depending on the extent of missing data. Also, we can do outliers
detection. To detect outliers using statistical methods or visual methods like box plots. We can
then decide whether to transform, cap, or remove outliers depending on the nature of the data
and the problem at hand; similarly, for fixing Incorrect Data, we can look for incorrect entries
such as typos or values that don't make sense in context.
(2) Feature Engineering: You are working on a project to predict student
performance based on factors such as study habits, hours of sleep, and
extracurricular activities. How would you engineer features from these variables?

To engineer features from the given variables, we can apply the following steps: Categorical
features: To reate categories based on study habits, such as “good,” “average,” and “poor,”
depending on whether the student studies regularly, occasionally, or rarely. This could be
derived from a rating scale or self-reported data. – Secondly, study time in terms of if available,
create a feature representing the number of hours spent studying per day/week. – Consistency
is needed to take into consideration to create a binary or continuous feature indicating the
consistency of study habits. For the study material type: we can categorize or encode the types
of study materials the student uses.

Also, in order to make a project to predict student performance in terms of feature

engineering, we can take into consideration these factors and decide the activities. Hours of
Sleep, Continuous feature: Directly use the number of hours of sleep as a continuous feature.
Quality of Sleep If data on sleep quality is available, create a derived feature like "sleep quality"
based on responses to a self-reported scale. Coming to sleep Duration Categories, creating
categorical features for different sleep durationsConsistency in Sleep: If there’s data on the
variability in the student’s sleep schedule, create a feature representing sleep consistency.

Extracurricular ActivitiesParticipation: we can create a binary feature indicating whether the

student participates in extracurricular activities. In terms of number of activities, we count the
number of extracurricular activities the student is involved in and use that as a continuous
feature. For activity type create categorical features based on the type of extracurricular
activities. This could be a one-hot encoding or group them into fewer categories if necessary.
Additional, time spent on activities can be taken into consideration if data on time spent in
extracurricular activities is available, include it as a feature Activity Intensity: Create an
intensity score if there is information on how demanding or time-consuming the activities are.

Interaction Features:Study and Sleep Interaction: Create features that capture the interaction
between study habits and sleep, as they might impact performance together. For example,
combine "hours of study" and "hours of sleep" to see how a balance between the two affects
performance. Study and Extracurricular Activities Interaction: Similarly, create interaction terms
between study habits and extracurricular involvement to assess whether balancing these
factors influences performance. Normalization and Transformation: Normalize continuous
features like hours of sleep and study time to a common scale to ensure they have equal
influence on predictive models. Apply feature transformations for skewed features Additional
Derived FeaturesStudy-to-Extracurricular RatioCreate a ratio of study time to extracurricular
activity time to evaluate the balance between academic focus and outside activities. Similarly,
create a ratio of sleep to study time to assess the relationship between rest and study time. By
carefully transforming these variables into meaningful features, we can provide the model with
a well-rounded view of the student's lifestyle, which should help improve the prediction of
student performance.

(3) Model Selection: You need to build a model to detect fraudulent transactions
in real-time for a financial institution. Which machine learning model would you
choose, and how would you balance accuracy with computation in time?

There are machine learning algorithms to detect fraudulent transactions, and this kind of
fraudulent catch mechanism is needed nowadays important than anything else. There are
several reasons to emphasize the importance of the model selection. First of all, beginning with
the machine learning algorithms, I would choose a model in a way that it can catch anomalies.
By emphasizing anomalies, I would like to data into consideration because data in these models
are important. There can be anomalies that can be catched by using machine learning
algorithms. Also, there are a lot of scenarios that can happen if these anomalies are not taken
into seriusness. For example, let say that e-commerce site has been created and a lot of data
are added to the website. Customers can buy these items and also decide which payment
method they would like to use. Here is the scenario that if the anomaly happens, it can be
detected easily using these machine learning algorithms. Going into detail of the question,
when selecting a machine learning model to detect fraudulent transactions in real-time for a
financial institution, there are several key factors to consider: model accuracy, computation
time, interpretability, and scalability. Here's how I would approach this problem: Model
Selection Given that this is a real-time fraud detection problem, I would focus on models that
can balance high predictive accuracy with low latency (quick inference time). Logistic
Regression: It is simple, interpretable, and works well with high-dimensional data, especially if
we have engineered features from transaction history However, it may not always capture
complex relationships in the data. Random Forest / Gradient Boosting These are more
sophisticated ensemble methods that can capture complex patterns and provide strong
performance in classification tasks like fraud detection. XGBoost and LightGBM, in particular,
have optimizations for faster training and inference, making them more suitable for real-time
predictions. Neural Networks. These models tend to perform well when large amounts of data
are available and there are complex relationships between features. However, neural networks
are generally slower to train and require more computational resources, which might impact
real-time performance, unless specifically optimized. Anomaly Detection (Isolation Forest, One-
Class SVM): If the fraudulent transactions are much rarer compared to legitimate transactions
an unsupervised anomaly detection model could work well. Isolation Forest, for example, is
lightweight and efficient in identifying outliers, which could represent fraudulent transactions.
Given these considerations, Gradient Boosting models would be a good balance between
accuracy and computational efficiency. These models are highly effective for structured data
and provide scalable, accurate predictions with relatively fast inference times.
(4) Cross-Validation: You are given a time-series dataset of sales data over the past
five years. How would you implement cross-validation for this time-series data,
and which validation technique would you use?

For time-series data, traditional cross-validation techniques like k-fold cross-validation aren't
suitable because they don't account for the temporal order of the data. Instead, we can use a
time-series specific validation technique such as TimeSeriesSplit orWalk Forward Validation. We
can would implement cross-validation for time-series data as taking timeSeriesSplit validation
into consideration. This is one of the most common techniques that we can use for time-series
cross-validation. The key idea here is to split the data into training and test sets while
preserving the temporal order of the observations.Steps. We can divide the dataset into a
series of train-test splits, but ensure that the training set always consists of the data up to the
test set, and the test set is always a future set of observations.

For example, if we have data points 1 to 100: - Split 1: Train on 1–70, test on 71–80. - Split 2:
Train on 1–80, test on 81–90. - Split 3: Train on 1–90, test on 91–100. By making this way, we
can train on increasing amounts of data and each model is tested on a future set by avoiding
data leakage from future information into the training process.

Similarly, cross-validation to test data points, we can train with increasing amount of data and
coming to specific modes, data leakage for training process is needed to taken into process
scenarios.

(5) Hyperparameter Tuning: You are training a support vector machine (SVM) for
image classification, but your model is underperforming. Describe how would you
tune the kernel and other hyperparameters to improve its accuracy.
SVM is a support vector machine that is used to improve the accuracy of your Support Vector
Machine for image classification, hyperparameter tuning is crucial. We can approach tuning the
kernel and other key hyperparameters: Kernel Selection The choice of kernel determines the
transformation of the feature space. The three main types of kernels in SVM are. First one is
linear kernel as this is best when the data is linearly separable. It is faster and less
computationally expensive. Second one is polynomial kernel. It is useful when data is not
linearly separable but can be separated with a polynomial decision boundary. Another one is
radial basis function kernel. It is most commonly used for non-linear problems as it maps data
into a higher-dimensional space where it can be separated with a hyperplane. There is also
sigmoid kernel as it is similar to the neural network activation function, but less commonly
used. Tuning the Kernel Start by experimenting with the RBF kernel, as it is typically a good
default choice. Using cross-validation to compare the performance of different kernels. In case
of RBF, experiment with other kernels if it is suspected a different transformation is better for
the data. For hyperparameters for kernel. If we use an RBF kernel, the following
hyperparameters should be tune. C (Regularization Parameter) as it controls the trade-off
between a smooth decision boundary and classifying training points correctly. A small value
allows more margin violations, while a large value places more emphasis on correct
classification. There is a tuning approach by starting with a small value 0.1 and gradually
increase it (e.g., 1, 10, 100) to check if it improves performance. Larger values can lead to
overfitting, while smaller values can result in underfitting. Gamma can be defined the influence
of a single training example. A small gamma means a larger influence, resulting in a smoother
decision boundary. A large gamma can cause overfitting by capturing noise in the data. Tuning
approach is good to start with a default value such as scale or auto, and experiment with
different values of gamma like 0.01, 0.1, 1, 10, etc., to find the optimal value.

Other Hyperparameters Degree: If you choose a polynomial kernel, this controls the degree of
the polynomial. Higher degrees can capture more complex decision boundaries but might lead
to overfitting. There is also tuning approach by starting with degree 3 (default) and try values
from 1 to 5, checking performance on the validation set. Moving to class weight, if the classes
are imbalanced, the SVM might prioritize the majority class. Adjusting the `class_weight` to
"balanced" or manually setting weights for classes can help improve performance for minority
classes. Additional, tuning approach can be emphasized to set the `class_weight` parameter to
`'balanced'`, or manually experiment with class weights based on the class distribution in your
dataset. Grid Search is used for grid search to search for the best combination of
hyperparameters systematically. Furthermore, random search can be taken into consideration,
as an alternative, random search can sometimes find better results faster than grid search,
especially when the search space is large.Cross-Validation. We can use k-fold cross-validation to
evaluate the performance of your SVM with different hyperparameter settings. This helps
prevent overfitting and ensures that your model generalizes well to unseen data. Feature
Scaling: Ensure that your data is preprocessed properly before training the SVM. SVMs are
sensitive to the scale of features, so standardizing or normalizing the image pixel values is
essential for achieving optimal performance. Lastly, we can evaluation in a way that after tuning
the hyperparameters, evaluate the model on a separate test set to verify if the improvements
made during training generalize to unseen data. By systematically tuning the kernel type,
regularization parameter, gamma, and other SVM hyperparameters, we can significantly
improve your model's accuracy in image classification tasks.

Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
S-1
No ratings yet
S-1
5 pages
Data Mining - UOG (HH) - Final - F23-1
No ratings yet
Data Mining - UOG (HH) - Final - F23-1
10 pages
Applied Predictive Modeling: An Overview of Applied Predictive Modeling
From Everand
Applied Predictive Modeling: An Overview of Applied Predictive Modeling
Steven Taylor
No ratings yet
Unit 6aics
No ratings yet
Unit 6aics
25 pages
Using Forecasting Methodologies to Explore an Uncertain Future
From Everand
Using Forecasting Methodologies to Explore an Uncertain Future
James Poon
No ratings yet
S-9
No ratings yet
S-9
18 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Credit Card Fraud Detection_final
No ratings yet
Credit Card Fraud Detection_final
3 pages
Anomaly Detection Report
No ratings yet
Anomaly Detection Report
33 pages
Latest Data Mining Lab Manual
No ratings yet
Latest Data Mining Lab Manual
74 pages
Salazar CPE124 Courswork 1
No ratings yet
Salazar CPE124 Courswork 1
22 pages
New Report
No ratings yet
New Report
61 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
sibi 5
No ratings yet
sibi 5
27 pages
Mastering Machine Learning: A Comprehensive Guide to Success
From Everand
Mastering Machine Learning: A Comprehensive Guide to Success
Rick Spair
No ratings yet
NN-7
No ratings yet
NN-7
26 pages
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
No ratings yet
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
11 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Final 1
No ratings yet
Final 1
6 pages
Credit Card Fraud Analysis Ashutosh
No ratings yet
Credit Card Fraud Analysis Ashutosh
3 pages
Arpit_Pal_E2_17_Report_Loan-Prediction-System
No ratings yet
Arpit_Pal_E2_17_Report_Loan-Prediction-System
34 pages
10. Ai_foundations of Machine Learning III
No ratings yet
10. Ai_foundations of Machine Learning III
98 pages
Core Concepts in Statistical Learning
From Everand
Core Concepts in Statistical Learning
Tushar Gulati
No ratings yet
Data Science Real World Applications
100% (1)
Data Science Real World Applications
19 pages
Assignment Solution 074
No ratings yet
Assignment Solution 074
8 pages
Disaster
No ratings yet
Disaster
20 pages
EMPLOYEE PERFORMANCE ANALYSIS
No ratings yet
EMPLOYEE PERFORMANCE ANALYSIS
3 pages
Classification Analysis
No ratings yet
Classification Analysis
4 pages
Machine Learning Part: Domain Overview
No ratings yet
Machine Learning Part: Domain Overview
20 pages
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
ML_DA
No ratings yet
ML_DA
55 pages
Aifb Lab Manual Exp 6 - Aids
No ratings yet
Aifb Lab Manual Exp 6 - Aids
3 pages
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
From Everand
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
Elaine Tate
No ratings yet
First Project
No ratings yet
First Project
34 pages
ads
No ratings yet
ads
8 pages
The Fundamentals of Machine Learning: Building Intelligent Systems from Data
From Everand
The Fundamentals of Machine Learning: Building Intelligent Systems from Data
Ethan Bennett
No ratings yet
DataAnalytics Lab Manual (1)
No ratings yet
DataAnalytics Lab Manual (1)
35 pages
Case Study 219302405
No ratings yet
Case Study 219302405
14 pages
Cyber Cafe Management System DEEPAK SHINDE
No ratings yet
Cyber Cafe Management System DEEPAK SHINDE
36 pages
Bee jay1
No ratings yet
Bee jay1
11 pages
PM Web 18058
No ratings yet
PM Web 18058
18 pages
Ultimate Enterprise Data Analysis and Forecasting using Python
From Everand
Ultimate Enterprise Data Analysis and Forecasting using Python
Shanthababu Pandian
No ratings yet
Credit Risk Project
No ratings yet
Credit Risk Project
11 pages
Fraud Detection in Financial Transactions.ppt.pptx_20240805_175608_0000 (1)
No ratings yet
Fraud Detection in Financial Transactions.ppt.pptx_20240805_175608_0000 (1)
22 pages
Fraud Detection On Bankism Data
No ratings yet
Fraud Detection On Bankism Data
25 pages
Online Payments Fraud Detection Documentation
No ratings yet
Online Payments Fraud Detection Documentation
40 pages
TE_ML_LAB_mannual
No ratings yet
TE_ML_LAB_mannual
21 pages
1
No ratings yet
1
19 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Machine Learning Based Student AcademicPerformance Prediction
No ratings yet
Machine Learning Based Student AcademicPerformance Prediction
6 pages
Feature and Feature Extractionlect2
No ratings yet
Feature and Feature Extractionlect2
28 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
1. Introduction to Data Mining & Classification
No ratings yet
1. Introduction to Data Mining & Classification
58 pages
Prediction Model For Students PDF
No ratings yet
Prediction Model For Students PDF
4 pages
Artigo_Fraud-Creditcard
No ratings yet
Artigo_Fraud-Creditcard
14 pages
Robust Data Model For Enhanced Anomaly Detection: R.Ravinder Reddy, Dr.Y Ramadevi, DR.K.V.N Sunitha
No ratings yet
Robust Data Model For Enhanced Anomaly Detection: R.Ravinder Reddy, Dr.Y Ramadevi, DR.K.V.N Sunitha
8 pages
Ia1 Ml Scheme Common to is,Ai,Cs - Copy
No ratings yet
Ia1 Ml Scheme Common to is,Ai,Cs - Copy
10 pages
A Multi-perspective Fraud Detection Method for Multi-Participant E-commerce Transactions
No ratings yet
A Multi-perspective Fraud Detection Method for Multi-Participant E-commerce Transactions
6 pages
educational-data-mining-the-case-of-department-of-mathematics-and-computing-in-the-period-2009-2018
No ratings yet
educational-data-mining-the-case-of-department-of-mathematics-and-computing-in-the-period-2009-2018
5 pages
SPARC Photogrammetry Draft
No ratings yet
SPARC Photogrammetry Draft
82 pages
OCAI Pro Manual
No ratings yet
OCAI Pro Manual
10 pages
Stahl HMI Operating Instructions
No ratings yet
Stahl HMI Operating Instructions
60 pages
CO511 Set-A 8-18
No ratings yet
CO511 Set-A 8-18
3 pages
SAP EWM Value Added Services: Kit-to-Stock Process
No ratings yet
SAP EWM Value Added Services: Kit-to-Stock Process
22 pages
CMP 212 2024 Lecture Note Part Two
No ratings yet
CMP 212 2024 Lecture Note Part Two
22 pages
000 2007 Business Intelligence Platform Capability Matrix Kurt Schlegel, Bhavish Sood
No ratings yet
000 2007 Business Intelligence Platform Capability Matrix Kurt Schlegel, Bhavish Sood
11 pages
GGG
No ratings yet
GGG
19 pages
M880 User Manual
No ratings yet
M880 User Manual
103 pages
Synopsis For Online Examination Website: 1. Subject
No ratings yet
Synopsis For Online Examination Website: 1. Subject
3 pages
Chapter 4. Network Access
No ratings yet
Chapter 4. Network Access
84 pages
Data Structures and Algorithms: Binary Search Trees BST Insertion
No ratings yet
Data Structures and Algorithms: Binary Search Trees BST Insertion
58 pages
Static Behavior of Natural Gas
No ratings yet
Static Behavior of Natural Gas
34 pages
Agra Database Sallery Class
No ratings yet
Agra Database Sallery Class
9 pages
Hydraulic Power Supply System MNT Manual (Curso)
No ratings yet
Hydraulic Power Supply System MNT Manual (Curso)
28 pages
DCC Unit I Lecture Notes
No ratings yet
DCC Unit I Lecture Notes
13 pages
Installing SolidWorks 2013 On Your PC Jan 2014
No ratings yet
Installing SolidWorks 2013 On Your PC Jan 2014
5 pages
Installation Guideline: Guideline For Planning, Assembling and Commissioning of Ethercat Networks
No ratings yet
Installation Guideline: Guideline For Planning, Assembling and Commissioning of Ethercat Networks
70 pages
Aiml 2023 0604
No ratings yet
Aiml 2023 0604
218 pages
The Automotive Standard ISO 26262 The Innovative D
No ratings yet
The Automotive Standard ISO 26262 The Innovative D
9 pages
Windows DNA: Windows Distributed Internet Architecture
No ratings yet
Windows DNA: Windows Distributed Internet Architecture
21 pages
Requirements Analysis: 4.2.1 Existing System Review & Literature Review
No ratings yet
Requirements Analysis: 4.2.1 Existing System Review & Literature Review
12 pages
CSBS Syllabus
No ratings yet
CSBS Syllabus
8 pages
Solaxcloud Web User Guide V3.2 Enduser en 2022.05.27
No ratings yet
Solaxcloud Web User Guide V3.2 Enduser en 2022.05.27
17 pages
Online Recruitment Application System: Maharecruitment - Mahaonline.gov - in
No ratings yet
Online Recruitment Application System: Maharecruitment - Mahaonline.gov - in
43 pages
Software Reference iMAP 17 3 PDF
No ratings yet
Software Reference iMAP 17 3 PDF
1,330 pages
1.7K FULL VALID MIX MAIL ACCESS (2)
No ratings yet
1.7K FULL VALID MIX MAIL ACCESS (2)
30 pages
Inbound 2346734632581697743
No ratings yet
Inbound 2346734632581697743
18 pages
Tip DS On Hadoop
100% (1)
Tip DS On Hadoop
14 pages
CHNG5603-Tut 2
No ratings yet
CHNG5603-Tut 2
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

S-11

Uploaded by

S-11

Uploaded by

Name Surname –

Semester: Fall 2024/2025

Also, in order to make a project to predict student performance in terms of feature

Extracurricular ActivitiesParticipation: we can create a binary feature indicating whether the

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.