Assignment 1
Assignment 1
theoretical derivations.
Problem 1: Data Exploration and Preprocessing for E-commerce Customer Behavior Analysis.
Imagine you are working for an e-commerce company that wants to understand its customer behavior better.
You are given a dataset (you can find a sample dataset online, e.g., on Kaggle or UCI Machine Learning
Repository, search for "e-commerce customer behavior dataset" or "online retail dataset"). This dataset typically
contains information about customer interactions on the website, such as:
● Event Type: Type of interaction (e.g., view product, add to cart, purchase, search).
Tasks:
Dataset Selection & Justification: Choose an e-commerce customer behavior dataset. Briefly describe the
dataset you selected and explain why you chose it (e.g., size, features, relevance to the problem). Provide a link
to the dataset if possible.
Data Loading and Inspection: Load the dataset into a suitable environment (like Python with Pandas). Display
the first few rows and get basic information about the data types, missing values, etc.
Feature Analysis: Choose at least three features from the dataset that you think are important for understanding
customer behavior. For each chosen feature:
Describe the feature and its data type.
Calculate descriptive statistics (mean, median, mode, standard deviation, range, etc., as appropriate for the data
type).
Create at least one meaningful visualization (e.g., histogram, bar chart, box plot, scatter plot – choose
appropriate visualizations based on the feature type and your analysis goal) to understand the distribution or
patterns of the feature. Explain what insights you gain from the visualization.
Data Similarity (Optional): If applicable to your chosen dataset and features, think about how you might
measure the similarity between customers or products based on the features you analyzed. Briefly discuss
potential similarity measures (like Euclidean distance, Cosine similarity, etc.) and why they might be relevant in
this context.
Data Preprocessing Plan: Based on your data exploration, identify at least two data preprocessing steps that you
think would be necessary or beneficial before applying machine learning algorithms to this dataset. Justify why
these preprocessing steps are needed (e.g., handling missing values, dealing with categorical data, scaling
numerical features, etc.). Briefly describe how you would implement these preprocessing steps.
Continuing with the e-commerce customer behavior scenario from Problem 1, let's focus on predicting a
numerical value related to customer purchase behavior. Let's assume your dataset includes a feature that
represents the total amount spent by each customer over a certain period (or a similar numerical target variable
related to purchase value).
Tasks:
Target Variable Selection: Clearly identify and describe the target variable you will be trying to predict. Explain
why this variable is a relevant indicator of customer purchase behavior.
Regression Model Selection and Justification: Choose one regression algorithm from the list covered in the
syllabus (Linear Regression, Polynomial Regression, or consider Decision Tree Regression if you want to
explore non-linear relationships). Justify your choice of algorithm based on the characteristics of your dataset
and the problem.
Model Training (Basic): If you are comfortable with coding, you can perform a basic implementation using
Python and Scikit-learn. Train your chosen regression model on a portion of your data. If you are not
comfortable with coding yet, you can describe conceptually how you would train the model and what input
features you would use.
Performance Metrics: Choose at least two appropriate performance metrics for evaluating your regression
model (e.g., Mean Squared Error, Root Mean Squared Error, Mean Absolute Error, R-squared). Explain why
these metrics are suitable for evaluating regression performance.
Performance Interpretation: If you implemented and trained a model, calculate the chosen performance metrics
on a test set (or training set if you don't have a separate test set for this assignment, but ideally, you would use a
test set). Interpret the performance metrics you obtained. What do these metrics tell you about the model's
ability to predict customer purchase behavior? If you did a conceptual approach, describe how you would
evaluate the model's performance.
Model Limitations: Discuss at least one limitation of the regression model you chose or the approach you took
for predicting customer purchase behavior. Consider factors like data quality, model assumptions, or the
complexity of real-world customer behavior.
Now, let's use unsupervised learning to segment customers into different groups based on their behavior. Using
the same (or a similar) e-commerce customer behavior dataset, the goal is to identify distinct customer segments
that the company can target with different marketing strategies or personalized experiences.
Tasks:
Feature Selection for Clustering: Choose at least two features from your dataset that you believe are relevant
for clustering customers into meaningful segments. Justify your feature selection. These features should ideally
capture different aspects of customer behavior (e.g., purchase frequency, average order value, product
categories purchased, website interaction patterns, etc.).
Clustering Algorithm Selection and Justification: Choose one clustering algorithm from the syllabus (k-Means
or Hierarchical Clustering). Justify your choice of algorithm for this customer segmentation task. Consider
factors like the expected shape of clusters, scalability, and interpretability of results.
Data Preparation for Clustering: Describe how you would prepare your chosen features for clustering (e.g.,
scaling, handling categorical features if needed – conceptually). Again, you can implement basic preprocessing
or just describe the steps.
Clustering Execution (Basic): If you are comfortable with coding, implement your chosen clustering algorithm
(e.g., using k-Means in Scikit-learn). Determine an appropriate number of clusters (you can use methods like
the Elbow method for k-Means, or decide based on business intuition). If you are not coding, describe
conceptually how you would apply the clustering algorithm.
Cluster Interpretation: After clustering, analyze the characteristics of each cluster. Calculate the mean or
median values of your chosen features for each cluster. Describe the profile of each customer segment you have
identified. For example, you might find segments like "High-Value Spenders," "Frequent Visitors,"
"Category-Specific Buyers," etc. Explain the business implications of these customer segments – how could the
e-commerce company use this segmentation to improve its strategies?
Algorithm Limitations: Discuss at least one limitation of the clustering algorithm you chose or the approach you
took for customer segmentation. Consider factors like the sensitivity of the algorithm to initial parameters
(k-Means), the computational complexity (Hierarchical), or the assumptions made by the algorithm about
cluster shapes.