Data Science Project Report
Data Science Project Report
Data Science Project Report
SUBMITTED BY
RUBI PATHAK [B1892R14300034]
MCA Department
AKS University SATNA-485001
28- APRIL, 2020
A K S University
Panna Road, sherganj , Satna, Madhya Pradesh
485001
CERTIFICATE
SUBMITTED TO SUBMITTED TO
Signature:…………. Signature:………….
Name:……………... Name:……………...
(Internal Examiner) (External Examiner)
Date: Date:
I hereby declare that the Internship major Project Report entitled ("INTRODUSING DATA
SCIENCE ") is an authentic record of my own work the award of degree of MCA(LE)
6thSemester at AKS UNIVERSITY, SATNA (M.P), under the guidance of (Prof Vijay
vishwakarma).
Every project is always a scheduled, guided & coordinated team effort aimed at achieving
common minimum goals. This minimum goal cannot be achieved without the guidance of
guide.
I hereby, express our heartily thanks to those persons who helped us and spend their
valuable time and effort to guide us for the project work. Without their guidance and co-
operation the project work would not have been successful.
This is the opportunity for us to express our honest thanks to them we would like to
thanks to our for encouraging us to prepare this project work.
I would like to give special thanks AKS UNIVERSITY SATNA, for giving vigilant
advices.
I would like to thank our college, teachers and lab-course teachers. We would like to
give special thanks to our HOD Prof. Akhilesh A.Waoo and Prof. Vijay vishwakarma for
giving vigilant advices.
I have no words to express thanks to our project guide Mr. Deepak Vishwakarma. for
his guidance, co-operation and interest to complete this project work.
Finally, I express our deep gratitude towards our family for their model support,
encouragement. Also we are thankful to our friends for their co-operation and helping nature
during this period.
Student Name
Rubi Pathak
Table of Contents
1. Abstract 4
2. Objective 4
3. Target Audience 4
4. Tools & Technologies 4
5. System Architecture & Design 4 - 5
6. Data Exploration 6
7. Dataset Description 6 7.1. Identifying Missing Values Count 7 7.2. Identifying
Unique Values of Each Feature 8 7.3. Categorical Objects 8
8. Data Cleaning 8 - 9
9. Feature Engineering 9 9.1. Creating Item Visibility Mean Ratio 9
9.2. Create Generalize Item Type 9
9.3. Creating Outlet Store Age 9
9.4. Modifying Item Fat Content 10
9.5. Converting Categorical Features In To Numerical 10 - 11
2. Objective:
Build a predictive model to find out the future sales of each product at a particular
store.
3. Target Audience:
Below are the technologies we used and where we used in our project
• • Coding Platform: Python 3 Jupyter Notebook
• • Pandas Library: We used this library to load the CSV data and used for
data munging and preparation.[1]
• • scikit-learn: It’s a machine learning library, we used this library for
regression analysis of sales data.
• • Matplotlib: It’s a 2D plotting library which produces a good quality of
images to download and visualize data. In this project, we used this library to plot
graphs in the data exploration.[2]
• • Seaborn: it is a data visualization library and build on top of matplotlib,
will give a high-level and high-quality interface for plotting attractive statistical
graphs. We used this library in data exploration section. [3]
• • Numpy: it’s a core library for scientific computations in python and also
provides a high-performance of multi-dimensional array objects, we used this
library for finding absolute , mean, standard deviation, etc.,[4]
This will tell us how many stores, unique items in the whole dataset. Item
visibility and Item MRP have more variation in dataset.
7.3 Categorical Objects:
To find out categorical objects and numerical feature which will helps in data
cleaning. Below are the categorical feature in our dataset.
1. Item_Fat_Content
2. Item_Identifier
3. Outlet_Identifier
4. Item_Type
5. Outlet_Location_Type
6. Outlet_Size
7. Outlet_Type
8. Source
8. Data Cleaning:
In the data exploration section, we found two features have missing values
Item_Weight and Outlet_Size. To make the right prediction, we have filled
Item_Weight values with average weight of the particular item and outlet_size by
mode of the outlet_size for the particular object. There are no missing values in
the dataset post this step.
9. Feature Engineering:
We have observed some feature with abnormalities in the variation of data in the
previous section. In this section we will remove that kind of data and create a new
feature with the help of existing features.
In the data exploration section we saw there were 16 unique item types present in
the dataset which will help in analyzing data. The first two letters of the item,
mentions a pattern, such as FD or DR or NC, which corresponds to Food, Drink,
Non-consumable. We have created a new category name as
“item_type_generalize”
• • Food 10201
• • Non-Consumable 2686
• • Drinks 1317
Sales of the items is also depending on the outlet store age, because there may be
a good chance of people going to new store when compared old store, if they are
nearby, so we are creating a new feature “Outlet_year”.
In this we are finding the distribution of sales data of every item type on the store.
This will give the broad information how sales are distributed across the item
type. Each violin bar displays the five-number (minimum, first quartile, median,
third quartile, and maximum), which is the summary of the data. Below are the
five-number distribution of each item type for each outlet.
Figure 8. Violin plot of Item type of OUT049 store
figure 9.Violin plot of Item type of OUT018 store.
Figure 10. Violin plot of Item type of OUT010 store.
Figure 11. Violin plot of Item type of OUT013store
This is an analysis plot of sales vs. price; here we can observe how sales are
happening with respect to price. We can infer from this that sales of items are
happening from $150 - $200 range. Here are we can see few outliers only for
certain type of items on certain day and we can observe more sales of specific
item, which we consider as outliers/noise. We can replace those values with mean
to get the accurate model.
10.5. Analysis Of Sales vs Item Type:
In this we are analyzing the sales of a item from all the store with respect to Item
type, which give an information about how sales are happening and which item
type have more sales. Let say Item Type Dairy have sales up to 200 whereas
Fruits and vegetables have more than 400. You can see least sales happening for
sea foods with the visualization. This tells that the sea food items have less
demand on the stores. This will give a good understanding of which item have
more sales on the store.
10.6. Analysis Of Outlet Sales vs MRP:
In this we can visualize the sales with respect to the MRP, we can infer some
valuable information from the below plot. As we discussed above sales of all the
stores with different colors, now we can see all the sales of the different outlet in
single plot. Now we can clearly identify which store is performing more sales and
which store is performing less sales in detailed in the below visualization plot.
Outlet27 is performing more sales, Outlet10 and Outlet19 have vey low sales.
10.7. Analysis Of Outlet Sales vs Item Weight:
In this plot, we are analyzing the sales of the store with respect to the Item weight
in that particular store. Based on these visualizations, we will clearly help in
interpreting how sales are happening in the store and which range of weight have
more sales.
11. Model Building:
It is used with short decision trees. Further, the first tree is created, the
performance of the tree on each training instance is used. Also, we use it to
weight how much attention the next tree. Thus, it is created should pay attention
to each training instance. Hence, training data that is hard to predict is given more
weight. Although, whereas easy to predict instances are given less weight.
Each instance in the training dataset is weighted. The initial weight is set to:
Weight(xi) = 1/n
Where xi is the i’th training instance and n is the number of training instances
Model Report :
RMSE : 1147
CV Score : Mean - 1159| STD - 40.96| Min - 1085| Max - 1230
We observed that the Linear Regression algorithm performs better where the
RMSE value is at the lowest compared to other Models.
Model Result:
RMSE: 1128
14. References:
[1] Pandas (2018) Pandas Library. [Online]. URL: https://pandas.pydata.org/
[5] Kaggle. (2018) Data Exploration and Price Prediction, House Sales. [Online].
URL: https://www.kaggle.com/fg1983/data-exploration-and-price-prediction-house-sales
[6] Columbia University, A Data-Cleaning Tool for Building Better Prediction Models. [Online].
URL: https://datascience.columbia.edu/data-cleaning-tool-building-better-prediction-models