0% found this document useful (0 votes)
3 views11 pages

Notes_Uber_Data_analysis_project

The document contains a Python script that processes an Uber dataset using pandas, numpy, and seaborn for data analysis and visualization. It includes steps for data loading, preprocessing (such as handling missing values), and visualizing the data through count plots and line plots. The dataset consists of 1156 entries with various attributes including start and end dates, category, miles traveled, and purpose of the trips.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views11 pages

Notes_Uber_Data_analysis_project

The document contains a Python script that processes an Uber dataset using pandas, numpy, and seaborn for data analysis and visualization. It includes steps for data loading, preprocessing (such as handling missing values), and visualizing the data through count plots and line plots. The dataset consists of 1156 entries with various attributes including start and end dates, category, miles traveled, and purpose of the trips.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

11/21/24, 4:44 PM Untitled

In [2]: import pandas as pd


import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [4]: dataset = pd.read_csv("UberDataset.csv")

In [6]: dataset

Out[6]: START_DATE END_DATE CATEGORY START STOP MILES PURPO

01-01-2016 01-01- Fort


0 Business Fort Pierce 5.1 Meal/Entert
21:11 2016 21:17 Pierce

01-02-2016 01-02- Fort


1 Business Fort Pierce 5.0 N
01:25 2016 01:37 Pierce

01-02-2016 01-02- Fort


2 Business Fort Pierce 4.8 Errand/Suppl
20:25 2016 20:38 Pierce

01-05-2016 01-05- Fort


3 Business Fort Pierce 4.7 Meeti
17:31 2016 17:45 Pierce

West
01-06-2016 01-06-
4 Business Fort Pierce Palm 63.7 Customer V
14:42 2016 15:49
Beach

... ... ... ... ... ... ...

12/31/2016 12/31/2016 Unknown


1151 Business Kar?chi 3.9 Temporary S
13:24 13:42 Location

12/31/2016 12/31/2016 Unknown Unknown


1152 Business 16.2 Meeti
15:03 15:38 Location Location

12/31/2016 12/31/2016
1153 Business Katunayake Gampaha 6.4 Temporary S
21:32 21:50

12/31/2016 12/31/2016
1154 Business Gampaha Ilukwatta 48.2 Temporary S
22:08 23:51

1155 Totals NaN NaN NaN NaN 12204.7 N

1156 rows × 7 columns

In [8]: dataset.shape

Out[8]: (1156, 7)

In [10]: dataset.info()

file:///C:/Users/swati/Downloads/Untitled.html 1/11
11/21/24, 4:44 PM Untitled

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1156 entries, 0 to 1155
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 START_DATE 1156 non-null object
1 END_DATE 1155 non-null object
2 CATEGORY 1155 non-null object
3 START 1155 non-null object
4 STOP 1155 non-null object
5 MILES 1156 non-null float64
6 PURPOSE 653 non-null object
dtypes: float64(1), object(6)
memory usage: 63.3+ KB

Data Preprocessing
In [15]: dataset['PURPOSE'].fillna("NOT", inplace = True)

C:\Users\swati\AppData\Local\Temp\ipykernel_31136\4083644620.py:1: FutureWarning:
A value is trying to be set on a copy of a DataFrame or Series through chained as
signment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work becau
se the intermediate object on which we are setting values always behaves as a cop
y.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.meth


od({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to pe
rform the operation inplace on the original object.

dataset['PURPOSE'].fillna("NOT", inplace = True)

In [17]: dataset.head()

Out[17]: START_DATE END_DATE CATEGORY START STOP MILES PURPOSE

01-01-2016 01-01-2016 Fort Fort


0 Business 5.1 Meal/Entertain
21:11 21:17 Pierce Pierce

01-02-2016 01-02-2016 Fort Fort


1 Business 5.0 NOT
01:25 01:37 Pierce Pierce

01-02-2016 01-02-2016 Fort Fort


2 Business 4.8 Errand/Supplies
20:25 20:38 Pierce Pierce

01-05-2016 01-05-2016 Fort Fort


3 Business 4.7 Meeting
17:31 17:45 Pierce Pierce

West
01-06-2016 01-06-2016 Fort
4 Business Palm 63.7 Customer Visit
14:42 15:49 Pierce
Beach

In [19]: dataset['START_DATE'] = pd.to_datetime(dataset['START_DATE'], errors = 'coerce')

dataset['END_DATE'] = pd.to_datetime(dataset['END_DATE'], errors = 'coerce')

file:///C:/Users/swati/Downloads/Untitled.html 2/11
11/21/24, 4:44 PM Untitled

In [21]: dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1156 entries, 0 to 1155
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 START_DATE 421 non-null datetime64[ns]
1 END_DATE 420 non-null datetime64[ns]
2 CATEGORY 1155 non-null object
3 START 1155 non-null object
4 STOP 1155 non-null object
5 MILES 1156 non-null float64
6 PURPOSE 1156 non-null object
dtypes: datetime64[ns](2), float64(1), object(4)
memory usage: 63.3+ KB

In [23]: from datetime import datetime

dataset['date'] = pd.DatetimeIndex(dataset['START_DATE']).date
dataset['time'] = pd.DatetimeIndex(dataset['START_DATE']).hour

In [25]: dataset.head()

Out[25]: START_DATE END_DATE CATEGORY START STOP MILES PURPOSE date t

2016-01-
2016-01-01 Fort Fort 2016-
0 01 Business 5.1 Meal/Entertain
21:11:00 Pierce Pierce 01-01
21:17:00

2016-01-
2016-01-02 Fort Fort 2016-
1 02 Business 5.0 NOT
01:25:00 Pierce Pierce 01-02
01:37:00

2016-01-
2016-01-02 Fort Fort 2016-
2 02 Business 4.8 Errand/Supplies
20:25:00 Pierce Pierce 01-02
20:38:00

2016-01-
2016-01-05 Fort Fort 2016-
3 05 Business 4.7 Meeting
17:31:00 Pierce Pierce 01-05
17:45:00

2016-01- West
2016-01-06 Fort 2016-
4 06 Business Palm 63.7 Customer Visit
14:42:00 Pierce 01-06
15:49:00 Beach

In [27]: dataset['day-night'] = pd.cut(x=dataset['time'],bins = [0,10,15,19,24],labels =

In [29]: dataset.head()

file:///C:/Users/swati/Downloads/Untitled.html 3/11
11/21/24, 4:44 PM Untitled

Out[29]:
START_DATE END_DATE CATEGORY START STOP MILES PURPOSE date t

2016-01-
2016-01-01 Fort Fort 2016-
0 01 Business 5.1 Meal/Entertain
21:11:00 Pierce Pierce 01-01
21:17:00

2016-01-
2016-01-02 Fort Fort 2016-
1 02 Business 5.0 NOT
01:25:00 Pierce Pierce 01-02
01:37:00

2016-01-
2016-01-02 Fort Fort 2016-
2 02 Business 4.8 Errand/Supplies
20:25:00 Pierce Pierce 01-02
20:38:00

2016-01-
2016-01-05 Fort Fort 2016-
3 05 Business 4.7 Meeting
17:31:00 Pierce Pierce 01-05
17:45:00

2016-01- West
2016-01-06 Fort 2016-
4 06 Business Palm 63.7 Customer Visit
14:42:00 Pierce 01-06
15:49:00 Beach

In [33]: dataset.dropna(inplace = True)

In [35]: dataset.shape

Out[35]: (413, 10)

Data Visualization
In [46]: plt.figure(figsize=(20,5))

plt.subplot(1,2,1)

sns.countplot(dataset['CATEGORY'])
plt.xticks(rotation =90)

plt.subplot(1,2,2)
sns.countplot(dataset['PURPOSE'])

Out[46]: <Axes: xlabel='count', ylabel='PURPOSE'>

In [48]: sns.countplot(dataset['day-night'])

file:///C:/Users/swati/Downloads/Untitled.html 4/11
11/21/24, 4:44 PM Untitled

Out[48]: <Axes: xlabel='count', ylabel='day-night'>

In [50]: dataset.head()

Out[50]:
START_DATE END_DATE CATEGORY START STOP MILES PURPOSE date t

2016-01-
2016-01-01 Fort Fort 2016-
0 01 Business 5.1 Meal/Entertain
21:11:00 Pierce Pierce 01-01
21:17:00

2016-01-
2016-01-02 Fort Fort 2016-
1 02 Business 5.0 NOT
01:25:00 Pierce Pierce 01-02
01:37:00

2016-01-
2016-01-02 Fort Fort 2016-
2 02 Business 4.8 Errand/Supplies
20:25:00 Pierce Pierce 01-02
20:38:00

2016-01-
2016-01-05 Fort Fort 2016-
3 05 Business 4.7 Meeting
17:31:00 Pierce Pierce 01-05
17:45:00

2016-01- West
2016-01-06 Fort 2016-
4 06 Business Palm 63.7 Customer Visit
14:42:00 Pierce 01-06
15:49:00 Beach

In [52]: dataset['MONTH'] = pd.DatetimeIndex(dataset['START_DATE']).month# START_DATE se

month_label = {1.0: 'Jan', 2.0: 'Feb', 3.0: 'Mar', 4.0: 'April',


5.0: 'May', 6.0: 'June', 7.0: 'July', 8.0: 'Aug',
9.0: 'Sep', 10.0: 'Oct', 11.0: 'Nov', 12.0: 'Dec'} # Months ko st

file:///C:/Users/swati/Downloads/Untitled.html 5/11
11/21/24, 4:44 PM Untitled

dataset["MONTH"] = dataset.MONTH.map(month_label) # Number months ko string name

mon = dataset.MONTH.value_counts(sort=False) # Har month ke counts calculate ka

In [54]: dataset.head()

Out[54]:
START_DATE END_DATE CATEGORY START STOP MILES PURPOSE date t

2016-01-
2016-01-01 Fort Fort 2016-
0 01 Business 5.1 Meal/Entertain
21:11:00 Pierce Pierce 01-01
21:17:00

2016-01-
2016-01-02 Fort Fort 2016-
1 02 Business 5.0 NOT
01:25:00 Pierce Pierce 01-02
01:37:00

2016-01-
2016-01-02 Fort Fort 2016-
2 02 Business 4.8 Errand/Supplies
20:25:00 Pierce Pierce 01-02
20:38:00

2016-01-
2016-01-05 Fort Fort 2016-
3 05 Business 4.7 Meeting
17:31:00 Pierce Pierce 01-05
17:45:00

2016-01- West
2016-01-06 Fort 2016-
4 06 Business Palm 63.7 Customer Visit
14:42:00 Pierce 01-06
15:49:00 Beach

In [58]: df = pd.DataFrame({
"MONTHS": mon.values, # Har month ka total count.
"VALUE COUNT": dataset.groupby('MONTH', sort=False)['MILES'].max() # Har mo
})

p = sns.lineplot(data=df) # Line plot banata hai.


p.set(xlabel="MONTHS", ylabel="VALUE COUNT") # Axis labels set karta ha

Out[58]: [Text(0.5, 0, 'MONTHS'), Text(0, 0.5, 'VALUE COUNT')]

file:///C:/Users/swati/Downloads/Untitled.html 6/11
11/21/24, 4:44 PM Untitled

In [60]: dataset.head()

Out[60]:
START_DATE END_DATE CATEGORY START STOP MILES PURPOSE date t

2016-01-
2016-01-01 Fort Fort 2016-
0 01 Business 5.1 Meal/Entertain
21:11:00 Pierce Pierce 01-01
21:17:00

2016-01-
2016-01-02 Fort Fort 2016-
1 02 Business 5.0 NOT
01:25:00 Pierce Pierce 01-02
01:37:00

2016-01-
2016-01-02 Fort Fort 2016-
2 02 Business 4.8 Errand/Supplies
20:25:00 Pierce Pierce 01-02
20:38:00

2016-01-
2016-01-05 Fort Fort 2016-
3 05 Business 4.7 Meeting
17:31:00 Pierce Pierce 01-05
17:45:00

2016-01- West
2016-01-06 Fort 2016-
4 06 Business Palm 63.7 Customer Visit
14:42:00 Pierce 01-06
15:49:00 Beach

In [64]: dataset['DAY'] = dataset.START_DATE.dt.weekday

day_label = {
0: 'Mon', 1:'Tues', 2:'Wed', 3:'Thur',4:'Fri', 5:'Sat', 6:'Sun'}

dataset['DAY'] = dataset['DAY'].map(day_label)

file:///C:/Users/swati/Downloads/Untitled.html 7/11
11/21/24, 4:44 PM Untitled

In [66]: dataset.head()

Out[66]:
START_DATE END_DATE CATEGORY START STOP MILES PURPOSE date t

2016-01-
2016-01-01 Fort Fort 2016-
0 01 Business 5.1 Meal/Entertain
21:11:00 Pierce Pierce 01-01
21:17:00

2016-01-
2016-01-02 Fort Fort 2016-
1 02 Business 5.0 NOT
01:25:00 Pierce Pierce 01-02
01:37:00

2016-01-
2016-01-02 Fort Fort 2016-
2 02 Business 4.8 Errand/Supplies
20:25:00 Pierce Pierce 01-02
20:38:00

2016-01-
2016-01-05 Fort Fort 2016-
3 05 Business 4.7 Meeting
17:31:00 Pierce Pierce 01-05
17:45:00

2016-01- West
2016-01-06 Fort 2016-
4 06 Business Palm 63.7 Customer Visit
14:42:00 Pierce 01-06
15:49:00 Beach

In [68]: day_label =dataset.DAY.value_counts()

sns.barplot(x=day_label.index, y= day_label)
plt.xlabel('DAY')
plt.ylabel('COUNT')

Out[68]: Text(0, 0.5, 'COUNT')

file:///C:/Users/swati/Downloads/Untitled.html 8/11
11/21/24, 4:44 PM Untitled

In [70]: dataset.head()

Out[70]:
START_DATE END_DATE CATEGORY START STOP MILES PURPOSE date t

2016-01-
2016-01-01 Fort Fort 2016-
0 01 Business 5.1 Meal/Entertain
21:11:00 Pierce Pierce 01-01
21:17:00

2016-01-
2016-01-02 Fort Fort 2016-
1 02 Business 5.0 NOT
01:25:00 Pierce Pierce 01-02
01:37:00

2016-01-
2016-01-02 Fort Fort 2016-
2 02 Business 4.8 Errand/Supplies
20:25:00 Pierce Pierce 01-02
20:38:00

2016-01-
2016-01-05 Fort Fort 2016-
3 05 Business 4.7 Meeting
17:31:00 Pierce Pierce 01-05
17:45:00

2016-01- West
2016-01-06 Fort 2016-
4 06 Business Palm 63.7 Customer Visit
14:42:00 Pierce 01-06
15:49:00 Beach

In [74]: sns.boxplot(dataset['MILES'])

Out[74]: <Axes: ylabel='MILES'>

In [78]: sns.boxplot(dataset[dataset['MILES']<100]['MILES'])

Out[78]: <Axes: ylabel='MILES'>

file:///C:/Users/swati/Downloads/Untitled.html 9/11
11/21/24, 4:44 PM Untitled

In [82]: sns.boxplot(dataset[dataset['MILES']<40]['MILES'])

Out[82]: <Axes: ylabel='MILES'>

In [86]: sns.distplot(dataset[dataset['MILES']<40]['MILES'])

file:///C:/Users/swati/Downloads/Untitled.html 10/11
11/21/24, 4:44 PM Untitled

C:\Users\swati\AppData\Local\Temp\ipykernel_31136\1678554178.py:1: UserWarning:

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

sns.distplot(dataset[dataset['MILES']<40]['MILES'])
Out[86]: <Axes: xlabel='MILES', ylabel='Density'>

In [ ]:

file:///C:/Users/swati/Downloads/Untitled.html 11/11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy