Pandas
Pandas
Pandas
ipynb - Colaboratory
Step 1. Go to https://www.kaggle.com/openfoodfacts/world-food-facts/data
import pandas as pd
import numpy as np
Step 3. Use the tsv file and assign it to a dataframe called food
food = pd.read_csv('~/Desktop/en.openfoodfacts.org.products.tsv', sep='\t')
//anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2717: DtypeWa
interactivity=interactivity, compiler=compiler, result=result)
food.head()
https://colab.research.google.com/drive/1ZprbEADRREetbTrAKlPiu7Wgc4mqP0R2#scrollTo=4k0ZxVEqDR2h&printMode=true 1/3
11/8/21, 3:47 PM Exercises_with_solutions.ipynb - Colaboratory
(356027, 163)
food.shape[0] #will give you only the observations/rows number
356027
print(food.shape) #will give you both (observations/rows, columns)
print(food.shape[1]) #will give you only the columns number
#OR
food.info() #Columns: 163 entries
(356027, 163)
163
<class 'pandas.core.frame.DataFrame'>
food.columns
u'generic_name', u'quantity',
...
u'fruits-vegetables-nuts_100g', u'fruits-vegetables-nuts-estimate_100g',
u'carbon-footprint_100g', u'nutrition-score-fr_100g',
u'nutrition-score-uk_100g', u'glycemic-index_100g',
u'water-hardness_100g'],
dtype='object', length=163)
food.columns[104]
'-glucose_100g'
https://colab.research.google.com/drive/1ZprbEADRREetbTrAKlPiu7Wgc4mqP0R2#scrollTo=4k0ZxVEqDR2h&printMode=true 2/3
11/8/21, 3:47 PM Exercises_with_solutions.ipynb - Colaboratory
dtype('float64')
food.index
food.values[18][7]
https://colab.research.google.com/drive/1ZprbEADRREetbTrAKlPiu7Wgc4mqP0R2#scrollTo=4k0ZxVEqDR2h&printMode=true 3/3
11/8/21, 3:49 PM Exercises_with_solutions.ipynb - Colaboratory
This time we are going to pull data directly from the internet.
Special thanks to:
https://github.com/justmarkham for sharing the dataset and materials.
import pandas as pd
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'
chipo = pd.read_csv(url, sep = '\t')
# clean the item_price column and transform it in a float
prices = [float(value[1 : -1]) for value in chipo.item_price]
# reassign the column with the cleaned prices
chipo.item_price = prices
# delete the duplicates in item_name and quantity
chipo_filtered = chipo.drop_duplicates(['item_name','quantity','choice_description'])
# chipo_filtered
# select only the products with quantity equals to 1
chipo_one_prod = chipo_filtered[chipo_filtered.quantity == 1]
chipo_one_prod
# chipo_one_prod[chipo_one_prod['item_price']>10].item_name.nunique()
# chipo_one_prod[chipo_one_prod['item_price']>10]
chipo.query('price_per_item > 10').item_name.nunique()
https://colab.research.google.com/drive/1ox8IK4owYyxn8JZA4dH8Zyy_20LF5otm#scrollTo=q5dOaddLDsGa&printMode=true 1/6
11/8/21, 3:49 PM Exercises_with_solutions.ipynb - Colaboratory
Barbacoa
4510 1793 1 [Guacamole] 11.49
Bowl
https://colab.research.google.com/drive/1ox8IK4owYyxn8JZA4dH8Zyy_20LF5otm#scrollTo=q5dOaddLDsGa&printMode=true 2/6
11/8/21, 3:49 PM Exercises_with_solutions.ipynb - Colaboratory
# delete the duplicates in item_name and quantity
# chipo_filtered = chipo.drop_duplicates(['item_name','quantity'])
chipo[(chipo['item_name'] == 'Chicken Bowl') & (chipo['quantity'] == 1)]
# select only the products with quantity equals to 1
# chipo_one_prod = chipo_filtered[chipo_filtered.quantity == 1]
# select only the item_name and item_price columns
https://colab.research.google.com/drive/1ox8IK4owYyxn8JZA4dH8Zyy_20LF5otm#scrollTo=q5dOaddLDsGa&printMode=true 3/6
11/8/21, 3:49 PM Exercises_with_solutions.ipynb - Colaboratory
# price_per_item = chipo_one_prod[['item_name', 'item_price']]
# sort the values from the most to less expensive
# price_per_item.sort_values(by = "item_price", ascending = False).head(20)
https://colab.research.google.com/drive/1ox8IK4owYyxn8JZA4dH8Zyy_20LF5otm#scrollTo=q5dOaddLDsGa&printMode=true 4/6
11/8/21, 3:49 PM Exercises_with_solutions.ipynb - Colaboratory
chipo.item_name.sort_values()
# OR
chipo.sort_values(by = "item_name")
https://colab.research.google.com/drive/1ox8IK4owYyxn8JZA4dH8Zyy_20LF5otm#scrollTo=q5dOaddLDsGa&printMode=true 6/6
11/8/21, 3:50 PM Exercise_with_solutions.ipynb - Colaboratory
Ex - GroupBy
Introduction:
GroupBy can be summarized as Split-Apply-Combine.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.
import pandas as pd
drinks = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drink
drinks.head()
0 Afghanistan 0 0 0
1 Albania 89 132 54
2 Algeria 25 0 14
4 Angola 217 57 45
drinks.groupby('continent').beer_servings.mean()
continent
AF 61.471698
AS 37.045455
EU 193.777778
OC 89.687500
SA 175.083333
https://colab.research.google.com/drive/1UhnKrANHuxz2PxCYda72nyqtqZtP2GhF#scrollTo=zFNwkEYREBXv&printMode=true 1/4
11/8/21, 3:50 PM Exercise_with_solutions.ipynb - Colaboratory
Step 5. For each continent print the statistics for wine consumption.
drinks.groupby('continent').wine_servings.describe()
continent
AF count 53.000000
mean 16.264151
std 38.846419
min 0.000000
25% 1.000000
50% 2.000000
75% 13.000000
max 233.000000
AS count 44.000000
mean 9.068182
std 21.667034
min 0.000000
25% 0.000000
50% 1.000000
75% 8.000000
max 123.000000
EU count 45.000000
mean 142.222222
std 97.421738
min 0.000000
25% 59.000000
50% 128.000000
75% 195.000000
max 370.000000
OC count 16.000000
mean 35.625000
std 64.555790
min 0.000000
25% 1.000000
50% 8.500000
75% 23.250000
max 212.000000
SA count 12.000000
mean 62.416667
std 88.620189
min 1.000000
25% 3.000000
50% 12.000000
75% 98.500000
max 221.000000
dtype: float64
Step 6. Print the mean alcohol consumption per continent for every column
drinks.groupby('continent').mean()
https://colab.research.google.com/drive/1UhnKrANHuxz2PxCYda72nyqtqZtP2GhF#scrollTo=zFNwkEYREBXv&printMode=true 2/4
11/8/21, 3:50 PM Exercise_with_solutions.ipynb - Colaboratory
continent
drinks.groupby('continent').median()
continent
Step 8. Print the mean, min and max values for spirit consumption.
This time output a DataFrame
drinks.groupby('continent').spirit_servings.agg(['mean', 'min', 'max'])
continent
AF 16.339623 0 152
AS 60.840909 0 326
EU 132.555556 0 373
OC 58.437500 0 254
SA 114.750000 25 302
https://colab.research.google.com/drive/1UhnKrANHuxz2PxCYda72nyqtqZtP2GhF#scrollTo=zFNwkEYREBXv&printMode=true 3/4
11/8/21, 3:50 PM Exercise_with_solutions.ipynb - Colaboratory
https://colab.research.google.com/drive/1UhnKrANHuxz2PxCYda72nyqtqZtP2GhF#scrollTo=zFNwkEYREBXv&printMode=true 4/4
11/8/21, 3:51 PM Exercises_with_solutions.ipynb - Colaboratory
Introduction:
This time you will download a dataset from the UCI.
import pandas as pd
import numpy
csv_url = 'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/04_Apply/S
df = pd.read_csv(csv_url)
df.head()
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... fa
5 rows × 33 columns
Step 4. For the purpose of this exercise slice the dataframe from 'school'
until the 'guardian' column
stud_alcoh = df.loc[: , "school":"guardian"]
stud_alcoh.head()
https://colab.research.google.com/drive/1sjDD5WnqaRs2cQf9Uvcck6x7sHYJsRME#printMode=true 1/5
11/8/21, 3:51 PM Exercises_with_solutions.ipynb - Colaboratory
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
stud_alcoh['Mjob'].apply(capitalizer)
stud_alcoh['Fjob'].apply(capitalizer)
0 Teacher
1 Other
2 Other
3 Services
4 Other
5 Other
6 Other
7 Teacher
8 Other
9 Other
10 Health
11 Other
12 Services
13 Other
14 Other
15 Other
16 Services
17 Other
18 Services
19 Other
20 Other
21 Health
22 Other
23 Other
24 Health
25 Services
26 Other
27 Services
28 Other
29 Teacher
...
365 Other
366 Services
367 Services
368 Services
369 Teacher
370 Services
371 Services
372 At_home
373 Other
https://colab.research.google.com/drive/1sjDD5WnqaRs2cQf9Uvcck6x7sHYJsRME#printMode=true 2/5
11/8/21, 3:51 PM Exercises_with_solutions.ipynb - Colaboratory
374 Other
375 Other
376 Other
377 Services
378 Other
379 Other
380 Teacher
381 Other
382 Services
383 Services
384 Other
385 Other
386 At_home
387 Other
388 Services
389 Other
390 Services
391 Services
392 Other
stud_alcoh.tail()
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reaso
Step 8. Did you notice the original dataframe is still lowercase? Why is that?
Fix it and capitalize Mjob and Fjob.
stud_alcoh['Mjob'] = stud_alcoh['Mjob'].apply(capitalizer)
stud_alcoh['Fjob'] = stud_alcoh['Fjob'].apply(capitalizer)
stud_alcoh.tail()
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reaso
https://colab.research.google.com/drive/1sjDD5WnqaRs2cQf9Uvcck6x7sHYJsRME#printMode=true 3/5
11/8/21, 3:51 PM Exercises_with_solutions.ipynb - Colaboratory
def majority(x):
if x > 17:
return True
else:
return False
stud_alcoh['legal_drinker'] = stud_alcoh['age'].apply(majority)
stud_alcoh.head()
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
def times10(x):
if type(x) is int:
return 10 * x
return x
stud_alcoh.applymap(times10).head(10)
https://colab.research.google.com/drive/1sjDD5WnqaRs2cQf9Uvcck6x7sHYJsRME#printMode=true 4/5
11/8/21, 3:51 PM Exercises_with_solutions.ipynb - Colaboratory
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reaso
https://colab.research.google.com/drive/1sjDD5WnqaRs2cQf9Uvcck6x7sHYJsRME#printMode=true 5/5
11/8/21, 3:52 PM Exercises_with_solutions.ipynb - Colaboratory
Housing Market
Introduction:
This time we will create our own dataset with fictional numbers to describe a house market. As
we are going to create random data don't try to reason of the numbers.
import pandas as pd
import numpy as np
s1 = pd.Series(np.random.randint(1, high=5, size=100, dtype='l'))
s2 = pd.Series(np.random.randint(1, high=4, size=100, dtype='l'))
s3 = pd.Series(np.random.randint(10000, high=30001, size=100, dtype='l'))
print(s1, s2, s3)
0 2
1 2
2 4
3 2
4 1
5 1
6 2
7 3
8 3
9 2
10 1
11 2
12 4
13 1
14 2
15 3
16 4
17 4
18 4
19 3
20 2
21 1
22 4
23 1
https://colab.research.google.com/drive/1i0DuW21V4kMAX67kbTlEyfvxOldmt7s1#printMode=true 1/6
11/8/21, 3:52 PM Exercises_with_solutions.ipynb - Colaboratory
24 3
25 2
26 3
27 1
28 3
29 4
..
70 4
71 2
72 2
73 4
74 2
75 1
76 2
77 4
78 3
79 2
80 2
81 2
82 4
83 2
84 2
85 2
86 1
87 3
88 1
89 1
90 1
91 3
92 1
93 2
94 3
95 4
96 4
97 2
housemkt = pd.concat([s1, s2, s3], axis=1)
housemkt.head()
0 1 2
0 2 2 16957
1 2 3 24571
2 4 2 28303
3 2 3 14153
4 1 3 23445
housemkt.rename(columns = {0: 'bedrs', 1: 'bathrs', 2: 'price_sqr_meter'}, inplace=True)
https://colab.research.google.com/drive/1i0DuW21V4kMAX67kbTlEyfvxOldmt7s1#printMode=true 2/6
11/8/21, 3:52 PM Exercises_with_solutions.ipynb - Colaboratory
housemkt.head()
0 2 2 16957
1 2 3 24571
2 4 2 28303
3 2 3 14153
4 1 3 23445
Step 5. Create a one column DataFrame with the values of the 3 Series and
assign it to 'bigcolumn'
# join concat the values
bigcolumn = pd.concat([s1, s2, s3], axis=0)
# it is still a Series, so we need to transform it to a DataFrame
bigcolumn = bigcolumn.to_frame()
print(type(bigcolumn))
bigcolumn
https://colab.research.google.com/drive/1i0DuW21V4kMAX67kbTlEyfvxOldmt7s1#printMode=true 3/6
11/8/21, 3:52 PM Exercises_with_solutions.ipynb - Colaboratory
<class 'pandas.core.frame.DataFrame'>
0 2
1 2
2 4
3 2
4 1
5 1
6 2
7 3
8 3
9 2
10 1
11 2
12 4
13 1
14 2
15 3
16 4
17 4
18 4
19 3
20 2
21 1
22 4
23 1
24 3
25 2
26 3
27 1
28 3
29 4
https://colab.research.google.com/drive/1i0DuW21V4kMAX67kbTlEyfvxOldmt7s1#printMode=true 4/6
11/8/21, 3:52 PM Exercises_with_solutions.ipynb - Colaboratory
# no the index are kept but the length of the DataFrame is 300
len(bigcolumn)
300
bigcolumn.reset_index(drop=True, inplace=True)
bigcolumn
https://colab.research.google.com/drive/1i0DuW21V4kMAX67kbTlEyfvxOldmt7s1#printMode=true 5/6
11/8/21, 3:52 PM Exercises_with_solutions.ipynb - Colaboratory
0 2
1 2
2 4
3 2
4 1
5 1
6 2
7 3
8 3
9 2
10 1
11 2
12 4
13 1
14 2
15 3
16 4
17 4
18 4
19 3
20 2
21 1
22 4
23 1
24 3
25 2
26 3
27 1
28 3
29 4
... ...
https://colab.research.google.com/drive/1i0DuW21V4kMAX67kbTlEyfvxOldmt7s1#printMode=true 6/6
11/8/21, 3:53 PM Exercises_with_solutions.ipynb - Colaboratory
Introduction:
The data have been modified to contain some missing values, identified by NaN.
"""
Yr Mo Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
61 1 1 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
61 1 2 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
61 1 3 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
"""
'\nYr Mo Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL\n
The first three columns are year, month and day. The
remaining 12 columns are average
windspeeds in knots at 12
locations in Ireland on that day.
import pandas as pd
import datetime
Step 3. Assign it to a variable called data and replace the first 3 columns by a
proper datetime index.
# parse_dates gets 0, 1, 2 columns and parses them as the index
data_url = 'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/
data = pd.read_csv(data_url, sep = "\s+", parse_dates = [[0,1,2]])
data.head()
https://colab.research.google.com/drive/1HL1effeYWC8kJC05cIqEQTvp4wlwzsNo#scrollTo=HXLQzV7PEtSN&printMode=true 1/9
11/8/21, 3:53 PM Exercises_with_solutions.ipynb - Colaboratory
Yr_Mo_Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL C
2061-01-
0 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.
01
2061-01-
1 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.
02
# The problem is that the dates are 2061 and so on...
# function that uses datetime
def fix_century(x):
year = x.year - 100 if x.year > 1989 else x.year
return datetime.date(year, x.month, x.day)
# apply the function fix_century on the column and replace the values to the right ones
data['Yr_Mo_Dy'] = data['Yr_Mo_Dy'].apply(fix_century)
# data.info()
data.head()
Yr_Mo_Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL C
1961-01-
0 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.
01
1961-01-
1 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.
02
1961-01-
2 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.
03
Step 5. Set the right dates as the index. Pay attention at the data type, it
should be datetime64[ns].
# transform Yr_Mo_Dy it to date type datetime64
data["Yr_Mo_Dy"] = pd.to_datetime(data["Yr_Mo_Dy"])
# set 'Yr_Mo_Dy' as the index
data = data.set_index('Yr_Mo_Dy')
data.head()
# data.info()
https://colab.research.google.com/drive/1HL1effeYWC8kJC05cIqEQTvp4wlwzsNo#scrollTo=HXLQzV7PEtSN&printMode=true 2/9
11/8/21, 3:53 PM Exercises_with_solutions.ipynb - Colaboratory
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO
Yr_Mo_Dy
1961-01-
15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58
01
1961-01-
14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67
02
Step 6. Compute
1961-01-
18.50how many
16.88 12.33values
10.13 are missing
11.17 for each
6.17 11.25 NaNlocation
8.50 over the
7.67
03
entire record.
They should be ignored in all calculations below.
# "Number of non-missing values for each location: "
data.isnull().sum()
RPT 6
VAL 3
ROS 2
KIL 5
SHA 2
BIR 0
DUB 3
CLA 2
MUL 3
CLO 1
BEL 0
MAL 4
dtype: int64
#number of columns minus the number of missing values for each location
data.shape[0] - data.isnull().sum()
#or
data.notnull().sum()
RPT 6568
VAL 6571
ROS 6572
KIL 6569
SHA 6572
BIR 6574
DUB 6571
CLA 6572
MUL 6571
CLO 6573
BEL 6574
MAL 6570
dtype: int64
https://colab.research.google.com/drive/1HL1effeYWC8kJC05cIqEQTvp4wlwzsNo#scrollTo=HXLQzV7PEtSN&printMode=true 3/9
11/8/21, 3:53 PM Exercises_with_solutions.ipynb - Colaboratory
Step 8. Calculate the mean windspeeds of the windspeeds over all the
locations and all the times.
A single number for the entire dataset.
data.sum().sum() / data.notna().sum().sum()
10.227883764282167
Step 9. Create a DataFrame called loc_stats and calculate the min, max and
mean windspeeds and standard deviations of the windspeeds at each
location over all the days
A different set of numbers for each location.
data.describe(percentiles=[])
Step 10. Create a DataFrame called day_stats and calculate the min, max and
mean windspeed and standard deviations of the windspeeds across all the
locations at each day.
A different set of numbers for each day.
# create the dataframe
day_stats = pd.DataFrame()
# this time we determine axis equals to one so it gets each row.
day_stats['min'] = data.min(axis = 1) # min
day_stats['max'] = data.max(axis = 1) # max
day_stats['mean'] = data.mean(axis = 1) # mean
day_stats['std'] = data.std(axis = 1) # standard deviations
https://colab.research.google.com/drive/1HL1effeYWC8kJC05cIqEQTvp4wlwzsNo#scrollTo=HXLQzV7PEtSN&printMode=true 4/9
11/8/21, 3:53 PM Exercises_with_solutions.ipynb - Colaboratory
day_stats.head()
Yr_Mo_Dy
Step 11. Find the average windspeed in January for each location.
Treat January 1961 and January 1962 both as January.
data.loc[data.index.month == 1].mean()
RPT 14.847325
VAL 12.914560
ROS 13.299624
KIL 7.199498
SHA 11.667734
BIR 8.054839
DUB 11.819355
CLA 9.512047
MUL 9.543208
CLO 10.053566
BEL 14.550520
MAL 18.028763
dtype: float64
Step 12. Downsample the record to a yearly frequency for each location.
data.groupby(data.index.to_period('A')).mean()
https://colab.research.google.com/drive/1HL1effeYWC8kJC05cIqEQTvp4wlwzsNo#scrollTo=HXLQzV7PEtSN&printMode=true 5/9
11/8/21, 3:53 PM Exercises_with_solutions.ipynb - Colaboratory
Yr_Mo_Dy
https://colab.research.google.com/drive/1HL1effeYWC8kJC05cIqEQTvp4wlwzsNo#scrollTo=HXLQzV7PEtSN&printMode=true 6/9
11/8/21, 3:53 PM Exercises_with_solutions.ipynb - Colaboratory
Yr_Mo_Dy
https://colab.research.google.com/drive/1HL1effeYWC8kJC05cIqEQTvp4wlwzsNo#scrollTo=HXLQzV7PEtSN&printMode=true 7/9
11/8/21, 3:53 PM Exercises_with_solutions.ipynb - Colaboratory
Step 14. Downsample the record to a weekly frequency for each location.
data.groupby(data.index.to_period('W')).mean()
https://colab.research.google.com/drive/1HL1effeYWC8kJC05cIqEQTvp4wlwzsNo#scrollTo=HXLQzV7PEtSN&printMode=true 8/9
11/8/21, 3:53 PM Exercises_with_solutions.ipynb - Colaboratory
0 6
1961-07-
17/1961- 4.202857 4.255714 6.738571 3.300000 6.112857 2.715714 3.964286
07-23
1978-06-
05/1978- 12.022857 9.154286 9.488571 5.971429 10.637143 8.030000 8.678571
06-11
1978-06-
12/1978- 9.410000 8.770000 14.135714 6.457143 8.564286 6.898571 7.297143
06-18
1978-06-
19/1978- 12.707143 10.244286 8.912857 5.878571 10.372857 6.852857 7.648571
06-25
1978-06-
26/1978- 12.208571 9.640000 10.482857 7.011429 12.772857 9.005714 11.055714
07-02
1978-07-
03/1978- 18.052857 12.630000 11.984286 9.220000 13.414286 10.762857 11.368571
07-09
1978-07-
10/1978- 5.882857 3.244286 5.358571 2.250000 4.618571 2.631429 2.494286
07-16
1978-07-
17/1978- 13.654286 10.007143 9.915714 6.577143 10.757143 8.282857 8.147143
07-23
1978-07-
24/1978- 12.172857 11.854286 11.094286 6.631429 9.918571 8.707143 7.458571
07-30
1978-07-
31/1978- 12.475714 9.488571 10.584286 5.457143 8.724286 5.855714 7.065714
08-06
1978-08-
07/1978- 10.114286 9.600000 7.635714 4.790000 8.101429 6.702857 5.452857
08-13
1978-08-
14/1978- 11.100000 11.237143 10.505714 5.697143 9.910000 8.034286 7.267143
08-20
1978-08-
21/1978- 6.208571 5.060000 8.565714 3.121429 4.638571 4.077143 3.291429
08-27
1978-08-
28/1978- 8.232857 4.888571 7.767143 3.588571 3.892857 5.090000 6.184286
09-03
1978-09-
04/1978- 11.487143 12.742857 11.124286 5.702857 10.721429 10.927143 9.157143
09 10
https://colab.research.google.com/drive/1HL1effeYWC8kJC05cIqEQTvp4wlwzsNo#scrollTo=HXLQzV7PEtSN&printMode=true 9/9
11/8/21, 3:57 PM Exercise_with_Solutions.ipynb - Colaboratory
This time we are going to pull data directly from the internet.
Special thanks to:
https://github.com/justmarkham for sharing the dataset and materials.
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
# set this so the
%matplotlib inline
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'
chipo = pd.read_csv(url, sep = '\t')
chipo.head(10)
Chips and
3 1 1 Tomatillo-Green NaN $2.39
Chili Salsa
# get the Series of the names
x = chipo.item_name
# use the Counter class from collections to create a dictionary with keys(text) and freque
letter_counts = Counter(x)
# convert the dictionary to a DataFrame
df = pd.DataFrame.from_dict(letter_counts, orient='index')
# sort the values from the top to the least value and slice the first 5 items
df = df[0].sort_values(ascending = True)[45:50]
# create the plot
df.plot(kind='bar')
# Set the title and labels
plt.xlabel('Items')
plt.ylabel('Number of Times Ordered')
plt.title('Most ordered Chipotle\'s Items')
# show the plot
plt.show()
Step 6. Create a scatterplot with the number of items orderered per order
price
Hint: Price should be in the X-axis and Items ordered in the Y-axis
# create a list of prices
https://colab.research.google.com/drive/1pKrFns6Ylt9bww4rWtyC0lHOlywMU7It#scrollTo=fFrNlqy-Fidk&printMode=true 2/4
11/8/21, 3:57 PM Exercise_with_Solutions.ipynb - Colaboratory
chipo.item_price = [float(value[1:-1]) for value in chipo.item_price] # strip the dollar s
# then groupby the orders and sum
orders = chipo.groupby('order_id').sum()
# creates the scatterplot
# plt.scatter(orders.quantity, orders.item_price, s = 50, c = 'green')
plt.scatter(x = orders.item_price, y = orders.quantity, s = 50, c = 'green')
# Set the title and labels
plt.xlabel('Order Price')
plt.ylabel('Items ordered')
plt.title('Number of items ordered per order price')
plt.ylim(0)
(0, 36.7178857951459)
https://colab.research.google.com/drive/1pKrFns6Ylt9bww4rWtyC0lHOlywMU7It#scrollTo=fFrNlqy-Fidk&printMode=true 3/4
11/8/21, 3:57 PM Exercise_with_Solutions.ipynb - Colaboratory
https://colab.research.google.com/drive/1pKrFns6Ylt9bww4rWtyC0lHOlywMU7It#scrollTo=fFrNlqy-Fidk&printMode=true 4/4
11/8/21, 3:58 PM Exercises_code_with_solutions.ipynb - Colaboratory
Introduction:
This exercise is based on the titanic Disaster dataset avaiable at Kaggle.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
url = 'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/Visualization/
titanic = pd.read_csv(url)
titanic.head()
Braund,
0 1 0 3 Mr. Owen male 22.0 1 0 A/5 21171 7.
Harris
Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.
(Florence
Briggs
Th
titanic.set_index('PassengerId').head()
https://colab.research.google.com/drive/1wqzjxsgMgmHlLGQ8dRs2KnyO-HUfh3rI#scrollTo=Wl4DJ1LGFxO5&printMode=true 1/5
11/8/21, 3:58 PM Exercises_code_with_solutions.ipynb - Colaboratory
PassengerId
Braund,
1 0 3 Mr. Owen male 22.0 1 0 A/5 21171 7.250
Harris
Cumings,
Mrs. John
Bradley
2 1 1 female 38.0 1 0 PC 17599 71.283
(Florence
Step 5. Create a pie chart presenting the male/female
Briggs proportion
Th
# sum the instances of males and females
males = (titanic['Sex'] == 'male').sum()
females = (titanic['Sex'] == 'female').sum()
# put them into a list called proportions
proportions = [males, females]
# Create a pie chart
plt.pie(
# using proportions
proportions,
# with the labels being officer names
labels = ['Males', 'Females'],
# with no shadows
shadow = False,
# with colors
colors = ['blue','red'],
# with one slide exploded out
explode = (0.15 , 0),
# with the start angle at 90%
startangle = 90,
# with the percent listed as a fraction
autopct = '%1.1f%%'
)
# View the plot drop above
plt.axis('equal')
# Set labels
plt.title("Sex Proportion")
# View the plot
plt.tight_layout()
plt.show()
https://colab.research.google.com/drive/1wqzjxsgMgmHlLGQ8dRs2KnyO-HUfh3rI#scrollTo=Wl4DJ1LGFxO5&printMode=true 2/5
11/8/21, 3:58 PM Exercises_code_with_solutions.ipynb - Colaboratory
Step 6. Create a scatterplot with the Fare payed and the Age, differ the plot
color by gender
# creates the plot using
lm = sns.lmplot(x = 'Age', y = 'Fare', data = titanic, hue = 'Sex', fit_reg=False)
# set title
lm.set(title = 'Fare x Age')
# get the axes object and tweak it
axes = lm.axes
axes[0,0].set_ylim(-5,)
axes[0,0].set_xlim(-5,85)
(-5, 85)
https://colab.research.google.com/drive/1wqzjxsgMgmHlLGQ8dRs2KnyO-HUfh3rI#scrollTo=Wl4DJ1LGFxO5&printMode=true 3/5
11/8/21, 3:58 PM Exercises_code_with_solutions.ipynb - Colaboratory
342
# sort the values from the top to the least value and slice the first 5 items
df = titanic.Fare.sort_values(ascending = False)
df
# create bins interval using numpy
binsVal = np.arange(0,600,10)
binsVal
# create the plot
plt.hist(df, bins = binsVal)
# Set the title and labels
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.title('Fare Payed Histrogram')
# show the plot
plt.show()
https://colab.research.google.com/drive/1wqzjxsgMgmHlLGQ8dRs2KnyO-HUfh3rI#scrollTo=Wl4DJ1LGFxO5&printMode=true 4/5
11/8/21, 3:58 PM Exercises_code_with_solutions.ipynb - Colaboratory
https://colab.research.google.com/drive/1wqzjxsgMgmHlLGQ8dRs2KnyO-HUfh3rI#scrollTo=Wl4DJ1LGFxO5&printMode=true 5/5