Basic Python Analysis
Basic Python Analysis
[2]: fixed acidity volatile acidity citric acid residual sugar chlorides \
0 7.4 0.70 0.00 1.9 0.076
1 7.8 0.88 0.00 2.6 0.098
2 7.8 0.76 0.04 2.3 0.092
3 11.2 0.28 0.56 1.9 0.075
4 7.4 0.70 0.00 1.9 0.076
alcohol quality
0 9.4 5
1 9.8 5
2 9.8 5
3 9.8 6
4 9.4 5
1
sulphates float64
alcohol float64
quality int64
dtype: object
1 EDA
[4]: df.describe()
1.1 Filtering
[5]: df['fixed acidity'] # Return as Series when singe []
[5]: 0 7.4
1 7.8
2 7.8
3 11.2
2
4 7.4
…
1594 6.2
1595 5.9
1596 6.3
1597 5.9
1598 6.0
Name: fixed acidity, Length: 1599, dtype: float64
3
df [df['fixed acidity'] > 9] # Check when fixed acidity > 9; syntax df inside␣
↪df
[8]: fixed acidity volatile acidity citric acid residual sugar chlorides \
3 11.2 0.28 0.56 1.9 0.075
56 10.2 0.42 0.57 3.4 0.070
68 9.3 0.32 0.57 2.0 0.074
74 9.7 0.32 0.54 2.5 0.094
88 9.3 0.39 0.44 2.1 0.107
… … … … … …
1470 10.0 0.69 0.11 1.4 0.084
1474 9.9 0.50 0.50 13.8 0.205
1476 9.9 0.50 0.50 13.8 0.205
1543 11.1 0.44 0.42 2.2 0.064
1548 11.2 0.40 0.50 2.0 0.099
alcohol quality
3 9.8 6
56 9.6 5
68 10.7 5
74 9.6 5
88 9.5 5
… … …
1470 9.7 5
1474 8.8 5
1476 8.8 5
1543 10.4 6
1548 10.4 5
[9]: # Test 2
df [(df['fixed acidity'] > 9) & (df['citric acid'] > 0.5)] # Multiple␣
↪condition over row, more condtion can be added inside parenthesis
4
[9]: fixed acidity volatile acidity citric acid residual sugar chlorides \
3 11.2 0.28 0.56 1.9 0.075
56 10.2 0.42 0.57 3.4 0.070
68 9.3 0.32 0.57 2.0 0.074
74 9.7 0.32 0.54 2.5 0.094
151 9.2 0.52 1.00 3.4 0.610
… … … … … …
1221 10.9 0.32 0.52 1.8 0.132
1319 9.1 0.76 0.68 1.7 0.414
1414 10.0 0.32 0.59 2.2 0.077
1416 10.0 0.32 0.59 2.2 0.077
1454 11.7 0.45 0.63 2.2 0.073
alcohol quality
3 9.8 6
56 9.6 5
68 10.7 5
74 9.6 5
151 9.4 4
… … …
1221 11.5 6
1319 9.1 6
1414 9.6 5
1416 9.6 5
1454 10.9 6
[10]: df [(df['fixed acidity'] > 9) & (df['citric acid'] > 0.5) & (df['pH'] >=3)] #␣
↪Test 3
[10]: fixed acidity volatile acidity citric acid residual sugar chlorides \
3 11.2 0.28 0.56 1.9 0.075
56 10.2 0.42 0.57 3.4 0.070
5
68 9.3 0.32 0.57 2.0 0.074
74 9.7 0.32 0.54 2.5 0.094
197 11.5 0.30 0.60 2.0 0.067
… … … … … …
1220 10.9 0.32 0.52 1.8 0.132
1221 10.9 0.32 0.52 1.8 0.132
1414 10.0 0.32 0.59 2.2 0.077
1416 10.0 0.32 0.59 2.2 0.077
1454 11.7 0.45 0.63 2.2 0.073
alcohol quality
3 9.8 6
56 9.6 5
68 10.7 5
74 9.6 5
197 10.1 6
… … …
1220 11.5 6
1221 11.5 6
1414 9.6 5
1416 9.6 5
1454 10.9 6
[11]: fixed acidity volatile acidity citric acid residual sugar chlorides \
3 11.2 0.28 0.56 1.9 0.075
16 8.5 0.28 0.56 1.8 0.092
19 7.9 0.32 0.51 1.8 0.341
47 8.7 0.29 0.52 1.6 0.113
56 10.2 0.42 0.57 3.4 0.070
6
… … … … … …
1548 11.2 0.40 0.50 2.0 0.099
1566 6.7 0.16 0.64 2.1 0.059
1570 6.4 0.36 0.53 2.2 0.230
1574 5.6 0.31 0.78 13.9 0.074
1576 8.0 0.30 0.63 1.6 0.081
alcohol quality
3 9.8 6
16 10.5 7
19 9.2 6
47 9.5 5
56 9.6 5
… … …
1548 10.4 5
1566 11.2 6
1570 12.4 6
1574 10.5 6
1576 10.8 6
[12]: # Row Wise File - with only selective columns as per condtion
df.loc[df['fixed acidity'] == 9.2, ['fixed acidity', 'citric acid', 'pH']] #␣
↪syntax: loc means, we are locationg, then condtion as ==9.2 & name of col as␣
↪want to show
# as per given condtion, the total retunr number / qnt not show, we can see it␣
↪as next code
7
524 9.2 0.49 3.23
540 9.2 0.24 3.26
614 9.2 0.18 2.87
691 9.2 0.24 3.48
741 9.2 0.24 3.21
765 9.2 0.10 3.31
880 9.2 0.18 3.15
905 9.2 0.20 3.23
1093 9.2 0.36 3.33
1170 9.2 0.34 3.20
1225 9.2 0.23 3.15
1360 9.2 0.31 3.24
[13]: (16, 3)
[14]: df.tail(7)
[14]: fixed acidity volatile acidity citric acid residual sugar chlorides \
1592 6.3 0.510 0.13 2.3 0.076
1593 6.8 0.620 0.08 1.9 0.068
1594 6.2 0.600 0.08 2.0 0.090
1595 5.9 0.550 0.10 2.2 0.062
1596 6.3 0.510 0.13 2.3 0.076
1597 5.9 0.645 0.12 2.0 0.075
1598 6.0 0.310 0.47 3.6 0.067
alcohol quality
1592 11.0 6
1593 9.5 6
1594 10.5 5
1595 11.2 6
1596 11.0 6
1597 10.2 5
1598 11.0 6
8
[15]: df[5:11] # want to see from index 5 to 10
[15]: fixed acidity volatile acidity citric acid residual sugar chlorides \
5 7.4 0.66 0.00 1.8 0.075
6 7.9 0.60 0.06 1.6 0.069
7 7.3 0.65 0.00 1.2 0.065
8 7.8 0.58 0.02 2.0 0.073
9 7.5 0.50 0.36 6.1 0.071
10 6.7 0.58 0.08 1.8 0.097
alcohol quality
5 9.4 5
6 9.4 5
7 10.0 7
8 9.5 7
9 10.5 5
10 9.2 5
# x.to_csv('1007_processedfile.csv')
2 Visualization
Very basic or Ordinary Visualization
[18]: # Plot with a single columns
df['fixed acidity'].plot.line()
[18]: <AxesSubplot:>
9
[19]: df['fixed acidity'].plot.line(figsize =(20,5), color='green')
[19]: <AxesSubplot:>
3 Better Visualization
3.1 Seaborn
[20]: import seaborn as sns
10
• There is Negative Relationship between Acidity & Ph, lets visualize the relationship between
these two
• Sns lineplot must require independent & dependent variable, here x is indendendent & y is
dependent over x
11
3.1.1 Categorical Data Visualization
• Male / Famale type
• In this dataset, quality is categorical data
• Event
• By the category you want to represent data, that must be passed in “hue”
• Checking Facebook traffic for day & night
12
3.2 Color Customization
[36]: p = sns.color_palette("flare", as_cmap=True) # customized color palette stored␣
↪in p & then pass to following code for color customization
13
[38]: p = sns.color_palette("crest", as_cmap=True) # Another color palette
sns.lineplot(data=df, x='fixed acidity', y='pH', hue='quality', palette=p)
14
[39]: p = sns.color_palette("Spectral", as_cmap=True) # Another color palette
sns.lineplot(data=df, x='fixed acidity', y='pH', hue='quality', palette=p)
15
3.3 Creationg New Dataset for Visualization
[52]: df2 = [[ 'Alice', 30, 'Male', 55 ], ['Bobe', 17,'Female', 25 ], ['Jeba', 11,␣
↪'Female', 12], ['Tom', 45,'Male', 72], ['Rita', 21, 'Female', 35],␣
df2
16
• More Application: facebook traffic day & night
tips
17
[60]: Total Bill Tips Gender Smoker Day Time Size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.65 3.50 Male No Mon Dinner 4
4 24.59 3.61 Female No Sun Dinner 3
18
[71]: tips = pd.read_csv('tips.csv')
tips.head()
[74]: tips.shape
[74]: (244, 7)
19
3.4.2 Scatter plot
20
[76]: sns.scatterplot(data=tips, x='total_bill', y='tip', hue='time')
21
[77]: sns.scatterplot(data=tips, x='total_bill', y='tip', hue='time', style='time')
22
[78]: sns.scatterplot(data=tips, x='total_bill', y='tip', hue='size')
23
[79]: sns.scatterplot(data=tips, x='total_bill', y='tip', hue='size', palette="deep")
24
3.4.3 Tip rate
• If there are a large number of unique numeric values, the legend will show a representative,
evenly-spaced set:
25
• A numeric variable can also be assigned to size to apply a semantic mapping to the areas of
the points:
26
• Control the range of marker areas with sizes, and set lengend=“full” to force every unique
value to appear in the legend:
[86]: sns.scatterplot(
data=tips, x="total_bill", y="tip", hue="size", size="size",
sizes=(20, 200), legend="full"
)
27
• Pass a tuple of values or a matplotlib.colors.Normalize object to hue_norm to control the
quantitative hue mapping:
[87]: sns.scatterplot(
data=tips, x="total_bill", y="tip", hue="size", size="size",
sizes=(20, 200), hue_norm=(0, 7), legend="full"
)
28
• Control the specific markers used to map the style variable by passing a Python list or
dictionary of marker codes:
29
[89]: sns.scatterplot(data=tips, x="total_bill", y="tip", s=100, color=".2",␣
↪marker="+")
30
[90]: index = pd.date_range("1 1 2000", periods=100, freq="m", name="date")
data = np.random.randn(100, 4).cumsum(axis=0)
wide_df = pd.DataFrame(data, index, ["a", "b", "c", "d"])
sns.scatterplot(data=wide_df)
[90]: <AxesSubplot:xlabel='date'>
31
• Use relplot() to combine scatterplot() and FacetGrid. This allows grouping within additional
categorical variables, and plotting them across multiple subplots.
• Using relplot() is safer than using FacetGrid directly, as it ensures synchronization of the
semantic mappings across facets.
[91]: sns.relplot(
data=tips, x="total_bill", y="tip",
col="time", hue="day", style="day",
kind="scatter"
)
32
[93]: # https://pandas.pydata.org/docs/user_guide/10min.html # ***
33