0% found this document useful (0 votes)
27 views8 pages

CURE Project Deliverable 1 Sep 17

This document provides instructions for a deliverable assessing skills in descriptive statistics, Python, and common machine learning datasets. Students are asked to complete tasks involving pandas to analyze particulate emissions data, including summarizing datasets numerically and visually. They also explore standard datasets from scikit-learn, reporting on the digits and breast cancer datasets. The deliverable is due by September 29th, 2023 for a grade comprising 3% of the overall course grade.

Uploaded by

diogo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views8 pages

CURE Project Deliverable 1 Sep 17

This document provides instructions for a deliverable assessing skills in descriptive statistics, Python, and common machine learning datasets. Students are asked to complete tasks involving pandas to analyze particulate emissions data, including summarizing datasets numerically and visually. They also explore standard datasets from scikit-learn, reporting on the digits and breast cancer datasets. The deliverable is due by September 29th, 2023 for a grade comprising 3% of the overall course grade.

Uploaded by

diogo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

ENDG 319 - Fall 23

CURE Project – Deliverable 1


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Descriptive Statistics, Python, and Common Machine Learning Datasets
Instructions:
 This deliverable is worth 3% of your overall grade.
 The entire deliverable will require you to use Python and MS Excel and all solutions involve
pasting screenshots of your work in a Word file. Please keep the original Word file. Once
done, convert the Word file into pdf with the file name: Last Name_First
Name_UCID_CURE Deliverable 1.
 Submit the pdf file to Assessment > Dropbox > ‘CURE Project Deliverable 1’ by Sep 29,
2023, 11:59 pm (MT).

 You must submit this deliverable on time to be able to submit the upcoming deliverables.
 This is not a group project. Each student needs to submit his/her own report.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Research skill:
(i) Looking at data critically (Who collected the data? why? what was the context? Are there missing
data? What is the usefulness of the dataset?)
(ii) Data preparation to develop machine learning algorithm with python, graphical/numerical
diagnostics/analysis of data
Relevant course content: Descriptive Statistics (Ch 1)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

ENDG 319 (Fall 23) CURE Project Deliverable 1 1|Page


TASK 1. GAINING A BASIC UNDERSTANDING OF PANDAS DATAFRAME
Storing tabular data as pandas dataframe:
(a) Data preprocessing is one of the steps in machine learning. The pandas library in python is
suitable to deal with tabular data. Create a variable ‘emissions’ and assign to it the following data
(Table 1) as padas DataFrame. Create an excel file ‘emissions_from_pandas.xlsx’ from the
‘emissions’ variable using python.
Table 1. Particulate matter (PM) emissions (in g/gal) for 15 vehicles driven at low altitude and
another 15 vehicles driven at high altitude.
Low Altitude High Altitude
1.50 7.59
1.48 2.06
2.98 8.86
1.40 8.67
3.12 5.61
0.25 6.28
6.73 4.04
5.30 4.40
9.30 9.52
6.96 1.50
7.21 6.07
0.87 17.11
1.06 3.57
7.39 2.68
1.37 6.46

iloc[] method:
(b) Using the .iloc[] method, we can access any part of the dataframe. Run the following
commands and show the outputs:
emissions.head()
emissions.iloc[0,0]
emissions.iloc[1,1]
emissions.iloc[0:2,0:2]
emissions.iloc[2:4,:]

ENDG 319 (Fall 23) CURE Project Deliverable 1 2|Page


File conversion:
(c) Create an xl file "emissions_from_pandas.xlsx" from the emissions variable using
the .to_excel method. Paste the screenshot of the input command.
(d) Create an MS Excel file ‘emissions_excel.xlsx’ containing the data in Table 1 above with the
column header and save it on your computer. Create a variable ‘emissions_from_excel’ from the
‘emissions_excel.xlsx’ file using pd_read function. Show the first five rows using .head(). Paste
a screenshot with the input commands used.
TASK 2. NUMERICAL AND GRAPHICAL SUMMARY OF DATASETS USING
PYTHON AND PANDAS
Report the following statistics for the emissions data (in Task 1) both at low and high altitude:
sample size or count, sample mean, sample standard deviation (std), minimum, maximum,
median, first and third quartile.
(Note: As mentioned in the textbook, different software packages calculate quartiles in slightly
different ways, so do not worry if your quartile values is slightly different from someone else
who use a different method.)
(a) Use python (any library, but pandas will be fast). You can past screenshot of both input and
output (output will look like the following.)

(b) Use the pandas library in python to generate a comparative boxplot of the emissions dataset.
Interpret the boxplot (max 50 words)
(c) Use Excel to compute the statistics as discussed in part (a) and draw the comparative boxplot
mentioned in part (b). The screenshot with your work may look like the following for the
summary statistics part.

ENDG 319 (Fall 23) CURE Project Deliverable 1 3|Page


Low High Low High
Altitude Altitude Altitude Altitude
1.50 7.59 count
1.48 2.06 mean
2.98 8.86 std
1.40 8.67 min
3.12 5.61 25%
0.25 6.28 50%
6.73 4.04 75%
5.30 4.40 max
9.30 9.52
6.96 1.50
7.21 6.07
0.87 17.11
1.06 3.57
7.39 2.68
1.37 6.46

ENDG 319 (Fall 23) CURE Project Deliverable 1 4|Page


TASK 3. IMPORTANCE OF GRAPHS
Observe the following two bivariate datasets.
Bivariate dataset I Bivariate dataset II

(a) Define a variable ‘dataset1’ of type dataframe using the bivariate dataset I given above. Find
the summary statistics using dataset1.describe().
[Hints: You can create the dataframe by any one of the techniques mentioned in Task A. Or, find
the dataset ‘anscombe’ from seaborn as shown below. Then define dataset1 using .iloc method.

]
(b) Define a dataframe ‘dataset2’ of type dataframe using the bivariate dataset II given above.
Find the summary statistics using dataset2.describe().
(c) Do you see any difference between the statistics that summarize the y variables in the two
datasets?
(d) Draw a diagram showing two scatterplots using the same axes to display dataset1 and
dataset2 above. Do you see any difference between the two datasets. Comment using less than 50
words.

ENDG 319 (Fall 23) CURE Project Deliverable 1 5|Page


TASK 4. EXPLORING AN EXISTING DATASET IN THE PYTHON LIBRARY
SKLEARN.
There are several python packages that contain some datasets often used in the study of machine
learning. We saw two such packages: sklearn and seborn. We used the following commands to
load the iris dataset, the keys, and the description of the dataset from sklearn package.

The list of some small standard datasets available in sklearn are given below with the functions
to load them on python.

For more, please see: https://scikit-learn.org/stable/datasets/toy_dataset.html


(a) Access and explore the ‘digits’ dataset, and report how many instances used,
I. Number of Instances
II. Number of Attributes
III. Attribute Information
IV. Missing Attribute Values
V. Creator
VI. Date
VII. Class

ENDG 319 (Fall 23) CURE Project Deliverable 1 6|Page


(b) Access and explore the ‘breast_cancer’ dataset, and report how many instances used,
I. Number of Instances
II. Number of Attributes
III. Attribute Information
IV. Missing Attribute Values
V. Creator
VI. Date
VII. Class

Note: You can directly take screenshot and paste it as your answers.

ENDG 319 (Fall 23) CURE Project Deliverable 1 7|Page


ENDG 319 (Fall 23) CURE Project Deliverable 1 8|Page

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy