DS100 Sp22 Lec 01 - Course Overview, Data Science Lifecycle
DS100 Sp22 Lec 01 - Course Overview, Data Science Lifecycle
Course Overview
An overview of data science, Data 100/200, and the data science lifecycle.
Roadmap
Lecture 01, Data 100 Spring 2022
2
Intro - Josh Hug
Science?
Lecture 01, Data 100 Spring 2022
5
Why I Care About Data Science
(A Coronavirus Story)
6
The World is Complicated
7
Belief is Social
Map Link
8
Data is a Tool for Finding Truth
Link
9
But Data Can Be Misleading, and Analysis is Hard
10
But Data Can Be Misleading, and Analysis is Hard (Source)
11
From Tyler Vigen (http://tylervigen.com)
12
But Data is Easy to Abuse or Misinterpret
13
But Data is Easy to Abuse or Misinterpret
14
Even Important Entities Communicate Poorly!
15
Actual Data: Link
Example of a Gap in Communication: Childcare
Be able to take data and produce useful insights on the world’s most challenging and
ambiguous problems.
17
What is Data Science?
18
Data is changing the world
Joey Gonzalez
(co-creator of this course)
20
Data Science Venn Diagram
There are many tools out there for data science, but they are merely tools.
● They don’t do any of the important thinking!
“The purpose of computing is insight, not numbers.” - R. Hamming. Numerical Methods for
Scientists and Engineers (1962).
23
Example Questions in Data Science
24
• Intros
• What is data science?
• What will you learn in this class?
• Course overview
• Lots of important details
What will you learn •
•
Data Science Lifecycle
Demo
in this class?
Lecture 01, Data 100 Spring 2022
25
What are the Principles and Techniques that We’ll Learn?
26
Course goals
27
Tentative List of Topics to be Covered in Data 100
28
Prerequisites
Course Overview
Lecture 01, Data 100 Spring 2022
30
Staff
31
GSIs
GSIs teach discussion, hold office hours, and help create assignments and
exams. Contact info: ds100.org/sp22/staff.
32
Bold donates 20 hour GSI.
Tutors
33
Course Websites / Platforms
34
Online platforms
35
Programming Environment for our Course: JupyterLab
36
Learning Advanced JupyterLab
37
Course Logistics
Content and workflow
38
Weekly Flow
39
Hybrid format
This fall, Data 100 will be run in a hybrid format. There are a lot of moving parts; we want to
cover them all now so that everyone is on the same page.
● Please give us feedback throughout the semester! Based on the data, we may change
various aspects of the course throughout the semester.
● Note: In-person meetings are fully dependent on public health guidelines. We are
prepared to hold all course activities online should circumstances demand.
Useful links:
● The following information is all on the syllabus page of the website.
● The calendar page contains the scheduling for all live events.
● Ed contains the Zoom links for all live events.
40
Hybrid Format
41
Lecture format
42
Discussion Section
Structure:
● Worksheet posted and discussions held on Fridays.
● Two section types: Online and in person.
○ In person sections start week 3, pending public health guidance.
● Worksheet may include extra problems at the end that TAs will not have time to cover!
43
Discussion Section Attendance
Francis’s second online 4:00pm - 5:00pm section will be recorded and posted. Only sign up for
this section if you are OK with being recorded.
Attendance is optional, but can boost your grade if your homework score is less than perfect.
● More shortly.
44
Lab Format
There is one lab assignment per week. Labs are shorter programming assignments designed
to give you familiarity with new concepts.
● In a typical week, lab is released on Friday and is due the following Tuesday.
● All lab autograder tests are visible.
Support:
● Spoiler walkthroughs released with each lab.
○ Don’t just go straight to the spoiler video! Try on your own first.
● In-person lab support Tuesdays 5 - 8 PM.
○ Will start week 3, pending public health guidance.
● Labs are fully autograded.
45
Homework
46
Weekly Check-ins
Weekly Check-ins
● Released on Mondays, and are due the following Monday.
● Weekly surveys may also contain logistical questions.
○ For instance, the Week 1 survey asks what timezone you think you’ll be in this
semester.
● You submit weekly surveys via Google forms
○ The links to these forms will be on the website.
● Mandatory (for undergraduates).
○ We’ll drop up to four missed or late surveys.
47
Office hours
Office hours are listed on the calendar and will be held both virtually and in person.
● These are led by GSIs, tutors, and academic interns.
● Come to get help on assignments – labs, homeworks, and projects – and concepts.
● To access virtual office hours, join the queue at oh.ds100.org.
○ When joining the queue, specify which assignment and question you need help with.
○ Once it’s your turn, you will be given a Zoom link.
● In person office hours will be held in various locations specified on the calendar
○ To adhere to public health guidelines, we ask that students leave the OH room
after their questions have been answered.
Josh and Lisa will also be hosting their own office hours.
● Primary focus will be on non-HW, non-project, non-lab questions, but these are also
welcome.
● Details TBA.
48
Exams
Two midterms:
● Midterm 1: Thursday February 24th, 7-9PM Pacific.
○ Primary focus: Programming and tools.
● Midterm 2: Thursday, April 7, 7 - 8:30PM Pacific.
○ Primary focus: Math and theory. Smaller, lighter weight midterm.
● Final: Friday, May 13, 7-10pm Pacific.
○ Comprehensive.
Format:
● Current plan: Primarily in-person exams with the option for virtual exams. Online details
TBD.
● Alternate exam times will be provided for all exams for pre-approved reasons, such as a
concurrent final exam.
● If you miss an exam due to a personal emergency or illness, please contact the Head TA
Andrew Lenz immediately.
49
Grading
50
Grading Logistics
Deadlines are firm at 11:59PM. Extensions are provided only to students with DSP
accommodations, or in the case of exceptional circumstances.
● No late homework or lab submissions will be accepted.
○ Gradescope may allow you to submit late, but you will be given a 0.
● You can submit projects up to 2 days late, at 10% off per day.
○ Rounded up to the next day: 2 minutes late = 1 day late.
If you have DSP accommodations, you should receive an email from us shortly.
51
Collaboration and Academic Dishonesty
We will be following the EECS Department Policy on Academic Dishonesty, which states that using
work or resources that are not your own or permitted by the course constitutes plagiarism and may
lead to disciplinary actions.
Assignments
Data science is a collaborative activity! It is okay to discuss problems with friends.
● List their names at the top of your assignments. We provide a place to do this.
● You must write your solutions individually! Do not copy any other student’s work.
● If we suspect that you have submitted plagiarized work, we will call you in for a meeting. If we
then determine that plagiarism has occurred, we reserve the right to give you a negative full
score (-100%) or lower on the assignments in question, along with reporting your offense to the
Center of Student Conduct.
Exams
● Cheating on exams is a serious offense. We will have proctoring in place and will prosecute
those caught cheating, with serious consequences for your career – so don’t do it!
52
Weekly Announcements
53
We are Here to Help!
54
• Intros
• What is data science?
• What will you learn in this class?
• Course overview
• Lots of important details
Data Science •
•
Data Science Lifecycle
Demo
Lifecycle
Lecture 01, Data 100 Spring 2022
55
The “data science lifecycle” you will see out in the wild may be slightly different than
the one we teach you, but the core ideas are all the same.
56
Data science lifecycle
The data science lifecycle is a
high-level description of the data
science workflow.
Understand Understand
the World the Data
Reports, Decisions,
and Solutions
57
1. Question/Problem Formulation
Understand Understand
the World the Data
Reports, Decisions,
and Solutions
58
2. Data Acquisition and Cleaning
Understand Understand
the World the Data
Reports, Decisions,
and Solutions
59
3. Exploratory Data Analysis & Visualization
Understand Understand
the World the Data
Reports, Decisions,
and Solutions
60
4. Prediction and Inference
Understand Understand
the World the Data
Reports, Decisions,
and Solutions
61
• Intros
• What is data science?
• What will you learn in this class?
• Course overview
• Lots of important details
Demo: The Data •
•
Data Science Lifecycle
Demo
Science Lifecycle
Available on the course website:
Lecture 01, Data 100 Spring 2022 https://ds100.org/sp22/lecture/lec01/
62
[1] Ask a Question: Who are you?
Ask a Obtain
Question Data
Demo Slides
Understand Understand
the World the Data
63
Reports, Decisions
[2] Data Acquisition and Cleaning
Ask a Obtain
Question Data
Demo Slides
Understand Understand
the World the Data
64
Reports, Decisions
[3] Exploratory Data Analysis and Visualization
Ask a Obtain
Question Data
Demo Slides
Understand Understand
the World the Data
65
Reports, Decisions
[3] Exploratory Data Analysis and Visualization
Ask a Obtain
Question Data
Demo Slides
Understand Understand
the World the Data
66
Reports, Decisions
[3] A harder direction to explore
Diversity ...?
Unfortunately, surveys of data scientists
suggest that there are far fewer women:
Demo Slides
Ask a Obtain
Question Data
Demo Slides
Understand Understand
the World the Data
68
Reports, Decisions
What is the gender diversity of this class?
69
[1, 2] (again, but for Baby Names Data)
Ask a Obtain
Question Data
Demo Slides
Understand Understand
the World the Data
71
Reports, Decisions
[4] Prediction and Inference: Simple Classifier
Simple classifier:
1. SSN: Proportion of F babies per name
2. Use step 1 to classify each student name
as F, M, or Unknown
3. Average step 2 to get a class prop. F
1. How
Askdo
a you feel about the estimatedObtain
proportion of females in this class? Data
Question
2. Do you trust it?
Demo Slides
Understand Understand
the World the Data
72
Reports, Decisions
A Classifier that Captures Uncertainty
Updated classifier:
1. SSN: Proportion of F babies per name
2. For each student name with step 1:
a. Pick a number in [0.0, 1.0)
b. If 2a is less than SSN prop F (or 0.5 for
Unknowns), classify student as F.
Otherwise, classify as M.
Demo Slides 3. Average step 2 to get a class prop. F
Possible limitations:
● U.S. name data, not global data
● Everyone born since 1937
● No “rare” names
● Sex as a proxy for gender
78
See you soon!
Weekly Check 1 (due Mon 2/24)
https://forms.gle/y46QNWarM27i8BUp8
Preferences for online/in-person
Discussion Sign Up (first-come first-serve, attendance grade optional)
Some slots reserved for release at midnight for async students
In-person Lecture Sign Up (for Th 1/20, Tu 1/25, Th 1/27):
https://www.signupgenius.com/go/805094EA8AA28A3FD0-inperson
Discussion this Friday, Zoom links to be posted on Ed 79
LECTURE 1
Course Overview
Content credit: Suraj Rampure, Allen Shen, Joey Gonzalez, Josh Hug, and Sam Lau
80