EE477 Lecture 1 - Introduction
EE477 Lecture 1 - Introduction
2
The 4th industrial revolution is upon us
Source: Wikipedia
Source: NCTA 3
And the amount of data is exploding
Source: Intel
Source: domo.com
Source: zdnet.com
4
Scary and exciting!
1 GB ≈ 1 cup of coffee
Source: IDC 5
No surprise: data skills are in high demand
6
Data science is also becoming important in manufacturing
9
Which techniques should I learn?
Look at the users
Don’t bother for now
Relational
databases are
still by far the
most dominant
technology
10
Enter EE477 Database and Big Data Systems
● We will cover:
● Relational Databases
○ Necessary fundamentals for Big data
● Spark Theory, Machine
Algorithms Learning
○ Best Big data system in Hadoop & NoSQL category
● Companion course: EE412 (opens in Fall)
Big Data
EE412 Analytics
Database &
EE477 Big Data
Systems
11
Staff
● Instructor: Steven E. Whang
○ Leads the Data Intelligence Lab
○ Office hours: Mon/Wed 9:30am ~ 10:30am
○ Office: N1 516, Email: swhang@kaist.ac.kr
● TAs: Chanho Lee (Head TA), Junseok Seo,
Hyunseung Hwang, Minsu Kim, Soyeon Kim
12
Staff
● TA Office hours
○ Chanho Lee: Wednesday 1:00pm ~ 2:00pm
○ Junseok Seo: Tuesday 3:00pm ~ 4:00pm
○ Hyunseung Hwang: Monday 1:00pm ~ 2:00pm
○ Minsu Kim: Friday 3:00pm ~ 4:00pm
○ Soyeon Kim: Thursday 4:00pm ~ 5:00pm
● Email: kaist.ee477.staff@gmail.com
13
Course work and grading
● 5 Homework assignments (40%)
● Midterm (25%)
● Final exam (30%)
● Gradiance quizzes (4%)
● Classum participation (1%)
14
Course materials
● Textbook: Database Systems: The Complete Book (2nd edition)
○ PDF searchable online
○ 5 copies in KAIST library
● Lecture slides and lecture videos will be
posted on KLMS
○ Google drive containing all lectures
○ Please check for updates
<International Version>
15
Communication
● KLMS
○ Lecture slides (please check for updates)
○ Lecture videos
<Classum>
● Classum
○ To join: use code K97DM88EG (link: www.classum.com/K97DM88EG)
○ Use Classum for all questions and public communication
○ We will post course announcements to Classum
● Staff mailing list
○ kaist.ee477.staff@gmail.com
○ Only ask questions not suitable for Classum
● Anonymous feedback
○ Link: https://goo.gl/forms/pE2CwGfeinf9DQW23
○ I appreciate any anonymous feedback to improve the course
<Anonymous feedback> 16
Gradiance
● Online quizzes and SQL labs designed for self study
○ Please do not ask questions on Classum unless really necessary
● We will give 2 weeks per quiz (no late days!)
● To register on Gradiance please use <Gradiance>
17
Homeworks
● Five homeworks
○ SQL
○ Relational algebra
○ Database design
○ Web application using database (so you too can make $1B)
○ Spark
● We will use Python, Google Cloud SQL (MySQL), and Apache Spark
● Due at 11:59pm
● Each homework will have specific instructions of what to submit
18
Homework Schedule (tentative)
Date Out In
03/05 (Tue) HW 0
03/12 (Tue) HW 1
03/26 (Tue) HW 2 HW 0, 1
04/16 (Tue) HW 3 HW 2
05/07 (Tue) HW 4 HW 3
05/28 (Tue) HW 5 HW 4
06/15 (Sat) HW 5
19
Late Day Policies for Homeworks
● Homeworks 1~4
○ 4 late days (in 24-hour chunks) can be used for any of these homeworks
○ Once the late days are exhausted, we will reduce your score by 20% per late day
○ However, no homework will be accepted more than 4 days after its due date
● Homework 5
○ No late submission allowed as we need to submit grades by 6/20
20
Google Cloud Platform (GCP)
● This class is supported by a generous GCP Credit Award
● See instructions of using GCP in HW0
● Cloud is the future, so KAIST should go cloud as well
● Add the skills you learned in your resume :-)
21
Exams
● Midterm
○ 4/17 (Wed) 9am-12pm
○ Covers first half of course
○ Other details TBD
● Final exam
○ 6/12 (Wed) 9am-12pm
○ Covers second half of course
○ Other details TBD
22
Class Participation
● We will not check class participation
○ Please come to class for the interaction
● Instead, we encourage you to answer questions on Classum
○ We will give credit to both the # of answers and # of “likes” you receive
23
Grading
● Letter grade (A, B, C, …) only
24
Prerequisites
● These courses or anything equivalent
○ CS101: Python intro
○ EE205: data structures and algorithms
○ EE209: programming structures
○ EE213: discrete methods
25
Course policy -- VERY IMPORTANT!
● We will check for plagiarism
○ Do not copy solutions from other people or the Web
○ We will use MOSS, which is a robust tool for catching common code
■ Doing the homework yourself is easier than trying to fool it
○ Copying from ChatGPT is also considered plagiarism
○ If confirmed to commit plagiarism, the standard penalty is automatic failure (F)
● You can talk to others about the algorithm(s) to be used to solve a homework
problem as long as you then mention their name(s) on the work you submit
● You should not use code of others or be looking at code of others when you write
your own
● See academic code of ethics in EE
26
Workload
● Gradiance quizzes should take at most a few hours
● Some HWs may take >10 hours
● You may form study groups
27
Programming environment
● Your own computer
● Lab machines in Haedong lounge
○ Linux server machines running Ubuntu
○ Address range: eelabg#.kaist.ac.kr (#: 1~36)
○ Remote access (always power on): eelabg1 and eelabg2
○ Use SSH protocol to communicate with a lab machine
○ We will create your accounts soon (due to maintenance, please login after Wednesday)
■ We will keep on creating accounts for newly enrolled students
■ ID: your KAIST student id
■ Password: 4NSzdQS7cL<student id> (except <, >)
■ Please change your password by using ‘passwd’ command in Linux system (not in Windows)
○ See https://ee.kaist.ac.kr/node/15074
28
Acknowledgements
● Some slides of this course are adapted from or inspired by these excellent courses
○ UW CSE344: Introduction to Data Management by Dan Suciu
○ SFU CMPT 354: Database Systems 1 by Jiannan Wang
○ Columbia W4111: Introduction to Databases by Eugene Wu
○ Stanford CS145: Data Management and Data Systems by Narayanan Shivakumar
○ Stanford CS245: Database Systems by Peter Bailis / Hector Garcia-Molina
○ CMU 15-415: Database Applications by Andy Pavlo and Christos Faloutsos
○ CMU 15-445: Database Systems by Andy Pavlo
29
Joke Break
30
Database
● What is a database?
○ A collection of files with related data
● Examples
○ Google and Naver (web documents)
○ Youtube (videos)
○ Amazon (products)
○ Banking systems (account information)
○ Airline systems (flights)
○ Corporate record keeping (employee information)
○ KLMS and portal (students and courses)
○ and many others
31
Database management system (DBMS)
● A piece of software that manages databases
○ Allows users to create new databases and specify their schemas (logical structure of data)
○ Give users ability to query and modify the data using a query language
○ Support the storage of very large amounts of data
○ Enable recovery of the database in the face of failures, errors, or intentional misuse
○ Control access to data from many users at once
● Examples
○ Commercial: Oracle, Microsoft SQL Server, IBM DB2, …
○ Open source: MySQL, SQLite, PostgreSQL, …
○ Cloud: Google Cloud SQL, Google Spanner, Microsoft Azure Cloud SQL Database, …
32
Brief history
● Late 1960’s
○ First commercial DBMS’s appear
○ Used tree-based “hierarchical” and graph-based “network” data models
○ No high-level query language
● 1970
○ Ted Codd proposes relations (data organized as tables)
○ Relational model provides data independence
○ SQL (Structured Query Language) is used
● 1990’s Tedd Codd, Turing Award 1981
○ Relational database systems become the norm
○ IBM DB2, Informix, Sybase, Oracle, SQL Server, MySQL, PostgreSQL, ...
33
Brief history
● 2000’s
○ Internet boom (Web 2.0), need high availability and scalability (10 to 1M users quickly)
○ NoSQL systems are introduced, but are non-relational and sacrifice ACID transactions
○ MapReduce/Hadoop, Spark, Bigtable, Cassandra, MongoDB, ...
● 2010’s and beyond
○ NewSQL systems support relational data and ACID transactions while having the same OLTP
performance as NoSQL systems
○ CockroachDB, Google Spanner, Amazon Aurora, ...
34
Brief history
● 2000’s
○ Internet boom (Web 2.0), need high availability and scalability (10 to 1M users quickly)
○ NoSQL systems are introduced, but are non-relational and sacrifice ACID transactions
○ MapReduce/Hadoop, Spark, Bigtable, Cassandra, MongoDB, ...
● 2010’s and beyond
○ NewSQL systems support relational data and ACID transactions while having the same OLTP
performance as NoSQL systems
○ CockroachDB, Google Spanner, Amazon Aurora, ...
Observations
- Relational databases are super important
- NoSQL systems are on the rise and evolving
35
Turing awards in data management
● Charles Bachman, 1973
○ IDS and CODASYL
● Ted Codd, 1981
○ Relational model
● Jim Gray, 1998
○ Transaction processing
● Michael Stonebraker, 2014
○ INGRES and Postgres
Turing Award ≈ Nobel Prize of Computing
● Jeff Ullman, 2021
○ Compilers and algorithms
○ Also wrote our Database textbook and founded Gradiance :-)
36
Why is implementing a DBMS hard?
● At first glance, looks simple
○ Relations -> Query -> Results
● Suppose we want to manage student information
● Naive implementation
○ Store a relation in an ASCII file ASCII file Example Query
○ Use shell to run queries
● What could go wrong? Adam # EE # 27 % select *
○ Can this be used in KLMS? Bill # CS # 20 from R
Carol # IE # 22 where R.age < 25
David # AE # 30
...
37
Problems with naive implementation
● Storage and deletion are expensive
● Search is expensive (no indexes)
● Query processing is slow
● No concurrency control
● No reliability
● ...
38
Data models
● Relational
● Key/Value
● Graph
● Document
● Column-family
● Array/Matrix
● Hierarchical
● Network
39
Data models
● Relational Most DBMS’s
● Key/Value
● Graph No SQL
● Document
● Column-family
● Array/Matrix Machine Learning
● Hierarchical
Obsolete
● Network
40
Course outline (tentative)
● Relational model Schema design
● SQL (basics, aggregates, subqueries)
● Relational algebra SQL query
● ER model
● Design theory Parse Query
● Secondary storage
● Indexing Select logical query plan
● Query optimization
● Transactions (concurrency, recovery) Select physical plan
● NoSQL, NewSQL
Query execution
Disk
41
To-do items
● Register to Classum.org
● Register to Gradiance
○ Use kaist_<your KAIST student ID> for username
● Login to a Haedong machine and change your password (after Wednesday)
○ ID: your KAIST student id
○ Password: 4NSzdQS7cL<student id> (except <, >)
● We will post HW0 next Tuesday
42