0% found this document useful (0 votes)
68 views

EE477 Lecture 1 - Introduction

This document provides an overview and introduction to a database and big data systems course. It outlines the course topics, staff, schedule, assignments, exams, prerequisites, and other logistical details. The course will cover relational databases, Spark, and big data fundamentals and technologies.

Uploaded by

Đỗ Duy Thảo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

EE477 Lecture 1 - Introduction

This document provides an overview and introduction to a database and big data systems course. It outlines the course topics, staff, schedule, assignments, exams, prerequisites, and other logistical details. The course will cover relational databases, Spark, and big data fundamentals and technologies.

Uploaded by

Đỗ Duy Thảo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

EE477

Database and Big Data Systems

Steven Euijong Whang 1


Agenda
● Introduction
● Course logistics
● Database overview

2
The 4th industrial revolution is upon us

Source: Korea Times

Source: Wikipedia

Source: NCTA 3
And the amount of data is exploding

Source: Intel

Source: domo.com

Source: zdnet.com
4
Scary and exciting!
1 GB ≈ 1 cup of coffee

1 ZB = 1,000,000,000,000 GB ≈ 1 Great Wall of China

Source: IDC 5
No surprise: data skills are in high demand

“Lack of manpower obstacle to


(data business) market growth”

6
Data science is also becoming important in manufacturing

Source: Naver News 7


But there is also a myriad of Big data technologies

Source: Matt Turck


8
Which techniques should I learn?
Look at the users

9
Which techniques should I learn?
Look at the users
Don’t bother for now

Hadoop & NoSQL


growing

Relational
databases are
still by far the
most dominant
technology

10
Enter EE477 Database and Big Data Systems
● We will cover:
● Relational Databases
○ Necessary fundamentals for Big data
● Spark Theory, Machine
Algorithms Learning
○ Best Big data system in Hadoop & NoSQL category
● Companion course: EE412 (opens in Fall)
Big Data
EE412 Analytics

Database &
EE477 Big Data
Systems

11
Staff
● Instructor: Steven E. Whang
○ Leads the Data Intelligence Lab
○ Office hours: Mon/Wed 9:30am ~ 10:30am
○ Office: N1 516, Email: swhang@kaist.ac.kr
● TAs: Chanho Lee (Head TA), Junseok Seo,
Hyunseung Hwang, Minsu Kim, Soyeon Kim

12
Staff
● TA Office hours
○ Chanho Lee: Wednesday 1:00pm ~ 2:00pm
○ Junseok Seo: Tuesday 3:00pm ~ 4:00pm
○ Hyunseung Hwang: Monday 1:00pm ~ 2:00pm
○ Minsu Kim: Friday 3:00pm ~ 4:00pm
○ Soyeon Kim: Thursday 4:00pm ~ 5:00pm
● Email: kaist.ee477.staff@gmail.com

13
Course work and grading
● 5 Homework assignments (40%)
● Midterm (25%)
● Final exam (30%)
● Gradiance quizzes (4%)
● Classum participation (1%)

14
Course materials
● Textbook: Database Systems: The Complete Book (2nd edition)
○ PDF searchable online
○ 5 copies in KAIST library
● Lecture slides and lecture videos will be
posted on KLMS
○ Google drive containing all lectures
○ Please check for updates

<International Version>

15
Communication
● KLMS
○ Lecture slides (please check for updates)
○ Lecture videos
<Classum>
● Classum
○ To join: use code K97DM88EG (link: www.classum.com/K97DM88EG)
○ Use Classum for all questions and public communication
○ We will post course announcements to Classum
● Staff mailing list
○ kaist.ee477.staff@gmail.com
○ Only ask questions not suitable for Classum
● Anonymous feedback
○ Link: https://goo.gl/forms/pE2CwGfeinf9DQW23
○ I appreciate any anonymous feedback to improve the course
<Anonymous feedback> 16
Gradiance
● Online quizzes and SQL labs designed for self study
○ Please do not ask questions on Classum unless really necessary
● We will give 2 weeks per quiz (no late days!)
● To register on Gradiance please use <Gradiance>

kaist_<your KAIST student ID> for your username (e.g., kaist_20241234)


○ We need to match your Gradiance ID with your student ID
○ Sign up at www.gradiance.com/services and enter class code 1B74397B
● You can submit as many times as you want, and your score is based on the most
recent submission
○ The point is to learn the material, rather than getting higher scores
● After the due date, you can see the solutions by looking at your submission

17
Homeworks
● Five homeworks
○ SQL
○ Relational algebra
○ Database design
○ Web application using database (so you too can make $1B)
○ Spark
● We will use Python, Google Cloud SQL (MySQL), and Apache Spark
● Due at 11:59pm
● Each homework will have specific instructions of what to submit

18
Homework Schedule (tentative)
Date Out In

03/05 (Tue) HW 0

03/12 (Tue) HW 1

03/26 (Tue) HW 2 HW 0, 1

04/16 (Tue) HW 3 HW 2

05/07 (Tue) HW 4 HW 3

05/28 (Tue) HW 5 HW 4

06/15 (Sat) HW 5

19
Late Day Policies for Homeworks
● Homeworks 1~4
○ 4 late days (in 24-hour chunks) can be used for any of these homeworks
○ Once the late days are exhausted, we will reduce your score by 20% per late day
○ However, no homework will be accepted more than 4 days after its due date
● Homework 5
○ No late submission allowed as we need to submit grades by 6/20

20
Google Cloud Platform (GCP)
● This class is supported by a generous GCP Credit Award
● See instructions of using GCP in HW0
● Cloud is the future, so KAIST should go cloud as well
● Add the skills you learned in your resume :-)

21
Exams
● Midterm
○ 4/17 (Wed) 9am-12pm
○ Covers first half of course
○ Other details TBD
● Final exam
○ 6/12 (Wed) 9am-12pm
○ Covers second half of course
○ Other details TBD

22
Class Participation
● We will not check class participation
○ Please come to class for the interaction
● Instead, we encourage you to answer questions on Classum
○ We will give credit to both the # of answers and # of “likes” you receive

23
Grading
● Letter grade (A, B, C, …) only

24
Prerequisites
● These courses or anything equivalent
○ CS101: Python intro
○ EE205: data structures and algorithms
○ EE209: programming structures
○ EE213: discrete methods

25
Course policy -- VERY IMPORTANT!
● We will check for plagiarism
○ Do not copy solutions from other people or the Web
○ We will use MOSS, which is a robust tool for catching common code
■ Doing the homework yourself is easier than trying to fool it
○ Copying from ChatGPT is also considered plagiarism
○ If confirmed to commit plagiarism, the standard penalty is automatic failure (F)
● You can talk to others about the algorithm(s) to be used to solve a homework
problem as long as you then mention their name(s) on the work you submit
● You should not use code of others or be looking at code of others when you write
your own
● See academic code of ethics in EE

26
Workload
● Gradiance quizzes should take at most a few hours
● Some HWs may take >10 hours
● You may form study groups

27
Programming environment
● Your own computer
● Lab machines in Haedong lounge
○ Linux server machines running Ubuntu
○ Address range: eelabg#.kaist.ac.kr (#: 1~36)
○ Remote access (always power on): eelabg1 and eelabg2
○ Use SSH protocol to communicate with a lab machine
○ We will create your accounts soon (due to maintenance, please login after Wednesday)
■ We will keep on creating accounts for newly enrolled students
■ ID: your KAIST student id
■ Password: 4NSzdQS7cL<student id> (except <, >)
■ Please change your password by using ‘passwd’ command in Linux system (not in Windows)
○ See https://ee.kaist.ac.kr/node/15074

28
Acknowledgements
● Some slides of this course are adapted from or inspired by these excellent courses
○ UW CSE344: Introduction to Data Management by Dan Suciu
○ SFU CMPT 354: Database Systems 1 by Jiannan Wang
○ Columbia W4111: Introduction to Databases by Eugene Wu
○ Stanford CS145: Data Management and Data Systems by Narayanan Shivakumar
○ Stanford CS245: Database Systems by Peter Bailis / Hector Garcia-Molina
○ CMU 15-415: Database Applications by Andy Pavlo and Christos Faloutsos
○ CMU 15-445: Database Systems by Andy Pavlo

29
Joke Break

30
Database
● What is a database?
○ A collection of files with related data
● Examples
○ Google and Naver (web documents)
○ Youtube (videos)
○ Amazon (products)
○ Banking systems (account information)
○ Airline systems (flights)
○ Corporate record keeping (employee information)
○ KLMS and portal (students and courses)
○ and many others

31
Database management system (DBMS)
● A piece of software that manages databases
○ Allows users to create new databases and specify their schemas (logical structure of data)
○ Give users ability to query and modify the data using a query language
○ Support the storage of very large amounts of data
○ Enable recovery of the database in the face of failures, errors, or intentional misuse
○ Control access to data from many users at once
● Examples
○ Commercial: Oracle, Microsoft SQL Server, IBM DB2, …
○ Open source: MySQL, SQLite, PostgreSQL, …
○ Cloud: Google Cloud SQL, Google Spanner, Microsoft Azure Cloud SQL Database, …

32
Brief history
● Late 1960’s
○ First commercial DBMS’s appear
○ Used tree-based “hierarchical” and graph-based “network” data models
○ No high-level query language
● 1970
○ Ted Codd proposes relations (data organized as tables)
○ Relational model provides data independence
○ SQL (Structured Query Language) is used
● 1990’s Tedd Codd, Turing Award 1981
○ Relational database systems become the norm
○ IBM DB2, Informix, Sybase, Oracle, SQL Server, MySQL, PostgreSQL, ...

33
Brief history
● 2000’s
○ Internet boom (Web 2.0), need high availability and scalability (10 to 1M users quickly)
○ NoSQL systems are introduced, but are non-relational and sacrifice ACID transactions
○ MapReduce/Hadoop, Spark, Bigtable, Cassandra, MongoDB, ...
● 2010’s and beyond
○ NewSQL systems support relational data and ACID transactions while having the same OLTP
performance as NoSQL systems
○ CockroachDB, Google Spanner, Amazon Aurora, ...

34
Brief history
● 2000’s
○ Internet boom (Web 2.0), need high availability and scalability (10 to 1M users quickly)
○ NoSQL systems are introduced, but are non-relational and sacrifice ACID transactions
○ MapReduce/Hadoop, Spark, Bigtable, Cassandra, MongoDB, ...
● 2010’s and beyond
○ NewSQL systems support relational data and ACID transactions while having the same OLTP
performance as NoSQL systems
○ CockroachDB, Google Spanner, Amazon Aurora, ...

Observations
- Relational databases are super important
- NoSQL systems are on the rise and evolving

35
Turing awards in data management
● Charles Bachman, 1973
○ IDS and CODASYL
● Ted Codd, 1981
○ Relational model
● Jim Gray, 1998
○ Transaction processing
● Michael Stonebraker, 2014
○ INGRES and Postgres
Turing Award ≈ Nobel Prize of Computing
● Jeff Ullman, 2021
○ Compilers and algorithms
○ Also wrote our Database textbook and founded Gradiance :-)

36
Why is implementing a DBMS hard?
● At first glance, looks simple
○ Relations -> Query -> Results
● Suppose we want to manage student information
● Naive implementation
○ Store a relation in an ASCII file ASCII file Example Query
○ Use shell to run queries
● What could go wrong? Adam # EE # 27 % select *
○ Can this be used in KLMS? Bill # CS # 20 from R
Carol # IE # 22 where R.age < 25
David # AE # 30
...

37
Problems with naive implementation
● Storage and deletion are expensive
● Search is expensive (no indexes)
● Query processing is slow
● No concurrency control
● No reliability
● ...

38
Data models
● Relational
● Key/Value
● Graph
● Document
● Column-family
● Array/Matrix
● Hierarchical
● Network

39
Data models
● Relational Most DBMS’s
● Key/Value
● Graph No SQL
● Document
● Column-family
● Array/Matrix Machine Learning

● Hierarchical
Obsolete
● Network

40
Course outline (tentative)
● Relational model Schema design
● SQL (basics, aggregates, subqueries)
● Relational algebra SQL query
● ER model
● Design theory Parse Query
● Secondary storage
● Indexing Select logical query plan
● Query optimization
● Transactions (concurrency, recovery) Select physical plan
● NoSQL, NewSQL
Query execution

Disk

41
To-do items
● Register to Classum.org
● Register to Gradiance
○ Use kaist_<your KAIST student ID> for username
● Login to a Haedong machine and change your password (after Wednesday)
○ ID: your KAIST student id
○ Password: 4NSzdQS7cL<student id> (except <, >)
● We will post HW0 next Tuesday

42

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy