Introduction to Data Science
Introduction to Data Science
Science
Introduction
• Data science is a buzz word
• One of trending and hot topic now a days
• Why?
• Data availability
• Computational power
• Newly implemented algorithms
• Mainly companies are focused on “Data Driven Decisions”
What is data
science
• It is a paradigm changing term
• Lot of definitions
• An overview,
• Collection of data sets
• Focused on many aspects such as data
cleansing, data wrangling
• A life cycle to get more clear data
• More advanced techniques involved
• Big Data
• Data Mining
More…
Will start
with What is information
data Structured
Common formats data
of data existence Unstructured
data
Image Credit - https://i1.wp.com/24pc.com.au/wp-content/uploads/2020/05/Structured-Data-Infographic.png?ssl=1
Steps required to follow for a
predictive analytical project
• Data extraction
• Data cleansing
• Exploratory Data Analysis Mostly focused on curating most acceptable dataset for the
• Data preprocessing problem
• Feature engineering
• Feature selection
• Data Analysis
• Model building Solve the problem using algorithms/techniques
• Model validation
• Model deployment
Maintain the model
• Monitoring the model
Data
Extraction
Introduction
• Data extraction is the process of
retrieving or "pulling out" data
from various sources and
converting it into a structured
format for analysis.
Extracting data
Interviews
Questionaries/Surveys
APIs
Cookies
Icebreaker
https://forms.gle/xCRDLrY4nuW7AvZZ9
Data Extraction
methods
Here are some common methods used for data extraction:
• Surveys: Surveys involve asking a set of questions to a sample
of individuals, either through face-to-face interviews, phone
calls, or online questionnaires.
• Interviews: Interviews involve one-on-one conversations with
individuals or groups, often in a structured or semi-structured
format.
• Observation: Observational methods involve observing and
recording the behavior of individuals or groups in a natural or
controlled setting.
More…
• Experiments: Experiments involve manipulating one or more variables and
observing the effect on other variables, often in a controlled laboratory setting.
• Case studies: Case studies involve in-depth analysis of a single individual, group,
or event, often using multiple sources of data.
• Secondary data sources: Secondary data sources involve using existing data that
has been collected for other purposes, such as government statistics, academic
research, or social media data.
• File parsing: File parsing involves extracting data from various file formats, such as
CSV, XML, or JSON, using programming languages or software tools.
• OCR (Optical Character Recognition): OCR involves converting scanned images or
PDF files into machine-readable text using software tools.
More…
• Sensor data: Sensor data involves collecting data from devices, such
as smartphones, wearables, or IoT devices, that record various types
of information, such as location, activity, or physiological signals.
• Manual data entry: Manual data entry involves manually extracting
data from paper documents or digital sources and entering it into a
structured format.
More…
• Web scraping: Web scraping involves automatically extracting data
from websites by writing programs that can navigate through web
pages, parse the HTML code, and extract the relevant data.
• Database queries: Database queries involve writing SQL (Structured
Query Language) queries to extract data from databases based on
specific criteria.
• APIs (Application Programming Interfaces): APIs are interfaces that
allow programs to interact with other software applications, such as
social media platforms, to extract data in a structured format.
Data cleaning
Introduction
• Data cleaning involves identifying and correcting errors,
inconsistencies, missing values, and other issues in the data
• Primary purpose is to improve its quality and reliability
• This phase is essential for making accurate predictions and
decisions making
• Removing duplicates: This involves identifying and
removing rows that contain duplicate data.
• Handling missing values: This involves identifying
missing data and either removing rows or filling in
the missing values using techniques such as mean
imputation, median imputation, or interpolation.
• Removing outliers: This involves identifying extreme
values that are far from the average and either
Techniques removing them or transforming them to be more in
line with the rest of the data.
• Standardizing data: This involves converting data into
a common format or scale so that it can be
compared and analyzed more easily.
• Encoding categorical variables: This involves
converting categorical data into numerical form so
that it can be used in machine learning algorithms.
• Data auditing: This involves examining the data to identify
potential issues, such as missing values, duplicates, outliers,
inconsistencies, and formatting errors.
• Data correction: This involves correcting errors and
inconsistencies in the data, such as fixing typos, standardizing
the format of data, and filling in missing values.
• Data transformation: This involves transforming the data into a
format that is suitable for analysis by applying techniques such
steps as feature scaling, normalization, encoding categorical
variables, and reducing dimensionality.
• Data integration: This involves combining data from multiple
sources to create a single dataset, which may involve resolving
differences in variable names, formats, and units of
measurement.
• Data verification: This involves verifying that the data has been
cleaned and transformed correctly and that there are no
remaining errors or inconsistencies.
Practical