0% found this document useful (0 votes)
10 views

Introduction to Data Science

Data science is a rapidly growing field focused on data-driven decision-making, leveraging data availability, computational power, and advanced algorithms. It encompasses a life cycle that includes data collection, cleansing, analysis, and model deployment, requiring domain knowledge, programming skills, and mathematical understanding. Key processes like data extraction and cleaning are essential for ensuring high-quality data for accurate predictions and insights.

Uploaded by

Yasiru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Introduction to Data Science

Data science is a rapidly growing field focused on data-driven decision-making, leveraging data availability, computational power, and advanced algorithms. It encompasses a life cycle that includes data collection, cleansing, analysis, and model deployment, requiring domain knowledge, programming skills, and mathematical understanding. Key processes like data extraction and cleaning are essential for ensuring high-quality data for accurate predictions and insights.

Uploaded by

Yasiru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Introduction to Data

Science
Introduction
• Data science is a buzz word
• One of trending and hot topic now a days
• Why?
• Data availability
• Computational power
• Newly implemented algorithms
• Mainly companies are focused on “Data Driven Decisions”
What is data
science
• It is a paradigm changing term
• Lot of definitions
• An overview,
• Collection of data sets
• Focused on many aspects such as data
cleansing, data wrangling
• A life cycle to get more clear data
• More advanced techniques involved
• Big Data
• Data Mining
More…

• To perform well on data science, you


will need
• Domain knowledge
• Programming skills
• Math/Statistics
• By using data science organizations
would,
• Improving decision marking
• Predictions
• Start with a problem / Define a hypothesis
Data science is hard
• Collect data
down to specific
• Clean data
flow. But general • Perform other relevant activities. (E.g. – Data
workflow can define imputation, Normalizing )
as below. • Analyze data ( Pattern mining, Data mining)
• Represent the result - Present the result with useful
insights in a way the "company" can understand.
• Take your data set to AI (Apply for an appropriate
algorithm for prediction, classification etc.)
What is data?

Will start
with What is information

data Structured
Common formats data
of data existence Unstructured
data
Image Credit - https://i1.wp.com/24pc.com.au/wp-content/uploads/2020/05/Structured-Data-Infographic.png?ssl=1
Steps required to follow for a
predictive analytical project
• Data extraction
• Data cleansing
• Exploratory Data Analysis Mostly focused on curating most acceptable dataset for the
• Data preprocessing problem
• Feature engineering
• Feature selection
• Data Analysis
• Model building Solve the problem using algorithms/techniques
• Model validation
• Model deployment
Maintain the model
• Monitoring the model
Data
Extraction
Introduction
• Data extraction is the process of
retrieving or "pulling out" data
from various sources and
converting it into a structured
format for analysis.
Extracting data

Interviews

Questionaries/Surveys

APIs

Cookies
Icebreaker
https://forms.gle/xCRDLrY4nuW7AvZZ9
Data Extraction
methods
Here are some common methods used for data extraction:
• Surveys: Surveys involve asking a set of questions to a sample
of individuals, either through face-to-face interviews, phone
calls, or online questionnaires.
• Interviews: Interviews involve one-on-one conversations with
individuals or groups, often in a structured or semi-structured
format.
• Observation: Observational methods involve observing and
recording the behavior of individuals or groups in a natural or
controlled setting.
More…
• Experiments: Experiments involve manipulating one or more variables and
observing the effect on other variables, often in a controlled laboratory setting.
• Case studies: Case studies involve in-depth analysis of a single individual, group,
or event, often using multiple sources of data.
• Secondary data sources: Secondary data sources involve using existing data that
has been collected for other purposes, such as government statistics, academic
research, or social media data.
• File parsing: File parsing involves extracting data from various file formats, such as
CSV, XML, or JSON, using programming languages or software tools.
• OCR (Optical Character Recognition): OCR involves converting scanned images or
PDF files into machine-readable text using software tools.
More…
• Sensor data: Sensor data involves collecting data from devices, such
as smartphones, wearables, or IoT devices, that record various types
of information, such as location, activity, or physiological signals.
• Manual data entry: Manual data entry involves manually extracting
data from paper documents or digital sources and entering it into a
structured format.
More…
• Web scraping: Web scraping involves automatically extracting data
from websites by writing programs that can navigate through web
pages, parse the HTML code, and extract the relevant data.
• Database queries: Database queries involve writing SQL (Structured
Query Language) queries to extract data from databases based on
specific criteria.
• APIs (Application Programming Interfaces): APIs are interfaces that
allow programs to interact with other software applications, such as
social media platforms, to extract data in a structured format.
Data cleaning
Introduction
• Data cleaning involves identifying and correcting errors,
inconsistencies, missing values, and other issues in the data
• Primary purpose is to improve its quality and reliability
• This phase is essential for making accurate predictions and
decisions making
• Removing duplicates: This involves identifying and
removing rows that contain duplicate data.
• Handling missing values: This involves identifying
missing data and either removing rows or filling in
the missing values using techniques such as mean
imputation, median imputation, or interpolation.
• Removing outliers: This involves identifying extreme
values that are far from the average and either
Techniques removing them or transforming them to be more in
line with the rest of the data.
• Standardizing data: This involves converting data into
a common format or scale so that it can be
compared and analyzed more easily.
• Encoding categorical variables: This involves
converting categorical data into numerical form so
that it can be used in machine learning algorithms.
• Data auditing: This involves examining the data to identify
potential issues, such as missing values, duplicates, outliers,
inconsistencies, and formatting errors.
• Data correction: This involves correcting errors and
inconsistencies in the data, such as fixing typos, standardizing
the format of data, and filling in missing values.
• Data transformation: This involves transforming the data into a
format that is suitable for analysis by applying techniques such
steps as feature scaling, normalization, encoding categorical
variables, and reducing dimensionality.
• Data integration: This involves combining data from multiple
sources to create a single dataset, which may involve resolving
differences in variable names, formats, and units of
measurement.
• Data verification: This involves verifying that the data has been
cleaned and transformed correctly and that there are no
remaining errors or inconsistencies.
Practical

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy