Unit-1

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 84

Introduction to Data

Science
(IS1100-1)

Abhishek S. Rao
Evolution of Data Science
The evolution of Data Science over the years has taken form in many phases. It all
started with Statistics.
Simple statistics models have been employed to collect, analyze, and manage data
since the early 1800s. These principles underwent various modulations over time until
the rise of the digital age.
Once computers were introduced as mainstream public devices, there was a shift in
the industry to the Digital Age.
A flood of data and digital information was created. This resulted in the statistical
practices and models getting computerized giving rise to digital analytics.
Then came the rise of the internet that exponentially grew the data available giving
rise to what we know as Big Data.
This explosion of information available to the masses gave rise to the need for
expertise to process, manage, analyze, and visualize this data for decision making using
various models. This gave birth to the term Data Science.
What is Data Science???
Data science is the science of analyzing raw data using statistics and
machine learning techniques with the purpose of drawing conclusions
about that information.
What does the future hold?
Splitting of Data Science: Some various designations and descriptions are associated
with data science like Data Analyst, Data Engineer, Data Visualization, Data Architect,
Machine Learning, and Business Intelligence to name a few.
Data Explosion: Today enormous amounts of data are being produced daily. Every
organization is dependent on the data being created for its processes. Whether it is
medicine, entertainment, sports, manufacturing, agriculture, or transport, it is all
dependent on data.
Rise of Automation: With an increase in the complexity of operations, there is always a
strive to simplify processes. In the future, it is evident that most machine learning
frameworks will contain libraries of models that are pre-structured and pre-trained. This
would bring about a paradigm shift in the working of a Data Scientist. Soft skills like
Data Visualization would come to the forefront of a Data scientist’s skill set.
Data
Science
Lifecycle
Step 2: Data
collection and
preparation
Step 5:
Deployment and
maintenance
Applications of Data Science
Toolboxes for Data Scientists
Why Python?
• Python is a mature programming language, but it also has excellent properties for
newbie programmers, making it ideal for people who have never programmed before.
• Some of the most remarkable of those properties are easy to read code, suppression of
non-mandatory delimiters, dynamic typing, and dynamic memory usage.
• Python is an interpreted language, so the code is executed immediately in the Python
console without needing the compilation step to machine language.
• Currently, Python is one of the most flexible programming languages. One of its main
characteristics that makes it so flexible is that it can be seen as a multiparadigm
language.
Python Libraries for Data Science
Many popular Python toolboxes/libraries:
• NumPy
• SciPy All these libraries are
• Pandas installed on the SCC
• SciKit-Learn

Visualization libraries
• matplotlib
• Seaborn

and many more …


24
Python Libraries for Data Science
NumPy:
 introduces objects for multidimensional arrays and matrices, as well as
functions that allow to easily perform advanced mathematical and statistical
operations on those objects

 provides vectorization of mathematical operations on arrays and matrices


which significantly improves the performance

 many other python libraries are built on NumPy

Link: http://www.numpy.org/

25
Python Libraries for Data Science
SciPy:
 collection of algorithms for linear algebra, differential equations, numerical
integration, optimization, statistics and more

 part of SciPy Stack

 built on NumPy

Link: https://www.scipy.org/scipylib/

26
Python Libraries for Data Science
Pandas:
 adds data structures and tools designed to work with table-like data (similar
to Series and Data Frames in R)

 provides tools for data manipulation: reshaping, merging, sorting, slicing,


aggregation etc.

 allows handling missing data

Link: http://pandas.pydata.org/

27
Python Libraries for Data Science
SciKit-Learn:
 provides machine learning algorithms: classification, regression, clustering,
model validation etc.

 built on NumPy, SciPy and matplotlib

Link: http://scikit-learn.org/

28
Python Libraries for Data Science
matplotlib:
 python 2D plotting library which produces publication quality figures in a
variety of hardcopy formats

 a set of functionalities similar to those of MATLAB

 line plots, scatter plots, barcharts, histograms, pie charts etc.

 relatively low-level; some effort needed to create advanced visualization

Link: https://matplotlib.org/

29
Python Libraries for Data Science
Seaborn:
 based on matplotlib

 provides high level interface for drawing attractive statistical graphics

 Similar (in style) to the popular ggplot2 library in R

Link: https://seaborn.pydata.org/

30
Integrated Development Environments (IDE)
Web Integrated
Development Environment
(WIDE): Jupyter
Since the project has grown so much, the IPython notebook has been separated from IPython
software and now it has become a part of a larger project: Jupyter. Jupyter (for Julia, Python,
and R) aims to reuse the same WIDE for all these interpreted languages and not just Python.
Time for Hands-on Session

Data_Operations.ipynb
Descriptive Statistics

Descriptive statistics helps to simplify large amounts of data


sensibly. Descriptive statistics applies the concepts, measures, and
terms that are used to describe the basic features of the samples
in a study. These procedures are essential to provide summaries of
the samples as an approximation of the population.
Data Preparation
The most common steps for data preparation involve the following operations.
Time for Hands-on Session
Data_Preparation.ipynb
Exploratory Data Analysis

Exploratory Data Analysis (EDA) involves understanding the structure and main characteristics of the dataset.

Below is a comprehensive example of EDA using Python with pandas, seaborn, and matplotlib libraries.

Step-by-Step EDA

• Loading the Data

• Basic Information

• Descriptive Statistics

• Data Visualization

• Correlation Analysis

• Handling Missing Values

• Outlier Detection
Data Summarization

Time for Hands-on Session

Data_Summarization.ipynb
Data Distributions

Time for Hands-on Session

Data_Distribution.ipynb
Measuring Asymmetry: Skewness and Pearson’s Median
Skewness Coefficient
Skewness and Pearson’s Median
Skewness Coefficient

Time for Hands-on Session

Measuring_Assymetry.ipynb
Estimation

An important aspect when working with statistical data is being able to use estimates to
approximate the values of unknown parameters of the dataset. In this section, we will
review different kinds of estimators (estimated mean, variance, standard score, etc.).
Mean
Variance
Standard Score
Statistical Inference
Statistical inference is a branch of statistics that involves drawing conclusions or making
decisions about a population based on data collected from a sample of that population. It
allows us to generalize from a sample to a larger population and make statements or
predictions about parameters, such as means, proportions, variances, etc.
Time for Hands-on Session

Sampling Distribution of Point Estimates.ipynb


Confidence Intervals

Confidence intervals are a statistical concept used to estimate the range of values
within which a population parameter, such as the mean or proportion, is likely to lie.
They provide a measure of the uncertainty or precision of our estimate based on
sample data.
Constructing a Confidence Interval
Time for Hands-on Session

Confidence_Interval.ipynb
Hypothesis Testing

Testing hypotheses using confidence intervals is a method to determine whether a sample


provides sufficient evidence to support a particular claim about a population parameter
(e.g., the mean).
Hypothesis Testing Using Confidence Intervals
Time for Hands-on Session

Hypothesis_Testing.ipynb
Testing Hypotheses Using p-Values

Testing hypotheses using p-values is a common method in statistical hypothesis


testing. It involves calculating the p-value, which indicates the probability of
obtaining a result at least as extreme as the one observed, assuming that the null
hypothesis is true.
Time for Hands-on Session

Hypotheses Using p-Values.ipynb


Network Analysis
Network data are generated when we consider relationships between two or more entities in
the data, like the highways connecting cities, friendships between people, or their phone
calls. In recent years, a huge number of network data have been generated and analyzed in
different fields.
Example:
Infectious disease transmission networks are built in epidemiological studies to find the best
way to prevent infection of people in a territory, by isolating certain areas.
We also find examples in academia, where we can build co-authorship networks and citation
networks to analyze collaborations among Universities.
Structuring data as networks can facilitate the study of the data for different goals:
• To discover the weaknesses of a structure. That could be the objective of a biologist
studying a community of plants and trying to establish which of its properties promote
quick transmission of disease.
• To find and exploit structures that work efficiently for the transmission of messages across
the network. This may be the goal of an advertising agent trying to find the best strategy
for spreading publicity.
In this chapter, we are going
to discuss how to analyze
networks and extract the
features we want to study.
Graphs

Depending on whether the edges of a graph are directed or undirected, the


graph is called a directed graph or an undirected graph, respectively.
• The degree of a node is the number of edges that connect to it.
• Figure shows an example of an undirected graph with 5 nodes and 5 edges. The degree of node C is 1, while
the degree of nodes A, D, and E is 2 and for node B it is 3. If a network is directed, then nodes have two
different degrees, the in-degree, which is the number of incoming edges, and the out-degree, which is the
number of outgoing edges.
• We could add strengths or weights to the links between the nodes, to represent some real-world measure. For
instance, the length of the highways connecting the cities in a network. In this case, the graph is called a
weighted graph.
• Moreover, many applications of graphs require shortest paths to be computed. The shortest path problem is
the problem of finding a path between two nodes in a graph such that the length of the path or the sum of the
weights of edges in the path is minimized.
• A graph is said to be connected if, for every pair of nodes, there is a path between them. A graph is fully
connected or complete if each pair of nodes is connected by an edge. A connected component or simply a
component of a graph is a subset of its nodes such that every node in the subset has a path to every other one.
Social Network Analysis
Basics in NetworkX

NetworkX is a Python toolbox for the creation, manipulation, and study of the structure,
dynamics, and functions of complex networks. After importing the toolbox, we can create
an undirected graph with 5 nodes by adding the edges, as is done in the following code.
Practical Case: Facebook Dataset

The SNAP collection has links to a great variety of networks such as Facebook-style social
networks, citation networks, Twitter networks, or open communities like Live Journal. The
Facebook dataset consists of a network representing friendships between Facebook users.
The Facebook data was anonymized by replacing the internal Facebook identifiers for each
user with a new value. The network corresponds to an undirected and unweighted graph that
contains users of Facebook (nodes) and their friendship relations (edges). The Facebook
dataset is defined by an edge list in a plain text file with one edge per line.

Link: https://snap.stanford.edu/data/ego-Facebook.html
Time for Hands-on Session

Social_network analysis_facebook data.ipynb


facebook_combined.txt
Centrality
Centrality measures in graphs help identify the most important nodes based on their
connections. There are several centrality measures commonly used, such as degree
centrality, betweenness centrality, closeness centrality, and eigenvector centrality. Each
measures a different aspect of node importance within the graph.
Time for Hands-on Session

centrality.ipynb
facebook_combined.txt
PageRank
Time for Hands-on Session

PageRank.ipynb
Ego-Networks
Ego networks, also known as egocentric networks or personal networks, are subgraphs
that focus on a central node (ego) and its immediate neighbors (alters) in a larger social
network. These networks provide insights into how individuals are connected directly
and indirectly to others within their immediate social circles.
Time for Hands-on Session

Ego_network.ipynb
Community Detection
Community detection is a key task in network analysis that involves identifying groups of
nodes that are more densely connected than to the rest of the network. These communities
can reveal important structural and functional properties of the network.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy