Unit-1
Unit-1
Unit-1
Science
(IS1100-1)
Abhishek S. Rao
Evolution of Data Science
The evolution of Data Science over the years has taken form in many phases. It all
started with Statistics.
Simple statistics models have been employed to collect, analyze, and manage data
since the early 1800s. These principles underwent various modulations over time until
the rise of the digital age.
Once computers were introduced as mainstream public devices, there was a shift in
the industry to the Digital Age.
A flood of data and digital information was created. This resulted in the statistical
practices and models getting computerized giving rise to digital analytics.
Then came the rise of the internet that exponentially grew the data available giving
rise to what we know as Big Data.
This explosion of information available to the masses gave rise to the need for
expertise to process, manage, analyze, and visualize this data for decision making using
various models. This gave birth to the term Data Science.
What is Data Science???
Data science is the science of analyzing raw data using statistics and
machine learning techniques with the purpose of drawing conclusions
about that information.
What does the future hold?
Splitting of Data Science: Some various designations and descriptions are associated
with data science like Data Analyst, Data Engineer, Data Visualization, Data Architect,
Machine Learning, and Business Intelligence to name a few.
Data Explosion: Today enormous amounts of data are being produced daily. Every
organization is dependent on the data being created for its processes. Whether it is
medicine, entertainment, sports, manufacturing, agriculture, or transport, it is all
dependent on data.
Rise of Automation: With an increase in the complexity of operations, there is always a
strive to simplify processes. In the future, it is evident that most machine learning
frameworks will contain libraries of models that are pre-structured and pre-trained. This
would bring about a paradigm shift in the working of a Data Scientist. Soft skills like
Data Visualization would come to the forefront of a Data scientist’s skill set.
Data
Science
Lifecycle
Step 2: Data
collection and
preparation
Step 5:
Deployment and
maintenance
Applications of Data Science
Toolboxes for Data Scientists
Why Python?
• Python is a mature programming language, but it also has excellent properties for
newbie programmers, making it ideal for people who have never programmed before.
• Some of the most remarkable of those properties are easy to read code, suppression of
non-mandatory delimiters, dynamic typing, and dynamic memory usage.
• Python is an interpreted language, so the code is executed immediately in the Python
console without needing the compilation step to machine language.
• Currently, Python is one of the most flexible programming languages. One of its main
characteristics that makes it so flexible is that it can be seen as a multiparadigm
language.
Python Libraries for Data Science
Many popular Python toolboxes/libraries:
• NumPy
• SciPy All these libraries are
• Pandas installed on the SCC
• SciKit-Learn
Visualization libraries
• matplotlib
• Seaborn
Link: http://www.numpy.org/
25
Python Libraries for Data Science
SciPy:
collection of algorithms for linear algebra, differential equations, numerical
integration, optimization, statistics and more
built on NumPy
Link: https://www.scipy.org/scipylib/
26
Python Libraries for Data Science
Pandas:
adds data structures and tools designed to work with table-like data (similar
to Series and Data Frames in R)
Link: http://pandas.pydata.org/
27
Python Libraries for Data Science
SciKit-Learn:
provides machine learning algorithms: classification, regression, clustering,
model validation etc.
Link: http://scikit-learn.org/
28
Python Libraries for Data Science
matplotlib:
python 2D plotting library which produces publication quality figures in a
variety of hardcopy formats
Link: https://matplotlib.org/
29
Python Libraries for Data Science
Seaborn:
based on matplotlib
Link: https://seaborn.pydata.org/
30
Integrated Development Environments (IDE)
Web Integrated
Development Environment
(WIDE): Jupyter
Since the project has grown so much, the IPython notebook has been separated from IPython
software and now it has become a part of a larger project: Jupyter. Jupyter (for Julia, Python,
and R) aims to reuse the same WIDE for all these interpreted languages and not just Python.
Time for Hands-on Session
Data_Operations.ipynb
Descriptive Statistics
Exploratory Data Analysis (EDA) involves understanding the structure and main characteristics of the dataset.
Below is a comprehensive example of EDA using Python with pandas, seaborn, and matplotlib libraries.
Step-by-Step EDA
• Basic Information
• Descriptive Statistics
• Data Visualization
• Correlation Analysis
• Outlier Detection
Data Summarization
Data_Summarization.ipynb
Data Distributions
Data_Distribution.ipynb
Measuring Asymmetry: Skewness and Pearson’s Median
Skewness Coefficient
Skewness and Pearson’s Median
Skewness Coefficient
Measuring_Assymetry.ipynb
Estimation
An important aspect when working with statistical data is being able to use estimates to
approximate the values of unknown parameters of the dataset. In this section, we will
review different kinds of estimators (estimated mean, variance, standard score, etc.).
Mean
Variance
Standard Score
Statistical Inference
Statistical inference is a branch of statistics that involves drawing conclusions or making
decisions about a population based on data collected from a sample of that population. It
allows us to generalize from a sample to a larger population and make statements or
predictions about parameters, such as means, proportions, variances, etc.
Time for Hands-on Session
Confidence intervals are a statistical concept used to estimate the range of values
within which a population parameter, such as the mean or proportion, is likely to lie.
They provide a measure of the uncertainty or precision of our estimate based on
sample data.
Constructing a Confidence Interval
Time for Hands-on Session
Confidence_Interval.ipynb
Hypothesis Testing
Hypothesis_Testing.ipynb
Testing Hypotheses Using p-Values
NetworkX is a Python toolbox for the creation, manipulation, and study of the structure,
dynamics, and functions of complex networks. After importing the toolbox, we can create
an undirected graph with 5 nodes by adding the edges, as is done in the following code.
Practical Case: Facebook Dataset
The SNAP collection has links to a great variety of networks such as Facebook-style social
networks, citation networks, Twitter networks, or open communities like Live Journal. The
Facebook dataset consists of a network representing friendships between Facebook users.
The Facebook data was anonymized by replacing the internal Facebook identifiers for each
user with a new value. The network corresponds to an undirected and unweighted graph that
contains users of Facebook (nodes) and their friendship relations (edges). The Facebook
dataset is defined by an edge list in a plain text file with one edge per line.
Link: https://snap.stanford.edu/data/ego-Facebook.html
Time for Hands-on Session
centrality.ipynb
facebook_combined.txt
PageRank
Time for Hands-on Session
PageRank.ipynb
Ego-Networks
Ego networks, also known as egocentric networks or personal networks, are subgraphs
that focus on a central node (ego) and its immediate neighbors (alters) in a larger social
network. These networks provide insights into how individuals are connected directly
and indirectly to others within their immediate social circles.
Time for Hands-on Session
Ego_network.ipynb
Community Detection
Community detection is a key task in network analysis that involves identifying groups of
nodes that are more densely connected than to the rest of the network. These communities
can reveal important structural and functional properties of the network.