What Is Data
What Is Data
What Is Data
Unit-I
What Can We Do With Data- Big Data and Data Science- Big Data Architectures-Small Data -
What is Data? -A Short Taxonomy of Data Analytics -Descriptive Statistics: Scale Types- Descriptive
Univariate Analysis : Univariate Frequencies -Univariate Data Visualization - Univariate Statistics -
Common Univariate Probability Distributions -Descriptive Bivariate Analysis
What is Data?
Data is defined as individual facts, such as numbers, words, measurements, observations
or just descriptions of things.
For example, data might include individual prices, weights, addresses, ages, names,
temperatures, dates, or distances.
Data is a set of characters used to collect, store and transmit information for a specific
purpose. Data can be in any form, i.e., text, image, audio, etc. Data comes from the Latin
word 'Datum', which means 'something given'. When the data is processed, it is termed
as 'Information'.
There are three types of Big Data: Structured, Semi-structured and Unstructured
data.
1. Structured Data: Any data in a fixed format is known as structured data. It can
only be accessed, stored, or processed in a particular format. This type of data is
stored in the form of tables with rows and columns. Any Excel file or SQL file is an
example of structured data.
2. Unstructured Data: Unstructured data do not have a fixed format. These are
stored in an unknown format. Such type of data is known as unstructured data. An
example of unstructured data is a web page with text, images, videos, etc.
3. Semi-structured Data: Semi-structured data is the combination of structured as
well as unstructured forms of data. It does not contain any table to show relations;
it contains tags or other markers to show hierarchy. JSON files, XML files, and
CSV files (Comma-separated files) are semi-structured data examples. The e-mails
we send or receive are also an example of semi-structured data.
1. Social Media and Entertainment: You must have witnessed streaming service apps
such as Netflix recommending shows and movies based on your previous searches
and what you have watched. It is done using the concept of Big Data. Netflix and
other streaming service apps create a custom user profile, where they store the
data of users, including their search history, their history, which genre they watch
the most, at what time of day they prefer to watch the most, their streaming time
per day, etc. analyze it and accordingly gives recommendations. It helps in a better
streaming experience for the users.
2. Shopping: Websites like Amazon, Flipkart, etc., also use Big Data to recommend
products based on your previous purchases, search history, and interests. It is done
to maximize their profits and provide a better shopping experience to their
customers.
3. Education: Big Data helps in analyzing and monitoring the behavior and activities of
students, like the time they need to answer a question, the number of questions
skipped, and the difficulty level of the questions that are skipped, and thus helps
students to analyze their overall preparation, weak topics, strong topics, etc.
4. Healthcare: Healthcare sectors use Big Data to track and analyze the health and
fitness of the patients, the number of visits, the number of skipped appointments a
patient, etc. Mass outbreaks of diseases can be predicted by analyzing the data and
using algorithms.
5. Transportation: Traffic control by collecting and analyzing the data from several
sensors and cameras installed on roads and highways. Accident-prone areas can be
detected with the help of Big Data analysis; thus, required measures can be taken
to avoid accidents.
Evolution of Big Data
o The earliest record to track and analyze data was not decades back but thousands
of years back when accounting was first introduced in Mesopotamia.
o In the 20th century, IBM developed the first large-scale data project, punch
carding systems, which tracked the information of millions of Americans.
o With the emergence of the World Wide Web and supercomputers in the 1990s, the
creation of data on a large scale started to grow at an exponential rate. It was in
the early 1990s when the term 'Big Data' was first used.
o The two main challenges regarding 'Big Data' were storing and processing such a
huge volume of data.
o In 2005, Yahoo created the open-source framework Hadoop, which stores and
processes large data sets.
o The storage solution in Hadoop was named HDFS (Hadoop Distributed File System),
and the processing solution was named MapReduce.
o Later, Hadoop was handed over to an open-source and non-profitable
corporation: Apache Software Foundation.
o In 2008, Cloudera became the first company to provide commercial Hadoop
distribution.
o In 2013, the Creators of Apache Spark founded a company, Databricks, which
offers a platform for Big Data and Machine Learning solutions.
o Over the past few years, top Cloud providers such as Microsoft, Google, and
Amazon also started to provide Big Data solutions. These Cloud providers made it
much easier for users and companies to work on Big Data.
Importance of Big Data
1. A better understanding of market conditions.
2. Time and cost saving.
3. Solving advertisers' problems.
4. Offering better market insights.
5. Boosting customer acquisition and retention.
DATA SCIENCE
What is a Data Science
Data science combines math and statistics, specialized programming, advanced analytics,
artificial intelligence (AI), and machine learning with specific subject matter expertise to
uncover actionable insights hidden in an organization’s data. These insights can be used to
guide decision making and strategic planning.
The data science lifecycle involves various roles, tools, and processes, which enables
analysts to glean actionable insights. Typically, a data science project undergoes the
following stages:
1. Business Understanding: The complete cycle revolves around the enterprise goal.
What will you resolve if you do not longer have a specific problem? It is extraordinarily
essential to apprehend the commercial enterprise goal sincerely due to the fact that
will be your ultimate aim of the analysis.
2. Data Understanding: After enterprise understanding, the subsequent step is data
understanding. This includes a series of all the reachable data. Here you need to
intently work with the commercial enterprise group as they are certainly conscious of
what information is present, what facts should be used for this commercial enterprise
problem, and different information.
3. Preparation of Data: Next comes the data preparation stage. This consists of steps
like choosing the applicable data, integrating the data by means of merging the data
sets, cleaning it, treating the lacking values through either eliminating them or imputing
them, treating inaccurate data through eliminating them, additionally test for outliers
the use of box plots and cope with them. Constructing new data, derive new elements
from present ones. Format the data into the preferred structure, eliminate undesirable
columns and features. Data preparation is the most time-consuming but arguably the
most essential step in the complete existence cycle. Your model will be as accurate as
your data.
4. Exploratory Data Analysis: This step includes getting some concept about the
answer and elements affecting it, earlier than constructing the real model. Distribution
of data inside distinctive variables of a character is explored graphically the usage of
bar-graphs, Relations between distinct aspects are captured via graphical
representations like scatter plots and warmth maps. Many data visualization strategies
are considerably used to discover each and every characteristic individually and by
means of combining them with different features.
5. Data Modeling: Data modeling is the coronary heart of data analysis. A model takes
the organized data as input and gives the preferred output. This step consists of
selecting the suitable kind of model, whether the problem is a classification problem, or
a regression problem or a clustering problem. After deciding on the model family,
amongst the number of algorithms amongst that family, we need to cautiously pick out
the algorithms to put into effect and enforce them.
6. Model Evaluation: Here the model is evaluated for checking if it is geared up to be
deployed. The model is examined on an unseen data, evaluated on a cautiously thought
out set of assessment metrics. We additionally need to make positive that the model
conforms to reality. If we do not acquire a quality end result in the evaluation, we have
to re-iterate the complete modelling procedure until the preferred stage of metrics is
achieved.
7. Model Deployment: The model after a rigorous assessment is at the end deployed in
the preferred structure and channel. This is the last step in the data science life cycle.
Each step in the data science life cycle defined above must be laboured upon carefully.
If any step is performed improperly, and hence,
What is Big Data Architecture?
It is a comprehensive system of processing a vast amount of data. The Big data
architecture framework lays out the blueprint of providing solutions and infrastructures
to handle big data depending on an organization’s needs. It clearly defines the
architecture components of big data analytics, layers to be used, and the flow of
information. The reference point is ingesting, processing, storing, managing, accessing, and
analyzing the data. A typical big data architecture framework looks like below, having the
following big data architecture layers:
Stream processing
Stream processing varies from real-time message ingestion. Stream processing handles all
the window streaming data or even streams and writes the streamed data to the output
area. The tools here are Apache Spark, Apache Flink, Storm.
Analytical datastore
The analytical data store is like a one-stop place for the prepared data. This data is either
presented in an interactive database that offers the metadata or in a NoSQL data
warehouse. The data set then prepared can be searched for by querying and used for
analysis with tools such as Hive, Spark SQL, Hbase.
Analytics Layer
The analysis layer interacts with the storage layer to extract valuable insights and
business intelligence. The architecture needs a mix of multiple data tools for handling the
unstructured data and making the analysis. The tools and techniques for big data
architecture framework are covered in a later section. On this subject, you may also like
to read: Big Data Analytics – Key Aspects One Must Know
AnalytixLabs, India’s top-ranked AI & Data Science Institute, is led by a team of IIM,
IIT, ISB, and McKinsey alumni. The institute provides a wide range of data analytics
courses, including detailed project work that helps an individual fit for the professional
roles in AI, Data Science, and Big Data Engineering. With its decade of experience in
providing meticulous, practical, and tailored learning, AnalytixLabs has proficiency in
making aspirants “industry-ready” professionals.
SMALL DATA
Small Data: It can be defined as small datasets that are capable of impacting decisions
in the present. Anything that is currently ongoing and whose data can be accumulated in
an Excel file. Small Data is also helpful in making decisions, but does not aim to impact
the business to a great extent, rather for a short span of Small data can be described
as small datasets that are capable of having an influence on current decisions. Almost
everything currently in progress and the data of which can be acquired in an Excel file.
Small data is also useful in decision-making but is not intended to have a large impact on
business, rather for a short period of time.
In nutshell, data that is simple enough to be used for human understanding in such a
volume and structure that makes it accessible, concise, and workable is known as small
data.
Big Data: It can be represented as large chunks of structured and unstructured data.
The amount of data stored is immense. It is therefore important for analysts to
thoroughly dig the whole thing into making it relevant and useful to make proper
business decisions.
In short, datasets that are really huge and complex that conventional data processing
techniques can not manage them are known as big data.
Predictive Analytics
Predictive analytics turn the data into valuable, actionable information. predictive
analytics uses data to determine the probable outcome of an event or a likelihood of a
situation occurring. Predictive analytics holds a variety of statistical techniques from
modeling, machine learning , data mining, and game theory that analyze current and
historical facts to make predictions about a future event. Techniques that are used for
predictive analytics are:
Linear Regression
Time Series Analysis and Forecasting
Data Mining
Basic Corner Stones of Predictive Analytics
Predictive modeling
Decision Analysis and optimization
Transaction profiling
Descriptive Analytics
Descriptive analytics looks at data and analyze past event for insight as to how to
approach future events. It looks at past performance and understands the performance
by mining historical data to understand the cause of success or failure in the past.
Almost all management reporting such as sales, marketing, operations, and finance uses
this type of analysis.
The descriptive model quantifies relationships in data in a way that is often used to
classify customers or prospects into groups. Unlike a predictive model that focuses on
predicting the behavior of a single customer, Descriptive analytics identifies many
different relationships between customer and product.
Common examples of Descriptive analytics are company reports that provide historic
reviews like:
Data Queries
Reports
Descriptive Statistics
Data dashboard
Prescriptive Analytics
Prescriptive Analytics automatically synthesize big data, mathematical science, business
rule, and machine learning to make a prediction and then suggests a decision option to
take advantage of the prediction.
Prescriptive analytics goes beyond predicting future outcomes by also suggesting action
benefits from the predictions and showing the decision maker the implication of each
decision option. Prescriptive Analytics not only anticipates what will happen and when to
happen but also why it will happen. Further, Prescriptive Analytics can suggest decision
options on how to take advantage of a future opportunity or mitigate a future risk and
illustrate the implication of each decision option.
For example, Prescriptive Analytics can benefit healthcare strategic planning by using
analytics to leverage operational and usage data combined with data of external factors
such as economic data, population demography, etc.
Diagnostic Analytics
In this analysis, we generally use historical data over other data to answer any question
or for the solution of any problem. We try to find any dependency and pattern in the
historical data of the particular problem.
For example, companies go for this analysis because it gives a great insight into a
problem, and they also keep detailed information about their disposal otherwise data
collection may turn out individual for every problem and it will be very time-consuming.
Common techniques used for Diagnostic Analytics are:
Data discovery
Data mining
Correlations
Descriptive statistics help you to understand the data, but before we understand what
data is, we should know different data types in descriptive statistical analysis. The below
screen helps you to get an overview of it.
Types of Data
A data set is a grouping of information that is related to each other. A data set can be
either qualitative or quantitative. A qualitative data set consists of words that can be
observed, not measured. A quantitative data set consists of numbers that can be directly
measured. Months in a year would be an example of qualitative, while the weight of persons
would be an example of quantitative data.
Now, let’s suppose you go to KFC to eat some burgers along with your friends, you placed
the order at coupon counter and after receiving from the food counter everyone eats what
you ordered on their behalf. If someone asked about the taste to others then the ratings
on the taste will vary from one to another but if asked how many burgers we ordered then
everyone will come to a definite count and it will be the same for all. Here, Taste’s ratings
represent the Categorical Data and the number of burgers is Numerical Data.
Types of Categorical Data:
1. Nominal Data: When there is no natural order between categories then data
is nominal type.
Example: Color of an Eye, Gender (Male & Female), Blood Type, Political Party,
and Zipcode, Type of living accommodation(House, Apartment, Trailer,
Other), Religious preference( Hindu, Buddhist, Muslim, Jewish, Christian,
Other), etc.
2. Ordinal Data: When there is natural order between categories then data is
ordinal type. But here, the difference between the values in order does not
matter.
Example: Exam Grades, Socio-economic status (poor, middle class, rich),
Education-level (kindergarten, primary, secondary, higher secondary,
graduation), satisfaction rating(extremely dislike, dislike, neutral, like,
extremely like), Time of Day(dawn, morning, noon, afternoon, evening, night),
Level of Agreement(yes, maybe, no), The Likert Scale(strongly disagree,
disagree, neutral, agree, strongly agree), etc.
Scales of Measurement:
Data can be classified as being on one of four scales: nominal, ordinal, interval or
ratio. Each level of measurement has some important properties that are useful to know.
1. Nominal Scale: Nominal datatype defined above can be placed into this
category. They don’t have a numeric value and so it neither be added,
subtracted, divided nor be multiplied. They also have no order; if they appear
to have an order then you probably have ordinal variables instead.
2. Ordinal Scale: Ordinal datatype defined above can be placed into this
category. The ordinal scale contains things that you can place in order. For
example, hottest to coldest, lightest to heaviest, richest to poorest. So, if
you can rank data by 1st, 2nd, 3rd place (and so on), then you have data that
is on an ordinal scale.
3. Interval Scale: An interval scale has ordered numbers with meaningful
divisions. Temperature is on the interval scale: a difference of 10 degrees
between 90 and 100 means the same as 10 degrees between 150 and 160.
Compare that to Olympic running race (which is ordinal), where the time
difference between the winner and runner up might be 0.01 second and
between second-last and last 0.5 seconds. If you have meaningful divisions,
you have something on the interval scale.
4. Ratio Scale: The ratio scale has all the property of interval scale with one
major difference: zero is meaningful. When the scale is equal to 0.0 then
there is none of that scale. For example, a height of zero is meaningful (it
means you don’t exist). The temperature in Kelvin(0.0 K), 0.0 Kelvin really
does mean “no heat”. Compare that to a temperature of zero, which while it
exists, it doesn’t mean anything in particular (although admittedly, in the
Celsius scale it’s the freezing point for water).
Suppose the temperature and ice cream sales are the two variables of a bivariate
data(figure 2). Here, the relationship is visible from the table that temperature and
sales are directly proportional to each other and thus related because as the
temperature increases, the sales also increase. Thus bivariate data analysis involves
comparisons, relationships, causes and explanations. These variables are often plotted
on X and Y axis on the graph for better understanding of data and one of these
variables is independent while the other is dependent.
3. Multivariate data
When the data involves three or more variables, it is categorized under multivariate.
Example of this type of data is suppose an advertiser wants to compare the popularity
of four advertisements on a website, then their click rates could be measured for both
men and women and relationships between variables can then be examined. It is similar
to bivariate but contains more than one dependent variable. The ways to perform
analysis on this data depends on the goals to be achieved. Some of the techniques are
regression analysis, path analysis, factor analysis and multivariate analysis of variance
(MANOVA).
There are a lots of different tools, techniques and methods that can be used to
conduct your analysis. You could use software libraries, visualization tools and statistic
testing methods. However, this blog we will be compare Univariate, Bivariate and
Multivariate analysis.
Univariate Bivariate Multivariate
It only summarize It only summarize two It only summarize more than 2
single variable at a variables variables.
time.
It does not deal It does deal with causes It does not deal with causes and
with causes and and relationships and relationships and analysis is done.
relationships. analysis is done.
It does not contain It does contain only one It is similar to bivariate but it
any dependent dependent variable. contains more than 2 variables.
variable.
The main purpose is The main purpose is to The main purpose is to study the
to describe. explain. relationship among them.
The example of a The example of bivariate Example, Suppose an advertiser wants
univariate can be can be temperature and ice to compare the popularity of four
height. sales in summer vacation. advertisements on a website.
Then their click rates could be
measured for both men and women
and relationships between variable can
be examined
The graph you plotted is called the frequency distribution of the data. You see that
there is a smooth curve-like structure that defines our data, but do you notice an
anomaly? We have an abnormally low frequency at a particular score range. So the
best guess would be to have missing values that remove the dent in the distribution.
Types of Distributions
Here is a list of distributions types
1. Bernoulli Distribution
2. Uniform Distribution
3. Binomial Distribution
4. Normal or Gaussian Distribution
5. Exponential Distribution
6. Poisson Distribution
Bernoulli Distribution
Let’s start with the easiest distribution, which is Bernoulli Distribution. It is actually
easier to understand than it sounds!
All you cricket junkies out there! At the beginning of any cricket match, how do you
decide who will bat or ball? A toss! It all depends on whether you win or lose the toss,
right? Let’s say if the toss results in a head, you win. Else, you lose. There’s no midway.
A Bernoulli distribution has only two bernoulli trials or possible outcomes, namely 1
(success) and 0 (failure), and a single trial. So the random variable X with a Bernoulli
distribution can take the value 1 with the probability of success, say p, and the value 0
with the probability of failure, say q or 1-p.
Here, the occurrence of a head denotes success, and the occurrence of a tail denotes
failure.
Probability of getting a head = 0.5 = Probability of getting a tail since there are only
two possible outcomes.
The probability mass function is given by: px(1-p)1-x where x € (0, 1)
It can also be written as:
The probabilities of success and failure need not be equally likely, like the result of a
fight between Undertaker and me. He is pretty much certain to win. So, in this case
probability of my success is 0.15, while my failure is 0.85
Bernoulli Distribution Example
Here, the probability of success(p) is not the same as the probability of failure. So,
the chart below shows the Bernoulli Distribution of our fight.
Here, the probability of success = 0.15, and the probability of failure = 0.85. The
expected value is exactly what it sounds like. If I punch you, I may expect you to
punch me back. Basically expected value of any distribution is the mean of the
distribution. The expected value of a random variable X from a Bernoulli distribution
is found as follows:
E(X) = 1*p + 0*(1-p) = p
The variance of a random variable from a bernoulli distribution is:
V(X) = E(X²) – [E(X)]² = p – p² = p(1-p)
There are many examples of Bernoulli distribution, such as whether it will rain
tomorrow or not, where rain denotes success and no rain denotes failure and Winning
(success) or losing (failure) the game.
Uniform Distribution
When you roll a fair die, the outcomes are 1 to 6. The probabilities of getting these
outcomes are equally likely, which is the basis of a uniform distribution. Unlike
Bernoulli Distribution, all the n number of possible outcomes of a uniform distribution
are equally likely.
A variable X is said to be uniformly distributed if the density function is:
You can see that the shape of the Uniform distribution curve is rectangular, the
reason why Uniform distribution is called rectangular distribution.
For a Uniform Distribution, a and b are the parameters.