ETCh2
ETCh2
Data Science
Ethiopia.
Introduction
In this chapter, you are going to learn more about data science, data vs.
information, data types and representation, data value chain, and basic
concepts of big data.
After completing this chapter, the students will be able to:
Describe what data science is and the role of data scientists.
Differentiate data and information.
Describe data processing life cycle.
Understand different data types from diverse perspectives
Describe data value chain in emerging era of big data.
Understand the basics of Big Data.
Describe the purpose of the Hadoop ecosystem components.
Data Science
I’m sure you have seen smart watches — or maybe you use one, too. These
smart gadgets can measure your sleep quality, how much you walk, your heart
rate, etc. Let’s take sleep quality, for instance!
If you check every single day, how did you sleep the night before, that’s 1 data
point for every day. Let’s say that you enjoyed excellent sleep last night: you
slept 8 hours, you didn’t move too much, you didn’t have short awakenings,
etc. That’s a data point. The day after, you slept slightly worse: only 7
hours. That’s another data point.
Cont…
By collecting these data points for a whole month, you can start to draw trends
from them. Maybe, on the weekends, you sleep better and longer. Maybe if you
go to bed earlier, your sleep quality is better. Or you recognize that you have
short awakenings around 2 am every night…
By collecting the data for a year, you can create more complex analyses. You can
learn what’s the best time for you to go to bed and wake up. You can identify the
more stressful parts of the year (when you worked too much and slept too little).
Even more, you might be able to predict these stressful parts of the year and
you can prepare yourself! So, we are getting closer and closer to data science…
What are data and information?
Data can be defined as a representation of facts, concepts, or instructions in
a formalized manner, which should be suitable for communication,
interpretation, or processing, by human or electronic machines.
It can be described as unprocessed facts and figures. It is represented with
the help of characters such as alphabets (A-Z, a-z), digits (0-9) or special
characters (+, -, /, *, <,>, =, etc.). Whereas
Cont…
Information is the processed data on which decisions and actions are based.
It is data that has been processed into a form that is meaningful to the
recipient and is of real or perceived value in the current or the prospective
action or decision of recipient.
Furtherer more, information is interpreted data; created from organized,
structured, and processed data in a particular context.
• Examples of Data and Information
• The history of temperature readings all over the world for the past 100 years is data. If this data is organized
and analyzed to find that global temperature is rising, then that is information.
• The number of visitors to a website by country is an example of data. Finding out that traffic from the U.S. is
increasing while that from Australia is decreasing is meaningful information.
Data Processing Cycle
Data processing is the re-structuring or re-ordering of data by people or
machines to increase their usefulness and add values for a particular
purpose.
Data processing consists of the following basic steps - input, processing,
and output.
Input − in this step, the input data is prepared in some convenient form for
processing. This can be text, voice etc. Devices such as the keyboard, mouse,
scanner, digital camera are considered input devices.
Processing − in this step, the input data is changed to produce data in a more
useful form.
Output − at this stage, the result of the proceeding processing step is collected.
The particular form of the output data depends on the use of the data. For
example, output data may be payroll for employees.
Data storage_ the final stage of data processing cycle is storage. After all of
the data is processed, it is then stored for future use.
Data types and their representation
A data type makes the values that expression, such as a variable or a function,
might take.
This data type defines the operations that can be done on the data, the meaning
of the data, and the way values of that type can be stored.
Data types can be described from diverse perspectives.
In this perspective there are three common types of data types or structures:
Examples of semi-structured data include JSON and XML are forms of semi-
structured data.
Metadata – Data about Data
It is one of the most important elements for big data analysis and big data
solutions. Metadata is data about data. It provides additional information
about a specific set of data.
Data value Chain
The Data Value Chain is introduced to describe the information flow
within a big data system as a series of steps needed to generate value and
useful insights from data.
Cont…
The Big Data Value Chain identifies the following key high-level activities:
Data Usage
Data usage in business decision-making can enhance competitiveness through
the reduction of costs, increased added value, or any other parameter that can
be measured against existing performance criteria.
Clustered Computing
A cluster is simply a combination of many computers designed to work together as one system. A
Hadoop cluster is, therefore, a cluster of computers used at Hadoop. Hadoop clusters are designed
specifically for analyzing and storing large amounts of unstructured data in distributed file systems
To better address the high storage and computational needs of big data, computer clusters are a
better fit.
Using clusters requires a solution for managing cluster membership, coordinating resource
sharing, and scheduling actual work on individual nodes. Cluster membership and resource
allocation can be handled by software like Hadoop’s YARN (which stands for yet another
Resource Negotiator).
Big data clustering software combines the resources of many smaller machines, seeking to
Cont…
o Resource Pooling: Combining the available storage space to hold data is a clear
benefit, but CPU and memory pooling are also extremely important. Processing
large datasets requires large amounts of all three of these resources.
o High Availability: Clusters can provide varying levels of fault tolerance and
availability guarantees to prevent hardware or software failures from affecting
access to data and processing.
Hadoop has an ecosystem that has evolved from its four core components:
2) Discuss data and its types from computer programming and data analytics
perspectives?
3) Discuss a series of steps needed to generate value and useful insights from
data?