Chapter 2 DS
Chapter 2 DS
1. Structured
Structured data is data whose elements are addressable for effective
analysis. It has been organized into a formatted repository that is typically
a database. It concerns all data which can be stored in database SQL in a
table with rows and columns. They have relational keys and can easily be
mapped into pre-designed fields. Today, those data are most processed in
the development and simplest way to manage information. Example:
Relational-data.
2. Semi-Structured –
Semi-structured data is information that does not reside in a relational
database but that has some organizational properties that make it easier to
analyze. With some processes, you can store them in the relation database
(it could be very hard for some kind of semi-structured data), but Semi-
structured exists to ease space. Example: XML data.
Unstructured data –
Unstructured data is data which is not organized in a predefined manner or does not
have a predefined data model; thus, it is not a good fit for a mainstream relational
database. So, for Unstructured data, there are alternative platforms for storing and
managing. It is increasingly prevalent in IT systems and is used by organizations in
a variety of business intelligence and analytics applications. Example: Word, PDF,
Text, Media logs.
Differences between Structured, Semi-structured and Unstructured data:
Unstructured
Properties Structured data Semi-structured data
data
It is based on
It is based on It is based on
XML/RDF(Resource
Technology Relational character and
Description
database table binary data
Framework).
Matured
No transaction
transaction and Transaction is adapted
Transaction management
various from DBMS not
management and no
concurrency matured
concurrency
techniques
Definition Primary Data refers to the Secondary Data has been collected
first-hand data collected by by other teams in the past. It does
the team. It is collected based not necessarily need to be aligned
on the researcher’s needs. with the researcher’s requirements.
Types of Data:
The data in statistics is classified into four categories:
• Nominal data
• Ordinal data
• Discrete data
• Continuous data
In statistics, there are four main types of data: nominal, ordinal, interval, and ratio.
These types of data are used to describe the nature of the data being collected or
analyzed, and they help determine the appropriate statistical tests to use.
Nominal Data
Nominal data is a type of data that consists of categories or names that cannot be
ordered or ranked. Nominal data is often used to categorize observations into groups,
and the groups are not comparable. In other words, nominal data has no inherent
order or ranking. Examples of nominal data include gender (Male or female), race
(White, Black, Asian), religion (Hinduism, Christianity, Islam, Judaism), and blood
type (A, B, AB, O).
Nominal data can be represented using frequency tables and bar charts, which
display the number or proportion of observations in each category. For example, a
frequency table for gender might show the number of males and females in a sample
of people.
Nominal data is analyzed using non-parametric tests, which do not make any
assumptions about the underlying distribution of the data. Common non-parametric
tests for nominal data include Chi-Squared Tests and Fisher’s Exact Tests. These
tests are used to compare the frequency or proportion of observations in different
categories.
Ordinal Data
Ordinal data is a type of data that consists of categories that can be ordered or ranked.
However, the distance between categories is not necessarily equal. Ordinal data is
often used to measure subjective attributes or opinions, where there is a natural order
to the responses. Examples of ordinal data include education level (Elementary,
Middle, High School, College), job position (Manager, Supervisor, Employee), etc.
Ordinal data can be represented using bar charts, line charts. These displays show
the order or ranking of the categories, but they do not imply that the distances
between categories are equal.
Ordinal data is analyzed using non-parametric tests, which make no assumptions
about the underlying distribution of the data. Common non-parametric tests for
ordinal data include the Wilcoxon Signed-Rank test and Mann-Whitney U test.
Quantitative Data (Numerical Data)
Quantitative Data is the type of data that represents the numerical value of the data.
They are also called Numerical Data. This data type is used to represent the height,
weight, length, and other things of the data. Quantitative data is further classified
into two categories that are,
• Discrete Data
• Continuous Data
Discrete Data
Discrete data type is a type of data in statistics that only uses Discrete Value or Single
Values. These data types have values that can be easily counted as whole numbers.
The example of the discrete data types is,
• Height of Students in a class
• Marks of the students in a class test
• Weight of different members of a family, etc.
Continuous Data
Continuous data is the type of quantitative data that represent the data in a continuous
range. The variable in the data set can have any value between the range of the data
set. Examples of the continuous data types are,
• Temperature Range
• Salary range of Workers in a Factory, etc.
The type of data that has clear spaces This information falls into a continuous
between values is discrete data. series.
Discrete Data is Countable Continuous Data is Measurable
There are distinct or different values Every value within a range is included in
in discrete data. continuous data.
Data Sources:
A data source is the location where data that is being used originates from. A data
source may be the initial location where data is born or where physical information
is first digitized, however even the most refined data may serve as a source, as long
as another process accesses and utilizes it.
Databases
A database is an organized collection of structured information, or data, typically
stored electronically in a computer system. A database is usually controlled by a
database management system (DBMS).
Types:
Relational Database
NoSQL Database
Files:
Data stored in files, which can be in various formats such as text files, CSV, Excel
Spreadsheets, and more.
Web Scraping
Web scraping is the process of using bots to extract content and data from a website.
Unlike screen scraping, which only copies pixels displayed on screen, web scraping
extracts underlying HTML code, and, with it, data stored in a database. The scraper
can then replicate entire website content elsewhere.
Usage: Extracting news articles, product information, reviews, and more from
websites.
Sensors
A sensor is a device that detects and responds to some type of input from the physical
environment. The input can be light, heat, motion, moisture, pressure, or any number
of other environmental phenomena. Sensors collect data from the environment or
devices, providing valuable information for various applications and IOT projects.
In the context of data science sensor data is valuable for IOT applications,
environmental monitoring, health care manufacturing and more.
Social Media
Social Media platforms generate vast amounts of data daily including text messages,
videos, and user engagement metrics.
Usage: Analyzing trends, sentiments, user behavior, and engagement patterns.
Chapter Ends…