0% found this document useful (0 votes)
3 views

Chapter 2 DS

Chapter 2 discusses various data types, including structured, semi-structured, and unstructured data, along with their characteristics and differences. It also categorizes data based on its collection methods into primary and secondary data, and outlines different data types in statistics such as nominal, ordinal, discrete, and continuous data. Additionally, the chapter covers various data sources including databases, files, APIs, web scraping, sensors, and social media.

Uploaded by

trexwarrior92
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Chapter 2 DS

Chapter 2 discusses various data types, including structured, semi-structured, and unstructured data, along with their characteristics and differences. It also categorizes data based on its collection methods into primary and secondary data, and outlines different data types in statistics such as nominal, ordinal, discrete, and continuous data. Additionally, the chapter covers various data sources including databases, files, APIs, web scraping, sensors, and social media.

Uploaded by

trexwarrior92
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Chapter-2

Data Types and Sources


Data Types and Sources: Different types of data: structured, unstructured,
semi-structured, Data sources: databases, files, APIs, web scraping, sensors,
social media

Data can be Structured data, Semi-structured data, and Unstructured data.

1. Structured
Structured data is data whose elements are addressable for effective
analysis. It has been organized into a formatted repository that is typically
a database. It concerns all data which can be stored in database SQL in a
table with rows and columns. They have relational keys and can easily be
mapped into pre-designed fields. Today, those data are most processed in
the development and simplest way to manage information. Example:
Relational-data.

2. Semi-Structured –
Semi-structured data is information that does not reside in a relational
database but that has some organizational properties that make it easier to
analyze. With some processes, you can store them in the relation database
(it could be very hard for some kind of semi-structured data), but Semi-
structured exists to ease space. Example: XML data.

Unstructured data –
Unstructured data is data which is not organized in a predefined manner or does not
have a predefined data model; thus, it is not a good fit for a mainstream relational
database. So, for Unstructured data, there are alternative platforms for storing and
managing. It is increasingly prevalent in IT systems and is used by organizations in
a variety of business intelligence and analytics applications. Example: Word, PDF,
Text, Media logs.
Differences between Structured, Semi-structured and Unstructured data:

Unstructured
Properties Structured data Semi-structured data
data

It is based on
It is based on It is based on
XML/RDF(Resource
Technology Relational character and
Description
database table binary data
Framework).

Matured
No transaction
transaction and Transaction is adapted
Transaction management
various from DBMS not
management and no
concurrency matured
concurrency
techniques

Version Versioning over Versioning over tuples Versioned as a


management tuples,row,tables or graph is possible whole

It is more flexible than It is more


It is schema
structured data but less flexible and
Flexibility dependent and less
flexible than there is absence
flexible
unstructured data of schema
It is very difficult
It’s scaling is simpler It is more
Scalability to scale DB
than structured data scalable.
schema

New technology, not


Robustness Very robust —
very spread

Data Types Based on Its Collection


Based on how data is collected, it can be divided into two categories - Primary and
Secondary data. Let’s review the key differences between these two types in the
following table -

Factor Primary Data Secondary Data

Definition Primary Data refers to the Secondary Data has been collected
first-hand data collected by by other teams in the past. It does
the team. It is collected based not necessarily need to be aligned
on the researcher’s needs. with the researcher’s requirements.

Data Real-time Data Historical Data


Process Time Consuming Quick and Easy

Collection Long Short


Time

Available In Raw and Crude form Refined form

Accuracy Very high Relatively less


and
Reliability

Examples Personal Interviews, Surveys, Websites, Articles, Research


Observations, etc. Papers, Historical Data, etc.

Types of Data:
The data in statistics is classified into four categories:
• Nominal data
• Ordinal data
• Discrete data
• Continuous data

In statistics, there are four main types of data: nominal, ordinal, interval, and ratio.
These types of data are used to describe the nature of the data being collected or
analyzed, and they help determine the appropriate statistical tests to use.

Qualitative Data (Categorical Data)


As the name suggests Qualitative Data tells the features of the data in the statistics.
Qualitative Data is also called Categorical Data and its categories the data into
various categories. Qualitative data includes data such as gender of people, their
family name, and others in a sample of population data.
Qualitative data is further categorized into two categories that includes,
• Nominal Data
• Ordinal Data

Nominal Data
Nominal data is a type of data that consists of categories or names that cannot be
ordered or ranked. Nominal data is often used to categorize observations into groups,
and the groups are not comparable. In other words, nominal data has no inherent
order or ranking. Examples of nominal data include gender (Male or female), race
(White, Black, Asian), religion (Hinduism, Christianity, Islam, Judaism), and blood
type (A, B, AB, O).
Nominal data can be represented using frequency tables and bar charts, which
display the number or proportion of observations in each category. For example, a
frequency table for gender might show the number of males and females in a sample
of people.
Nominal data is analyzed using non-parametric tests, which do not make any
assumptions about the underlying distribution of the data. Common non-parametric
tests for nominal data include Chi-Squared Tests and Fisher’s Exact Tests. These
tests are used to compare the frequency or proportion of observations in different
categories.
Ordinal Data
Ordinal data is a type of data that consists of categories that can be ordered or ranked.
However, the distance between categories is not necessarily equal. Ordinal data is
often used to measure subjective attributes or opinions, where there is a natural order
to the responses. Examples of ordinal data include education level (Elementary,
Middle, High School, College), job position (Manager, Supervisor, Employee), etc.
Ordinal data can be represented using bar charts, line charts. These displays show
the order or ranking of the categories, but they do not imply that the distances
between categories are equal.
Ordinal data is analyzed using non-parametric tests, which make no assumptions
about the underlying distribution of the data. Common non-parametric tests for
ordinal data include the Wilcoxon Signed-Rank test and Mann-Whitney U test.
Quantitative Data (Numerical Data)
Quantitative Data is the type of data that represents the numerical value of the data.
They are also called Numerical Data. This data type is used to represent the height,
weight, length, and other things of the data. Quantitative data is further classified
into two categories that are,
• Discrete Data
• Continuous Data

Discrete Data
Discrete data type is a type of data in statistics that only uses Discrete Value or Single
Values. These data types have values that can be easily counted as whole numbers.
The example of the discrete data types is,
• Height of Students in a class
• Marks of the students in a class test
• Weight of different members of a family, etc.

Continuous Data
Continuous data is the type of quantitative data that represent the data in a continuous
range. The variable in the data set can have any value between the range of the data
set. Examples of the continuous data types are,
• Temperature Range
• Salary range of Workers in a Factory, etc.

Difference between Quantitative and Qualitative Data


Quantitative and Qualitative data has huge differences and the basic differences
between them are studied in the table added below,
Quantitative data Qualitative data

Data is not depicted in numerical


Data is depicted in numerical terms.
terms.

Can be shown in numbers and variables Could be about the behavioral


like ratio, percentage, and more. attributes of a person, or thing.

Examples: loud behavior, fair skin,


Example: 100%, 1:3, 123
soft quality, and more.

Difference between Discrete and Continuous Data


Discrete data and continuous data both come under Quantitative data and the
differences between them is studied in the table added below,

Discrete Data Continuous Data

The type of data that has clear spaces This information falls into a continuous
between values is discrete data. series.
Discrete Data is Countable Continuous Data is Measurable

There are distinct or different values Every value within a range is included in
in discrete data. continuous data.

Discrete Data is depicted using bar Continuous Data is depicted using


graphs histograms

Ungrouped frequency distribution of Grouped distribution of continuous data


discrete data is performed against a tabulation frequencies is performed
single value. against a value group.

Data Sources:

A data source is the location where data that is being used originates from. A data
source may be the initial location where data is born or where physical information
is first digitized, however even the most refined data may serve as a source, as long
as another process accesses and utilizes it.
Databases
A database is an organized collection of structured information, or data, typically
stored electronically in a computer system. A database is usually controlled by a
database management system (DBMS).
Types:
Relational Database
NoSQL Database
Files:
Data stored in files, which can be in various formats such as text files, CSV, Excel
Spreadsheets, and more.

APIs (Application Programming Interface)


API stands for Application Programming Interface. In the context of APIs, the word
Application refers to any software with a distinct function. Interface can be thought
of as a contract of service between two applications. This contract defines how the
two communicate with each other using requests and responses.
Types:
Web APIs: Allow access to data over HTTP (eg. RESTful APIs) and usually return
data in JSON or XML format.
Library APIs: APIs provided by programming libraries to access specific functions
and data.

Web Scraping
Web scraping is the process of using bots to extract content and data from a website.
Unlike screen scraping, which only copies pixels displayed on screen, web scraping
extracts underlying HTML code, and, with it, data stored in a database. The scraper
can then replicate entire website content elsewhere.
Usage: Extracting news articles, product information, reviews, and more from
websites.

Sensors

A sensor is a device that detects and responds to some type of input from the physical
environment. The input can be light, heat, motion, moisture, pressure, or any number
of other environmental phenomena. Sensors collect data from the environment or
devices, providing valuable information for various applications and IOT projects.
In the context of data science sensor data is valuable for IOT applications,
environmental monitoring, health care manufacturing and more.

Social Media

Social Media platforms generate vast amounts of data daily including text messages,
videos, and user engagement metrics.
Usage: Analyzing trends, sentiments, user behavior, and engagement patterns.

Chapter Ends…

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy