Principles-of-Data-Science-WEB-3
Principles-of-Data-Science-WEB-3
The resulting plot with the additional lines of code has a narrower range of values along the x- and y-axes.
EXAMPLE 1.15
Problem
Using the iris dataset, draw a scatterplot between petal length and height of Setosa Iris. Set the title, x-axis
label, and y-axis label properly as well.
Solution
PYTHON CODE
# draw a scatterplot
plt.scatter(setosa["petal_length"], setosa["petal_width"])
52 1 • What Are Data and Data Science?
Datasets
Note: The primary datasets referenced in the chapter code may also be downloaded here
(https://openstax.org/r/spreadsheetsd1).
Key Terms
attribute characteristic or feature that defines an item in a dataset
categorical data data that is represented in different forms and do not indicate measurable quantities
cell a block or rectangle on a table that is specified with a combination of a row number and a column
number
comma-separated values (CSV) format of a dataset in which each item takes up a single line and its values
are separated by commas (“,”)
continuous data data whose value is chosen from an infinite set of numbers
data anything that we can analyze to compile some high-level insights
data analysis the process of examining and interpreting raw data to uncover patterns, discover meaningful
insights, and make informed decisions
data collection the systematic process of gathering information on variables of interest
data preparation (data processing) the second step within the data science cycle; converts the collected
data into an optimal form for analysis
data reporting the presentation of data in a way that will best convey the information learned from data
analysis
data science a field of study that investigates how to collect, manage, and analyze data in order to retrieve
meaningful information from some seemingly arbitrary data
data science cycle a process used when investigating data
data visualization the graphical representation of data to point out the patterns and trends involving the
use of visual elements such as charts, graphs, and maps
data warehousing the process of storing and managing large volumes of data from various sources in a
central location for easier access and analysis by businesses
DataFrame a data type that Pandas uses to store a multi-column tabular data
dataset a collection of related and organized information or data points grouped together for reference or
analysis
discrete data data that follows a specific precision
Excel a spreadsheet program with a graphical user interface developed by Microsoft to help with the
manipulation and analysis of data
Extensible Markup Language (XML) format of a dataset with which uses tags
Google Colaboratory (Colab) software for editing and running Jupyter Notebook files
Google Sheets a spreadsheet program with a graphical user interface developed by Google to help with the
manipulation and analysis of data
information some high-level insights that are compiled from data
Internet of Things (IoT) the network of multiple objects interacting with each other through the Internet
item an element that makes up a dataset; also referred to as an entry and an instance
JavaScript Object Notation (JSON) format of a dataset that follows the syntax of the JavaScript
programming language
Jupyter Notebook a web-based document that helps users run Python programs more interactively
nominal data data whose values do not include any ordering notion
numeric data data that are represented in numbers and indicate measurable quantities
ordinal data data whose values include an ordering notion
Pandas a Python library specialized for data manipulation and analysis
predictive analytics statistical techniques, algorithms, and machine learning that analyze historical data and
make predictions about future events, an approach often used in medicine to offer more accurate diagnosis
and treatment
programming language a formal language that consists of a set of instructions or commands used to
communicate with a computer and instruct it to perform specific tasks
Python a programming language that has extensive libraries and is commonly used for data analysis
54 1 • Group Project
qualitative data non-numerical data that generally describe subjective attributes or characteristics and are
analyzed using methods such as thematic analysis or content analysis
quantitative data data that can be measured by specific quantities and amounts and are often analyzed
using statistical methods
R an open-source programming language that is specifically designed for statistical computing and graphics
recommendation system a system that makes data-driven, personalized suggestions for users
sabermetrics a statistical approach to sports team management
sports analytics use of data and business analytics in sports
spreadsheet program a software application consisting of electronic worksheets with rows and columns
where data can be entered, manipulated, and calculated
structured data dataset whose individual items have the same structure
unstructured data dataset whose individual items have different structures
XML tag any block of text that consists of a pair of angle brackets (< >) with some text inside
Group Project
Project A: Data Source Quality
As a student of, or a new professional working in, data science, you will not always be collecting new primary
data. It’s just as important to be able to locate, critically evaluate, and properly clean existing sources of
secondary data. (Collecting and Preparing Data will cover the topic of data collection and cleaning in more
detail.)
Using the suggested sources or similar-quality sources that you research on the Internet, find two to three
datasets about the field or industry in which you intend to work. (You might also try to determine whether
similar data sets are available at the national, state/province, and local/city levels.) In a group, formulate a
specific, typical policy issue or business decision that managers in these organizations might make. For the
datasets you found, compare and contrast their size, collection methods, types of data, update frequency and
recency, and relevance to the decision question you have identified.
a. Privacy concerns related to data collection (See the Protecting Personal Privacy (https://openstax.org/r/
gaog) website of the U.S. Government Accountability Office.)
b. Ethics concerns related to data collection, including fair use of copyrighted materials (See the U.S.
Copyright Office guidelines (https://openstax.org/r/fairuse).)
c. Bias concerns related to data collection (See the National Cancer Institute (NCI) article
(https://openstax.org/r/bias) on data bias.)
Suppose that you are part of a data science team working for an organization on data collection for a major
project or product. Discuss as a team how the issues of privacy, ethics, and equity (avoiding bias) could be
addressed, depending on your position in the organization and the type of project or product.
Chapter Review
1. Select the incorrect step and goal pair of the data science cycle.
a. Data collection: collect the data so that you have something for analysis.
b. Data preparation: have the collected data stored in a server as is so that you can start the analysis.
c. Data analysis: analyze the prepared data to retrieve some meaningful insights.
d. Data reporting: present the data in an effective way so that you can highlight the insights found from
the analysis.
2. Which of the following best describes the evolution of data management in the data science process?
a. Initially, data was stored locally on individual computers, but with the advent of cloud-based systems,
data is now stored on designated servers outside of local storage.
b. Data management has remained static over time, with most data scientists continuing to store and
process data locally on individual computers.
c. The need for data management arose as a result of structured data becoming unmanageable, leading
to the development of cloud-based systems for data storage.
d. Data management systems have primarily focused on analysis rather than processing, resulting in the
development of modern data warehousing solutions.
3. Which of the following best exemplifies the interdisciplinary nature of data science in various fields?
a. A historian traveling to Italy to study ancient manuscripts to uncover historical insights about the
Roman Empire
b. A mathematician solving complex equations to model physical phenomena
c. A biologist analyzing a large dataset of genetic sequences to gain insights about the genetic basis of
diseases
d. A chemist synthesizing new compounds in a laboratory
Critical Thinking
1. For each dataset (https://openstax.org/r/spreadsheetsd1), list the attributes.
a. Spotify dataset
b. CancerDoc dataset
2. For each dataset (https://openstax.org/r/spreadsheetsd1), define the type of the data based on following
criteria and explain why:
• Numeric vs. categorical
• If it is numeric, continuous vs. discrete; if it is categorical, nominal vs. ordinal
3. For each dataset (https://openstax.org/r/spreadsheetsd1), identify the type of the dataset—structured vs.
unstructured. Explain why.
a. Spotify dataset
b. CancerDoc dataset
5. Open the WikiHow dataset (ch1-wikiHow.json (https://openstax.org/r/filed)) and list the attributes of the
dataset.
6. Draw scatterplot between bpm (x-axis) and danceability (y-axis) of the Spotify dataset
(https://openstax.org/r/filed) using:
a. Python Matplotlib
b. A spreadsheet program such as MS Excel or Google Sheets (Hint: Search “Scatterplot” on Help.)
7. Regenerate the scatterplot of the Spotify dataset (https://openstax.org/r/filed), but with a custom title and
x-/y-axis label. The title should be “BPM vs. Danceability.” The x-axis label should be titled “bpm” and range
from the minimum to the maximum bpm value. The y-axis label should be titled “danceability” and range
from the minimum to the maximum Danceability value.
a. Python Matplotlib (Hint: DataFrame.min() and DataFrame.max() methods return min and max
values of the DataFrame. You can call these methods upon a specific column of a DataFrame as well.
For example, if a DataFrame is named df and has a column named “col1”, df[“col1”].min() will
return the minimum value of the “col1” column of df.)
b. A spreadsheet program such as MS Excel or Google Sheets (Hint: Calculate the minimum and
maximum value of each column somewhere else first, then simply use the value when editing the
scatterplot.)
8. Based on the Spotify dataset (https://openstax.org/r/spreadsheet4), filter the following using Python
Pandas:
a. Tracks whose artist is Taylor Swift
b. Tracks that were sung by Taylor Swift and released earlier than 2020
Quantitative Problems
1. Based on the Spotify dataset (https://openstax.org/r/spreadsheet4), calculate the average bpm of the
songs released in 2023 using:
a. Python Pandas
b. A spreadsheet program such as MS Excel or Google Sheets (Hint: The formula AVERAGE() computes the
average across the cells specified in the parentheses. For example, within Excel, typing in the
command “=AVERAGE(A1:A10)” in any empty cell will calculate the numeric average for the contents of
cells A1 through A10. Search “AVERAGE function” on Help as well.)
References
Anaconda. (2020). 2020 state of data science. https://www.anaconda.com/resources/whitepapers/state-of-
data-science-2020
Clewlow, A. (2024, January 26). Three smart cities that failed within five years of launch. Intelligent Build.tech.
https://www.intelligentbuild.tech/2024/01/26/three-smart-cities-that-failed-within-five-years-of-launch/
Hays, C. L. (2004, November 14). What Wal-Mart knows about customers’ habits. New York Times.
https://www.nytimes.com/2004/11/14/business/yourmoney/what-walmart-knows-about-customers-
habits.html
Herrington, D. (2023, July 31). Amazon is delivering its largest selection of products to U.S. Prime members at
the fastest speeds ever. Amazon News. https://www.aboutamazon.com/news/operations/doug-herrington-
amazon-prime-delivery-speed
Hitachi. (2023, February 22). Ag Automation and Hitachi drive farming efficiency with sustainable digital
solutions. Hitachi Social Innovation. https://social-innovation.hitachi/en-us/case_studies/digital-solutions-in-
agriculture-drive-sustainability-in-farming/
IABAC. (2023, September 20). Fraud detection through data analytics: Identifying anomalies and patterns.
International Association of Business Analytics Certification. https://iabac.org/blog/fraud-detection-
through-data-analytics-identifying-anomalies-and-patterns
Statista. (2024, May 10). Walmart: weekly customer visits to stores worldwide FY2017-FY2024.
https://www.statista.com/statistics/818929/number-of-weekly-customer-visits-to-walmart-stores-worldwide/
Van Bocxlaer, A. (2020, 20 August). Sensors for a smart city. RFID & Wireless IoT. https://www.rfid-wiot-
search.com/rfid-wiot-global-sensors-for-a-smart-city
58 1 • References
Figure 2.1 Periodic population surveys such as censuses help governments plan resources to supply public services. (credit:
modification of work “Census 2010 @ La Fuente” by Jenn Turner/Flickr, CC BY 2.0)
Chapter Outline
2.1 Overview of Data Collection Methods
2.2 Survey Design and Implementation
2.3 Web Scraping and Social Media Data Collection
2.4 Data Cleaning and Preprocessing
2.5 Handling Large Datasets
Introduction
Data collection and preparation are the first steps in the data science cycle. They involve systematically
gathering the necessary data to meet a project's objectives and ensuring its readiness for further analysis.
Well-executed data collection and preparation serve as a solid foundation for effective, data-driven decision-
making and aid in detecting patterns, trends, and insights that can drive business growth and efficiency.
With today’s ever-increasing volume of data, a robust approach to data collection is crucial for ensuring
accurate and meaningful results. This process requires following a comprehensive and systematic
methodology designed to ensure the quality, reliability, and validity of data gathered for analysis. It involves
identifying and sourcing relevant data from diverse sources, including internal databases, external
repositories, websites, and user-generated information. And it requires meticulous planning and execution to
guarantee the accuracy, comprehensiveness, and reliability of the collected data.
Preparing, or “wrangling,” the collected data adequately prior to analysis is equally important. Preparation
involves scrubbing, organizing, and transforming the data into a format suitable for analysis. Data preparation
plays a pivotal role in detecting and resolving any inconsistencies or errors present in the data, thereby
enabling accurate analysis. The rapidly advancing technology and widespread use of the internet have added
complexity to the data collection and preparation processes. As a result, data analysts and organizations face
many challenges, such as identifying relevant data sources, managing large data volumes, identifying outliers
or erroneous data, and handling unstructured data. By mastering the art and science of collecting and
preparing data, organizations can leverage valuable insights to drive informed decision-making and achieve
60 2 • Collecting and Preparing Data
business success.
Data collection refers to the systematic and well-organized process of gathering and accurately conveying
important information and aspects related to a specific phenomenon or event. This involves using statistical
tools and techniques to collect data, identify its attributes, and capture relevant contextual information. The
gathered data is crucial for making sound interpretations and gaining meaningful insights. Additionally, it is
important to take note of the environment and geographic location from where the data was obtained, as it
can significantly influence the decision-making process and overall conclusions drawn from the data.
Data collection can be carried out through various methods, depending on the nature of the research or
project and the type of data being collected. Some common methods for data collection include experiments,
surveys, observation, focus groups, interviews, and document analysis.
This chapter will focus on the use of surveys and experiments to collect data. Social scientists, marketing
specialists, and political analysts regularly use surveys to gather data on topics such as public opinion,
customer satisfaction, and demographic information. Pharmaceutical companies heavily rely on experimental
data from clinical trials to test the safety and efficacy of new drugs. This data is then used by their legal teams
to gain regulatory approval and bring drugs to market.
Before collecting data, it is essential for a data scientist to have a clear understanding of the project's
objectives, which involves identifying the research question or problem and defining the target population or
sample. If a survey or experiment is used, the design of the survey/experiment is also a critical step, requiring
careful consideration of the type of questions, response options, and overall structure. A survey may be
conducted online, via phone, or in person, while experimental research requires a controlled environment to
ensure data validity and reliability.
Types of Data
Observational and transactional data play important roles in data analysis and related decision-making, each
offering unique insights into different aspects of real-world phenomena and business operations.
Observational data, often used in qualitative research, is collected by systematically observing and recording
behavior without the active participation of the researcher. Transactional data refers to any type of
information related to transactions or interactions between individuals, businesses, or systems, and it is more
often used in quantitative research.
Many fields of study use observational data for their research. Table 2.1 summarizes some examples of fields
that rely on observational data, the type of data they collect, and the purpose of their data collection.
Transactional data is collected by directly recording transactions that occur in a particular setting, such as a
retail store or an online platform that allows for accurate and detailed information on actual consumer
behavior. It can include financial data, but it also includes data related to customer purchases, website clicks,
user interactions, or any other type of activity that is recorded and tracked.
Transactional data can be used to understand patterns and trends, make predictions and recommendations,
and identify potential opportunities or areas for improvement. For example, the health care industry may
focus on transactional data related to patient interactions with health care providers and facilities, such as
appointments, treatments, and medications prescribed. The retail industry may use transactional data on
customer purchases and product returns, while the transportation industry may analyze data related to ticket
sales and passenger traffic.
While observational data provides detailed descriptions of behavior, transactional data provides numerical
data for statistical analysis. There are strengths and limitations with each of these, and the examples in this
chapter will make use of both types.
EXAMPLE 2.1
Problem
Ashley loves setting up a bird feeder in her backyard and watching the different types of birds that come to
feed. She has always been curious about the typical number of birds that visit her feeder each day and has
estimated the number based on the amount of food consumed. However, she has to visit her
grandmother's house for three days and is worried about leaving the birds without enough food. In order
to prepare the right amount of bird food for her absence, Ashley has decided to measure the total amount
of feed eaten each day to determine the total amount of food needed for her three-day absence. Which
method of data collection is best suited for Ashley's research on determining the total amount of food
required for her three-day absence—observational or transactional? Provide a step-by-step explanation of
the chosen method.
62 2 • Collecting and Preparing Data
Solution
Ashley wants to ensure that there is enough food for her local birds while she is away for three days. To do
this, she will carefully observe the feeder daily for two consecutive weeks. She will record the total amount
of feed eaten each day and make sure to refill the feeder each morning before the observation. This will
provide a consistent amount of food available for the birds. After two weeks, Ashley will use the total
amount of food consumed and divide it by the number of days observed to estimate the required daily
food. Then, she will multiply the daily food by three to determine the total amount of bird food needed for
her three-day absence. By directly observing and recording the bird food, as well as collecting data for two
weeks, Ashley will gather accurate and reliable information. This will help her confidently prepare the
necessary amount of bird food for her feathered friends while she is away, thus ensuring that the birds are
well-fed and taken care of during her absence.
EXAMPLE 2.2
Problem
A group of data scientists working for a large hospital have been tasked with analyzing their transactional
data to identify areas for improvement. In the past year, the hospital has seen an increase in patient
complaints about long wait times for appointments and difficulties scheduling follow-up visits. Samantha is
one of the data scientists tasked to collect data in order to analyze these issues.
a. What methodology should be employed by Samantha to collect pertinent data for analyzing the recent
surge in patient complaints regarding extended appointment wait times and difficulties in scheduling
follow-up visits at the hospital?
b. What strategies could be used to analyze the data?
Solution
1. Electronic Health Records (EHRs): Samantha can gather data from the hospital's electronic health
records system. This data may include patients' appointment schedules, visit durations, and wait times.
This information can help identify patterns and trends in appointment scheduling and wait times.
2. Appointment Booking System: Samantha can gather data from the hospital's appointment booking
system. This data can include appointment wait times, appointment types (e.g., primary care,
specialist), and scheduling difficulties (e.g., appointment availability, cancellations). This information
can help identify areas where the booking system may be causing delays or challenges for patients.
3. Hospital Call Center: Samantha can gather data from the hospital's call center, which is responsible for
booking appointments over the phone. This data can include call wait times, call duration, and reasons
for call escalations. This information can help identify areas for improvement in the call center's
processes and procedures.
4. Historical Data: Samantha can analyze historical data, such as appointment wait times and scheduling
patterns, to identify any changes that may have contributed to the recent increase in complaints. This
data can also be compared to current data to track progress and improvements in wait times and
scheduling.
Consider this example: Scientist Sally aimed to investigate the impact of sunlight on plant growth. The
research inquiry was to determine whether increased exposure to sunlight enhances the growth of plants.
Sally experimented with two groups of plants wherein one group received eight hours of sunlight per day,
while the other only received four hours. The height of each plant was measured and documented every week
for four consecutive weeks. The main research objective was to determine the growth rate of plants exposed
to eight hours of sunlight compared to those with only four hours. A total of 20 identical potted plants were
used, with one group allocated to the "sunlight" condition and the other to the "limited sunlight" condition.
Both groups were maintained under identical environmental conditions, including temperature, humidity, and
soil moisture. Adequate watering was provided to ensure equal hydration of all plants. The measurements of
plant height were obtained and accurately recorded every week. This approach allowed for the collection of
precise and reliable data on the impact of sunlight on plant growth, which can serve as a valuable resource for
further research and understanding of this relationship.
Surveys are a common strategy for gathering data in a wide range of domains, including market research,
social sciences, and education. Surveys collect information from a sample of individuals and often use
questionnaires to collect data. Sampling is the process of selecting a subset of a larger population to
represent and analyze information about that population.
Constructing good surveys is hard. A survey should begin with simple and easy-to-answer questions and
progress to more complex or sensitive ones. This can help build a rapport with the respondents and increase
their willingness to answer more difficult questions. Additionally, the researcher may consider mixing up the
response options for multiple-choice questions to avoid response bias. To ensure the quality of the data
collected, the survey questionnaire should undergo a pilot test with a small group of individuals from the
target population. This allows the researcher to identify any potential issues or confusion with the questions
and make necessary adjustments before administering the survey to the larger population.
Open-ended questions allow for more in-depth responses and provide the opportunity for unexpected
insights. They also allow respondents to elaborate on their thoughts and provide detailed and personal
responses. Closed-ended questions have predetermined answer choices and are effective in gathering
quantitative data. They are quick and easy to answer, and their clear and structured format allows for
quantifiable results.
An example of a biased survey question in a survey conducted by a shampoo company might be "Do you
prefer our brand of shampoo over cheaper alternatives?" This question is biased because it assumes that the
respondent prefers the company's brand over others. A more unbiased and accurate question would be "What
factors do you consider when choosing a shampoo brand?" This allows for a more detailed and accurate
response. The biased question could have led to inflated results in favor of the company's brand.
Sampling
The next step in the data collection process is to choose a participant sample to ideally represent the
restaurant's customer base. Sampling could be achieved by randomly selecting customers, using customer
databases, or targeting specific demographics, such as age or location.
Sampling is necessary in a wide range of data science projects to make data collection more manageable and
cost-effective while still drawing meaningful conclusions. A variety of techniques can be employed to
determine a subset of data from a larger population to perform research or construct hypotheses about the
entire population. The choice of a sampling technique depends upon the nature and features of the
population being studied as well as the objectives of the research. When using a survey, researchers must also
consider the tool(s) that will be used for distributing the survey, such as through email, social media, or
physically distributing questionnaires at the restaurant. It's crucial to make the survey easily accessible to the
chosen sample to achieve a higher response rate.
A number of sampling techniques and their advantages are described below. The most frequently used among
these are simple random selection, stratified sampling, cluster sampling, and convenience sampling.
1. Simple random selection. Simple random selection is a statistical technique used to pick a representative
sample from a larger population. This process involves randomly choosing individuals or items from the
population, ensuring that each selected member of the population has an identical chance of being
contained in the sample. The main step in simple random selection is to define the population of interest
and assign a unique identification number to each member. This could be done using a random number
generator, a computer program designed to generate a sequence of random numbers, or a random
number table, which lists numbers in a random sequence. The primary benefit of this technique is its
ability to minimize bias and deliver a fair representation of the population.
In the health care field, simple random sampling is utilized to select patients for medical trials or surveys,
allowing for a diverse and unbiased sample (Elfil & Negida, 2017). Similarly, in finance, simple random
sampling can be applied to gather data on consumer behavior and guide decision-making in financial
institutions. In engineering, this technique is used to select random samples of materials or components
for quality control testing. In the political arena, simple random sampling is commonly used to select
randomly registered voters for polls or surveys, ensuring equal representation and minimizing bias in the
data collected.
2. Stratified sampling. Stratified sampling involves splitting the population into subgroups based on
specified factors, such as age, area, income, or education level, and taking a random sample from each
stratum in proportion to its size in the population. Stratified sampling allows for a more accurate
representation of the population as it ensures that all subgroups are adequately represented in the
sample. This can be especially useful when the variables being studied vary significantly between the
stratified groups.
3. Cluster sampling. With cluster sampling, the population is divided into natural groups or clusters, such as
schools, communities, or cities, with a random sample of these clusters picked and all members within the
chosen clusters included in the sample. Cluster sampling is helpful to represent the entire population even
if it is difficult or time-consuming due to challenges such as identifying clusters, sourcing a list of clusters,
traveling to different clusters, and communicating with them. Additionally, data analysis and sample size
calculation may be more complex, and there is a risk of bias in the sample. However, cluster sampling can
be more cost-effective.
An example of cluster sampling would be a study on the effectiveness of a new educational program in a
state. The state is divided into clusters based on school districts. The researcher uses a random selection
process to choose a sample of school districts and then collects data from all the schools within those
districts. This method allows the researcher to obtain a representative sample of the state's student
population without having to visit each individual school, saving time and resources.
4. Convenience sampling. Convenience sampling applies to selecting people or items for the sample based
on their availability and convenience to the data science research. For example, a researcher may choose
to survey students in their classroom or manipulate data from social media users. Convenience sampling
is effortless to achieve, and it is useful for exploratory studies. However, it may not provide a
representative sample as it is prone to selection bias in that individuals who are more readily available or
willing to participate may be overrepresented.
An example of convenience sampling would be conducting a survey about a new grocery store in a busy
shopping mall. A researcher stands in front of the store and approaches people who are coming out of the
store to ask them about their shopping experience. The researcher only includes responses from those
who agreed to participate, resulting in a sample that is convenient but may not be representative of the
entire population of shoppers in the mall.
5. Systematic sampling. Systematic sampling is based on starting at a random location in the dataset and
then selecting every nth member from a population to be contained in the sample. This process is
straightforward to implement, and it provides a representative sample when the population is randomly
distributed. However, if there is a pattern in the sampling frame (the organizing structure that represents
the population from which a sample is drawn), it may lead to a biased sample.
Suppose a researcher wants to study the dietary habits of students in a high school. The researcher has a
list of all the students enrolled in the school, which is approximately 1,000 students. Instead of randomly
selecting a sample of students, the researcher decides to use systematic sampling. The researcher first
assigns a number to each student, going from 1 to 1,000. Then, the researcher randomly selects a number
from 1 to 10—let's say they select 4. This number will be the starting point for selecting the sample of
students. The researcher will then select every 10th student from the list, which means every student with
a number ending in 4 (14, 24, 34, etc.) will be included in the sample. This way, the researcher will have a
representative sample of 100 students from the high school, which is 10% of the population. The sample
will consist of students from different grades, genders, and backgrounds, making it a diverse and
representative sample.
66 2 • Collecting and Preparing Data
6. Purposive sampling. With purposive sampling, one or more specific criteria are used to select
participants who are likely to provide the most relevant and useful information for the research study. This
can involve selecting participants based on their expertise, characteristics, experiences, or behaviors that
are relevant to the research question.
For example, if a researcher is conducting a study on the effects of exercise on mental health, they may
use purposive sampling to select participants who have a strong interest or experience in physical fitness
and have a history of mental health issues. This sampling technique allows the researcher to target a
specific population that is most relevant to the research question, making the findings more applicable
and generalizable to that particular group. The main advantage of purposive sampling is that it can save
time and resources by focusing on individuals who are most likely to provide valuable insights and
information. However, researchers need to be transparent about their sampling strategy and potential
biases that may arise from purposely selecting certain individuals.
7. Snowball sampling. Snowball sampling is typically used in situations where it is difficult to access a
particular population; it relies on the assumption that people with similar characteristics or experiences
tend to associate with each other and can provide valuable referrals. This type of sampling can be useful in
studying hard-to-reach or sensitive populations, but it may also be biased and limit the generalizability of
findings.
8. Quota sampling. Quota sampling is a non-probability sampling technique in which experimenters select
participants based on predetermined quotas to guarantee that a certain number or percentage of the
population of interest is represented in the sample. These quotas are based on specific demographic
characteristics, such as age, gender, ethnicity, and occupation, which are believed to have a direct or
indirect relationship with the research topic. Quota sampling is generally used in market research and
opinion polls, as it allows for a fast and cost-effective way to gather data from a diverse range of
individuals. However, it is important to note that the results of quota sampling may not accurately
represent the entire population, as the sample is not randomly selected and may be biased toward certain
characteristics. Therefore, the findings from studies using quota sampling should be interpreted with
caution.
9. Volunteer sampling. Volunteer sampling refers to the fact that the participants are not picked at random
by the researcher, but instead volunteer themselves to be a part of the study. This type of sampling is
commonly used in studies that involve recruiting participants from a specific population, such as a specific
community or organization. It is also often used in studies where convenience and accessibility are
important factors, as participants may be more likely to volunteer if the study is easily accessible to them.
Volunteer sampling is not considered a random or representative sampling technique, as the participants
may not accurately represent the larger population. Therefore, the results obtained from volunteer
sampling may not be generalizable to the entire population.
Sampling Error
Sampling error is the difference between the results obtained from a sample and the true value of the
population parameter it is intended to represent. It is caused by chance and is inherent in any sampling
method. The goal of researchers is to minimize sampling errors and increase the accuracy of the results. To
avoid sampling error, researchers can increase sample size, use probability sampling methods, control for
extraneous variables, use multiple modes of data collection, and pay careful attention to question formulation.
Sampling Bias
Sampling bias occurs when the sample used in a study isn’t representative of the population it intends to
generalize to, leading to skewed or inaccurate conclusions. This bias can take many forms, such as selection
bias, where certain groups are systematically over- or underrepresented, or volunteer bias, where only a
specific subset of the population participates. Researchers use the sampling techniques summarized earlier to
avoid sampling bias and ensure that each member of the population has an equal chance of being included in
the sample. Additionally, careful consideration of the sampling frame should ideally encompass all members of
the target population and provide a clear and accessible way to identify and select individuals or units for
inclusion in the sample. Sampling bias can occur at various stages of the sampling process, and it can greatly
impact the accuracy and validity of research findings.
Measurement Error
Measurement errors are inaccuracies or discrepancies that surface during the process of collecting,
recording, or analyzing data. They may occur due to human error, environmental factors, or inherent
inconsistencies in the phenomena being studied. Random error, which arises unpredictably, can affect the
precision of measurements, and systematic error may consistently bias measurements in a particular
direction. In data analysis, addressing measurement error is crucial for ensuring the reliability and validity of
results. Techniques for mitigating measurement error include improving data collection methods, calibrating
instruments, conducting validation studies, and employing statistical methods like error modeling or
sensitivity analysis to account for and minimize the impact of measurement inaccuracies on the analysis
outcomes.
Types of sampling error that could occur in this study include the following:
1. Sampling bias. One potential source of bias in this study is self-selection bias. As the participants are all
college students, they may not be representative of the larger population, as college students tend to
have more access and motivation to exercise compared to the general population. This could limit the
generalizability of the study's findings. In addition, if the researchers only recruit participants from one
university, there may be under-coverage bias. This means that certain groups of individuals, such as
nonstudents or students from other universities, may be excluded from the study, potentially leading to
biased results.
2. Measurement error. Measurement errors could occur, particularly if the researchers are measuring the
participants' exercise and mental health outcomes through self-report measures. Participants may not
accurately report their exercise habits or mental health symptoms, leading to inaccurate data.
3. Non-response bias. Some participants in the study may choose not to participate or may drop out before
the study is completed. This could introduce non-response bias, as those who choose not to participate or
drop out may differ from those who remain in the study in terms of their exercise habits or mental health
outcomes.
4. Sampling variability. The sample of 100 participants is a relatively small subset of the larger population.
As a result, there may be sampling variability, meaning that the characteristics and outcomes of the
participants may differ from those of the larger population simply due to chance.
5. Sampling error in random assignment. In this study, the researchers randomly assign participants to
either the exercise group or the control group. However, there is always a possibility of sampling error in
the random assignment process, meaning that the groups may not be perfectly balanced in terms of their
exercise habits or other characteristics.
These types of sampling errors can affect the accuracy and generalizability of the study's findings.
68 2 • Collecting and Preparing Data
Researchers need to be aware of these potential errors and take steps to minimize them when designing
and conducting their studies.
EXAMPLE 2.3
Problem
Mark is a data scientist who works for a marketing research company. He has been tasked to lead a study to
understand consumer behavior toward a new product that is about to be launched in the market. As data
scientists, they know the importance of using the right sampling technique to collect accurate and reliable
data. Mark divided the population into different groups based on factors such as age, education, and
income. This ensures that he gets a representative sample from each group, providing a more accurate
understanding of consumer behavior. What is the name of the sampling technique used by Mark to ensure
a representative sample from different groups of consumers for his study on consumer behavior toward a
new product?
Solution
The sampling technique used by Mark is called stratified sampling. This involves dividing the population
into subgroups or strata based on certain characteristics and then randomly selecting participants from
each subgroup. This ensures that each subgroup is represented in the sample, providing a more accurate
representation of the entire population. This type of sampling is often used in market research studies to
get a more comprehensive understanding of consumer behavior and preferences. By using stratified
sampling, Mark can make more reliable conclusions and recommendations for the new product launch
based on the data he collects.
Web scraping and social media data collection are two approaches used to gather data from the internet. Web
scraping involves pulling information and data from websites using a web data extraction tool, often known
as a web scraper. One example would be a travel company looking to gather information about hotel prices
and availability from different booking websites. Web scraping can be used to automatically gather this data
from the various websites and create a comprehensive list for the company to use in its business strategy
without the need for manual work.
Social media data collection involves gathering information from various platforms like Twitter and Instagram
using application programming interface or monitoring tools. An application programming interface (API) is
a set of protocols, tools, and definitions for building software applications allowing different software systems
to communicate and interact with each other and enabling developers to access data and services from other
applications, operating systems, or platforms. Both web scraping and social media data collection require
determining the data to be collected and analyzing it for accuracy and relevance.
Web Scraping
There are several techniques and approaches for scraping data from websites. See Table 2.2 for some of the
common techniques used. (Note: The techniques used for web scraping will vary depending on the website
and the type of data being collected. It may require a combination of different techniques to effectively scrape
data from a website.)
Regular Expressions • Search for and extract specific patterns of text from a web page
• Useful for scraping data that follows a particular format, such as dates,
phone numbers, or email addresses
HTML Parsing • Analyzes the HTML (HyperText Markup Language) structure of a web page
and identifies the specific tags and elements that contain the desired data
• Often used for simple scraping tasks
Application • Authorize developers to access and retrieve data instantly without the need
Programming for web scraping
Interfaces (APIs) • Often a more efficient and reliable method for data collection
(XML) API Subset • XML (Extensible Markup Language) is another markup language used
exchanging data
• This method works similarly to using the HTML API subset by making HTTP
requests to the website's API endpoints and then parsing the data received
in XML format
(JSON) API Subset • JSON (JavaScript Object Notation) is a lightweight data interchange format
that is commonly used for sending and receiving data between servers and
web applications
• Many websites provide APIs in the form of JSON, making it another efficient
method for scraping data
Table 2.2 Techniques and Approaches for Scraping Data from Websites
An example of social media data collection is conducting a Twitter survey on customer satisfaction for a food
delivery company. Data scientists can use Twitter's API to collect tweets containing specific hashtags related to
the company and analyze them to understand customers' opinions and preferences. They can also use social
listening to monitor conversations and identify trends in customer behavior. Additionally, creating a social
media survey on Twitter can provide more targeted insights into customer satisfaction and preferences. This
data can then be analyzed using data science techniques to identify key areas for improvement and drive
informed business decisions.
To scrape data such as a table from a website using Python, we follow these steps:
1. Import the pandas library. The first step is to import the pandas library (https://openstax.org/r/pandas),
which is a popular Python library for data analysis and manipulation.
import pandas as pd
df = pd.read_html("https://......
3. Access the desired data. If the data on the web page is divided into different tables, we need to specify
which table we want to extract. We have used indexing to access the desired table (for example: index 4)
from the list of DataFrame objects returned by the read_html() function. The index here represents the
table order in the web page.
4. Store the data in a DataFrame. The result of the read_html() function is a list of DataFrame objects,
and each DataFrame represents a table from the web page. We can store the desired data in a DataFrame
variable for further analysis and manipulation.
5. Display the DataFrame. By accessing the DataFrame variable, we can see the extracted data in a tabular
format.
6. Convert strings to numbers. As noted in Chapter 1, a string is a data type used to represent a sequence
of characters, such as letters, numbers, and symbols that are enclosed by matching single (') or double (")
quotes. If the data in the table is in string format and we want to perform any numerical operations on it,
we need to convert the data to numerical format. We can use the to_numeric() function from pandas to
convert strings to numbers and then store the result in a new column in the DataFrame.
df['column_name'] = pd.to_numeric(df['column_name'])
This will create a new column in the DataFrame with the converted numerical values, which can then be used
for further analysis or visualization.
In computer programming, indexing usually starts from 0. This is because most programming languages use 0
as the initial index for arrays, matrices, or other data structures. This convention has been adopted to simplify
the implementation of some algorithms and to make it easier for programmers to access and manipulate
data. Additionally, it aligns with the way computers store and access data in memory. In the context of parsing
tables from HTML pages, using 0 as the initial index allows programmers to easily access and manipulate data
from different tables on the same web page. This enables efficient data processing and analysis, making the
EXAMPLE 2.4
Problem
Extract data table "Current Population Survey: Household Data: (Table A-13). Employed and unemployed
persons by occupation, Not seasonally adjusted" from the FRED (Federal Reserve Economic Data)
(https://openstax.org/r/fred) website in the link (https://fred.stlouisfed.org/release/
tables?rid=50&eid=3149#snid=4498 (https://openstax.org/r/stlouisfed)) using Python code. The data in this
table provides a representation of the overall employment and unemployment situation in the United
States. The table is organized into two main sections: employed persons and unemployed persons.
Solution
PYTHON CODE
# Import pandas
import pandas as pd
In Python, there are several libraries and methods that can be used for parsing and extracting data from text.
These include the following:
Overall, the library or method used for parsing and extracting data will depend on the specific task and type of
data being analyzed. It is important to research and determine the best approach for a given project.
can be searched and matched. Typical applications include data parsing, input validation, and extracting
targeted information from larger text sources. Common use cases in Python involve recognizing various types
of data, such as dates, email addresses, phone numbers, and URLs, within extensive text files. Moreover,
regular expressions are valuable for tasks like data cleaning and text processing. Despite their versatility,
regular expressions can be elaborate, allowing for advanced search patterns utilizing meta-characters like *, ?,
and +. However, working with these expressions can present challenges, as they require a thorough
understanding and careful debugging to ensure successful implementation.
• The * character is known as the “star” or “asterisk” and is used to match zero or more occurrences of
the preceding character or group in a regular expression. For example, the regular expression "a*"
would match an "a" followed by any number (including zero) of additional "a"s, such as "a", "aa",
"aaa", etc.
• The ? character is known as the "question mark" and is used to indicate that the preceding character or
group is optional. It matches either zero or one occurrences of the preceding character or group. For
example, the regular expression "a?b" would match either "ab" or "b".
• The + character is known as the "plus sign" and is used to match one or more occurrences of the
preceding character or group. For example, the regular expression "a+b" would match one or more
"a"s followed by a "b", such as "ab", "aab", "aaab", etc. If there are no "a"s, the match will fail. This
is different from the * character, which would match zero or more "a"s followed by a "b", allowing for
a possible match without any "a"s.
EXAMPLE 2.5
Problem
Write Python code using regular expressions to search for a selected word “Python” in a given string and
print the number of times it appears.
Solution
PYTHON CODE
# use the "findall" to search for matching patterns in the data string
words = re.findall(pattern, story)
In Python, "len" is short for "length" and is used to determine the number of items in a collection. It
returns the total count of items in the collection, including spaces and punctuation marks in a string.
Slicing a string refers to extracting a portion or section of a string based on a specified range of indices. An
index refers to the position of a character in a string, starting from 0 for the first character. The range specifies
the start and end indices for the slice, and the resulting substring includes all characters within that range. For
example, the string "Data Science" can be sliced to extract "Data" by specifying the range from index 0 to
4, which includes the first four characters. Slicing can also be used to manipulate strings by replacing, deleting,
or inserting new content into specific positions within the string.
Parsing and extracting data involves the analysis of a given dataset or string to extract specific pieces of
information. This is accomplished using various techniques and functions, such as splitting and slicing strings,
which allow for the structured retrieval of data. This process is particularly valuable when working with large
and complex datasets, as it provides a more efficient means of locating desired data compared to traditional
search methods. Note that parsing and extracting data differs from the use of regular expressions, as regular
expressions serve as a specialized tool for pattern matching and text manipulation. In contrast, parsing and
data extraction offers a comprehensive approach to identifying and extracting specific data within a dataset.
Parsing and extracting data using Python involves using the programming language to locate and extract
specific information from a given text. This is achieved by utilizing the re library, which enables the use of
regular expressions to identify and retrieve data based on defined patterns. This process can be demonstrated
through an example of extracting data related to a person purchasing an iPhone at an Apple store.
The code in the following Python feature box uses regular expressions (regex) to match and extract specific
data from a string. The string is a paragraph containing information about a person purchasing a new phone
from the Apple store. The objective is to extract the product name, model, and price of the phone. First, the
code starts by importing the necessary library for using regular expressions. Then, the string data is defined as
a variable. Next, regex is used to search for specific patterns in the string. The first pattern searches for the
words "product: " and captures anything that comes after it until it reaches a comma. The result is then
stored in a variable named "product". Similarly, the second pattern looks for the words "model: " and
captures anything that comes after it until it reaches a comma. The result is saved in a variable named
"model". Finally, the third pattern searches for the words "price: " and captures any sequence of numbers
or symbols that follows it until the end of the string. The result is saved in a variable named "price". After all
the data is extracted, it is printed out to the screen, using concatenation to add appropriate labels before each
variable.
This application of Python code demonstrates the effective use of regex and the re library to parse and extract
specific data from a given data. By using this method, the desired information can be easily located and
retrieved for further analysis or use.
PYTHON CODE
cleaning process includes identifying and correcting any errors, inconsistencies, and missing values in a
dataset and is essential for ensuring that the data is accurate, reliable, and usable for analysis or other
purposes. Python is often utilized for data processing due to its flexibility and ease of use, and it offers a wide
range of tools and libraries specifically designed for data processing. Once the data is processed, it needs to be
stored for future use. (We will cover data storage in Data Cleaning and Preprocessing.) Python has several
libraries that allow for efficient storage and manipulation of data in the form of DataFrames.
One method of storing data using Python is using the pandas library to create a DataFrame and then using
the to_csv() function to save the DataFrame as a CSV (comma-separated values) file. This file can then be
easily opened and accessed for future analysis or visualization. For example, the code in the following Python
sidebar is a Python script that creates a dictionary with data about the presidents of the United States
(https://openstax.org/r/wikipedia), including their ordered number and state of birth. A data dictionary is a
data structure that stores data in key-value pairs, allowing for efficient retrieval of data using its key. It then
uses the built-in CSV library (https://openstax.org/r/librarycsv) to create a CSV file and write the data to it. This
code is used to store the US presidents' data in a structured format for future use, analysis, or display.
PYTHON CODE
import csv
Data cleaning and preprocessing is an important stage in any data science task. It refers to the technique of
organizing and converting raw data into usable structures for further analysis. It involves extracting irrelevant
or duplicate data, handling missing values, and correcting errors or inconsistencies. This ensures that the data
is accurate, comprehensive, and ready for analysis. Data cleaning and preprocessing typically involve the
following steps:
1. Data integration. Data integration refers to merging data from multiple sources into a single dataset.
2. Data cleaning. In this step, data is assessed for any errors or inconsistencies, and appropriate actions are
taken to correct them. This may include removing duplicate values, handling missing data, and correcting
formatting misconceptions.
3. Data transformation. This step prepares the data for the next step by transforming the data into a
format that is suitable for further analysis. This may involve converting data types, scaling or normalizing
numerical data, or encoding categorical variables.
4. Data reduction. If the dataset contains a large number of columns or features, data reduction techniques
may be used to select only the most appropriate ones for analysis.
5. Data discretization. Data discretization involves grouping continuous data into categories or ranges,
which can help facilitate analysis.
6. Data sampling. In some cases, the data may be too large to analyze in its entirety. In such cases, a sample
of the data can be taken for analysis while still maintaining the overall characteristics of the original
dataset.
The goal of data cleaning and preprocessing is to guarantee that the data used for analysis is accurate,
consistent, and relevant. It helps to improve the quality of the results and increase the efficiency of the analysis
process. A well-prepared dataset can lead to more accurate insights and better decision-making.
An outlier is a data point that differs significantly from other data points in a given dataset. This can be due to
human error, measurement error, or a true outlier value in the data. Outliers can skew statistical analysis and
bias results, which is why it is important to identify and handle them properly before analysis.
Missing data and outliers are common problems that can affect the accuracy and reliability of results. It is
important to identify and handle these issues properly to ensure the integrity of the data and the validity of
the analysis. You will find more details about outliers in Measures of Center, but here we summarize the
measures typically used to handle missing data and outliers in a data science project:
1. Identify the missing data and outliers. The first stage is to identify which data points are missing or
appear to be outliers. This can be done through visualization techniques, such as scatterplots, box plots,
IQR (interquartile range), or histograms, or through statistical methods, such as calculating the mean,
median, and standard deviation (see Measures of Center and Measures of Variation as well as Encoding
Univariate Data).
It is important to distinguish between different types of missing data. MCAR (missing completely at
random) data is missing data not related to any other variables, with no underlying cause for its absence.
Consider data collection with a survey asking about driving habits. One of the demographic questions asks
for income level. Some respondents accidentally skip this question, and so there is missing data for
income, but this is not related to the variables being collected related to driving habits.
MAR (missing at random) data is missing data related to other variables but not to any unknown or
unmeasured variables. As an example, during data collection, a survey is sent to respondents and the
survey asks about loneliness. One of the questions asks about memory retention. Some older respondents
might skip this question since they may be unwilling to share this type of information. The likelihood of
missing data for loneliness factors is related to age (older respondents). Thus, the missing data is related
to an observed variable of age but not directly related to loneliness measurements.
MNAR (missing not at random) data refers to a situation in which the absence of data depends on
observed data but not on unobserved data. For example, during data collection, a survey is sent to
respondents and the survey asks about debt levels. One of the questions asks about outstanding debt that
the customers have such as credit card debt. Some respondents with high credit card debt are less likely
to respond to certain questions. Here the missing credit card information is related to the unobserved
debt levels.
2. Determine the reasons behind missing data and outliers. It is helpful to understand the reasons
behind the missing data and outliers. Some expected reasons for missing data include measurement
errors, human error, or data not being collected for a particular variable. Similarly, outliers can be caused
by incorrect data entry, measurement errors, or genuine extreme values in the data.
3. Determine how to solve missing data issues. Several approaches can be utilized to handle missing data.
One option is to withdraw the missing data altogether, but this can lead to a loss of important information.
Other methods include imputation, where the absent values are replaced with estimated values based on
the remaining data or using predictive models to fill in the missing values.
4. Consider the influence of outliers. Outliers can greatly affect the results of the analysis, so it is important
to carefully consider their impact. One approach is to delete the outliers from the dataset, but this can also
lead to a loss of valuable information. Another option is to deal with the outliers as an individual group
and analyze them separately from the rest of the data.
5. Use robust statistical methods. When dealing with outliers, it is important to use statistical methods that
are not affected by extreme values. This includes using median instead of mean and using nonparametric
tests instead of parametric tests, as explained in Statistical Inference and Confidence Intervals.
6. Validate the results. After handling the missing data and outliers, it is important to validate the results to
ensure that they are robust and accurate. This can be done through various methods, such as cross-
validation or comparing the results to external data sources.
Handling missing data and outliers in a data science task requires careful consideration and appropriate
methods. It is important to understand the reasons behind these issues and to carefully document the process
to ensure the validity of the results.
80 2 • Collecting and Preparing Data
EXAMPLE 2.6
Problem
Starting in 1939, the United States Bureau of Labor Statistics tracked employment on a monthly basis. The
number of employers in the construction field between 1939 and 2019 is presented in Figure 2.2.
a. Determine if there is any outlier in the dataset that deviates significantly from the overall trend.
b. In the event that the outlier is not a reflection of real employment numbers, how would you handle the
outliers?
Solution
a. Based on the visual evidence, it appears that an outlier is present in the employment data for March 28,
1990. The employment level during this month shows a significant jump from approximately 5,400 to
9,500, exceeding the overall maximum employment level recorded between 1939 and 2019.
b. A possible approach to addressing this outlier is to replace it with median, calculated by taking the
mean of the points before and after the outlier. This method can help to improve the smoothness and
realism of the employment curve as well as mitigate any potential impacts the outlier may have on
statistical analysis or modeling processes. (See Table 2.3.)
Date Number of
Employers × 1000
1/1/1990 4974
2/1/1990 4933
3/1/1990 4989