0% found this document useful (0 votes)

50 views

Principles-of-Data-Science-WEB-3

The document provides an example of creating a scatterplot using the iris dataset in Python, specifically focusing on petal length and width of Setosa Iris. It also includes key terms related to data science, such as data types, data analysis processes, and data visualization techniques. Additionally, it outlines group projects that emphasize data source quality, visualization, and ethical considerations in data collection.

Uploaded by

pihak21291

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views

Principles-of-Data-Science-WEB-3

Uploaded by

pihak21291

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

1.

5 • Data Science with Python 51

The resulting plot with the additional lines of code has a narrower range of values along the x- and y-axes.

EXAMPLE 1.15

Problem

Using the iris dataset, draw a scatterplot between petal length and height of Setosa Iris. Set the title, x-axis
label, and y-axis label properly as well.

Solution

PYTHON CODE

import matplotlib.pyplot as plt

data = pd.read_csv("[Path to ch1-iris.csv]")

# select the rows whose species are Setosa Iris

setosa = data.loc[(data['species'] == 'Iris-setosa')]

# draw a scatterplot
plt.scatter(setosa["petal_length"], setosa["petal_width"])
52 1 • What Are Data and Data Science?

# set the title

plt.title("Petal Length vs. Petal Width of Setosa Iris")

# set the x-axis label

plt.xlabel("Petal Length")

# set the y-axis label

plt.ylabel("Petal Width")

The resulting output will look like this:

Datasets

Note: The primary datasets referenced in the chapter code may also be downloaded here
(https://openstax.org/r/spreadsheetsd1).

Access for free at openstax.org

1 • Key Terms 53

Key Terms
attribute characteristic or feature that defines an item in a dataset
categorical data data that is represented in different forms and do not indicate measurable quantities
cell a block or rectangle on a table that is specified with a combination of a row number and a column
number
comma-separated values (CSV) format of a dataset in which each item takes up a single line and its values
are separated by commas (“,”)
continuous data data whose value is chosen from an infinite set of numbers
data anything that we can analyze to compile some high-level insights
data analysis the process of examining and interpreting raw data to uncover patterns, discover meaningful
insights, and make informed decisions
data collection the systematic process of gathering information on variables of interest
data preparation (data processing) the second step within the data science cycle; converts the collected
data into an optimal form for analysis
data reporting the presentation of data in a way that will best convey the information learned from data
analysis
data science a field of study that investigates how to collect, manage, and analyze data in order to retrieve
meaningful information from some seemingly arbitrary data
data science cycle a process used when investigating data
data visualization the graphical representation of data to point out the patterns and trends involving the
use of visual elements such as charts, graphs, and maps
data warehousing the process of storing and managing large volumes of data from various sources in a
central location for easier access and analysis by businesses
DataFrame a data type that Pandas uses to store a multi-column tabular data
dataset a collection of related and organized information or data points grouped together for reference or
analysis
discrete data data that follows a specific precision
Excel a spreadsheet program with a graphical user interface developed by Microsoft to help with the
manipulation and analysis of data
Extensible Markup Language (XML) format of a dataset with which uses tags
Google Colaboratory (Colab) software for editing and running Jupyter Notebook files
Google Sheets a spreadsheet program with a graphical user interface developed by Google to help with the
manipulation and analysis of data
information some high-level insights that are compiled from data
Internet of Things (IoT) the network of multiple objects interacting with each other through the Internet
item an element that makes up a dataset; also referred to as an entry and an instance
JavaScript Object Notation (JSON) format of a dataset that follows the syntax of the JavaScript
programming language
Jupyter Notebook a web-based document that helps users run Python programs more interactively
nominal data data whose values do not include any ordering notion
numeric data data that are represented in numbers and indicate measurable quantities
ordinal data data whose values include an ordering notion
Pandas a Python library specialized for data manipulation and analysis
predictive analytics statistical techniques, algorithms, and machine learning that analyze historical data and
make predictions about future events, an approach often used in medicine to offer more accurate diagnosis
and treatment
programming language a formal language that consists of a set of instructions or commands used to
communicate with a computer and instruct it to perform specific tasks
Python a programming language that has extensive libraries and is commonly used for data analysis
54 1 • Group Project

qualitative data non-numerical data that generally describe subjective attributes or characteristics and are
analyzed using methods such as thematic analysis or content analysis
quantitative data data that can be measured by specific quantities and amounts and are often analyzed
using statistical methods
R an open-source programming language that is specifically designed for statistical computing and graphics
recommendation system a system that makes data-driven, personalized suggestions for users
sabermetrics a statistical approach to sports team management
sports analytics use of data and business analytics in sports
spreadsheet program a software application consisting of electronic worksheets with rows and columns
where data can be entered, manipulated, and calculated
structured data dataset whose individual items have the same structure
unstructured data dataset whose individual items have different structures
XML tag any block of text that consists of a pair of angle brackets (< >) with some text inside

Group Project
Project A: Data Source Quality
As a student of, or a new professional working in, data science, you will not always be collecting new primary
data. It’s just as important to be able to locate, critically evaluate, and properly clean existing sources of
secondary data. (Collecting and Preparing Data will cover the topic of data collection and cleaning in more
detail.)

Some reputable government data sources are:

Data.gov (https://openstax.org/r/datagov)

Bureau of Labor Statistics (BLS) (https://openstax.org/r/blsgov1)

National Oceanic and Atmospheric Administration (NOAA) (https://openstax.org/r/noaagov)

Some reputable nongovernment data sources are:

Kaggle (https://openstax.org/r/kaggle1)
Statista (https://openstax.org/r/statista)
Pew Research Center (https://openstax.org/r/pewresearch)

Using the suggested sources or similar-quality sources that you research on the Internet, find two to three
datasets about the field or industry in which you intend to work. (You might also try to determine whether
similar data sets are available at the national, state/province, and local/city levels.) In a group, formulate a
specific, typical policy issue or business decision that managers in these organizations might make. For the
datasets you found, compare and contrast their size, collection methods, types of data, update frequency and
recency, and relevance to the decision question you have identified.

Project B: Data Visualization

Using one of the data sources mentioned in the previous project, find a dataset that interests you. Download it
as a CSV file. Use Python to read in the CSV file as a Pandas DataFrame. As a group, think of a specific question
that might be addressed using this dataset, discuss which features of the data seem most important to answer
your question, and then use the Python libraries Pandas and Matplotlib to select the features and make
graphs that might help to answer your question about the data. Note, you will learn many sophisticated
techniques for doing data analysis in later chapters, but for this project, you should stick to simply isolating
some data and visualizing it using the tools present in this chapter. Write a brief report on your findings.

Project C: Privacy, Ethics, and Bias

Identify at least one example from recent current events or news articles that is related to each of the

Access for free at openstax.org

1 • Chapter Review 55

following themes (starting references given in parentheses):

a. Privacy concerns related to data collection (See the Protecting Personal Privacy (https://openstax.org/r/
gaog) website of the U.S. Government Accountability Office.)
b. Ethics concerns related to data collection, including fair use of copyrighted materials (See the U.S.
Copyright Office guidelines (https://openstax.org/r/fairuse).)
c. Bias concerns related to data collection (See the National Cancer Institute (NCI) article
(https://openstax.org/r/bias) on data bias.)

Suppose that you are part of a data science team working for an organization on data collection for a major
project or product. Discuss as a team how the issues of privacy, ethics, and equity (avoiding bias) could be
addressed, depending on your position in the organization and the type of project or product.

Chapter Review
1. Select the incorrect step and goal pair of the data science cycle.
a. Data collection: collect the data so that you have something for analysis.
b. Data preparation: have the collected data stored in a server as is so that you can start the analysis.
c. Data analysis: analyze the prepared data to retrieve some meaningful insights.
d. Data reporting: present the data in an effective way so that you can highlight the insights found from
the analysis.

2. Which of the following best describes the evolution of data management in the data science process?
a. Initially, data was stored locally on individual computers, but with the advent of cloud-based systems,
data is now stored on designated servers outside of local storage.
b. Data management has remained static over time, with most data scientists continuing to store and
process data locally on individual computers.
c. The need for data management arose as a result of structured data becoming unmanageable, leading
to the development of cloud-based systems for data storage.
d. Data management systems have primarily focused on analysis rather than processing, resulting in the
development of modern data warehousing solutions.

3. Which of the following best exemplifies the interdisciplinary nature of data science in various fields?
a. A historian traveling to Italy to study ancient manuscripts to uncover historical insights about the
Roman Empire
b. A mathematician solving complex equations to model physical phenomena
c. A biologist analyzing a large dataset of genetic sequences to gain insights about the genetic basis of
diseases
d. A chemist synthesizing new compounds in a laboratory

Critical Thinking
1. For each dataset (https://openstax.org/r/spreadsheetsd1), list the attributes.
a. Spotify dataset
b. CancerDoc dataset

2. For each dataset (https://openstax.org/r/spreadsheetsd1), define the type of the data based on following
criteria and explain why:
• Numeric vs. categorical
• If it is numeric, continuous vs. discrete; if it is categorical, nominal vs. ordinal

a. “artist_count” attribute of Spotify dataset

b. “mode” attribute of Spotify dataset
56 1 • Quantitative Problems

c. “key” attribute of Spotify dataset

d. the second column in CancerDoc dataset

3. For each dataset (https://openstax.org/r/spreadsheetsd1), identify the type of the dataset—structured vs.
unstructured. Explain why.
a. Spotify dataset
b. CancerDoc dataset

4. For each dataset (https://openstax.org/r/spreadsheetsd1), list the first data entry.

a. Spotify dataset
b. CancerDoc dataset

5. Open the WikiHow dataset (ch1-wikiHow.json (https://openstax.org/r/filed)) and list the attributes of the
dataset.

6. Draw scatterplot between bpm (x-axis) and danceability (y-axis) of the Spotify dataset
(https://openstax.org/r/filed) using:
a. Python Matplotlib
b. A spreadsheet program such as MS Excel or Google Sheets (Hint: Search “Scatterplot” on Help.)

7. Regenerate the scatterplot of the Spotify dataset (https://openstax.org/r/filed), but with a custom title and
x-/y-axis label. The title should be “BPM vs. Danceability.” The x-axis label should be titled “bpm” and range
from the minimum to the maximum bpm value. The y-axis label should be titled “danceability” and range
from the minimum to the maximum Danceability value.
a. Python Matplotlib (Hint: DataFrame.min() and DataFrame.max() methods return min and max
values of the DataFrame. You can call these methods upon a specific column of a DataFrame as well.
For example, if a DataFrame is named df and has a column named “col1”, df[“col1”].min() will
return the minimum value of the “col1” column of df.)
b. A spreadsheet program such as MS Excel or Google Sheets (Hint: Calculate the minimum and
maximum value of each column somewhere else first, then simply use the value when editing the
scatterplot.)

8. Based on the Spotify dataset (https://openstax.org/r/spreadsheet4), filter the following using Python
Pandas:
a. Tracks whose artist is Taylor Swift
b. Tracks that were sung by Taylor Swift and released earlier than 2020

Quantitative Problems
1. Based on the Spotify dataset (https://openstax.org/r/spreadsheet4), calculate the average bpm of the
songs released in 2023 using:
a. Python Pandas
b. A spreadsheet program such as MS Excel or Google Sheets (Hint: The formula AVERAGE() computes the
average across the cells specified in the parentheses. For example, within Excel, typing in the
command “=AVERAGE(A1:A10)” in any empty cell will calculate the numeric average for the contents of
cells A1 through A10. Search “AVERAGE function” on Help as well.)

References
Anaconda. (2020). 2020 state of data science. https://www.anaconda.com/resources/whitepapers/state-of-
data-science-2020

Clewlow, A. (2024, January 26). Three smart cities that failed within five years of launch. Intelligent Build.tech.
https://www.intelligentbuild.tech/2024/01/26/three-smart-cities-that-failed-within-five-years-of-launch/

Access for free at openstax.org

1 • References 57

Hays, C. L. (2004, November 14). What Wal-Mart knows about customers’ habits. New York Times.
https://www.nytimes.com/2004/11/14/business/yourmoney/what-walmart-knows-about-customers-
habits.html

Herrington, D. (2023, July 31). Amazon is delivering its largest selection of products to U.S. Prime members at
the fastest speeds ever. Amazon News. https://www.aboutamazon.com/news/operations/doug-herrington-
amazon-prime-delivery-speed

Hitachi. (2023, February 22). Ag Automation and Hitachi drive farming efficiency with sustainable digital
solutions. Hitachi Social Innovation. https://social-innovation.hitachi/en-us/case_studies/digital-solutions-in-
agriculture-drive-sustainability-in-farming/

IABAC. (2023, September 20). Fraud detection through data analytics: Identifying anomalies and patterns.
International Association of Business Analytics Certification. https://iabac.org/blog/fraud-detection-
through-data-analytics-identifying-anomalies-and-patterns

Statista. (2024, May 10). Walmart: weekly customer visits to stores worldwide FY2017-FY2024.
https://www.statista.com/statistics/818929/number-of-weekly-customer-visits-to-walmart-stores-worldwide/

Van Bocxlaer, A. (2020, 20 August). Sensors for a smart city. RFID & Wireless IoT. https://www.rfid-wiot-
search.com/rfid-wiot-global-sensors-for-a-smart-city
58 1 • References

Access for free at openstax.org

2
Collecting and Preparing Data

Figure 2.1 Periodic population surveys such as censuses help governments plan resources to supply public services. (credit:
modification of work “Census 2010 @ La Fuente” by Jenn Turner/Flickr, CC BY 2.0)

Chapter Outline
2.1 Overview of Data Collection Methods
2.2 Survey Design and Implementation
2.3 Web Scraping and Social Media Data Collection
2.4 Data Cleaning and Preprocessing
2.5 Handling Large Datasets

Introduction
Data collection and preparation are the first steps in the data science cycle. They involve systematically
gathering the necessary data to meet a project's objectives and ensuring its readiness for further analysis.
Well-executed data collection and preparation serve as a solid foundation for effective, data-driven decision-
making and aid in detecting patterns, trends, and insights that can drive business growth and efficiency.

With today’s ever-increasing volume of data, a robust approach to data collection is crucial for ensuring
accurate and meaningful results. This process requires following a comprehensive and systematic
methodology designed to ensure the quality, reliability, and validity of data gathered for analysis. It involves
identifying and sourcing relevant data from diverse sources, including internal databases, external
repositories, websites, and user-generated information. And it requires meticulous planning and execution to
guarantee the accuracy, comprehensiveness, and reliability of the collected data.

Preparing, or “wrangling,” the collected data adequately prior to analysis is equally important. Preparation
involves scrubbing, organizing, and transforming the data into a format suitable for analysis. Data preparation
plays a pivotal role in detecting and resolving any inconsistencies or errors present in the data, thereby
enabling accurate analysis. The rapidly advancing technology and widespread use of the internet have added
complexity to the data collection and preparation processes. As a result, data analysts and organizations face
many challenges, such as identifying relevant data sources, managing large data volumes, identifying outliers
or erroneous data, and handling unstructured data. By mastering the art and science of collecting and
preparing data, organizations can leverage valuable insights to drive informed decision-making and achieve
60 2 • Collecting and Preparing Data

business success.

2.1 Overview of Data Collection Methods

Learning Outcomes
By the end of this section, you should be able to:
• 2.1.1 Define data collection and its role in data science.
• 2.1.2 Describe different data collection methods commonly used in data science, such as surveys
and experiments.
• 2.1.3 Recognize scenarios where specific data collection methods are most appropriate.

Data collection refers to the systematic and well-organized process of gathering and accurately conveying
important information and aspects related to a specific phenomenon or event. This involves using statistical
tools and techniques to collect data, identify its attributes, and capture relevant contextual information. The
gathered data is crucial for making sound interpretations and gaining meaningful insights. Additionally, it is
important to take note of the environment and geographic location from where the data was obtained, as it
can significantly influence the decision-making process and overall conclusions drawn from the data.

Data collection can be carried out through various methods, depending on the nature of the research or
project and the type of data being collected. Some common methods for data collection include experiments,
surveys, observation, focus groups, interviews, and document analysis.

This chapter will focus on the use of surveys and experiments to collect data. Social scientists, marketing
specialists, and political analysts regularly use surveys to gather data on topics such as public opinion,
customer satisfaction, and demographic information. Pharmaceutical companies heavily rely on experimental
data from clinical trials to test the safety and efficacy of new drugs. This data is then used by their legal teams
to gain regulatory approval and bring drugs to market.

Before collecting data, it is essential for a data scientist to have a clear understanding of the project's
objectives, which involves identifying the research question or problem and defining the target population or
sample. If a survey or experiment is used, the design of the survey/experiment is also a critical step, requiring
careful consideration of the type of questions, response options, and overall structure. A survey may be
conducted online, via phone, or in person, while experimental research requires a controlled environment to
ensure data validity and reliability.

Types of Data
Observational and transactional data play important roles in data analysis and related decision-making, each
offering unique insights into different aspects of real-world phenomena and business operations.
Observational data, often used in qualitative research, is collected by systematically observing and recording
behavior without the active participation of the researcher. Transactional data refers to any type of
information related to transactions or interactions between individuals, businesses, or systems, and it is more
often used in quantitative research.

Many fields of study use observational data for their research. Table 2.1 summarizes some examples of fields
that rely on observational data, the type of data they collect, and the purpose of their data collection.

Field Data Collected By Purpose

Education Teachers To monitor and assess student behavior and
learning progress in the classroom
Psychology Therapists and To gather information about their clients' behavior,
psychologists thoughts, and emotions

Table 2.1 Fields Where Observation Methods Are Used

Access for free at openstax.org

2.1 • Overview of Data Collection Methods 61

Field Data Collected By Purpose

Health care Medical professionals To diagnose and monitor patients' conditions and
progress
Market research Businesses To gather information about consumer behavior
and preferences to improve their products,
services, and marketing strategies
Environmental science Scientists To gather data about the natural environment and
track changes over time
Criminal investigations Law enforcement officers To gather evidence and information about criminal
activity
Animal behavior Zoologists To study and understand the behavior of various
animal species
Transportation planning Urban planners and To collect data on traffic patterns and
engineers transportation usage to make informed decisions
about infrastructure and transit systems

Table 2.1 Fields Where Observation Methods Are Used

Transactional data is collected by directly recording transactions that occur in a particular setting, such as a
retail store or an online platform that allows for accurate and detailed information on actual consumer
behavior. It can include financial data, but it also includes data related to customer purchases, website clicks,
user interactions, or any other type of activity that is recorded and tracked.

Transactional data can be used to understand patterns and trends, make predictions and recommendations,
and identify potential opportunities or areas for improvement. For example, the health care industry may
focus on transactional data related to patient interactions with health care providers and facilities, such as
appointments, treatments, and medications prescribed. The retail industry may use transactional data on
customer purchases and product returns, while the transportation industry may analyze data related to ticket
sales and passenger traffic.

While observational data provides detailed descriptions of behavior, transactional data provides numerical
data for statistical analysis. There are strengths and limitations with each of these, and the examples in this
chapter will make use of both types.

EXAMPLE 2.1

Problem

Ashley loves setting up a bird feeder in her backyard and watching the different types of birds that come to
feed. She has always been curious about the typical number of birds that visit her feeder each day and has
estimated the number based on the amount of food consumed. However, she has to visit her
grandmother's house for three days and is worried about leaving the birds without enough food. In order
to prepare the right amount of bird food for her absence, Ashley has decided to measure the total amount
of feed eaten each day to determine the total amount of food needed for her three-day absence. Which
method of data collection is best suited for Ashley's research on determining the total amount of food
required for her three-day absence—observational or transactional? Provide a step-by-step explanation of
the chosen method.
62 2 • Collecting and Preparing Data

Solution

Ashley wants to ensure that there is enough food for her local birds while she is away for three days. To do
this, she will carefully observe the feeder daily for two consecutive weeks. She will record the total amount
of feed eaten each day and make sure to refill the feeder each morning before the observation. This will
provide a consistent amount of food available for the birds. After two weeks, Ashley will use the total
amount of food consumed and divide it by the number of days observed to estimate the required daily
food. Then, she will multiply the daily food by three to determine the total amount of bird food needed for
her three-day absence. By directly observing and recording the bird food, as well as collecting data for two
weeks, Ashley will gather accurate and reliable information. This will help her confidently prepare the
necessary amount of bird food for her feathered friends while she is away, thus ensuring that the birds are
well-fed and taken care of during her absence.

EXAMPLE 2.2

Problem

A group of data scientists working for a large hospital have been tasked with analyzing their transactional
data to identify areas for improvement. In the past year, the hospital has seen an increase in patient
complaints about long wait times for appointments and difficulties scheduling follow-up visits. Samantha is
one of the data scientists tasked to collect data in order to analyze these issues.

a. What methodology should be employed by Samantha to collect pertinent data for analyzing the recent
surge in patient complaints regarding extended appointment wait times and difficulties in scheduling
follow-up visits at the hospital?
b. What strategies could be used to analyze the data?

Solution

a. Explore the stored information as transactional data

b. Collecting transactional data for analysis can be achieved by utilizing various sources within the
hospital setting. These sources include:

1. Electronic Health Records (EHRs): Samantha can gather data from the hospital's electronic health
records system. This data may include patients' appointment schedules, visit durations, and wait times.
This information can help identify patterns and trends in appointment scheduling and wait times.
2. Appointment Booking System: Samantha can gather data from the hospital's appointment booking
system. This data can include appointment wait times, appointment types (e.g., primary care,
specialist), and scheduling difficulties (e.g., appointment availability, cancellations). This information
can help identify areas where the booking system may be causing delays or challenges for patients.
3. Hospital Call Center: Samantha can gather data from the hospital's call center, which is responsible for
booking appointments over the phone. This data can include call wait times, call duration, and reasons
for call escalations. This information can help identify areas for improvement in the call center's
processes and procedures.
4. Historical Data: Samantha can analyze historical data, such as appointment wait times and scheduling
patterns, to identify any changes that may have contributed to the recent increase in complaints. This
data can also be compared to current data to track progress and improvements in wait times and
scheduling.

Access for free at openstax.org

2.2 • Survey Design and Implementation 63

Collecting Data Through Experiments

Collecting data through scientific experiments requires a well-designed experimental scheme, describing the
research objectives, variables, and procedures. The establishment of a control specimen is crucial, and data is
obtained through systematic properties, measurements, or characteristics. It is crucial to follow ethical
guidelines for the proper documentation and ethical utilization of the collected data (see Ethics in Data
Collection).

Consider this example: Scientist Sally aimed to investigate the impact of sunlight on plant growth. The
research inquiry was to determine whether increased exposure to sunlight enhances the growth of plants.
Sally experimented with two groups of plants wherein one group received eight hours of sunlight per day,
while the other only received four hours. The height of each plant was measured and documented every week
for four consecutive weeks. The main research objective was to determine the growth rate of plants exposed
to eight hours of sunlight compared to those with only four hours. A total of 20 identical potted plants were
used, with one group allocated to the "sunlight" condition and the other to the "limited sunlight" condition.
Both groups were maintained under identical environmental conditions, including temperature, humidity, and
soil moisture. Adequate watering was provided to ensure equal hydration of all plants. The measurements of
plant height were obtained and accurately recorded every week. This approach allowed for the collection of
precise and reliable data on the impact of sunlight on plant growth, which can serve as a valuable resource for
further research and understanding of this relationship.

2.2 Survey Design and Implementation

Learning Outcomes
By the end of this section, you should be able to:
• 2.2.1 Describe the elements of survey design and identify the steps data scientists take to ensure the
reliability of survey results.
• 2.2.2 Describe methods for avoiding bias in survey questions.
• 2.2.3 Describe various sampling techniques and the advantages of each.

Surveys are a common strategy for gathering data in a wide range of domains, including market research,
social sciences, and education. Surveys collect information from a sample of individuals and often use
questionnaires to collect data. Sampling is the process of selecting a subset of a larger population to
represent and analyze information about that population.

Designing the Survey

The process of data collection through surveys is a crucial aspect of research—and one that requires careful
planning and execution to gather accurate and reliable data. The first step, as stated earlier, is to clearly define
the research objectives and determine the appropriate target population. This will help you structure the
survey and identify the specific questions that need to be included.

Constructing good surveys is hard. A survey should begin with simple and easy-to-answer questions and
progress to more complex or sensitive ones. This can help build a rapport with the respondents and increase
their willingness to answer more difficult questions. Additionally, the researcher may consider mixing up the
response options for multiple-choice questions to avoid response bias. To ensure the quality of the data
collected, the survey questionnaire should undergo a pilot test with a small group of individuals from the
target population. This allows the researcher to identify any potential issues or confusion with the questions
and make necessary adjustments before administering the survey to the larger population.

Open-Ended Versus Closed-Ended Questions

Surveys should generally contain a mix of closed-ended and open-ended questions to gather both quantitative
and qualitative data.
64 2 • Collecting and Preparing Data

Open-ended questions allow for more in-depth responses and provide the opportunity for unexpected
insights. They also allow respondents to elaborate on their thoughts and provide detailed and personal
responses. Closed-ended questions have predetermined answer choices and are effective in gathering
quantitative data. They are quick and easy to answer, and their clear and structured format allows for
quantifiable results.

Avoiding Bias in Survey Questions

Unbiased sampling and unbiased survey methodology are essential for ensuring accurate and reliable results.
One well-known real-life instance of sampling bias leading to inaccurate findings is the 1936 Literary Digest
poll. This survey aimed to forecast the results of the US presidential election and utilized a mailing list of
telephone and automobile owners. This approach was considered biased toward affluent individuals and
therefore favored Republican voters. As a consequence, the poll predicted a victory for Republican nominee Alf
Landon. However, the actual outcome was a landslide win for Franklin D. Roosevelt (Lusinchi, 2012). This
discrepancy can be attributed to the biased sampling method as well as the use of primarily closed-ended
questions, which may not have accurately captured the opinions of all voters.

An example of a biased survey question in a survey conducted by a shampoo company might be "Do you
prefer our brand of shampoo over cheaper alternatives?" This question is biased because it assumes that the
respondent prefers the company's brand over others. A more unbiased and accurate question would be "What
factors do you consider when choosing a shampoo brand?" This allows for a more detailed and accurate
response. The biased question could have led to inflated results in favor of the company's brand.

Sampling
The next step in the data collection process is to choose a participant sample to ideally represent the
restaurant's customer base. Sampling could be achieved by randomly selecting customers, using customer
databases, or targeting specific demographics, such as age or location.

Sampling is necessary in a wide range of data science projects to make data collection more manageable and
cost-effective while still drawing meaningful conclusions. A variety of techniques can be employed to
determine a subset of data from a larger population to perform research or construct hypotheses about the
entire population. The choice of a sampling technique depends upon the nature and features of the
population being studied as well as the objectives of the research. When using a survey, researchers must also
consider the tool(s) that will be used for distributing the survey, such as through email, social media, or
physically distributing questionnaires at the restaurant. It's crucial to make the survey easily accessible to the
chosen sample to achieve a higher response rate.

A number of sampling techniques and their advantages are described below. The most frequently used among
these are simple random selection, stratified sampling, cluster sampling, and convenience sampling.

1. Simple random selection. Simple random selection is a statistical technique used to pick a representative
sample from a larger population. This process involves randomly choosing individuals or items from the
population, ensuring that each selected member of the population has an identical chance of being
contained in the sample. The main step in simple random selection is to define the population of interest
and assign a unique identification number to each member. This could be done using a random number
generator, a computer program designed to generate a sequence of random numbers, or a random
number table, which lists numbers in a random sequence. The primary benefit of this technique is its
ability to minimize bias and deliver a fair representation of the population.

In the health care field, simple random sampling is utilized to select patients for medical trials or surveys,
allowing for a diverse and unbiased sample (Elfil & Negida, 2017). Similarly, in finance, simple random
sampling can be applied to gather data on consumer behavior and guide decision-making in financial
institutions. In engineering, this technique is used to select random samples of materials or components

Access for free at openstax.org

2.2 • Survey Design and Implementation 65

for quality control testing. In the political arena, simple random sampling is commonly used to select
randomly registered voters for polls or surveys, ensuring equal representation and minimizing bias in the
data collected.
2. Stratified sampling. Stratified sampling involves splitting the population into subgroups based on
specified factors, such as age, area, income, or education level, and taking a random sample from each
stratum in proportion to its size in the population. Stratified sampling allows for a more accurate
representation of the population as it ensures that all subgroups are adequately represented in the
sample. This can be especially useful when the variables being studied vary significantly between the
stratified groups.
3. Cluster sampling. With cluster sampling, the population is divided into natural groups or clusters, such as
schools, communities, or cities, with a random sample of these clusters picked and all members within the
chosen clusters included in the sample. Cluster sampling is helpful to represent the entire population even
if it is difficult or time-consuming due to challenges such as identifying clusters, sourcing a list of clusters,
traveling to different clusters, and communicating with them. Additionally, data analysis and sample size
calculation may be more complex, and there is a risk of bias in the sample. However, cluster sampling can
be more cost-effective.

An example of cluster sampling would be a study on the effectiveness of a new educational program in a
state. The state is divided into clusters based on school districts. The researcher uses a random selection
process to choose a sample of school districts and then collects data from all the schools within those
districts. This method allows the researcher to obtain a representative sample of the state's student
population without having to visit each individual school, saving time and resources.
4. Convenience sampling. Convenience sampling applies to selecting people or items for the sample based
on their availability and convenience to the data science research. For example, a researcher may choose
to survey students in their classroom or manipulate data from social media users. Convenience sampling
is effortless to achieve, and it is useful for exploratory studies. However, it may not provide a
representative sample as it is prone to selection bias in that individuals who are more readily available or
willing to participate may be overrepresented.

An example of convenience sampling would be conducting a survey about a new grocery store in a busy
shopping mall. A researcher stands in front of the store and approaches people who are coming out of the
store to ask them about their shopping experience. The researcher only includes responses from those
who agreed to participate, resulting in a sample that is convenient but may not be representative of the
entire population of shoppers in the mall.
5. Systematic sampling. Systematic sampling is based on starting at a random location in the dataset and
then selecting every nth member from a population to be contained in the sample. This process is
straightforward to implement, and it provides a representative sample when the population is randomly
distributed. However, if there is a pattern in the sampling frame (the organizing structure that represents
the population from which a sample is drawn), it may lead to a biased sample.

Suppose a researcher wants to study the dietary habits of students in a high school. The researcher has a
list of all the students enrolled in the school, which is approximately 1,000 students. Instead of randomly
selecting a sample of students, the researcher decides to use systematic sampling. The researcher first
assigns a number to each student, going from 1 to 1,000. Then, the researcher randomly selects a number
from 1 to 10—let's say they select 4. This number will be the starting point for selecting the sample of
students. The researcher will then select every 10th student from the list, which means every student with
a number ending in 4 (14, 24, 34, etc.) will be included in the sample. This way, the researcher will have a
representative sample of 100 students from the high school, which is 10% of the population. The sample
will consist of students from different grades, genders, and backgrounds, making it a diverse and
representative sample.
66 2 • Collecting and Preparing Data

6. Purposive sampling. With purposive sampling, one or more specific criteria are used to select
participants who are likely to provide the most relevant and useful information for the research study. This
can involve selecting participants based on their expertise, characteristics, experiences, or behaviors that
are relevant to the research question.

For example, if a researcher is conducting a study on the effects of exercise on mental health, they may
use purposive sampling to select participants who have a strong interest or experience in physical fitness
and have a history of mental health issues. This sampling technique allows the researcher to target a
specific population that is most relevant to the research question, making the findings more applicable
and generalizable to that particular group. The main advantage of purposive sampling is that it can save
time and resources by focusing on individuals who are most likely to provide valuable insights and
information. However, researchers need to be transparent about their sampling strategy and potential
biases that may arise from purposely selecting certain individuals.
7. Snowball sampling. Snowball sampling is typically used in situations where it is difficult to access a
particular population; it relies on the assumption that people with similar characteristics or experiences
tend to associate with each other and can provide valuable referrals. This type of sampling can be useful in
studying hard-to-reach or sensitive populations, but it may also be biased and limit the generalizability of
findings.
8. Quota sampling. Quota sampling is a non-probability sampling technique in which experimenters select
participants based on predetermined quotas to guarantee that a certain number or percentage of the
population of interest is represented in the sample. These quotas are based on specific demographic
characteristics, such as age, gender, ethnicity, and occupation, which are believed to have a direct or
indirect relationship with the research topic. Quota sampling is generally used in market research and
opinion polls, as it allows for a fast and cost-effective way to gather data from a diverse range of
individuals. However, it is important to note that the results of quota sampling may not accurately
represent the entire population, as the sample is not randomly selected and may be biased toward certain
characteristics. Therefore, the findings from studies using quota sampling should be interpreted with
caution.
9. Volunteer sampling. Volunteer sampling refers to the fact that the participants are not picked at random
by the researcher, but instead volunteer themselves to be a part of the study. This type of sampling is
commonly used in studies that involve recruiting participants from a specific population, such as a specific
community or organization. It is also often used in studies where convenience and accessibility are
important factors, as participants may be more likely to volunteer if the study is easily accessible to them.
Volunteer sampling is not considered a random or representative sampling technique, as the participants
may not accurately represent the larger population. Therefore, the results obtained from volunteer
sampling may not be generalizable to the entire population.

Sampling Error
Sampling error is the difference between the results obtained from a sample and the true value of the
population parameter it is intended to represent. It is caused by chance and is inherent in any sampling
method. The goal of researchers is to minimize sampling errors and increase the accuracy of the results. To
avoid sampling error, researchers can increase sample size, use probability sampling methods, control for
extraneous variables, use multiple modes of data collection, and pay careful attention to question formulation.

Sampling Bias
Sampling bias occurs when the sample used in a study isn’t representative of the population it intends to
generalize to, leading to skewed or inaccurate conclusions. This bias can take many forms, such as selection
bias, where certain groups are systematically over- or underrepresented, or volunteer bias, where only a
specific subset of the population participates. Researchers use the sampling techniques summarized earlier to
avoid sampling bias and ensure that each member of the population has an equal chance of being included in

Access for free at openstax.org

2.2 • Survey Design and Implementation 67

the sample. Additionally, careful consideration of the sampling frame should ideally encompass all members of
the target population and provide a clear and accessible way to identify and select individuals or units for
inclusion in the sample. Sampling bias can occur at various stages of the sampling process, and it can greatly
impact the accuracy and validity of research findings.

Measurement Error
Measurement errors are inaccuracies or discrepancies that surface during the process of collecting,
recording, or analyzing data. They may occur due to human error, environmental factors, or inherent
inconsistencies in the phenomena being studied. Random error, which arises unpredictably, can affect the
precision of measurements, and systematic error may consistently bias measurements in a particular
direction. In data analysis, addressing measurement error is crucial for ensuring the reliability and validity of
results. Techniques for mitigating measurement error include improving data collection methods, calibrating
instruments, conducting validation studies, and employing statistical methods like error modeling or
sensitivity analysis to account for and minimize the impact of measurement inaccuracies on the analysis
outcomes.

A Sampling Case Study

Consider a research study that wants to randomly select a group of college students from a larger population
to examine the effects of exercise on their mental health outcomes. Using student ID numbers generated by a
computer program, 100 participants from the larger population were randomly selected to participate in the
study to achieve the desired accuracy. This process ensured that every student in the university had an equal
chance of being selected to participate. The participants were then randomly assigned to either the exercise
group or the control group. This method of random sampling ensures that the sample is representative of the
larger population, providing a more accurate representation of the relationship between exercise and mental
health outcomes for college students.

Types of sampling error that could occur in this study include the following:

1. Sampling bias. One potential source of bias in this study is self-selection bias. As the participants are all
college students, they may not be representative of the larger population, as college students tend to
have more access and motivation to exercise compared to the general population. This could limit the
generalizability of the study's findings. In addition, if the researchers only recruit participants from one
university, there may be under-coverage bias. This means that certain groups of individuals, such as
nonstudents or students from other universities, may be excluded from the study, potentially leading to
biased results.
2. Measurement error. Measurement errors could occur, particularly if the researchers are measuring the
participants' exercise and mental health outcomes through self-report measures. Participants may not
accurately report their exercise habits or mental health symptoms, leading to inaccurate data.
3. Non-response bias. Some participants in the study may choose not to participate or may drop out before
the study is completed. This could introduce non-response bias, as those who choose not to participate or
drop out may differ from those who remain in the study in terms of their exercise habits or mental health
outcomes.
4. Sampling variability. The sample of 100 participants is a relatively small subset of the larger population.
As a result, there may be sampling variability, meaning that the characteristics and outcomes of the
participants may differ from those of the larger population simply due to chance.
5. Sampling error in random assignment. In this study, the researchers randomly assign participants to
either the exercise group or the control group. However, there is always a possibility of sampling error in
the random assignment process, meaning that the groups may not be perfectly balanced in terms of their
exercise habits or other characteristics.

These types of sampling errors can affect the accuracy and generalizability of the study's findings.
68 2 • Collecting and Preparing Data

Researchers need to be aware of these potential errors and take steps to minimize them when designing
and conducting their studies.

EXAMPLE 2.3

Problem

Mark is a data scientist who works for a marketing research company. He has been tasked to lead a study to
understand consumer behavior toward a new product that is about to be launched in the market. As data
scientists, they know the importance of using the right sampling technique to collect accurate and reliable
data. Mark divided the population into different groups based on factors such as age, education, and
income. This ensures that he gets a representative sample from each group, providing a more accurate
understanding of consumer behavior. What is the name of the sampling technique used by Mark to ensure
a representative sample from different groups of consumers for his study on consumer behavior toward a
new product?

Solution

The sampling technique used by Mark is called stratified sampling. This involves dividing the population
into subgroups or strata based on certain characteristics and then randomly selecting participants from
each subgroup. This ensures that each subgroup is represented in the sample, providing a more accurate
representation of the entire population. This type of sampling is often used in market research studies to
get a more comprehensive understanding of consumer behavior and preferences. By using stratified
sampling, Mark can make more reliable conclusions and recommendations for the new product launch
based on the data he collects.

2.3 Web Scraping and Social Media Data Collection

Learning Outcomes:
By the end of this section, you should be able to:
• 2.3.1 Discuss the uses of web scraping for collecting and preparing data for analysis.
• 2.3.2 Apply regular expressions for data manipulation and pattern matching.
• 2.3.3 Write Python code to scrape data from the web.
• 2.3.4 Apply various methods for parsing, extracting, processing, and storing data.

Web scraping and social media data collection are two approaches used to gather data from the internet. Web
scraping involves pulling information and data from websites using a web data extraction tool, often known
as a web scraper. One example would be a travel company looking to gather information about hotel prices
and availability from different booking websites. Web scraping can be used to automatically gather this data
from the various websites and create a comprehensive list for the company to use in its business strategy
without the need for manual work.

Social media data collection involves gathering information from various platforms like Twitter and Instagram
using application programming interface or monitoring tools. An application programming interface (API) is
a set of protocols, tools, and definitions for building software applications allowing different software systems
to communicate and interact with each other and enabling developers to access data and services from other
applications, operating systems, or platforms. Both web scraping and social media data collection require
determining the data to be collected and analyzing it for accuracy and relevance.

Web Scraping
There are several techniques and approaches for scraping data from websites. See Table 2.2 for some of the

Access for free at openstax.org

2.3 • Web Scraping and Social Media Data Collection 69

common techniques used. (Note: The techniques used for web scraping will vary depending on the website
and the type of data being collected. It may require a combination of different techniques to effectively scrape
data from a website.)

Web Scraping Details

Technique
Web Crawling • Follows links on a web page to navigate to other pages and collect data
from them
• Useful for scraping data from multiple pages of a website

XPath • Powerful query language

• Navigates through the elements in an HTML document
• Often used in combination with HTML parsing to select specific elements to
scrape

Regular Expressions • Search for and extract specific patterns of text from a web page
• Useful for scraping data that follows a particular format, such as dates,
phone numbers, or email addresses

HTML Parsing • Analyzes the HTML (HyperText Markup Language) structure of a web page
and identifies the specific tags and elements that contain the desired data
• Often used for simple scraping tasks

Application • Authorize developers to access and retrieve data instantly without the need
Programming for web scraping
Interfaces (APIs) • Often a more efficient and reliable method for data collection

(XML) API Subset • XML (Extensible Markup Language) is another markup language used
exchanging data
• This method works similarly to using the HTML API subset by making HTTP
requests to the website's API endpoints and then parsing the data received
in XML format

(JSON) API Subset • JSON (JavaScript Object Notation) is a lightweight data interchange format
that is commonly used for sending and receiving data between servers and
web applications
• Many websites provide APIs in the form of JSON, making it another efficient
method for scraping data

Table 2.2 Techniques and Approaches for Scraping Data from Websites

Social Media Data Collection

Social media data collection can be carried out through various methods such as API integration, social
listening, social media surveys, network analysis, and image and video analysis. APIs provided by social media
platforms allow data scientists to collect structured data on user interactions and content. Social listening
involves monitoring online conversations for insights on customer behavior and trends. Surveys conducted on
social media can provide information on customer preferences and opinions. Network analysis, or the
examination of relationships and connections between users, data, or entities within a network, can reveal
influential users and communities. It involves identifying and analyzing influential individuals or groups as well
as understanding patterns and trends within the network. Image and video analysis can provide insights into
visual trends and user behavior.
70 2 • Collecting and Preparing Data

An example of social media data collection is conducting a Twitter survey on customer satisfaction for a food
delivery company. Data scientists can use Twitter's API to collect tweets containing specific hashtags related to
the company and analyze them to understand customers' opinions and preferences. They can also use social
listening to monitor conversations and identify trends in customer behavior. Additionally, creating a social
media survey on Twitter can provide more targeted insights into customer satisfaction and preferences. This
data can then be analyzed using data science techniques to identify key areas for improvement and drive
informed business decisions.

Using Python to Scrape Data from the Web

As noted previously, web scraping is a strategy of gathering data from the internet using automated
mechanisms or programs. Python is one of the popular programming languages used for web scraping due to
its various libraries and frameworks that make it easy to pull and process data from websites.

To scrape data such as a table from a website using Python, we follow these steps:

1. Import the pandas library. The first step is to import the pandas library (https://openstax.org/r/pandas),
which is a popular Python library for data analysis and manipulation.

import pandas as pd

2. Use the read_html

read_html() function. This function is used to read HTML tables from a web page and convert
them into a list of DataFrame objects. Recall from What Are Data and Data Science? that a DataFrame is a
data type that pandas uses to store multi-column tabular data.

df = pd.read_html("https://......

3. Access the desired data. If the data on the web page is divided into different tables, we need to specify
which table we want to extract. We have used indexing to access the desired table (for example: index 4)
from the list of DataFrame objects returned by the read_html() function. The index here represents the
table order in the web page.
4. Store the data in a DataFrame. The result of the read_html() function is a list of DataFrame objects,
and each DataFrame represents a table from the web page. We can store the desired data in a DataFrame
variable for further analysis and manipulation.
5. Display the DataFrame. By accessing the DataFrame variable, we can see the extracted data in a tabular
format.
6. Convert strings to numbers. As noted in Chapter 1, a string is a data type used to represent a sequence
of characters, such as letters, numbers, and symbols that are enclosed by matching single (') or double (")
quotes. If the data in the table is in string format and we want to perform any numerical operations on it,
we need to convert the data to numerical format. We can use the to_numeric() function from pandas to
convert strings to numbers and then store the result in a new column in the DataFrame.

df['column_name'] = pd.to_numeric(df['column_name'])

This will create a new column in the DataFrame with the converted numerical values, which can then be used
for further analysis or visualization.

In computer programming, indexing usually starts from 0. This is because most programming languages use 0
as the initial index for arrays, matrices, or other data structures. This convention has been adopted to simplify
the implementation of some algorithms and to make it easier for programmers to access and manipulate
data. Additionally, it aligns with the way computers store and access data in memory. In the context of parsing
tables from HTML pages, using 0 as the initial index allows programmers to easily access and manipulate data
from different tables on the same web page. This enables efficient data processing and analysis, making the

Access for free at openstax.org

2.3 • Web Scraping and Social Media Data Collection 71

task more manageable and streamlined.

EXAMPLE 2.4

Problem

Extract data table "Current Population Survey: Household Data: (Table A-13). Employed and unemployed
persons by occupation, Not seasonally adjusted" from the FRED (Federal Reserve Economic Data)
(https://openstax.org/r/fred) website in the link (https://fred.stlouisfed.org/release/
tables?rid=50&eid=3149#snid=4498 (https://openstax.org/r/stlouisfed)) using Python code. The data in this
table provides a representation of the overall employment and unemployment situation in the United
States. The table is organized into two main sections: employed persons and unemployed persons.

Solution

PYTHON CODE

# Import pandas
import pandas as pd

# Read data from the URL

df_list = pd.read_html(
"https://fred.stlouisfed.org/release/tables?rid=50&eid=3149#snid=4498")

# Since pd.read_html() returns a list of DataFrames, select the first DataFrame

df = df_list[0]

# Print the first 5 rows of the DataFrame

print(df.head(5))

The resulting output will look like this:

72 2 • Collecting and Preparing Data

In Python, there are several libraries and methods that can be used for parsing and extracting data from text.
These include the following:

1. Regular expressions (regex or RE) (https://openstax.org/r/docpython). This is a built-in library in Python

that allows for pattern matching and extraction of data from strings. It uses a specific syntax to define
patterns and rules for data extraction.
2. Beautiful Soup (https://openstax.org/r/pypi). This is an external library that is mostly used for scraping and
parsing HTML and XML code. It can be utilized to extract specific data from web pages or documents.
3. Natural Language Toolkit (NLTK) (https://openstax.org/r/nltk). This is a powerful library for natural
language processing in Python. It provides various tools for tokenizing, parsing, and extracting data from
text data. (Tokenizing is the process of breaking down a piece of text or string of characters into smaller
units called tokens, which can be words, phrases, symbols, or individual characters.)
4. TextBlob (https://openstax.org/r/textblob). This library provides a simple interface for most natural
language processing assignments, such as argument and part-of-speech tagging. It can also be utilized
for parsing and extracting data from text.
5. SpaCy (https://openstax.org/r/spacy). This is a popular open-source library for natural language
processing. It provides efficient methods for tokenizing, parsing, and extracting data from text data.

Overall, the library or method used for parsing and extracting data will depend on the specific task and type of
data being analyzed. It is important to research and determine the best approach for a given project.

Regular Expressions in Python

Regular expressions, also known as regex, are a set of symbols used to define a search pattern in text data.
In Python, these expressions are supported by the re module (function), and their syntax is similar to other
programming languages. The use of regular expressions offers researchers a robust method for identifying
and manipulating patterns in text. With this powerful tool, specific words, characters, or patterns of characters

Access for free at openstax.org

2.3 • Web Scraping and Social Media Data Collection 73

can be searched and matched. Typical applications include data parsing, input validation, and extracting
targeted information from larger text sources. Common use cases in Python involve recognizing various types
of data, such as dates, email addresses, phone numbers, and URLs, within extensive text files. Moreover,
regular expressions are valuable for tasks like data cleaning and text processing. Despite their versatility,
regular expressions can be elaborate, allowing for advanced search patterns utilizing meta-characters like *, ?,
and +. However, working with these expressions can present challenges, as they require a thorough
understanding and careful debugging to ensure successful implementation.

USING META-CHARACTERS IN REGULAR EXPRESSIONS

• The * character is known as the “star” or “asterisk” and is used to match zero or more occurrences of
the preceding character or group in a regular expression. For example, the regular expression "a*"
would match an "a" followed by any number (including zero) of additional "a"s, such as "a", "aa",
"aaa", etc.
• The ? character is known as the "question mark" and is used to indicate that the preceding character or
group is optional. It matches either zero or one occurrences of the preceding character or group. For
example, the regular expression "a?b" would match either "ab" or "b".
• The + character is known as the "plus sign" and is used to match one or more occurrences of the
preceding character or group. For example, the regular expression "a+b" would match one or more
"a"s followed by a "b", such as "ab", "aab", "aaab", etc. If there are no "a"s, the match will fail. This
is different from the * character, which would match zero or more "a"s followed by a "b", allowing for
a possible match without any "a"s.

EXAMPLE 2.5

Problem

Write Python code using regular expressions to search for a selected word “Python” in a given string and
print the number of times it appears.

Solution

PYTHON CODE

## import the regular expression module

import re
# create a string with story problem
story = "Samantha is a sixth-grade student who uses the popular coding language
Python to collect and analyze weather data for science project. She creates a
program that collects data from an online weather API and stores it in a CSV file
for 10 days. With the help of her teacher, she uses Python to visualize the data
and discovers patterns in temperature, humidity, and precipitation."
# create a regex pattern to match a selected word
pattern = "Python"
74 2 • Collecting and Preparing Data

# use the "findall" to search for matching patterns in the data string
words = re.findall(pattern, story)

# print the number of repeated python

print ("The word 'python' is repeated", len(words), "times in the story.")

# print the matched patterns

print(words) # output: ['Python']

The resulting output will look like this:

The word 'python' is repeated 2 times in the story.

['Python', 'Python']

In Python, "len" is short for "length" and is used to determine the number of items in a collection. It
returns the total count of items in the collection, including spaces and punctuation marks in a string.

Parsing and Extracting Data

Splitting and slicing are two methods used to manipulate text strings in programming. Splitting a string
means dividing a text string into smaller parts or substrings based on a specified separator. The separator can
be a character, string, or regular expression. This can be useful for separating words, phrases, or data values
within a larger string. For example, the string "Data Science" can be split into two substrings "Data" and
"Science" by using a space as the separator.

Slicing a string refers to extracting a portion or section of a string based on a specified range of indices. An
index refers to the position of a character in a string, starting from 0 for the first character. The range specifies
the start and end indices for the slice, and the resulting substring includes all characters within that range. For
example, the string "Data Science" can be sliced to extract "Data" by specifying the range from index 0 to
4, which includes the first four characters. Slicing can also be used to manipulate strings by replacing, deleting,
or inserting new content into specific positions within the string.

Parsing and extracting data involves the analysis of a given dataset or string to extract specific pieces of
information. This is accomplished using various techniques and functions, such as splitting and slicing strings,
which allow for the structured retrieval of data. This process is particularly valuable when working with large
and complex datasets, as it provides a more efficient means of locating desired data compared to traditional
search methods. Note that parsing and extracting data differs from the use of regular expressions, as regular
expressions serve as a specialized tool for pattern matching and text manipulation. In contrast, parsing and
data extraction offers a comprehensive approach to identifying and extracting specific data within a dataset.

Parsing and extracting data using Python involves using the programming language to locate and extract
specific information from a given text. This is achieved by utilizing the re library, which enables the use of
regular expressions to identify and retrieve data based on defined patterns. This process can be demonstrated
through an example of extracting data related to a person purchasing an iPhone at an Apple store.

The code in the following Python feature box uses regular expressions (regex) to match and extract specific
data from a string. The string is a paragraph containing information about a person purchasing a new phone

Access for free at openstax.org

2.3 • Web Scraping and Social Media Data Collection 75

from the Apple store. The objective is to extract the product name, model, and price of the phone. First, the
code starts by importing the necessary library for using regular expressions. Then, the string data is defined as
a variable. Next, regex is used to search for specific patterns in the string. The first pattern searches for the
words "product: " and captures anything that comes after it until it reaches a comma. The result is then
stored in a variable named "product". Similarly, the second pattern looks for the words "model: " and
captures anything that comes after it until it reaches a comma. The result is saved in a variable named
"model". Finally, the third pattern searches for the words "price: " and captures any sequence of numbers
or symbols that follows it until the end of the string. The result is saved in a variable named "price". After all
the data is extracted, it is printed out to the screen, using concatenation to add appropriate labels before each
variable.

This application of Python code demonstrates the effective use of regex and the re library to parse and extract
specific data from a given data. By using this method, the desired information can be easily located and
retrieved for further analysis or use.

PYTHON CODE

## import necessary library

import re
# Define data to be parsed
data = "Samantha went to the Apple store to purchase a new phone. She was
specifically looking for the latest and most expensive model available. As she
looked at the different options, she came across the product: iPhone 12, the
product name caught her attention, as it was the newest version on the market. She
then noticed the model: A2172, which confirmed that this was indeed the latest and
most expensive model she was looking for. The price made her hesitate for a moment,
but she decided that it was worth it price: $799. She purchased the iPhone 12 and
was excited to show off her new phone to her friends."

# Use regex to match and extract data based on specific pattern

product = re.search(r"product: (.+?),", data).group(1)
model = re.search(r"model: (.+?),", data).group(1)
price = re.search(r"price: (.+?.+?.+?.+?)", data).group(1)

# Print the extracted data

print("product: " + product)
print("model: " + model)
print("price: " + price)

The resulting output will look like this:

product: iPhone 12 model: A2172 price: $799

Processing and Storing Data

Once the data is collected, it should be processed and stored in a suitable format for further analysis. This is
where data processing and storage come into play. Data processing manages the raw data first by cleaning it
through the removal of irrelevant information and then by transforming it into a structured format. The
76 2 • Collecting and Preparing Data

cleaning process includes identifying and correcting any errors, inconsistencies, and missing values in a
dataset and is essential for ensuring that the data is accurate, reliable, and usable for analysis or other
purposes. Python is often utilized for data processing due to its flexibility and ease of use, and it offers a wide
range of tools and libraries specifically designed for data processing. Once the data is processed, it needs to be
stored for future use. (We will cover data storage in Data Cleaning and Preprocessing.) Python has several
libraries that allow for efficient storage and manipulation of data in the form of DataFrames.

One method of storing data using Python is using the pandas library to create a DataFrame and then using
the to_csv() function to save the DataFrame as a CSV (comma-separated values) file. This file can then be
easily opened and accessed for future analysis or visualization. For example, the code in the following Python
sidebar is a Python script that creates a dictionary with data about the presidents of the United States
(https://openstax.org/r/wikipedia), including their ordered number and state of birth. A data dictionary is a
data structure that stores data in key-value pairs, allowing for efficient retrieval of data using its key. It then
uses the built-in CSV library (https://openstax.org/r/librarycsv) to create a CSV file and write the data to it. This
code is used to store the US presidents' data in a structured format for future use, analysis, or display.

PYTHON CODE

import csv

# Create a dictionary to store the data

presidents = {
"1": ["George Washington", "Virginia"],
"2": ["John Adams", "Massachusetts"],
"3": ["Thomas Jefferson", "Virginia"],
"4": ["James Madison", "Virginia"],
"5": ["James Monroe", "Virginia"],
"6": ["John Quincy Adams", "Massachusetts"],
"7": ["Andrew Jackson", "South Carolina"],
"8": ["Martin Van Buren", "New York"],
"9": ["William Henry Harrison", "Virginia"],
"10": ["John Tyler", "Virginia"],
"11": ["James K. Polk", "North Carolina"],
"12": ["Zachary Taylor", "Virginia"],
"13": ["Millard Fillmore", "New York"],
"14": ["Franklin Pierce", "New Hampshire"],
"15": ["James Buchanan", "Pennsylvania"],
"16": ["Abraham Lincoln", "Kentucky"],
"17": ["Andrew Johnson", "North Carolina"],
"18": ["Ulysses S. Grant", "Ohio"],
"19": ["Rutherford B. Hayes", "Ohio"],
"20": ["James A. Garfield", "Ohio"],
"21": ["Chester A. Arthur", "Vermont"],
"22": ["Grover Cleveland", "New Jersey"],
"23": ["Benjamin Harrison", "Ohio"],
"24": ["Grover Cleveland", "New Jersey"],
"25": ["William McKinley", "Ohio"],

Access for free at openstax.org

2.3 • Web Scraping and Social Media Data Collection 77

"26": ["Theodore Roosevelt", "New York"],

"27": ["William Howard Taft", "Ohio"],
"28": ["Woodrow Wilson", "Virginia"],
"29": ["Warren G. Harding", "Ohio"],
"30": ["Calvin Coolidge", "Vermont"],
"31": ["Herbert Hoover", "Iowa"],
"32": ["Franklin D. Roosevelt", "New York"],
"33": ["Harry S. Truman", "Missouri"],
"34": ["Dwight D. Eisenhower", "Texas"],
"35": ["John F. Kennedy", "Massachusetts"],
"36": ["Lyndon B. Johnson", "Texas"],
"37": ["Richard Nixon", "California"],
"38": ["Gerald Ford", "Nebraska"],
"39": ["Jimmy Carter", "Georgia"],
"40": ["Ronald Reagan", "Illinois"],
"41": ["George H. W. Bush", "Massachusetts"],
"42": ["Bill Clinton", "Arkansas"],
"43": ["George W. Bush", "Connecticut"],
"44": ["Barack Obama", "Hawaii"],
"45": ["Donald Trump", "New York"],
"46": ["Joe Biden", "Pennsylvania"]
}

# Open a new CSV file in write mode

with open("presidents.csv", "w", newline='') as csv_file:
# Specify the fieldnames for the columns
fieldnames = ["Number", "Name", "State of Birth"]
# Create a writer object
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
# Write the header row
writer.writeheader()
# Loop through the presidents dictionary
for key, value in presidents.items():
# Write the data for each president to the CSV file
writer.writerow({
"Number": key,
"Name": value[0],
"State of Birth": value[1]
})
# Print a success message
print("Data successfully stored in CSV file.")

The resulting output will look like this:

Data successfully stored in CSV file.

78 2 • Collecting and Preparing Data

2.4 Data Cleaning and Preprocessing

Learning Outcomes
By the end of this section, you should be able to:
• 2.4.1 Apply methods to deal with missing data and outliers.
• 2.4.2 Explain data standardization techniques, such as normalization, transformation, and
aggregation.
• 2.4.3 Identify sources of noise in data and apply various data preprocessing methods to reduce
noise.

Data cleaning and preprocessing is an important stage in any data science task. It refers to the technique of
organizing and converting raw data into usable structures for further analysis. It involves extracting irrelevant
or duplicate data, handling missing values, and correcting errors or inconsistencies. This ensures that the data
is accurate, comprehensive, and ready for analysis. Data cleaning and preprocessing typically involve the
following steps:

1. Data integration. Data integration refers to merging data from multiple sources into a single dataset.
2. Data cleaning. In this step, data is assessed for any errors or inconsistencies, and appropriate actions are
taken to correct them. This may include removing duplicate values, handling missing data, and correcting
formatting misconceptions.
3. Data transformation. This step prepares the data for the next step by transforming the data into a
format that is suitable for further analysis. This may involve converting data types, scaling or normalizing
numerical data, or encoding categorical variables.
4. Data reduction. If the dataset contains a large number of columns or features, data reduction techniques
may be used to select only the most appropriate ones for analysis.
5. Data discretization. Data discretization involves grouping continuous data into categories or ranges,
which can help facilitate analysis.
6. Data sampling. In some cases, the data may be too large to analyze in its entirety. In such cases, a sample
of the data can be taken for analysis while still maintaining the overall characteristics of the original
dataset.

The goal of data cleaning and preprocessing is to guarantee that the data used for analysis is accurate,
consistent, and relevant. It helps to improve the quality of the results and increase the efficiency of the analysis
process. A well-prepared dataset can lead to more accurate insights and better decision-making.

Handling Missing Data and Outliers

Missing data refers to any data points or values that are not present in a dataset. This could be due to data
collection errors, data corruption, or nonresponse from participants in a study. Missing data can impact the
accuracy and validity of an analysis, as it reduces the sample size and potentially introduces bias.

Some specific examples of missing data include the following:

1. A survey participant forgetting to answer a question

2. A malfunction in data collection equipment resulting in missing values
3. A participant choosing not to answer a question due to sensitivity or discomfort

An outlier is a data point that differs significantly from other data points in a given dataset. This can be due to
human error, measurement error, or a true outlier value in the data. Outliers can skew statistical analysis and
bias results, which is why it is important to identify and handle them properly before analysis.

Missing data and outliers are common problems that can affect the accuracy and reliability of results. It is
important to identify and handle these issues properly to ensure the integrity of the data and the validity of
the analysis. You will find more details about outliers in Measures of Center, but here we summarize the

Access for free at openstax.org

2.4 • Data Cleaning and Preprocessing 79

measures typically used to handle missing data and outliers in a data science project:

1. Identify the missing data and outliers. The first stage is to identify which data points are missing or
appear to be outliers. This can be done through visualization techniques, such as scatterplots, box plots,
IQR (interquartile range), or histograms, or through statistical methods, such as calculating the mean,
median, and standard deviation (see Measures of Center and Measures of Variation as well as Encoding
Univariate Data).

It is important to distinguish between different types of missing data. MCAR (missing completely at
random) data is missing data not related to any other variables, with no underlying cause for its absence.
Consider data collection with a survey asking about driving habits. One of the demographic questions asks
for income level. Some respondents accidentally skip this question, and so there is missing data for
income, but this is not related to the variables being collected related to driving habits.

MAR (missing at random) data is missing data related to other variables but not to any unknown or
unmeasured variables. As an example, during data collection, a survey is sent to respondents and the
survey asks about loneliness. One of the questions asks about memory retention. Some older respondents
might skip this question since they may be unwilling to share this type of information. The likelihood of
missing data for loneliness factors is related to age (older respondents). Thus, the missing data is related
to an observed variable of age but not directly related to loneliness measurements.

MNAR (missing not at random) data refers to a situation in which the absence of data depends on
observed data but not on unobserved data. For example, during data collection, a survey is sent to
respondents and the survey asks about debt levels. One of the questions asks about outstanding debt that
the customers have such as credit card debt. Some respondents with high credit card debt are less likely
to respond to certain questions. Here the missing credit card information is related to the unobserved
debt levels.
2. Determine the reasons behind missing data and outliers. It is helpful to understand the reasons
behind the missing data and outliers. Some expected reasons for missing data include measurement
errors, human error, or data not being collected for a particular variable. Similarly, outliers can be caused
by incorrect data entry, measurement errors, or genuine extreme values in the data.
3. Determine how to solve missing data issues. Several approaches can be utilized to handle missing data.
One option is to withdraw the missing data altogether, but this can lead to a loss of important information.
Other methods include imputation, where the absent values are replaced with estimated values based on
the remaining data or using predictive models to fill in the missing values.
4. Consider the influence of outliers. Outliers can greatly affect the results of the analysis, so it is important
to carefully consider their impact. One approach is to delete the outliers from the dataset, but this can also
lead to a loss of valuable information. Another option is to deal with the outliers as an individual group
and analyze them separately from the rest of the data.
5. Use robust statistical methods. When dealing with outliers, it is important to use statistical methods that
are not affected by extreme values. This includes using median instead of mean and using nonparametric
tests instead of parametric tests, as explained in Statistical Inference and Confidence Intervals.
6. Validate the results. After handling the missing data and outliers, it is important to validate the results to
ensure that they are robust and accurate. This can be done through various methods, such as cross-
validation or comparing the results to external data sources.

Handling missing data and outliers in a data science task requires careful consideration and appropriate
methods. It is important to understand the reasons behind these issues and to carefully document the process
to ensure the validity of the results.
80 2 • Collecting and Preparing Data

EXAMPLE 2.6

Problem

Starting in 1939, the United States Bureau of Labor Statistics tracked employment on a monthly basis. The
number of employers in the construction field between 1939 and 2019 is presented in Figure 2.2.

a. Determine if there is any outlier in the dataset that deviates significantly from the overall trend.
b. In the event that the outlier is not a reflection of real employment numbers, how would you handle the
outliers?

Figure 2.2 US Retail Employment with Outliers

(Data source: Bureau of Labor Statistics; credit: modification of work by Hyndman, R.J., & Athanasopoulos, G. (2021)
Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. OTexts.com/fpp3. Accessed on April 23, 2024)

Solution

a. Based on the visual evidence, it appears that an outlier is present in the employment data for March 28,
1990. The employment level during this month shows a significant jump from approximately 5,400 to
9,500, exceeding the overall maximum employment level recorded between 1939 and 2019.
b. A possible approach to addressing this outlier is to replace it with median, calculated by taking the
mean of the points before and after the outlier. This method can help to improve the smoothness and
realism of the employment curve as well as mitigate any potential impacts the outlier may have on
statistical analysis or modeling processes. (See Table 2.3.)

Date Number of
Employers × 1000
1/1/1990 4974
2/1/1990 4933
3/1/1990 4989

Access for free at openstax.org

Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
No ratings yet
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
4 pages
Data Science I: Charles C.N. Wang
No ratings yet
Data Science I: Charles C.N. Wang
68 pages
Data Analysis with Python
No ratings yet
Data Analysis with Python
51 pages
Approaches in data analysis [Slides]
No ratings yet
Approaches in data analysis [Slides]
13 pages
Approaches in data analysis [Slides] [Re-brand]
No ratings yet
Approaches in data analysis [Slides] [Re-brand]
13 pages
Fds Csheet and Read The Rule
No ratings yet
Fds Csheet and Read The Rule
4 pages
data science
No ratings yet
data science
42 pages
Approaches in data science [Slides]
No ratings yet
Approaches in data science [Slides]
13 pages
Unit1-Data Science Fundamentals
No ratings yet
Unit1-Data Science Fundamentals
35 pages
AI ML June 4 2022
No ratings yet
AI ML June 4 2022
40 pages
DAV Notes
No ratings yet
DAV Notes
266 pages
Viva_Answers
No ratings yet
Viva_Answers
3 pages
Data Science Class X Notes
No ratings yet
Data Science Class X Notes
3 pages
BDA-24_Lect (3-4)-(Fundamentals of Data Analysis)
No ratings yet
BDA-24_Lect (3-4)-(Fundamentals of Data Analysis)
15 pages
Introduction-It Skills
No ratings yet
Introduction-It Skills
20 pages
Lecture 2 The data science process and tools for each step
No ratings yet
Lecture 2 The data science process and tools for each step
8 pages
Unit 1 Full Notes
No ratings yet
Unit 1 Full Notes
52 pages
Andrews M. Doing Data Science in R. an Introduction...2021
No ratings yet
Andrews M. Doing Data Science in R. an Introduction...2021
486 pages
Data Science Book
No ratings yet
Data Science Book
383 pages
Part 1 Lectures
No ratings yet
Part 1 Lectures
100 pages
Lecture 3 (DS) - Steps in Data Science Process
No ratings yet
Lecture 3 (DS) - Steps in Data Science Process
57 pages
3 Data Science Intro
No ratings yet
3 Data Science Intro
76 pages
Efficient Data Preparation: With Python
No ratings yet
Efficient Data Preparation: With Python
19 pages
intro
No ratings yet
intro
144 pages
Module 4
No ratings yet
Module 4
13 pages
CRM Data Collection and Storage
No ratings yet
CRM Data Collection and Storage
22 pages
DA-1,2,3[1]_merged
No ratings yet
DA-1,2,3[1]_merged
39 pages
Unit-1
No ratings yet
Unit-1
84 pages
TTDS Lectures
No ratings yet
TTDS Lectures
13 pages
EXPLORATORY DATA ANALYSIS WITH PYTHON
No ratings yet
EXPLORATORY DATA ANALYSIS WITH PYTHON
24 pages
Module1 DS Ppt
No ratings yet
Module1 DS Ppt
61 pages
Data Science With Python Updated Brochure
No ratings yet
Data Science With Python Updated Brochure
13 pages
Ipl Data Analysis Pbl
No ratings yet
Ipl Data Analysis Pbl
11 pages
Data Analytics With Python Lecture 1
No ratings yet
Data Analytics With Python Lecture 1
23 pages
5_6237938787641463884
No ratings yet
5_6237938787641463884
9 pages
IJERT Data Analysis Using Python
No ratings yet
IJERT Data Analysis Using Python
6 pages
Data Visualization
No ratings yet
Data Visualization
25 pages
Unit 2 PPT (BA)
No ratings yet
Unit 2 PPT (BA)
33 pages
FDS - UNIT 1
No ratings yet
FDS - UNIT 1
233 pages
Lecture Notes 2
No ratings yet
Lecture Notes 2
5 pages
AI-Data Science
No ratings yet
AI-Data Science
21 pages
DAV EXP 1 t12 31
No ratings yet
DAV EXP 1 t12 31
39 pages
Agenda NED University
No ratings yet
Agenda NED University
13 pages
Data Science
No ratings yet
Data Science
59 pages
Module 1 Introduction To DataScience and Analytics
No ratings yet
Module 1 Introduction To DataScience and Analytics
10 pages
Dav Exps - Merged - Merged
No ratings yet
Dav Exps - Merged - Merged
99 pages
Introduction To Data Science - 1650687630477
No ratings yet
Introduction To Data Science - 1650687630477
34 pages
aadityaji
No ratings yet
aadityaji
17 pages
Introduction To Data ScienceA Python Approach To Concepts, Techniques and Applications PDF
100% (9)
Introduction To Data ScienceA Python Approach To Concepts, Techniques and Applications PDF
227 pages
Getting Started With Python Data Analysis - Sample Chapter
0% (1)
Getting Started With Python Data Analysis - Sample Chapter
17 pages
Handout6 - visualization
No ratings yet
Handout6 - visualization
75 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
49 pages
Module 2
No ratings yet
Module 2
30 pages
Crash Course Data Science
No ratings yet
Crash Course Data Science
7 pages
Data Analytics and Reporting - Notes Unit 1 and 2
No ratings yet
Data Analytics and Reporting - Notes Unit 1 and 2
11 pages
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
Data Science with Python: Unlocking the Power of Pandas and Numpy
From Everand
Data Science with Python: Unlocking the Power of Pandas and Numpy
Robert Johnson
No ratings yet
Research Work
No ratings yet
Research Work
64 pages
Job Order Costing: Perhitungan Harga Pokok Produksi Menggunakan Metode (Studi Kasus Pada Usaha Konveksi "Mowin Concept")
No ratings yet
Job Order Costing: Perhitungan Harga Pokok Produksi Menggunakan Metode (Studi Kasus Pada Usaha Konveksi "Mowin Concept")
12 pages
Password Policy
100% (1)
Password Policy
7 pages
AC Exam PDF January 2024 by AffairsCloud 1
No ratings yet
AC Exam PDF January 2024 by AffairsCloud 1
59 pages
Reliability Centered Maintenance
75% (4)
Reliability Centered Maintenance
27 pages
Irrevocable Letter of Credit Sample - PDF
0% (1)
Irrevocable Letter of Credit Sample - PDF
1 page
2.2 Risk Management Methodology
No ratings yet
2.2 Risk Management Methodology
3 pages
Unit 5 Quotation: Đ Anh Thư
No ratings yet
Unit 5 Quotation: Đ Anh Thư
27 pages
Agri-Fishery Arts: Module 15: Marketing Raised or Cultured Fish
No ratings yet
Agri-Fishery Arts: Module 15: Marketing Raised or Cultured Fish
20 pages
Unit II
No ratings yet
Unit II
14 pages
OSH Notes
No ratings yet
OSH Notes
18 pages
OS Strategic Planning Change TUTORIAL
No ratings yet
OS Strategic Planning Change TUTORIAL
13 pages
CALAMBA NegoSale Batch 47129 052623
No ratings yet
CALAMBA NegoSale Batch 47129 052623
20 pages
RATIO Questions
No ratings yet
RATIO Questions
5 pages
PONTE Finalsedp - Done
No ratings yet
PONTE Finalsedp - Done
56 pages
Unfair trade practices
No ratings yet
Unfair trade practices
4 pages
CompanyList COE KSA
100% (1)
CompanyList COE KSA
135 pages
Browser With Ad Blocker: Keeway Superlight 200 Service Manual
No ratings yet
Browser With Ad Blocker: Keeway Superlight 200 Service Manual
1 page
Porter's Five Forces Model and The Key External Forces Having An Effect On ONE Bank Limited
No ratings yet
Porter's Five Forces Model and The Key External Forces Having An Effect On ONE Bank Limited
5 pages
Print - Udyam Registration Certificate
No ratings yet
Print - Udyam Registration Certificate
2 pages
BS Delhi 10-06-2024
No ratings yet
BS Delhi 10-06-2024
19 pages
Pfm15e Im Ch11
No ratings yet
Pfm15e Im Ch11
40 pages
Uma PO Copy 8000024961
No ratings yet
Uma PO Copy 8000024961
11 pages
Moot Script
No ratings yet
Moot Script
3 pages
Fill in The Blanks With 2 Words
No ratings yet
Fill in The Blanks With 2 Words
2 pages
Survey Questionnaire
No ratings yet
Survey Questionnaire
3 pages
Instructions Content Marketing Strategy and Plan 2023-2024
No ratings yet
Instructions Content Marketing Strategy and Plan 2023-2024
3 pages
Comparative Analysis On Perceived Service Quality of Banks and Non-Banking Financial Companies (NBFCS) - Evidence From Commercial Vehicle Finance
No ratings yet
Comparative Analysis On Perceived Service Quality of Banks and Non-Banking Financial Companies (NBFCS) - Evidence From Commercial Vehicle Finance
6 pages
Consumer Awareness and Rights
No ratings yet
Consumer Awareness and Rights
6 pages
Levi's Realigning Around Direc
No ratings yet
Levi's Realigning Around Direc
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.