module 2-3 fuba midterms
module 2-3 fuba midterms
BIG DATA - is a term used to describe the massive volumes Velocity describes the rate at which this data is generated.
of digital data generated, collected, and processed. For example, New York Stock Exchange generated data by
a billion sold shares cannot just be stored for later analysis. It
must be analyzed and reported immediately. The data
- big data describes data that is either moving too quickly, infrastructure must instantly respond to the demands of
is simply too large, or is too complex to be stored, applications accessing and streaming the data. Big data
processed, or analyzed with traditional data storage and scales instantaneously, and research often needs to occur in
analytics applications. real-time.
- Some examples of big data include data generated by
postings to social media accounts, such as Facebook A manufacturer is installing one hundred new sensors to check
and Twitter and the ratings given to products on e- for product defects during production. The sensors will take
commerce sites like Amazon marketplace. twenty to thirty readings a second, then must analyze the data
immediately to determine if an issue with the equipment or
Size is only one of the characteristics that define big data. process is causing a defect.
Other criteria include the speed of generated data and the Velocity
variety of data collected and stored.
Veracity (Uncertainty of Data)
BIG DATA CHARACTERISTICS
Veracity is the process of preventing inaccurate data from
spoiling your data sets. For example, when people sign up
Big data's characteristics change how data is collected,
for an online account, they often use false contact
transmitted, stored, and accessed.
information. Much of this inaccurate information must be
“scrubbed” – process of cleaning or removing inaccurate data
The four Vs of big data and the challenges they create for from a data set from the data before use in analysis.
the data infrastructure engineers. Increased veracity in the collection of data can reduce the
amount of data cleaning that is required.
Volume (Scale of data)
An online retailer analyzes data from customer ratings and
Volume describes the amount of data transported and reviews. The retailer is concerned that people may be more likely
stored. According to International Data Corporation (IDC) to provide reviews if they have a bad experience than if they have
experts, discovering ways to process the increasing amounts a good experience with a product.
of data generated each day is a challenge. They predict data Veracity
volume will increase at a compound annual growth rate of
23% over the next five years. While traditional data storage
systems can, in theory, handle large amounts of data, they
are struggling to keep up with the high volume demands of
big data. The Potential Benefits of Data
Growth
A credit card processing company processes over 18 million
transactions a day and reports the account numbers and There are many drivers of this data growth, but the most
purchase information to the card issuers. All of the transactions
must be stored until confirmation is received from the card predominant include...
issuers and balance information updated.
Volume - the proliferation of Internet of Things (IoT) devices,
- increased internet access, greater access to
Variety (Forms of data) broadband,
- use of smartphones, and
Variety describes the many forms data can take, most of - the popularity of social media.
which are rarely in a ready state for processing and analysis.
A significant contributor to big data is unstructured data, This data pool enables applications to take advantage of
such as video, images and text documents, which are trends and comparisons uncovered through analytics to take
estimated to represent 80 to 90% of the world’s data. These action and make recommendations and reliable predictions.
formats are too complex for traditional data warehouse
storage architectures. The unstructured data that makes up
a significant portion of big data does not fit into the rows and
columns of traditional relational data storage systems.
HEALTH
A movie studio is gathering feedback on a new movie that
premiered in a traditional theater and on a streaming service Robotics, smart medical devices, integrated software
during the same week. The feedback is collected through systems, and virtual collaboration platforms are changing
ratings and reviews, comments on social media, and magazine how patient care is delivered. Many of these data-driven
articles. technologies simplify the lives of patients, doctors, and
Variety
healthcare administrators by performing tasks that humans related data on a daily or weekly basis. It doesn’t need to
typically do. Computers can detect cancers with remarkable access and analyze data immediately after a support ticket is
accuracy using the data available from millions of medical closed or a subscription is renewed. An example of
tests. These systems, in turn, create more data to be streaming ingestion is when you request a ride from a ride
analyzed and used to improve care. share service. The company combines streams of data (e.g.
historical data, real-time traffic data, and location tracking) to
RETAIL make sure you get a ride from the driver who is closest to
you at the time.
Retailers increasingly depend on the data generated by
digital technologies to improve their bottom line. Cisco’s TRANSFORMATION
Connected Mobile Experiences (CMX) allows retailers to
provide consumers with highly personalized content while After housing ingested data in temporary storage, we’re
gaining visibility into their behavior in the store. ready to go, right? Well, not quite. Data nearly always needs
to be transformed to be useful for later analyses. There are
EDUCATION two main issues to deal with here. One, data often needs to
be cleaned up: missing values, dates can be in the wrong
format, and data quickly gets outdated: you might have
In education, instructors can use data to identify areas where gathered data on individuals who have changed roles or
students struggle or thrive, understand the individual needs companies.
of students, and develop strategies for personalized learning.
Virtual schools give students access to textbooks, content,
and assistance designed and customized to meet the The other major issue involves transforming your data so
students’ requirements. that its structure aligns with the system needed to allow
accurate analyses. For example, you might want to figure out
your company’s best selling products every month. But the
DATA PIPELINES data may only contain each product’s sale date. You would
need to transform the data by creating, for example, a
Using all of this data to achieve these potential number of sales per month variable.
benefits requires managing the data. Data engineers
are the professionals who engage in this management. STORAGE
This process includes developing infrastructure and
systems to ingest the data, clean and transform it, and After transforming data, it needs to be stored in places and
finally store it in ways that make it easy for the rest of forms, making it easy for analysts to run reports on weekly
the people in their organization to access and query the sales and data scientists to create predictive
data to answer business questions. recommendation models. Data security, or managing data
access so that people who should be accessing the data can
efficiently, and keeping out people who shouldn’t.
What is a data pipeline?
There are two primary locations for businesses to store their
The best approach is to think of a data pipeline to data, on-premises or in the cloud. Often, companies use a
understand better what data engineers do with data. hybrid of both.
You can think of it almost like water flowing through
pipes. To understand what data engineers do with The term “on-premises” refers to hardware on an
these data, consider the figure below, which is organization’s servers and infrastructure - usually physically
a simplified representation of data flowing through the on site. In the past, on-premises storage was the only option
three phases of a data pipeline: ingestion, available for storing data. The organization would deploy
transformation, and storage. more servers as storage needs increased. Over time,
organizations had entire rooms or data centers with servers
hosting the databases that stored all the data. This model
had significant direct costs for the hardware and licenses for
the servers and indirect costs of power, cooling, and off-site
backup services. The company must also keep IT staff on
hand to maintain and manage the servers.
SUPERVISED
MEDICINE
RETAIL
Supervised machine learning algorithms are the most positive values to desired outcomes and negative values to
commonly used for predictive analytics. Supervised machine undesired effects. The result is optimal solutions; the system
learning requires human interaction to label data read for learns to avoid adverse outcomes and seek the positive.
accurate supervised learning. In supervised learning, the Practical applications of reinforcement learning include
model is taught by example using input and output data sets building ratification intelligence for playing video games and
processed by human experts, usually data scientists. The robotics and industrial automation.
model learns the relationships between input and output data
and then uses that information to formulate predictions The Machine Learning Process
based on new datasets. For example, a classification model
can learn to identify plants after being trained on a dataset of
properly labeled images with the plant species and other Developing a machine learning solution is seldom a linear
identifying characteristics. process. Several trial-and-error steps are necessary to fine-
tune the solution. The details of each step performed by the
Data Crunchers data scientists as they work on the new
Supervised machine learning methods commonly solve weed identification and eradication model are as follows:
regression and classification problems:
Step 1. Data preparation - Perform data cleaning
Regression problems involve estimating the mathematical procedures such as transformation into a structured format
relationship(s) between a continuous variable and one or and removing missing data and noisy/corrupted
more other variables. This mathematical relationship can observations.
then compute the values of one unknown variable given the
known values of the others. Examples of problems that use
regression include estimating a car's position and speed Step 2a. Learning data - Create a learning data set used to
using GPS, predicting the trajectory of a tornado using train the model.
weather data, or predicting the future value of a stock using
historical and other data. Step 2b. Testing data - Create a test dataset used to
evaluate the model performance. Only perform this step in
Classification problems consist of a discrete unknown the case of supervised learning.
variable. Typically, the issue involves estimating which
specific sample belongs to a set of pre-defined classes. Step 3. Learning Process Loop - Selection. An algorithm
Examples of classification are filtering email into spam or is chosen based on the problem. Depending on the selected
non-spam, diagnosing pathologies from medical tests, or algorithm, additional pre-processing steps might be
identifying faces in a picture. necessary.
AI AND ML SUMMARY