100% found this document useful (1 vote)
25 views

module 2-3 fuba midterms

Big data refers to the vast amounts of digital data generated, characterized by its volume, variety, velocity, and veracity, which complicate traditional data processing methods. Data engineers manage this data through a pipeline consisting of ingestion, transformation, and storage to facilitate analysis. The growth of big data is driven by factors such as IoT devices, increased internet access, and social media, enabling various applications across industries like healthcare, retail, and agriculture.

Uploaded by

glorymalabad22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
25 views

module 2-3 fuba midterms

Big data refers to the vast amounts of digital data generated, characterized by its volume, variety, velocity, and veracity, which complicate traditional data processing methods. Data engineers manage this data through a pipeline consisting of ingestion, transformation, and storage to facilitate analysis. The growth of big data is driven by factors such as IoT devices, increased internet access, and social media, enabling various applications across industries like healthcare, retail, and agriculture.

Uploaded by

glorymalabad22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Defining BIG DATA Velocity (Analysis of data flow)

BIG DATA - is a term used to describe the massive volumes Velocity describes the rate at which this data is generated.
of digital data generated, collected, and processed. For example, New York Stock Exchange generated data by
a billion sold shares cannot just be stored for later analysis. It
must be analyzed and reported immediately. The data
- big data describes data that is either moving too quickly, infrastructure must instantly respond to the demands of
is simply too large, or is too complex to be stored, applications accessing and streaming the data. Big data
processed, or analyzed with traditional data storage and scales instantaneously, and research often needs to occur in
analytics applications. real-time.
- Some examples of big data include data generated by
postings to social media accounts, such as Facebook A manufacturer is installing one hundred new sensors to check
and Twitter and the ratings given to products on e- for product defects during production. The sensors will take
commerce sites like Amazon marketplace. twenty to thirty readings a second, then must analyze the data
immediately to determine if an issue with the equipment or
Size is only one of the characteristics that define big data. process is causing a defect.
Other criteria include the speed of generated data and the Velocity
variety of data collected and stored.
Veracity (Uncertainty of Data)
BIG DATA CHARACTERISTICS
Veracity is the process of preventing inaccurate data from
spoiling your data sets. For example, when people sign up
Big data's characteristics change how data is collected,
for an online account, they often use false contact
transmitted, stored, and accessed.
information. Much of this inaccurate information must be
“scrubbed” – process of cleaning or removing inaccurate data
The four Vs of big data and the challenges they create for from a data set from the data before use in analysis.
the data infrastructure engineers. Increased veracity in the collection of data can reduce the
amount of data cleaning that is required.
Volume (Scale of data)
An online retailer analyzes data from customer ratings and
Volume describes the amount of data transported and reviews. The retailer is concerned that people may be more likely
stored. According to International Data Corporation (IDC) to provide reviews if they have a bad experience than if they have
experts, discovering ways to process the increasing amounts a good experience with a product.
of data generated each day is a challenge. They predict data Veracity
volume will increase at a compound annual growth rate of
23% over the next five years. While traditional data storage
systems can, in theory, handle large amounts of data, they
are struggling to keep up with the high volume demands of
big data. The Potential Benefits of Data
Growth
A credit card processing company processes over 18 million
transactions a day and reports the account numbers and There are many drivers of this data growth, but the most
purchase information to the card issuers. All of the transactions
must be stored until confirmation is received from the card predominant include...
issuers and balance information updated.
Volume - the proliferation of Internet of Things (IoT) devices,
- increased internet access, greater access to
Variety (Forms of data) broadband,
- use of smartphones, and
Variety describes the many forms data can take, most of - the popularity of social media.
which are rarely in a ready state for processing and analysis.
A significant contributor to big data is unstructured data, This data pool enables applications to take advantage of
such as video, images and text documents, which are trends and comparisons uncovered through analytics to take
estimated to represent 80 to 90% of the world’s data. These action and make recommendations and reliable predictions.
formats are too complex for traditional data warehouse
storage architectures. The unstructured data that makes up
a significant portion of big data does not fit into the rows and
columns of traditional relational data storage systems.
HEALTH
A movie studio is gathering feedback on a new movie that
premiered in a traditional theater and on a streaming service Robotics, smart medical devices, integrated software
during the same week. The feedback is collected through systems, and virtual collaboration platforms are changing
ratings and reviews, comments on social media, and magazine how patient care is delivered. Many of these data-driven
articles. technologies simplify the lives of patients, doctors, and
Variety
healthcare administrators by performing tasks that humans related data on a daily or weekly basis. It doesn’t need to
typically do. Computers can detect cancers with remarkable access and analyze data immediately after a support ticket is
accuracy using the data available from millions of medical closed or a subscription is renewed. An example of
tests. These systems, in turn, create more data to be streaming ingestion is when you request a ride from a ride
analyzed and used to improve care. share service. The company combines streams of data (e.g.
historical data, real-time traffic data, and location tracking) to
RETAIL make sure you get a ride from the driver who is closest to
you at the time.
Retailers increasingly depend on the data generated by
digital technologies to improve their bottom line. Cisco’s TRANSFORMATION
Connected Mobile Experiences (CMX) allows retailers to
provide consumers with highly personalized content while After housing ingested data in temporary storage, we’re
gaining visibility into their behavior in the store. ready to go, right? Well, not quite. Data nearly always needs
to be transformed to be useful for later analyses. There are
EDUCATION two main issues to deal with here. One, data often needs to
be cleaned up: missing values, dates can be in the wrong
format, and data quickly gets outdated: you might have
In education, instructors can use data to identify areas where gathered data on individuals who have changed roles or
students struggle or thrive, understand the individual needs companies.
of students, and develop strategies for personalized learning.
Virtual schools give students access to textbooks, content,
and assistance designed and customized to meet the The other major issue involves transforming your data so
students’ requirements. that its structure aligns with the system needed to allow
accurate analyses. For example, you might want to figure out
your company’s best selling products every month. But the
DATA PIPELINES data may only contain each product’s sale date. You would
need to transform the data by creating, for example, a
Using all of this data to achieve these potential number of sales per month variable.
benefits requires managing the data. Data engineers
are the professionals who engage in this management. STORAGE
This process includes developing infrastructure and
systems to ingest the data, clean and transform it, and After transforming data, it needs to be stored in places and
finally store it in ways that make it easy for the rest of forms, making it easy for analysts to run reports on weekly
the people in their organization to access and query the sales and data scientists to create predictive
data to answer business questions. recommendation models. Data security, or managing data
access so that people who should be accessing the data can
efficiently, and keeping out people who shouldn’t.
What is a data pipeline?
There are two primary locations for businesses to store their
The best approach is to think of a data pipeline to data, on-premises or in the cloud. Often, companies use a
understand better what data engineers do with data. hybrid of both.
You can think of it almost like water flowing through
pipes. To understand what data engineers do with The term “on-premises” refers to hardware on an
these data, consider the figure below, which is organization’s servers and infrastructure - usually physically
a simplified representation of data flowing through the on site. In the past, on-premises storage was the only option
three phases of a data pipeline: ingestion, available for storing data. The organization would deploy
transformation, and storage. more servers as storage needs increased. Over time,
organizations had entire rooms or data centers with servers
hosting the databases that stored all the data. This model
had significant direct costs for the hardware and licenses for
the servers and indirect costs of power, cooling, and off-site
backup services. The company must also keep IT staff on
hand to maintain and manage the servers.

Today, however, businesses are increasingly moving their


INGESTION data storage to the cloud. Cloud storage sounds mysterious,
but it just means storing data on servers maintained by
Data engineers will want to ingest two primary sources of providers such as Amazon Web Services (AWS), Microsoft
data: batches of data from servers or databases (batch Azure, Google Cloud Platform (GCP), and Alibaba Cloud.
ingestion) and real-time events happening in the world and The cloud service provider purchases, installs, and maintains
streaming from the world of devices (streaming ingestion). all hardware, software, and supporting infrastructure in its
An example of batch ingestion is a game company that data centers. Using cloud services, an organization avoids
wants to examine the relationship between subscription the enormous costs of building and supporting the
renewals and customer support tickets. It could ingest all the infrastructure necessary to store the vast amounts of data
they collect. Instead, the cloud service provider charges a Artificial intelligence solutions in some retail stores improve
“pay-as-you-go” (monthly) subscription fee. customer engagement through interactive chat programs or
Chatbots. Chatbots can be an effective way to communicate
SUMMARY with customers. They can answer frequently asked
questions, recommend products, address grievances, collect
Big data is a term used to describe the massive volumes of valuable customer data, and divert calls to a human
digital data generated, collected, and processed. The telesales executive if needed. They can also be programmed
characteristics associated with big data change how data is to self-learn from past data to keep refining and personalize
collected, transmitted, stored, and accessed. The four V’s of subsequent customer interactions.
big data are volume, variety, velocity, and veracity.
FITNESS
Drivers of the increased data growth are the increase of IoT
devices, internet access including broadband, the use of The fitness app on your smartphone, or fitness tracker,
smartphones, and the popularity of social media. collects data fed into an application that can provide you with
valuable health information. Apps must build a model of your
Data engineers manage data through a data pipeline. The movements to identify what constitutes taking a step and the
data pipeline has three stages: ingestion, transformation, distance you cover with each one! Some fitness trackers are
and storage. These stages preclude any analysis that is to even using self-learning AI software that can recognize and
be performed. Data engineers work with two primary sources adapt to a wide variety of movements and can learn new
of data: batch ingestion and streaming ingestion. fitness activities based on repetitive, cyclical patterns.
Transformation filters out bad data and converts the data
type, if necessary, to allow for accurate calculations. Data is
then stored in an on-premises infrastructure, cloud storage, BIG DATA AND MACHINE LEARNING
or both.
What is Machine Learning?

Machine learning has become increasingly popular among


AI FACT FICTION businesses and organizations across all industries to
improve efficiency and productivity. Use the internet to
There are many applications of artificial intelligence (AI) that research machine learning use cases. Identify five use cases
we see and use every day. that span different industries or applications and use the
space below to discuss your findings.
ENTERTAINMENT
Types of Machine Learning Analysis
Classification algorithms can help viewers find videos they
will like. Based on a customer’s profile, their video rental Machine learning encompasses different algorithms or
behavior, and the video rental behavior of other customers models with broad applicability, while others are suited for
with similar demographics, the algorithm predicts which specific applications. Machine learning is divided into three
videos a customer is likely to enjoy and makes primary learning model approaches supervised,
recommendations to the customer. unsupervised, and reinforcement. Each model differs in
training; each has its strengths and faces different tasks or
AGRICULTURE problems. When choosing a machine learning model to
deploy, an organization needs to understand the available
data and the problem to be solved.
Farmers use cell phones to provide researchers with images
of plant diseases. These images are used in image
recognition systems to diagnose plant diseases. Combined
with environmental data regression, algorithms predict future
disease outbreaks.

SUPERVISED

MEDICINE

Researchers have developed a machine learning model that


uses probability to classify breast cancer by examining
medical histopathology images. This approach may
eventually be capable of detecting cancer subtypes and
classifying benign and malignant tissue.

RETAIL
Supervised machine learning algorithms are the most positive values to desired outcomes and negative values to
commonly used for predictive analytics. Supervised machine undesired effects. The result is optimal solutions; the system
learning requires human interaction to label data read for learns to avoid adverse outcomes and seek the positive.
accurate supervised learning. In supervised learning, the Practical applications of reinforcement learning include
model is taught by example using input and output data sets building ratification intelligence for playing video games and
processed by human experts, usually data scientists. The robotics and industrial automation.
model learns the relationships between input and output data
and then uses that information to formulate predictions The Machine Learning Process
based on new datasets. For example, a classification model
can learn to identify plants after being trained on a dataset of
properly labeled images with the plant species and other Developing a machine learning solution is seldom a linear
identifying characteristics. process. Several trial-and-error steps are necessary to fine-
tune the solution. The details of each step performed by the
Data Crunchers data scientists as they work on the new
Supervised machine learning methods commonly solve weed identification and eradication model are as follows:
regression and classification problems:
Step 1. Data preparation - Perform data cleaning
Regression problems involve estimating the mathematical procedures such as transformation into a structured format
relationship(s) between a continuous variable and one or and removing missing data and noisy/corrupted
more other variables. This mathematical relationship can observations.
then compute the values of one unknown variable given the
known values of the others. Examples of problems that use
regression include estimating a car's position and speed Step 2a. Learning data - Create a learning data set used to
using GPS, predicting the trajectory of a tornado using train the model.
weather data, or predicting the future value of a stock using
historical and other data. Step 2b. Testing data - Create a test dataset used to
evaluate the model performance. Only perform this step in
Classification problems consist of a discrete unknown the case of supervised learning.
variable. Typically, the issue involves estimating which
specific sample belongs to a set of pre-defined classes. Step 3. Learning Process Loop - Selection. An algorithm
Examples of classification are filtering email into spam or is chosen based on the problem. Depending on the selected
non-spam, diagnosing pathologies from medical tests, or algorithm, additional pre-processing steps might be
identifying faces in a picture. necessary.

UNSUPERVISED Step 4. Learning Process Loop - Evaluation. This


selected algorithm's performance is evaluated on the
Unsupervised machine learning algorithms do not require learning data. If the algorithm and the model reach an
human experts but autonomously discover patterns in data. acceptable performance on learning data, the solution
Unsupervised learning mainly deals with unlabeled data. The validates the test data. Otherwise, repeat the learning
model must work on its own to find patterns and information. process with a proposed new model and algorithm.
Examples of problems solved with unsupervised methods
are clustering and association: Step 5. Model evaluation - Test the solution on the test
data. The performances on learning data are not necessarily
Clustering methods - Clustering is the grouping of data that transferrable to test data. The more complex and fine-tuned
have similar characteristics. It helps segment data into the model is, the higher the chances are that the model will
groups and analyzes each to find patterns. For example, become prone to overfitting, which means it cannot perform
clustering algorithms identify groups of users based on their accurately against unseen data. Overfitting can result in
online purchasing history and then send each member going back to the model learning process.
targeted ads.
Step 6. Model implementation - After the model achieves
Association methods - Association consists of discovering satisfactory performance on test data, implement the model.
groups of items frequently observed together. Online Implementing the model means performing the necessary
retailers use associations to suggest additional purchases to tasks to scale the machine-learning solution to big data.
a user based on the content of their shopping cart.

Training Machines to Recognize Patterns

Earlier, you learned that machine learning is a subset of


REINFORCEMENT artificial intelligence. Artificial Intelligence is the concept that
a system can learn from data, identify patterns, and make
Reinforcement learning teaches the machine through trial decisions with little or no human intervention. Machine
and error using feedback from its actions and experiences, learning has many valuable applications in the field of data
also known as learning from mistakes. It involves assigning analytics. One critical application is pattern recognition.
Pattern recognition utilizes machine learning algorithms to efficiently and effectively. AI impacts numerous aspects of
identify patterns in digital data. These patterns are then our lives, such as marketing, blogging, healthcare,
applied to different datasets with the goal of recognizing the agriculture, retail experiences, and fitness.
same or similar patterns in the new data. The data can be
contained in many different formats, such as text, topic Objective: Explain how big data enables machine
photographs, or videos. For example, if referencing classes learning.
of birds, a description of a bird would be a pattern. The types
could be sparrows, robins, or finch, among others. Using Machine learning is a subset of artificial intelligence based
computer vision, image processing technologies, and pattern on the concept that a system can learn from data, identify
recognition, we can extract specific patterns from images of patterns, and make decisions with little or no human
birds in this example and compare them to pictures of birds intervention. Machine learning is comprised of both
stored in a database. classifiers and algorithms. Classifiers categorize operations,
while algorithms are the techniques that organize and orient
Pattern recognition uses the concept of learning to classify classifiers.
data based on statistical information gained from patterns
and their representations. Learning enables the pattern Machine learning is divided into three primary learning model
recognition systems to be "trained" and adaptable to provide approaches: Supervised, unsupervised, and reinforced.
more accurate results. When training the pattern recognition
system, a portion of the dataset prepares the system, and
the remaining amount tests the system's accuracy. As shown A machine learning solution includes:
in the figure below, the data set divides into two groups: train
the model and test the model. The training data set is used  Step 1 - Data Preparation
to build the model and consists of about 80% of the data. It  Step 2a - Learning data
contains the set of images used to train the system. The  Step 2b - Testing data
testing data set consists of about 20% of the data and  Step 3 and 4 - Learning Process Loop
measures the model's accuracy. For example, if the system  Step 5 - Model evaluation
that identifies categories of birds can correctly identify seven  Step 6 - Model Implementation
out of ten birds, then the system's accuracy is 70%.
Reflection

Imagine a future where personal identification is no longer


needed. Document owners can be identified strictly by
handwriting, and individuals can be identified by their unique
voices. The technology to accomplish this already exists.
With machine learning, we already have image and speech
recognition, statistical arbitrage, and predictive analytics, not
to mention extraction of unstructured data to prevent,
identify, and diagnose health disorders.

It seems that it will become increasingly hard to hide your


Pattern recognition algorithms can be applied to different identity and daily activities in the near future. Perhaps the
types of digital data, including images, texts, or videos, and medical field will make gains in identifying and treating
can be used to automate and solve complicated analytical childhood illnesses. Whereas most machine learning is being
problems fully. The applications and use cases for pattern used to improve our daily lives and future, there are
recognition are virtually unlimited. Some examples include: concerns. It stands to ask, can machine learning and artificial
intelligence make mistakes? What systems are in place to
 Mobile Security - Identifying fingerprints or facial catch those mistakes before any damage has occurred?
recognition to gain access to a smartphone.
 Engineering - Speech recognition by digital
assistant systems such as Alexa, Google Assistant,
and Siri.
 Geology - Detecting specific types of rocks and
minerals and interpreting temporal patterns in
seismic array recordings.
 Biomedical - Using biometric patterns to identify
tumor and cancer cells in the body.

AI AND ML SUMMARY

Explain the concept of AI.

AI uses computer systems to perform tasks that formerly


required human intelligence. AI completes processes more

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy