0% found this document useful (0 votes)
0 views

Big Data Class 27Feb

The document outlines the evolution of data management, highlighting the historical milestones from the 1940s to the present, including the rise of big data and the challenges faced in managing it. It discusses the importance of structuring data for analysis, the types of data (structured, unstructured, semi-structured), and the four V's of big data: volume, velocity, variety, and veracity. Additionally, it addresses the data analytics project life cycle and the common challenges organizations face in understanding and implementing data analytics.

Uploaded by

raaayaraaay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Big Data Class 27Feb

The document outlines the evolution of data management, highlighting the historical milestones from the 1940s to the present, including the rise of big data and the challenges faced in managing it. It discusses the importance of structuring data for analysis, the types of data (structured, unstructured, semi-structured), and the four V's of big data: volume, velocity, variety, and veracity. Additionally, it addresses the data analytics project life cycle and the common challenges organizations face in understanding and implementing data analytics.

Uploaded by

raaayaraaay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

History of Data Management-

Evolution of Big Data


In the early
In the 60s,60s,
early
technology witnessed In the 90s, technology
problems withwith
problems Today, technology is
witnessed issues with
velocity
velocityororreal-time
real-time facing issues related
variety (e-mails,
data assimilation. This to huge volume,
documents, videos),
This
needneed inspired
inspired the leading to new
leading to the
the evolution
evolutionofof storage and
emergence of non-SQL
databases.
databases. processing solutions.
stores.
Evolution
Year Milestone
of Big Data
1940s An American librarian speculated the potential shortfall of shelves and cataloging staff,
realizing the rapid increase in information and limited storage.

1960s ‘Automatic Data Compression’ acts as a complete automatic and fast three-part compressor
that can be used for any kind of information in order to reduce the slow external storage
requirements and increase the rate of transmission from a computer system.

1970s In Japan, the Ministry of Posts and Telecommunications initiated a project to study
information flow in order to track the volume of information circulating in the country.

1980s A research project was started by the Hungarian Central Statistics office to account for the
country’s information industry. It measured the volume of information in bits.

1990s Digital storage systems became more economical than paper storage. Challenges related to
the amount of data and the presence of obsolete data became apparent.

2000 onwards Various methods were introduced to streamline information. Techniques for controlling the
Volume, Velocity and Variety of data emerged, thus introducing 3D data management.
Structuring Big Data:
• Structuring of data is arranging the available data in a manner such
that it becomes
- easy to study
- easy to analyze
- derive conclusion
Why is structuring required?
In daily life, we have come across questions like:
 How do I use to my advantage the vast amount of data and
information I come across?
 Which news articles should I read of the thousands I come across?
 How do I choose a book of the millions available on my favourite sites
or stores?
 How do I keep myself updated about new events, sports, inventions
and discoveries taking place across the globe?
Solutions to such Questions can be found by Information Processing
systems.
• Structuring data helps in understanding user behaviors, requirements
and preferences to make personalized recommendations for every
individual.

• Various sources generate a variety of data, such as images, text,


audios, etc.

• All such different types of data can be structured only if it is sorted


and organized in some logical pattern.

• The process of structuring data requires one to first understand the


various types of data available today.
Types of Data
• Data that comes from multiple sources, such as databases, Enterprise
Resource Planning(ERP) systems, weblogs, chat history and GPS maps
varies in its format.

• Different formats of data need to be made consistent and clear to be


used for analysis.

• Data is primarily obtained from the following types of sources:


• Internal sources such as organizational or enterprise data
• External sources such as social data
Comparison between the Internal and External
sources of data
Data Source Definition Examples of Sources Application
Internal Provides structured or • Customer Relationship This data (current data in
organized data that originates Management(CRM) the operational system) is
from within the enterprise and • Enterprise Resource used to support daily
helps run business Planning(ERP), systems business operations of an
• Customers details organization.
• Products and sales data
• Generally OLTP and
operational data
External Provides unstructured or • Business partners This data is often analyzed
unorganized data that • Syndicate data suppliers to understand the entities
originates from the external • Internet mostly external to the
environment of an oganization • Government organizations such as
• Market research customers, competitors,
organizations market and environment.
Concepts of Big Data
It is obvious that in Big Data
We deal with these Analysis
Data Data
Things. Storage mining

Distributed Data
system Science
Big Data
Par al
Pro allel ci
fi nce
ces r ti
A lige
sing
nt el
I
• On the basis of the data received from the sources mentioned in the
table, Big Data comprises:
- Structured data
- Unstructured data
- Semi-Structured data
• The unstructured data is larger in volume than the structured and
semi-structured data, approximately 70% to 80% of data is in
unstructured form.
Unstruct Semi-
Structure
ured structure Big Data
d data
data d data
Structured Data:
• Structured data can be defined as the data that has a defined repeating pattern.
• This pattern makes it easier for any program to sort, read and process the data.
• Structured data:
- Is organized data in a predefined format
- Is stored in tabular form
- Is the data that resides in fixed fields within a record or file
- Is formatted data that has entities and their attributes mapped
- Is used to query and report against predetermined data types
Some sources of structured data include:
• Relational databases(in the form of tables)
• Flat files in the form of records (like comma separated values(csv) and
tab-separated files)
• Multidimensional databases (majorly used in data warehouse
technology)
• Legacy databases.
• Example:
Customer ID Name Product ID City State

12365 Smith 241 Graz Styria

23658 Jack 365 Wolfsberg Carinthia

32456 Kady 421 Enns Upper Austria


Unstructured Data:
• Unstructured data is a set of data that might or might not have any
logical or repeating patterns.
• Unstructured Data:
- Consists typically of metadata, i.e., the additional information
related to data.
- Comprises inconsistent data, such as data obtained from files,
social media websites, satellites etc.
- Consists of data in different formats such as e-mails, text, audio,
video or images.
• Some sources of unstructured data include :
- Text both internal and external to an organization
- Social media-Data obtained from social networking platforms.
- Mobile data- Data such as text messages and location information.
• About 80% of enterprise data consists of unstructured content.
Challenges associated with Unstructured data
- Identifying the unstructured data that can be processed.
- Sorting, organizing and arranging unstructured data in different
sets and formats.
- Combining and linking unstructured data in a more structured
format to derive any logical conclusions out of the available
information.
- costing in terms of storage space and human resource(data analysts and
data scientists) needed to deal with the exponential growth of
unstructured data.
Semi-Structured Data
• Semi-structured data is also known as having a schema-less or self-
describing structure, refers to a form of structured data that contains
tags or markup elements.
• Some sources of semi-structured data include:
- File systems such as Web data in the form of cookies.
- Data exchange formats such as Javascript Object Notation(JSON)

data
Elements of Big Data
• According to Gartner, data is growing at the rate of 59% year. This
growth can be depicted in terms of the following four V’s:
- Volume
- Velocity
- Variety
- Veracity
Volume:
• Volume is the amount of data generated by organizations or
individuals.
• Today, the volume of data in most organizations is approaching
exabytes.Some experts predict the volume of data to reach Zettabytes
in the coming years.
• For example, according to IBM, over 2.7 zetabytes of data is present in
the digital world today.
• Every minute, over 571 new websites are being created.
• The internet alone generates huge amount of data. The following
figures help us to get an idea of the internet traffic:
- Internet has around 14.3 trillion live web pages, and 48 billion web
pages are indexed by Google Inc,; 14 billion webpages are indexed by
Microsoft Bing.
- Internet has around 672 exabytes of accessible data.
- Total data stored on the Internet is over 1 yottabyte.
• The exact size of the internet will be never known.
Velocity
• Velocity describes the rate at which data is generated, captured and
shared.
• Enterprises can capitalize on data only if it is captured and shared in
real time.
• Information Processing systems such as CRM and ERP face problems
associated with data ,which keeps adding up but cannot be processed
quickly.
• These systems are able to attend data in batches every few hours;
however, even this time lag causes the data to lose its importance as
new data is constantly being generated.
• For example,eBay analyzes around 5 million transactions per day in
real time to detect and prevent frauds arising from the use of PayPal.
• The sources of high velocity data include the following:
• IT devices, including routers, switches, firewalls etc., constantly
generate valuable data.
• Social media, including Facebook posts, tweets and other social media
activities, create huge amount of data.
• Portable device, including mobile, PDA etc., also generate data at a
high speed.
Data Analytics Project Life Cycle:
• The data analytics project life cycle typically involves several phases,
each with its own set of tasks, goals, and deliverables.
• An analysis process contains all or some of the following phases:
1. Business Understanding:
- The first phase involves identifying and understanding the
business objectives. It deals with problems to be solved and
decisions to be made.
- The main goal is to enhance the business profitability.
- Once the business objectives are determined, the analysts
evaluate the situation, and identify the data mining goals.
- According to the defined goals, a project plan is created
between the analytics team and the IT or development team.
2. Data Collection:
- The process of collecting data is an important task in executing
a project plan accurately.
- In this phase, data from different data sources is collected first
and then described in terms of its application and need of the
project.
- This process is also called Data Exploration. Exploration of data
is required to ensure the quality of the collected data.
3. Data Preparation:
- From the data thus collected, unnecessary or unwanted data is
to be removed in this phase.
- In other words, the data must be prepared for the purpose of
analysis.
4.Data Modeling:
- In this phase, a model is created by using a data modeling
technique.
- The data model is used to analyze the relationships between
different selected objects in the data.
- Test cases are created to assess the applicability of model and
data is structured according to the model.
5.Data Evaluation:
- The results obtained from different test cases are evaluated and
reviewed for errors.
- After validating the results, analysis reports are created for
determining the next plan of action.
6.Deployment:
- In this phase, the plan is finalized for deployment.
- The deployed plan is constantly checked for errors and
maintenance.
- This process is also termed as reviewing of the project.

• The various phases of analysis are explained in figure.


Determine Collect Select
Select Evaluate Plan
Business Initial Modeling
Data Results Development
Objectives data Technique

Generate Plan
Assess Describe Clean Review
Test monitoring &
Situation Data Data Process
Design maintenance

Determine
Explore Produce
Data Construct Build Determine
data final
Mining Data model next steps
report
Goals
Verify
Produce Integrate Assess Review
Data
Project Plan Data model Project
Quality
Format
Data
Problems and Challenges in understanding Data
Analytics
• Understanding data analytics can present several challenges and
problems, both for individuals and organizations. Here are some
common issues:
1. Complexity of Data: Understanding the intricacies of different data
types, formats, and structures can be challenging.
2. Data Quality Issues: Poor data quality, including inaccuracies,
inconsistencies, missing values, and duplications, can impact the
accuracy and reliability of analytics results. Cleaning and preprocessing
data to ensure quality can be time-consuming and resource-intensive.
3. Lack of Data Governance: Inadequate data governance practices,
including unclear data ownership, access controls, and data
management policies, can lead to data security breaches, privacy
concerns, and regulatory compliance issues.
4.Limited Skills and Expertise: Data analytics requires a diverse skill set
encompassing statistics, mathematics, programming, data visualization,
and domain-specific knowledge. A shortage of skilled professionals with
expertise in these areas can hinder effective data analysis efforts.
5. Technology and Infrastructure Constraints: Inadequate technology
infrastructure, including outdated tools, insufficient computing
resources, and incompatible systems, can impede data analytics
initiatives and limit the scalability and performance of analytics
solutions.
6. Integration Challenges: Integrating data from disparate sources and
systems to create a unified view for analysis can be challenging.
7. Interpretation and Communication: Analyzing data is only part of
the process; effectively interpreting the results and communicating
insights to stakeholders is equally important.
8. Ethical and Bias Concerns: Data analytics can raise ethical concerns
related to privacy, fairness, and bias.
9. Cost and Return on Investment (ROI): Implementing data analytics
initiatives can be costly, requiring investment in technology, talent, and
infrastructure.
10. Cultural Resistance and Organizational Change: Resistance to
change and cultural barriers within organizations can hinder the
adoption of data-driven decision-making processes.
Web Page Categorization:
• Web page categorization refers to the process of assigning a specific
category or classification to a webpage based on its content, purpose,
or theme.
• This categorization is often performed by algorithms or human
reviewers to help organize and index the vast amount of information
available on the internet.
• There are several approaches to web page categorization:
1. Keyword Analysis: This involves analyzing the keywords present in
the webpage's content or metadata to determine its category.
Keywords related to specific topics or themes can indicate the
nature of the page.
2. Natural Language Processing (NLP): Advanced NLP techniques can
be used to analyze the textual content of web pages and categorize
them based on semantic meaning. This involves techniques such as text
classification, topic modeling, and sentiment analysis.
3. Machine Learning: Machine learning algorithms can be trained on
labeled datasets to automatically categorize web pages. These
algorithms learn patterns and features from the content of web pages
and use them to predict the most appropriate category.
4.Website Metadata: Information such as meta tags, titles, and
descriptions embedded in the HTML of web pages can provide valuable
clues about their category.
5.Link Analysis: Web pages often link to other pages within similar
topics or categories. Analyzing the links pointing to a page or
originating from it can provide insights into its category.
6.User Behavior: Analyzing user interactions with web pages, such as
click-through rates, time spent on page, and bounce rates, can also help
determine their category.
7.Hybrid Approaches: Combining multiple methods such as keyword
analysis, NLP, and machine learning algorithms can improve the
accuracy of web page categorization.
• Web page categorization is used for various purposes, including
search engine optimization (SEO), content filtering, targeted
advertising, and organizing web directories.
• It helps users find relevant information more efficiently and enables
businesses to deliver more targeted content and advertisements to
their audience.
Computing the frequency of Stock Market Change:
• To compute the frequency of stock market change, you would
typically follow these steps:
1. Define a Time Interval: Determine the time interval over which you
want to compute the frequency of stock market change. This could
be daily, weekly, monthly, etc.
2. Collect Stock Market Data: Gather historical stock market data for
the selected time interval. This data typically includes the opening
and closing prices of the stock for each time period, although you
can use other metrics such as high, low, and volume depending on
your analysis requirements.
3. Calculate Changes: For each time interval, calculate the change in
stock price. This can be done by subtracting the closing price from
the opening price or by calculating the percentage change.
4. Count Changes: Count the number of times the stock price changed
within your chosen time interval. You can set a threshold for what
constitutes a significant change, such as a certain percentage or
absolute value.
5. Compute Frequency: Divide the total number of changes by the total
number of time intervals to obtain the frequency of stock market
change.
Use of Big Data in Social Networking:
• A human being lives in a social environment and gains knowledge and
experience through communication.
• Today, communication is not restricted to meeting in person.
• The affordable and handy use of mobile phones and the internet have
made communication and sharing data of all kinds possible across the
globe.
• Some popular social networking sites are Twitter, Facebook and
LinkedIn.
• Lets first understand the meaning of social network data.
• Social network data refers to the data generated from people
socializing on social media .
• On a social networking site ,you will find different people constantly
adding and updating comments, statuses, preferences, etc.
• All these activities generate large amounts of data.
• Analyzing and mining such large volumes of data show business
trends with respect to wants and preferences and likes and dislikes of
a wide audience .
• This data can be segregated on the basis of different age groups ,
locations and genders for the purpose of analysis .
• Based on the information extracted ,organizations design products
and services specific to peoples need.
YouTube users
upload 72
hours of new
Apple users video Email users
download send 200
nearly 50,000 million
apps messages
Every
minute of
Amazon the day Google
generates receives over
over $80,000 2,000,000
in online sales search queries
Twitter users Facebook users
send over share 2.5million
300,000 pieces of
tweets content

Social Network Data Generated Every Minute of the Day


• The following are the areas in which decision-making processes are
influenced by social network data:
- Business Intelligence
- Marketing
- Product design and development
• Business Intelligence: Is a data analysis process to convert a raw dataset
to meaningful information by using different techniques and tools for
boosting business performance. This system allows a company to collect,
store, access and analyze data for adding value to decision making.
• Marketing: Consumers can now make their preferences clear and select
the marketing messages they wish to receive-when, where and from
whom. Marketers aim to deliver what consumers want by using
interactive communication across digital channels such as e-mail, mobile,
social and the web. Affiliate marketing is a reward-based marketing
structure.
• Product Design and Development: By listening to what customers
want, by understanding where the gap in the offering is and so on,
organizations can make the right decisions in the direction of their
product development and offerings.
Use of Big Data in Preventing Fraudulent activities:
• A fraud can be defined as the false representation of facts, leading to
concealment or distortion of the truth.
• The following are some of the most common types of financial frauds:

- Credit card fraud: In an online shopping transactions, the online retailer cannot
see the authentic user of the card and therefore, the valid owner of the card
cannot be verified. In spite of the security checks, such as address verification or
card security code, fraudsters manage to manipulate the loopholes in the system.

- Exchange or return policy fraud: An online retailer always has a policy allowing
the exchange and return of goods and sometimes, people take advantage of this
policy. Such a fraud can be averted by charging a restocking fee on the returned
goods, getting customers signature on the delivery of the product.
• Personal Information fraud: In this type of fraud, people obtain the
login information of a customer and then log-in to the customers
account, purchase a product online and then change the delivery
address to a different location.
• All these frauds can be prevented only by studying the customer’s
ordering patterns and keeping track of out-of-line orders.
• Other aspects should also be taken into consideration such as any
change in the shipping address, rush orders, sudden huge orders and
suspicious billing addresses.
• By observing such precautions , the frequency of the occurrence of
such frauds can be reduced to a certain event, but cannot be
completely eliminated.
Preventing Fraud Using Big Data Analytics
• One of the ways to prevent financial frauds is to study the customer’s
ordering pattern and other related data. This method works only
when the data to be analyzed is small in size.
• In order to deal with huge amounts of data and gain meaningful
business insights, organizations need to apply Big Data Analytics.
• Analyzing Big Data allows organizations to:
- Keep track of and process huge volumes of data.
- Differentiate between real and fraudulent entries.
- Identify new methods of fraud and add them to the list of fraud-
prevention checks.
- Verify whether a product has actually been delivered to the valid
recipient.
- Determine the location of the customer and the time when the
product was actually delivered.
- Check the listings of popular retail sites, such as e-Bay, to find
whether the product is up for sale somewhere else.

Fraud Detection in Real Time:


• Big Data also helps to detect frauds in real time.
• It compares live transactions with different data sources to validate
the authenticity of online transactions.
• For example, in an online transaction Big Data would compare the
incoming IP address with the geo-data received from the customer’s
smartphone apps. A valid match between the two confirms the
authenticity of the transaction.
• Big Data also examines the entire historical data to track suspicious
patterns of the customer order.
• These patterns are then used to create checks for avoiding real-time
fraud.
Visually Analyzing Fraud:
• Image analytics is another emerging field that can help detect frauds.
• It refers to the process of analyzing image data with the help of digital
processing of the image. Examples: Bar code and QR codes.
• Some other examples include complex solutions such as facial
recognition and position-and-movement analysis.
Use of Big Data in Retail Industry
• Considering the immense number of transactions and their correlation,
the retail industry offers a promising space for Big Data to operate.
• Business insights in customer behavior and company health can be
obtained by finding a relation between the organization’s sales
between in-store and online sales.
• It could be very difficult for a marketing analyst to understand the
health and strength of different types of products and campaigns and
reconcile the data obtained from these systems.
• Many times extracting data in real time is not feasible as systems are
affected because of scaling issues.
• Raw transactional data can only help a company understand its sales
but does not provide any relationships, patterns, or other clues for
deeper analysis.
• Some information in a Big Data feed can have a long-term strategic
value while some information will be used immediately and some
information will not be used at all.
• The main part of taming Big data is to identify which portions fall into
which category.

Use of RFID Data in Retail:


• The introduction of Radio Frequency Identification(RFID) technology
automated the process of labeling and tracking of products, thereby,
saving significant time, cost and effort.
• Walmart was one of the first retailers to implement RFID in its
merchandise.
• The RFID technology helps better item tracking by differentiating the
items that are out of stock and that are available on shelves.
• Various types of RFID tags are available for various environments such
as cardboard boxes, wooden, glass or metal containers.
• Tags also come in various sizes and are of varied capabilities, including
read and write capability, memory and power requirements.
• They also have a wide range of durability.
• Some varieties are paper-thin and are typically for one-time use and
are called ‘Smart Labels’.
• RFID tags can also be customized and withstand heat, moisture, acids
and other extreme conditions. Some RFID tags are also reusable.
• The use of RFIDs saves time, reduces labor, enhances the visibility of
products throughout the production- delivery cycle and saves costs.
Advantages of using RFID:
Asset Management

Inventory Control

Shipping and Receiving

Regulatory Compliance

Service and Warranty Authorizations

Some common benefits of using RFID are shown in fig.


• Asset Management: Organizations can tag all their capital assets, such as
pallets, vehicles and tools in order to trace them anytime and from any
location.
• Inventory control: One of the primary benefits of using RFID is inventory
tracking, especially in areas where tracking has not been done or was
not possible before.
• Using an RFID tracking system can result in an optimized inventory level,
and thus reduce the overall cost of stocking and labor.
• Shipping and Receiving: RFID tags can also be used to trigger automated
shipping tracking applications.
• Serial Shipping Container Code(SSCC) is widely used in shipping labels.
• The data contained in the RFID tags can be considered with the shipment
information, which can easily be read by the receiving organization to
simplify the receiving process and eliminate processing delays.
• Regulatory Compliance: The entire custody trail can be produced
before regulatory bodies such as the Food and Drug
Administration(FDA), Department of transportation(DOT), and
Occupational safety and Health Administration(OSHA) along with
other regulatory requirements, provided the RFID tag that travels with
the material has been updated with all the handling data.
• Service and Warranty Authorizations: A warranty card or document
to request for a service warranty would no longer be necessary
because an RFID tag can hold all this information.
• Once the repair or service has been completed, the information can
be fed into the RFID tag as the maintenance history. This is something
that will always remain on the product.
• If future repairs are required the technician can access this
information without accessing any external database, which helps in
reducing calls and time-expensive enquiries into documents.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy