Big Data Class 27Feb
Big Data Class 27Feb
1960s ‘Automatic Data Compression’ acts as a complete automatic and fast three-part compressor
that can be used for any kind of information in order to reduce the slow external storage
requirements and increase the rate of transmission from a computer system.
1970s In Japan, the Ministry of Posts and Telecommunications initiated a project to study
information flow in order to track the volume of information circulating in the country.
1980s A research project was started by the Hungarian Central Statistics office to account for the
country’s information industry. It measured the volume of information in bits.
1990s Digital storage systems became more economical than paper storage. Challenges related to
the amount of data and the presence of obsolete data became apparent.
2000 onwards Various methods were introduced to streamline information. Techniques for controlling the
Volume, Velocity and Variety of data emerged, thus introducing 3D data management.
Structuring Big Data:
• Structuring of data is arranging the available data in a manner such
that it becomes
- easy to study
- easy to analyze
- derive conclusion
Why is structuring required?
In daily life, we have come across questions like:
How do I use to my advantage the vast amount of data and
information I come across?
Which news articles should I read of the thousands I come across?
How do I choose a book of the millions available on my favourite sites
or stores?
How do I keep myself updated about new events, sports, inventions
and discoveries taking place across the globe?
Solutions to such Questions can be found by Information Processing
systems.
• Structuring data helps in understanding user behaviors, requirements
and preferences to make personalized recommendations for every
individual.
Distributed Data
system Science
Big Data
Par al
Pro allel ci
fi nce
ces r ti
A lige
sing
nt el
I
• On the basis of the data received from the sources mentioned in the
table, Big Data comprises:
- Structured data
- Unstructured data
- Semi-Structured data
• The unstructured data is larger in volume than the structured and
semi-structured data, approximately 70% to 80% of data is in
unstructured form.
Unstruct Semi-
Structure
ured structure Big Data
d data
data d data
Structured Data:
• Structured data can be defined as the data that has a defined repeating pattern.
• This pattern makes it easier for any program to sort, read and process the data.
• Structured data:
- Is organized data in a predefined format
- Is stored in tabular form
- Is the data that resides in fixed fields within a record or file
- Is formatted data that has entities and their attributes mapped
- Is used to query and report against predetermined data types
Some sources of structured data include:
• Relational databases(in the form of tables)
• Flat files in the form of records (like comma separated values(csv) and
tab-separated files)
• Multidimensional databases (majorly used in data warehouse
technology)
• Legacy databases.
• Example:
Customer ID Name Product ID City State
data
Elements of Big Data
• According to Gartner, data is growing at the rate of 59% year. This
growth can be depicted in terms of the following four V’s:
- Volume
- Velocity
- Variety
- Veracity
Volume:
• Volume is the amount of data generated by organizations or
individuals.
• Today, the volume of data in most organizations is approaching
exabytes.Some experts predict the volume of data to reach Zettabytes
in the coming years.
• For example, according to IBM, over 2.7 zetabytes of data is present in
the digital world today.
• Every minute, over 571 new websites are being created.
• The internet alone generates huge amount of data. The following
figures help us to get an idea of the internet traffic:
- Internet has around 14.3 trillion live web pages, and 48 billion web
pages are indexed by Google Inc,; 14 billion webpages are indexed by
Microsoft Bing.
- Internet has around 672 exabytes of accessible data.
- Total data stored on the Internet is over 1 yottabyte.
• The exact size of the internet will be never known.
Velocity
• Velocity describes the rate at which data is generated, captured and
shared.
• Enterprises can capitalize on data only if it is captured and shared in
real time.
• Information Processing systems such as CRM and ERP face problems
associated with data ,which keeps adding up but cannot be processed
quickly.
• These systems are able to attend data in batches every few hours;
however, even this time lag causes the data to lose its importance as
new data is constantly being generated.
• For example,eBay analyzes around 5 million transactions per day in
real time to detect and prevent frauds arising from the use of PayPal.
• The sources of high velocity data include the following:
• IT devices, including routers, switches, firewalls etc., constantly
generate valuable data.
• Social media, including Facebook posts, tweets and other social media
activities, create huge amount of data.
• Portable device, including mobile, PDA etc., also generate data at a
high speed.
Data Analytics Project Life Cycle:
• The data analytics project life cycle typically involves several phases,
each with its own set of tasks, goals, and deliverables.
• An analysis process contains all or some of the following phases:
1. Business Understanding:
- The first phase involves identifying and understanding the
business objectives. It deals with problems to be solved and
decisions to be made.
- The main goal is to enhance the business profitability.
- Once the business objectives are determined, the analysts
evaluate the situation, and identify the data mining goals.
- According to the defined goals, a project plan is created
between the analytics team and the IT or development team.
2. Data Collection:
- The process of collecting data is an important task in executing
a project plan accurately.
- In this phase, data from different data sources is collected first
and then described in terms of its application and need of the
project.
- This process is also called Data Exploration. Exploration of data
is required to ensure the quality of the collected data.
3. Data Preparation:
- From the data thus collected, unnecessary or unwanted data is
to be removed in this phase.
- In other words, the data must be prepared for the purpose of
analysis.
4.Data Modeling:
- In this phase, a model is created by using a data modeling
technique.
- The data model is used to analyze the relationships between
different selected objects in the data.
- Test cases are created to assess the applicability of model and
data is structured according to the model.
5.Data Evaluation:
- The results obtained from different test cases are evaluated and
reviewed for errors.
- After validating the results, analysis reports are created for
determining the next plan of action.
6.Deployment:
- In this phase, the plan is finalized for deployment.
- The deployed plan is constantly checked for errors and
maintenance.
- This process is also termed as reviewing of the project.
Generate Plan
Assess Describe Clean Review
Test monitoring &
Situation Data Data Process
Design maintenance
Determine
Explore Produce
Data Construct Build Determine
data final
Mining Data model next steps
report
Goals
Verify
Produce Integrate Assess Review
Data
Project Plan Data model Project
Quality
Format
Data
Problems and Challenges in understanding Data
Analytics
• Understanding data analytics can present several challenges and
problems, both for individuals and organizations. Here are some
common issues:
1. Complexity of Data: Understanding the intricacies of different data
types, formats, and structures can be challenging.
2. Data Quality Issues: Poor data quality, including inaccuracies,
inconsistencies, missing values, and duplications, can impact the
accuracy and reliability of analytics results. Cleaning and preprocessing
data to ensure quality can be time-consuming and resource-intensive.
3. Lack of Data Governance: Inadequate data governance practices,
including unclear data ownership, access controls, and data
management policies, can lead to data security breaches, privacy
concerns, and regulatory compliance issues.
4.Limited Skills and Expertise: Data analytics requires a diverse skill set
encompassing statistics, mathematics, programming, data visualization,
and domain-specific knowledge. A shortage of skilled professionals with
expertise in these areas can hinder effective data analysis efforts.
5. Technology and Infrastructure Constraints: Inadequate technology
infrastructure, including outdated tools, insufficient computing
resources, and incompatible systems, can impede data analytics
initiatives and limit the scalability and performance of analytics
solutions.
6. Integration Challenges: Integrating data from disparate sources and
systems to create a unified view for analysis can be challenging.
7. Interpretation and Communication: Analyzing data is only part of
the process; effectively interpreting the results and communicating
insights to stakeholders is equally important.
8. Ethical and Bias Concerns: Data analytics can raise ethical concerns
related to privacy, fairness, and bias.
9. Cost and Return on Investment (ROI): Implementing data analytics
initiatives can be costly, requiring investment in technology, talent, and
infrastructure.
10. Cultural Resistance and Organizational Change: Resistance to
change and cultural barriers within organizations can hinder the
adoption of data-driven decision-making processes.
Web Page Categorization:
• Web page categorization refers to the process of assigning a specific
category or classification to a webpage based on its content, purpose,
or theme.
• This categorization is often performed by algorithms or human
reviewers to help organize and index the vast amount of information
available on the internet.
• There are several approaches to web page categorization:
1. Keyword Analysis: This involves analyzing the keywords present in
the webpage's content or metadata to determine its category.
Keywords related to specific topics or themes can indicate the
nature of the page.
2. Natural Language Processing (NLP): Advanced NLP techniques can
be used to analyze the textual content of web pages and categorize
them based on semantic meaning. This involves techniques such as text
classification, topic modeling, and sentiment analysis.
3. Machine Learning: Machine learning algorithms can be trained on
labeled datasets to automatically categorize web pages. These
algorithms learn patterns and features from the content of web pages
and use them to predict the most appropriate category.
4.Website Metadata: Information such as meta tags, titles, and
descriptions embedded in the HTML of web pages can provide valuable
clues about their category.
5.Link Analysis: Web pages often link to other pages within similar
topics or categories. Analyzing the links pointing to a page or
originating from it can provide insights into its category.
6.User Behavior: Analyzing user interactions with web pages, such as
click-through rates, time spent on page, and bounce rates, can also help
determine their category.
7.Hybrid Approaches: Combining multiple methods such as keyword
analysis, NLP, and machine learning algorithms can improve the
accuracy of web page categorization.
• Web page categorization is used for various purposes, including
search engine optimization (SEO), content filtering, targeted
advertising, and organizing web directories.
• It helps users find relevant information more efficiently and enables
businesses to deliver more targeted content and advertisements to
their audience.
Computing the frequency of Stock Market Change:
• To compute the frequency of stock market change, you would
typically follow these steps:
1. Define a Time Interval: Determine the time interval over which you
want to compute the frequency of stock market change. This could
be daily, weekly, monthly, etc.
2. Collect Stock Market Data: Gather historical stock market data for
the selected time interval. This data typically includes the opening
and closing prices of the stock for each time period, although you
can use other metrics such as high, low, and volume depending on
your analysis requirements.
3. Calculate Changes: For each time interval, calculate the change in
stock price. This can be done by subtracting the closing price from
the opening price or by calculating the percentage change.
4. Count Changes: Count the number of times the stock price changed
within your chosen time interval. You can set a threshold for what
constitutes a significant change, such as a certain percentage or
absolute value.
5. Compute Frequency: Divide the total number of changes by the total
number of time intervals to obtain the frequency of stock market
change.
Use of Big Data in Social Networking:
• A human being lives in a social environment and gains knowledge and
experience through communication.
• Today, communication is not restricted to meeting in person.
• The affordable and handy use of mobile phones and the internet have
made communication and sharing data of all kinds possible across the
globe.
• Some popular social networking sites are Twitter, Facebook and
LinkedIn.
• Lets first understand the meaning of social network data.
• Social network data refers to the data generated from people
socializing on social media .
• On a social networking site ,you will find different people constantly
adding and updating comments, statuses, preferences, etc.
• All these activities generate large amounts of data.
• Analyzing and mining such large volumes of data show business
trends with respect to wants and preferences and likes and dislikes of
a wide audience .
• This data can be segregated on the basis of different age groups ,
locations and genders for the purpose of analysis .
• Based on the information extracted ,organizations design products
and services specific to peoples need.
YouTube users
upload 72
hours of new
Apple users video Email users
download send 200
nearly 50,000 million
apps messages
Every
minute of
Amazon the day Google
generates receives over
over $80,000 2,000,000
in online sales search queries
Twitter users Facebook users
send over share 2.5million
300,000 pieces of
tweets content
- Credit card fraud: In an online shopping transactions, the online retailer cannot
see the authentic user of the card and therefore, the valid owner of the card
cannot be verified. In spite of the security checks, such as address verification or
card security code, fraudsters manage to manipulate the loopholes in the system.
- Exchange or return policy fraud: An online retailer always has a policy allowing
the exchange and return of goods and sometimes, people take advantage of this
policy. Such a fraud can be averted by charging a restocking fee on the returned
goods, getting customers signature on the delivery of the product.
• Personal Information fraud: In this type of fraud, people obtain the
login information of a customer and then log-in to the customers
account, purchase a product online and then change the delivery
address to a different location.
• All these frauds can be prevented only by studying the customer’s
ordering patterns and keeping track of out-of-line orders.
• Other aspects should also be taken into consideration such as any
change in the shipping address, rush orders, sudden huge orders and
suspicious billing addresses.
• By observing such precautions , the frequency of the occurrence of
such frauds can be reduced to a certain event, but cannot be
completely eliminated.
Preventing Fraud Using Big Data Analytics
• One of the ways to prevent financial frauds is to study the customer’s
ordering pattern and other related data. This method works only
when the data to be analyzed is small in size.
• In order to deal with huge amounts of data and gain meaningful
business insights, organizations need to apply Big Data Analytics.
• Analyzing Big Data allows organizations to:
- Keep track of and process huge volumes of data.
- Differentiate between real and fraudulent entries.
- Identify new methods of fraud and add them to the list of fraud-
prevention checks.
- Verify whether a product has actually been delivered to the valid
recipient.
- Determine the location of the customer and the time when the
product was actually delivered.
- Check the listings of popular retail sites, such as e-Bay, to find
whether the product is up for sale somewhere else.
Inventory Control
Regulatory Compliance