0% found this document useful (0 votes)

5 views23 pages

Introduction To Big Data Platform (Module-3)

The document provides an overview of big data platforms, highlighting the challenges faced by conventional systems in handling large, complex data sets. It discusses the characteristics of big data, classifications of data types, and the analytical processes involved in data analysis, including data collection, cleaning, and visualization. Additionally, it emphasizes the importance of data security and the role of modern analytics tools in addressing big data challenges.

Uploaded by

sekharpranitha02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views23 pages

Introduction To Big Data Platform (Module-3)

Uploaded by

sekharpranitha02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Module-3

Introduction to Big Data Platform – Challenges of conventional systems - Web data –

Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting -
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.

Introduction to Big Data Platform

Big data is data sets that are so voluminous and complex that traditional data-processing
application software are inadequate to deal with them.
Big data challenges include capturing data, data storage, data analysis,search, sharing, transfer,
visualization, querying, updating, information privacy and data source.

Big data can be described by the following characteristics

• Volume : The quantity of generated and stored data. The size of the data determines
the value and potential insight and whether it can be considered big data or not.
• Variety : The type and nature of the data. This helps people who analyze it to
effectively use the resulting insight. Big data draws from text, images, audio, video;
plus it completes missing pieces through data fusion.
• Velocity : In this context, the speed at which the data is generated and processed to
meet the demands and challenges that lie in the path of growth and development. Big
data is often available in real-time.
• Variability : Inconsistency of the data set can hamper processes to handle and
manage it.
• Veracity : The data quality of captured data can vary greatly, affecting the accurate
analysis.

Classification of Data
Data can be classified as
• structured
• semi-structured
• multi-structured
• unstructured.

Structured Data
Structured data conform and associate with data schemas and data models. Structured
data are found in tables (rows and columns). Nearly 15-20% data are in structured or
semi-structured form.
Structured data enables the following:
• data insert, delete, update and append
• indexing to enable faster data retrieval
• Scalability which enables increasing or decreasing capacities and data processing
operations such as, storing, processing and analytics
• Transactions processing which follows ACID rules (Atomicity, Consistency, Isolation
and Durability)
• encryption and decryption for data security.

Semi Structured Data

Examples of semi-structured data are XML and JSON documents. Semi-structured data
contain tags or other markers, which separate semantic elements and enforce hierarchies
of records and fields within the data. Semi-structured form of data does not conform and
associate with formal data model structures. Data do not associate data models, such as the
relational database and table models.

Multi Structured Data

• Multi-structured data refers to data consisting of multiple formats of data, viz.
structured, semi-structured and/or unstructured data.
• Multi-structured data sets can have many formats.
• They are found in non-transactional systems.
• For example, streaming data on customer interactions, data of multiple sensors, data
at web or enterprise server or the data- warehouse data in multiple formats.

Unstructured Data
• Data does not possess data features such as a table or a database.
• Unstructured data are found in file types such as .TXT, .CSV.
• Data may be as key-value pairs, such as hash key-value pairs.
• Data may have internal structures, such as in e- mails.
• The data do not reveal relationships, hierarchy relationships.
• The relationships, schema and features need to be separately established.

Examples of unstructured Data

• Mobile data: Text messages, chat messages, tweets, blogs and comments
• Website content data: YouTube videos, browsing data, e-payments, web store data,
user-generated maps
• Social media data: For exchanging data in various forms
• Texts and documents
• Personal documents and e-mails
• Text internal to an organization: Text within documents, logs, survey results
• Satellite images, atmospheric data, surveillance, traffic videos, images from
Instagram, Flickr (upload, access, organize, edit and share photos from any device
from anywhere in the world).
Challenges of conventional system in big data
Three Challenges That big data face.
⛤Volume Data

⛤Process
⛤Management

⛤Security

Volume of Data
The volume of data, especially machine-generated data, is exploding,
2.how fast that data is growing every year, with new sources of data that are emerging.
3.For example, in the year 2000, 800,000petabytes (PB) of data were stored in the world, and
it is expected to reach 35 zettabytes (ZB) by2020 (according to IBM).

It's in the name—big data is big. Most companies are increasing the amount of data they
collect daily. Eventually, the storage capacity a traditional data center can provide will be
inadequate, which worries many business leaders. Forty-three percent of IT decision-makers
in the technology sector worry about this data influx overwhelming their infrastructure[2].

To handle this challenge, companies are migrating their IT infrastructure to the cloud. Cloud
storage solutions can scale dynamically as more storage is needed. Big data software is
designed to store large volumes of data that can be accessed and queried quickly

Processing
More than 80% of today’s information is unstructured and it is typically too big to manage
effectively. Today, companies are looking to leverage a lot more data from a wider variety of
sources both inside and outside the organization. Things like documents, contracts, machine
data, sensor data, social media, health records, emails, etc. The list is endless really.
Processing big data refers to the reading, transforming, extraction, and formatting of useful
information from raw information. The input and output of information in unified formats
continue to present difficulties.
Management
A lot of this data is unstructured, or has a complex structure that’s hard to represent in rows
and columns. Database sharding, memory caching, moving to the cloud and separating read-
only and write-active databases are all effective scaling methods. While each one of those
approaches is fantastic on its own, combining them will lead you to the next level. Human
decision-making and machine learning require ample and reliable data, but larger datasets are
more likely to contain inaccuracies, incomplete records, errors, and duplicates. Not correcting
quality issues leads to ill-informed decisions and lost revenue.

Before analyzing big data, it must be run through automated cleansing tools that check for
and correct duplicates, anomalies, missing information, and other errors. Setting specific data
quality standards and measuring these benchmarks regularly will also help by highlighting
where data collection and cleansing techniques must change.

Security
Security is one of the most significant risks of big data. Cybercriminals are more likely to
target businesses that store sensitive information, and each data breach can cost time, money,
and reputation. Similarly, privacy laws like the European Union’s General Data Protection
Regulation (GDPR) make collecting vast amounts of data while upholding user privacy
standards difficult. Non-encrypted information is at risk of theft or damage by cyber-
criminals. Therefore, data security professionals must balance access to data against
maintaining strict security protocols.
Web data
Web analytics is the measurement, collection, analysis and reporting of web data for
purposes of understanding and optimizing web usage. However, Web analytics is not just a
process for measuring web traffic but can be used as a tool for business and market research,
and to assess and improve the effectiveness of a website. Web analytics applications can also
help companies measure the results of traditional print or broadcast advertising campaigns

Collection of data: This stage is the collection of the basic, elementary data. Usually, these data are
counts of things. The objective of this stage is to gather the data.

Processing of data into information: This stage usually take counts and make them ratios, although
there still may be some counts. The objective of this stage is to take the data and conform it into
information, specifically metrics.

Developing KPI: This stage focuses on using the ratios (and counts) and infusing them with
business strategies, referred to as Key Performance Indicators (KPI). Many times, KPIs deal
with conversion aspects, but not always. It depends on the organization.
Formulating online strategy: This stage is concerned with the online goals, objectives, and
standards for the organization or business. These strategies are usually related to making
money, saving money, or increasing market share.Another essential function developed by the
analysts for the optimization of the websites are the experiments

Experiments and testings: A/B testing is a controlled experiment with two variants, in online
settings, such as web development. The goal of A/B testing is to identify changes to web pages that
increase or maximize a statistically tested result of interest.

Each stage impacts or can impact (i.e., drives) the stage preceding or following it. So, sometimes the
data that is available for collection impacts the online strategy. Other times, the online strategy affects
the data collected.

Web analytics technologies

There are at least two categories of web analytics; off-site and on-site web analytics.

• Off-site web analytics refers to web measurement and analysis regardless of whether you
own or maintain a website. It includes the measurement of a website's potential audience
(opportunity), share of voice (visibility), and buzz (comments) that is happening on the Internet as a
whole.

• On-site web analytics, the most common, measure a visitor's behavior once on your website.This
includes its drivers and conversions; for example, the degree to which different landing pages are
associated with online purchases. On-site web analytics measures the performance of your website in
a commercial context. This data is typically compared against key performance indicators for
performance, and used to improve a website or marketing campaign's audience response.

A performance indicator or key performance indicator (KPI) is a type of performance measurement.

KPIs evaluate the success of an organization or of a particular activity (such as projects, programs,
products and other initiatives) in which it engages.
Analytical Scalability

Vertical scalability means scaling up the given system’s resources and increasing the
system’sanalytics, reporting and visualization capabilities. This is an additional way to
solve problemsof greater complexities. Scaling up means designing the algorithm according
to the architecturethat uses resources efficiently.x terabyte of data take time t for
processing, code size with increasing complexity increaseby factor n, then scaling up means
that processing takes equal, less or much less than (n * t).

Horizontal scalability means increasing the number of systems working in coherence and
scaling out the workload. Processing different datasets of a large dataset deploys horizontal
scalability. Scaling out means using more resources and distributing the processing and
storage tasks in parallel. The easiest way to scale up and scale out execution of analytics
software is to implement it on a bigger machine with more CPUs for greater volume,
velocity, variety and complexity of data. The software will definitely perform better on a
bigger machine.

Analytic Process
The collection, transformation, and organization of data to draw conclusions make
predictions for the future and make informed data-driven decisions is called Data
Analysis. The profession that handles data analysis is called a Data Analyst.
1. Define the Problem or Research Question
2. Collect Data
3. Data Cleaning
4. Analyzing the Data
5. Data Visualization
6. Presenting Data

1. Define the Problem or Research Question

In the first step of process the data analyst is given a problem/business task. The analyst has
to understand the task and the stakeholder’s expectations for the solution. A stakeholder is a
person that has invested their money and resources to a project. The analyst must be able to
ask different questions in order to find the right solution to their problem. The analyst has
to find the root cause of the problem in order to fully understand the problem. The analyst
must make sure that he/she doesn’t have any distractions while analyzing the problem.
Communicate effectively with the stakeholders and other colleagues to completely
understand what the underlying problem is. Questions to ask yourself for the Ask phase
are:
• What are the problems that are being mentioned by my stakeholders?
• What are their expectations for the solutions?

2. Collect Data
The second step is to Prepare or Collect the Data. This step includes collecting data and
storing it for further analysis. The analyst has to collect the data based on the task given
from multiple sources. The data has to be collected from various sources, internal or
external sources. Internal data is the data available in the organization that you work for
while external data is the data available in sources other than your organization. The data
that is collected by an individual from their own resources is called first-party data. The
data that is collected and sold is called second-party data. Data that is collected from
outside sources is called third-party data. The common sources from where the data is
collected are Interviews, Surveys, Feedback, Questionnaires. The collected data can be
stored in a spreadsheet or SQL database. A spreadsheet is a digital worksheet that contains
rows and columns while a database contains tables that have functions to manipulate the
data. Spreadsheets are used to store some thousands or ten thousand of data while databases
are used when there are too many rows to store. The best tools to store the data are MS
Excel or Google Sheets in the case of Spreadsheets and there are so many databases like
Oracle, Microsoft to store the data.

3. Data Cleaning
The third step is Clean and Process Data. After the data is collected from multiple
sources, it is time to clean the data. Clean data means data that is free from misspellings,
redundancies, and irrelevance. Clean data largely depends on data integrity. There might be
duplicate data or the data might not be in a format, therefore the unnecessary data is
removed and cleaned. There are different functions provided by SQL and Excel to clean the
data. This is one of the most important steps in Data Analysis as clean and formatted data
helps in finding trends and solutions. The most important part of the Process phase is to
check whether your data is biased or not. Bias is an act of favoring a particular
group/community while ignoring the rest. Biasing is a big no-no as it might affect the
overall data analysis. The data analyst must make sure to include every group while the
data is being collected.

4. Analyzing the Data

The fourth step is to Analyze. The cleaned data is used for analyzing and identifying
trends. It also performs calculations and combines data for better results. The tools used for
performing calculations are Excel or SQL. These tools provide in-built functions to perform
calculations or sample code is written in SQL to perform calculations. Using Excel, we can
create pivot tables and perform calculations while SQL creates temporary tables to perform
calculations. Programming languages are another way of solving problems. They make it
much easier to solve problems by providing packages. The most widely used programming
languages for data analysis are R and Python.

5. Data Visualization
The fifth step is visualizing the data. Nothing is more compelling than a visualization. The
data now transformed has to be made into a visual (chart, graph). The reason for making
data visualizations is that there might be people, mostly stakeholders that are non-technical.
Visualizations are made for a simple understanding of complex data. Tableau and Looker
are the two popular tools used for compelling data visualizations. Tableau is a simple drag
and drop tool that helps in creating compelling visualizations. Looker is a data viz tool that
directly connects to the database and creates visualizations. Tableau and Looker are both
equally used by data analysts for creating a visualization. R and Python have some
packages that provide beautiful data visualizations. R has a package named ggplot which
has a variety of data visualizations. A presentation is given based on the data findings.
Sharing the insights with the team members and stakeholders will help in making better
decisions. It helps in making more informed decisions and it leads to better outcomes.

6. Presenting the Data

Presenting the data involves transforming raw information into a format that is easily
comprehensible and meaningful for various stakeholders. This process encompasses the
creation of visual representations, such as charts, graphs, and tables, to effectively
communicate patterns, trends, and insights gleaned from the data analysis. The goal is to
facilitate a clear understanding of complex information, making it accessible to both
technical and non-technical audiences. Effective data presentation involves thoughtful
selection of visualization techniques based on the nature of the data and the specific
message intended. It goes beyond mere display to storytelling, where the presenter
interprets the findings, emphasizes key points, and guides the audience through the
narrative that the data unfolds. Whether through reports, presentations, or interactive
dashboards, the art of presenting data involves balancing simplicity with depth, ensuring
that the audience can easily grasp the significance of the information presented and use it
for informed decision-making.

Data Analytics Tools

1) Microsoft Excel
Probably not the first thing that comes to mind, but Excel is one of the most widely used
analytics tools in the world given its massive installed base. You won’t use it for advanced
analytics to be sure, but Excel is a great way to start learning the basics of analytics not to
mention a useful tool for basic grunt work. It supports all the important features like
summarizing data, visualizing data, and basic data manipulation. It has a huge user
community with plenty of support, tutorials and free resources.
2) IBM HYPERLINK "https://www.ibm.com/products/cognos-analytics"Cognos
HYPERLINK "https://www.ibm.com/products/cognos-analytics" Analytics
IBM’s Cognos Analytics is an upgrade to Cognos Business Intelligence (Cognos BI). Cognos
Analytics has a Web-based interface and offers data visualization features not found in the BI
product. It provides self-service analytics with enterprise security, data governance and
management features. Data can be sourced from multiple sources to create visualizations and
reports.

3) The R language
R has been around more than 20 years as a free and open source project, making it quite
popular, and R was designed to do one thing: analytics. There are numerous add-on packages
and Microsoft supports it as part of its Big Data efforts. Extra packages include Big Data
support, connecting to external databases, visualizing data, mapping data geographically and
performing advanced statistical functions. On the down side, R has been criticized for being
single threaded in an era where parallel processing is imperative.

4) Sage Live
Sage Live is a cloud-based accounting platform for small and mid-sized businesses, with
features like the ability to create and send invoices, accept payments, pay bills, record receipts
and record sales, all from within a mobile-capable platform. It supports multiple companies,
currencies and banks and integrates with Salesforce CRM for no additional charge.

5) Sisense
Sisense’s self-titled product is a BI solution that provides advanced analytical tools for
analysis, visualization and reporting. Sisense allows businesses to merge data from many
sources and merge it into a single database where it does the analysis. It can be deployed on-
premises or hosted in the cloud as a SaaS application.

6) Chart.io
Chart.io is a drag and drop chart creation tool that works on a tablet or laptop to build
connections to databases, ranging from MySQL to Oracle, and then creates scripts for data
analysis. Data can be blended from multiple sources with a single click before executing
analysis. It makes a variety of charts, such as bar graphs, pie charts, scatter plots, and more.

7) SAP HYPERLINK https://www.sap.com/products/bi-platform.html

BusinessObjects SAP’s BusinessObjects provides a set of centralized tools to perform a wide
variety of BI and analytics, from ETL to data cleansing to predictive dashboards and reports.
It’s modular so customers can start small with just the functions they need and grow the app
with their business. It supports everything from SMBs to large enterprises and can be
configured for a number of vertical industries. It also supports Microsoft Office and
Salesforce SaaS.

8) Netlink
Business Analytics Netlink’s Business Analytics platform is a comprehensive on-demand
solution, meaning no Capex investment. It can be accessed via a Web browser from any
device and scale from a department to a full enterprise. Dashboards can be shared among
teams via the collaboration features. The features are geared toward sales, with advanced
analytic capabilities around sales & inventory forecasting, voice and text analytics, fraud
detection, buying propensity, sentiment, and customer churn analysis.

9) Domo
Domo is another cloud-based business management suite is browser-accessible and scales
from a small business to a giant enterprise. It provides analysis on all business-level activity,
like top selling products, forecasting, marketing return on investment and cash balances. It
offers interactive visualization tools and instant access to company-wide data via customized
dashboards.

10) InetSoft Style Intelligence

Style Intelligence is a business intelligence software platform that allows users to create
dashboards, visual analyses and reports via a data engine that integrates data from multiple
sources such as OLAP servers, ERP apps, relational databases and more. InetSoft’s
proprietary Data Block technology enables the data mashups to take place in real time. Data
and reports can be accessed via dashboards, enterprise reports, scorecards and exception
alerts.

11) Dataiku
Dataiku develops Dataiku Data Science Studio (DSS), a data analysis and collaboration
platform that helps data analysts work together with data scientists to build more meaningful
data applications. It helps prototype and build data-driven models and extract data from a
variety of sources, from databases to Big Data repositories.

12) Python
Python is already a popular language because it’s powerful and easy to learn. Over the years,
analytics features have been added, making it increasingly popular with developers looking to
do analytics apps but wanting more power than the R language. R is built for one thing,
statistical analysis, but Python can do analytics plus many other functions and types of apps,
including machine learning and analytics.

13) Apache Spark

Spark is Big Data analytics designed to run in-memory. Early Big Data systems like Hadoop
were batch processes that ran during low utilization (at night) and were disk-based. Spark is
meant to run in real time and entirely in memory, thus allowing for much faster real-time
analytics. Spark has easy integration with the Hadoop ecosystem and its own machine
learning library. And it’s open source, which means it’s free.

14) SAS
Institute SAS is a long-time BI vendor, so its move into analytics was only natural. to be
widely used in the industry. Two of its major apps are SAS Enterprise Miner and SAS Visual
Analytics. Enterprise Miner is good for core statistical analysis, data analytics and machine
learning. It’s mature and has been around a while, with a lot of macros and code for specific
uses. Visual Analytics is newer and designed to run in distributed memory on top of Hadoop.

15) Tableau
Tableau is a data visualization software package and one of the most popular on the market.
It’s a fast visualization software which lets you explore data and make all kinds of analysis
and observations by drag and drop interfaces. Its intelligent algorithms figure out the type of
data and the best method available to process it. You can easily build dashboards with the
GUI and connect to a host of analytical apps, including R.

16) Splunk
Splunk Enterprise started out as a log-analysis tool, but has grown to become a broad based
platform for searching, monitoring, and analyzing machine-generated Big Data. The software
can import data from a variety of sources, from logs to data collected by Big Data
applications such as Hadoop or sensors. It then generates reports a non-IT business person can
easily read and understand.
Analysis verses reporting

Living in the era of digital technology and big data has made organizations dependent on the wealth
of information data can bring. You might have seen how reporting and analysis are used
interchangeably, especially the manner which outsourcing companies market their services. While
both areas are part of web analytics (note that analytics isn’t similar to analysis), there’s a vast
difference between them, and it’s more than just spelling. It’s important that we differentiate the two
because some organizations might be selling themselves short in one area and not reap the benefits,
which web analytics can bring to the table. The first core component of web analytics, reporting, is
merely organizing data into summaries. On the other hand, analysis is the process of inspecting,
cleaning, transforming, and modeling these summaries (reports) with the goal of highlighting useful
information. Simply put, reporting translates data into information while analysis turns information
into insights. Also, reporting should enable users to ask “What?” questions about the information,
whereas analysis should answer to “Why”” and “What can we do about it?”

Here are five differences between reporting and analysis:

1. Purpose
Reporting helps companies monitor their data even before digital technology boomed.
Various organizations have been dependent on the information it brings to their business, as
reporting extracts that and makes it easier to understand. Analysis interprets data at a deeper
level. While reporting can link between crosschannels of data, provide comparison, and make
understand information easier (think of a dashboard, charts, and graphs, which are reporting
tools and not analysis reports), analysis interprets this information and provides
recommendations on actions.

2. Tasks
As reporting and analysis have a very fine line dividing them, sometimes it’s easy to confuse
tasks that have analysis labeled on top of them when all it does is reporting. Hence, ensure
that your analytics team has a healthy balance doing both. Here’s a great differentiator to keep
in mind if what you’re doing is reporting or analysis: Reporting includes building,
configuring, consolidating, organizing, formatting, and summarizing. It’s very similar to the
abovementioned like turning data into charts, graphs, and linking data across multiple
channels. Analysis consists of questioning, examining, interpreting, comparing, and
confirming. With big data, predicting is possible as well.
3. Outputs
Reporting and analysis have the push and pull effect from its users through their outputs.
Reporting has a push approach, as it pushes information to users and outputs come in the
forms of canned reports, dashboards, and alerts. Analysis has a pull approach, where a data
analyst draws information to further probe and to answer business questions. Outputs from
such can be in the form of ad hoc responses and analysis presentations. Analysis presentations
are comprised of insights, recommended actions, and a forecast of its impact on the
company—all in a language that’s easy to understand at the level of the user who’ll be
reading and deciding on it. This is important for organizations to realize truly the value of
data, such that a standard report is not similar to a meaningful analytics.

4. Delivery
Considering that reporting involves repetitive tasks—often with truckloads of data,
automation has been a lifesaver, especially now with big data. It’s not surprising that the first
thing outsourced are data entry services since outsourcing companies are perceived as data
reporting experts. Analysis requires a more custom approach, with human minds doing
superior reasoning and analytical thinking to extract insights, and technical skills to provide
efficient steps towards accomplishing a specific goal. This is why data analysts and scientists
are demanded these days, as organizations depend on them to come up with recommendations
for leaders or business executives make decisions about their businesses.

5. Value
This isn’t about identifying which one brings more value, rather understanding that both are
indispensable when looking at the big picture. It should help businesses grow, expand, move
forward, and make more profit or increase their value. This Path to Value diagram illustrates
how data converts into value by reporting and analysis such that it’s not achievable without
the other.

Data — Reporting — Analysis — Decision-making — Action — VALUE

Data alone is useless, and action without data is baseless. Both reporting and analysis are vital to
bringing value to your data and operations.
Modern Analytical Tools
1. Tableau Public
What is Tableau Public?
It is a simple and intuitive and tool which offers intriguing insights through Data
Visualization. Tableau Public’s million row limit, which is easy to use fares better
than most of the other players in the Data Analytics market. With Tableau’s visuals,
you can investigate a hypothesis, explore the data, and cross-check your insights.

Uses of Tableau Public

• You can publish interactive data visualizations to the web for free.
• No programming skills required.
• Visualizations published to Tableau Public can be embedded into blogs and web
pages and be shared through email or social media. The shared content can be made
available s for downloads.

Limitations of Tableau Public

• All data is public and offers very little scope for restricted access.
• Data size limitation.
• Cannot be connected to R.
• The only way to read is via OData sources, is Excel or txt.

2. OpenRefine
What is OpenRefine ?
Formerly known as GoogleRefine, the data cleaning software that helps you clean up
data for analysis. It operates on a row of data which have cells under columns, quite
similar to relational database tables.

Uses of OpenRefine
• Cleaning messy data.
• Transformation of data.
• Parsing data from websites.
• Adding data to data set by fetching it from web services. For instance, OpenRefine
could be used for geocoding addresses to geographic coordinates.
Limitations of OpenRefine
• Open Refine is unsuitable for large datasets.
• Refine does not work very well with Big Data.

3. KNIME
What is KNIME?
KNIME helps you to manipulate, analyze, and model data through visual
programming. It is used to integrate various components for data mining and Machine
Learning via its modular data pipelining concept.

Uses of KNIME
• Rather than writing blocks of code, you just have to drop and drag connection points
between activities.
• This data analysis tool supports programming languages. In fact, analysis tools like
these can be extended to run chemistry data, text mining, Python, and R.

Limitation of KNIME
• Poor data visualization.

4. RapidMiner
What is RapidMiner?
RapidMiner provides Machine Learning procedures and Data Mining including Data
Visualization, processing, statistical modeling, deployment, evaluation, and predictive
analytics. RapidMiner written in the Java is fast gaining acceptance as a Big Data
Analytics tool.

Uses of RapidMiner
• It provides an integrated environment for business analytics, predictive
analysis, text mining, Data Mining, and Machine Learning.
• Along with commercial and business applications, RapidMiner is also used for
application development, rapid prototyping, training, education, and research.
Limitations of RapidMiner
• RapidMiner has size constraints with respect to the number of rows.
• For RapidMiner, you need more hardware resources than ODM and SAS.

5. Google Fusion Tables

What is Google Fusion Tables?
When talking about Data Analytics tools for free, here comes a much cooler, larger,
and nerdier version of Google Spreadsheets. An incredible tool for data analysis,
mapping, and large dataset visualization, Google Fusion Tables can be added to
business analytics tools list.

Uses of Google Fusion Tables

• Visualize bigger table data online.
• Filter and summarize across hundreds of thousands of rows.
• Combine tables with other data on web.

You can merge two or three tables to generate a single visualization that includes sets
of data. With Google Fusion Tables, you can combine public data with your own for a
better visualization. You can create a map in minutes!

Limitations of Google Fusion Tables

• Only the first 100,000 rows of data in a table are included in query results or
mapped.
The total size of the data sent in one API call cannot be more than 1MB.

6. NodeXL
What is NodeXL?
It is a visualization and analysis software of relationships and networks. NodeXL
provides exact calculations. It is a free (not the pro one) and open-source network
analysis and visualization software. NodeXL is one of the best statistical tools for data
analytics which includes advanced network metrics, access to social media network
data importers, and automation.
Uses of NodeXL
This is one of the data analysis tools in excel that helps in following areas:
• Data Import
• Graph Visualization
• Graph Analysis
• Data Representation
• This software integrates into Microsoft Excel 2007, 2010, 2013, and 2016. It opens
as a workbook with a variety of worksheets containing the elements of a graph
structure like nodes and edges.
• This software can import various graph formats like adjacency matrices, Pajek .net,
UCINet .dl, GraphML, and edge lists.

Limitations of NodeXL
• You need to use multiple seeding terms for a particular problem.
• Running the data extractions at slightly different times.

7. Wolfram Alpha
What is Wolfram Alpha?
It is a computational knowledge engine or answering engine founded by Stephen
Wolfram. With Wolfram Alpha, you get answers to factual queries directly by
computing the answer from externally sourced ‘curated data’ instead of providing a
list of documents or web pages.

Uses of Wolfram Alpha

• Is an add-on for Apple’s Siri.
• Provides detailed responses to technical searches and solves calculus problems.
• Helps business users with information charts and graphs, and helps in creating topic
overviews, commodity information, and high-level pricing history.

Limitations of Wolfram Alpha

• Wolfram Alpha can only deal with publicly known number and facts, not with
viewpoints.
• It limits the computation time for each query.
8. Google Search Operators
What is Google Search Operators?
It is a powerful resource which helps you filter Google results instantly to get most
relevant and useful information.

Uses of Google Search Operators

• Faster filtering of Google search results.
• Google’s powerful data analysis tool can help discover new information or market
research.

9. Solver
What is Excel Solver?
The Solver Add-in is a Microsoft Office Excel add-in program that is available when
you install Microsoft Excel or Office. It is a linear programming and optimization tool
in excel. This allows you to set constraints. It is an advanced optimization tool that
helps in quick problem solving.

Uses of Solver
• The final values found by Solver are a solution to interrelation and decision.
• It uses a variety of methods, from nonlinear optimization and linear
programming to evolutionary and genetic algorithms, to find solutions.

Limitations of Solver
• Poor scaling is one of the areas where Excel Solver lacks.
• It can affect solution time and quality.
• Solver affects the intrinsic solvability of your model.

10. Dataiku DSS

What is Dataiku DSS?
This is a collaborative data science software platform that helps team build, prototype,
explore, and deliver their own data products more efficiently.

Uses of Dataiku DSS

• It provides an interactive visual interface where they can build, click, and
point or use languages like SQL.
• This data analytics tool lets you draft data preparation and modulization in
seconds.
• Helps you coordinate development and operations by handling workflow
automation, creating predictive web services, model health daily, and
monitoring data.

Limitation of Dataiku DSS

• Limited visualization capabilities
• UI hurdles: Reloading of code/datasets
• Inability to easily compile entire code into a single document/notebook
• Still need to integrate with SPARK

Statistical Concepts

Sampling distribution
n statistics, a sampling distribution or finite-sample distribution is the probability
distribution of a given statistic based on a random sample. Sampling distributions are
important in statistics because they provide a major simplification en route to
statistical inference

For example, consider a normal population with mean and variance

{\displaystyle \sigma ^{2}} \sigma ^{2}.

Assume we repeatedly take samples of a given size from this population and calculate
the arithmetic mean
{\displaystyle \scriptstyle {\bar {x}}} \scriptstyle {\bar x}
for each sample – this statistic is called the sample mean.

The distribution of these means, or averages, is called the "sampling distribution of

the sample mean". This distribution is normal
{\displaystyle \scriptstyle {\mathcal {N}}(\mu ,\,\sigma ^{2}/n)} \scriptstyle
{\mathcal {N}}(\mu ,\,\sigma ^{2}/n)

(n is the sample size) since the underlying population is normal, although sampling
distributions may also often be close to normal even when the population distribution
is not (see central limit theorem). An alternative to the sample mean is the sample
median. When calculated from the same population, it has a different sampling
distribution to that of the mean and is generally not normal (but it may be close for
large sample sizes).

Standard Deviation
The standard deviation of the sampling distribution of a statistic is referred to as the
standard error of that quantity. For the case where the statistic is the sample mean, and
samples are uncorrelated, the standard error is:
{\displaystyle \sigma _{\bar {x}}={\frac {\sigma }{\sqrt {n}}}} \sigma _{{{\bar
x}}}={\frac {\sigma }{{\sqrt {n}}}} where {\displaystyle \sigma }
\sigma is the standard deviation of the population distribution of that quantity and
{\displaystyle n} n is the sample size.

Resampling
In statistics, resampling is any of a variety of methods for doing one of the following:
Estimating the precision of sample statistics (medians, variances, percentiles) by using
subsets of available data (jackknifing) or drawing randomly with replacement from a
set of data points (bootstrapping) Exchanging labels on data points when performing
significance tests (permutation tests, also called exact tests, randomization tests, or re-
randomization tests) Validating models by using random subsets (bootstrapping, cross
validation)

Bootstrapping is a statistical method for estimating the sampling distribution of an

estimator by sampling with replacement from the original sample, most often with the
purpose of deriving robust estimates of standard errors and confidence intervals of a
population parameter like a mean, median, proportion, odds ratio, correlation
coefficient or regression coefficient. It may also be used for constructing hypothesis
tests. It is often used as a robust alternative to inference based on parametric
assumptions when those assumptions are in doubt, or where parametric inference is
impossible or requires very complicated formulas for the calculation of standard
errors. Bootstrapping techniques are also used in the updating-selection transitions of
particle filters, genetic type algorithms and related tesample/teconfiguration Monte
Carlo methods used in computational physics and molecular chemistry.In this context,
the bootstrap is used to replace sequentially empirical weighted probability measures
by empirical measures. The bootstrap allows to replace the samples with low weights
by copies of the samples with high weights.

Statistical inference
Statistical inference is the process of using data analysis to deduce properties of an
underlying probability distribution. Inferential statistical analysis infers properties of a
population, for example by testing hypotheses and deriving estimates.

Statistical inference makes propositions about a population, using data drawn from
the population with some form of sampling. Given a hypothesis about a population, for
which we wish to draw inferences, statistical inference consists of (first) selecting a
statistical model of the process that generates the data and (second) deducing propositions
from the model. a point estimate, i.e. a particular value that best approximates some
parameter of interest; an interval estimate, e.g. a confidence interval (or set estimate), i.e.
an interval constructed using a dataset drawn from a population so that, under repeated
sampling of such datasets, such intervals would contain the true parameter value with the
probability at the stated confidence level; a credible interval, i.e. a set of values containing,
for example, 95% of posterior belief; rejection of a hypothesis;[clustering or classification
of data points into groups.

Prediction error
In statistics the mean squared prediction error of a smoothing or curve fitting
procedure is the expected value of the squared difference between the fitted values implied
by the predictive function and the values of the (unobservable) function g.

Module 1
No ratings yet
Module 1
60 pages
Bda (Unit 1)
No ratings yet
Bda (Unit 1)
24 pages
Unit1 BDT
No ratings yet
Unit1 BDT
96 pages
Data Analytics
No ratings yet
Data Analytics
20 pages
KCA 034 - Unit 1
No ratings yet
KCA 034 - Unit 1
48 pages
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet
Cloud Computing
No ratings yet
Cloud Computing
86 pages
ET Ch-2 Data Science PPT
No ratings yet
ET Ch-2 Data Science PPT
28 pages
BDA Unit 1
No ratings yet
BDA Unit 1
10 pages
BDA NOTES With Questions Included
No ratings yet
BDA NOTES With Questions Included
108 pages
Big Data Lecture # 1
No ratings yet
Big Data Lecture # 1
15 pages
Unit 1
No ratings yet
Unit 1
44 pages
Bigdatanalyticsintro
No ratings yet
Bigdatanalyticsintro
60 pages
Big Data Chapter 1
No ratings yet
Big Data Chapter 1
7 pages
Emergency Chapter Two
No ratings yet
Emergency Chapter Two
41 pages
Unit I - Big Data Programming
No ratings yet
Unit I - Big Data Programming
19 pages
Unit 1 Question&answers
No ratings yet
Unit 1 Question&answers
36 pages
Unit 1 Bigdata
No ratings yet
Unit 1 Bigdata
30 pages
Unit 4
No ratings yet
Unit 4
29 pages
BDA Unit 1
No ratings yet
BDA Unit 1
39 pages
Converted 4011171
No ratings yet
Converted 4011171
144 pages
Unit 1
No ratings yet
Unit 1
24 pages
SAS Access 92
100% (1)
SAS Access 92
984 pages
Bigdata Analytics
No ratings yet
Bigdata Analytics
19 pages
What Is Data
No ratings yet
What Is Data
20 pages
117769
No ratings yet
117769
20 pages
Big Data Project
100% (3)
Big Data Project
61 pages
big Data
No ratings yet
big Data
21 pages
1.1 Module-1
No ratings yet
1.1 Module-1
31 pages
UNIT1
100% (1)
UNIT1
37 pages
NJ Cse4261-1
No ratings yet
NJ Cse4261-1
26 pages
Entity Framework Tutorial
100% (3)
Entity Framework Tutorial
290 pages
5 - Big Data Dimensions, Evolution, Impacts, and Challenges PDF
No ratings yet
5 - Big Data Dimensions, Evolution, Impacts, and Challenges PDF
11 pages
Bda (Chapter 1)
No ratings yet
Bda (Chapter 1)
8 pages
Module 1
No ratings yet
Module 1
21 pages
Data-Driven Business Strategies: Understanding and Harnessing the Power of Big Data
From Everand
Data-Driven Business Strategies: Understanding and Harnessing the Power of Big Data
Steven Vollmer
No ratings yet
Bda Module 1 Notes
No ratings yet
Bda Module 1 Notes
10 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
34 pages
Big Data Intro
No ratings yet
Big Data Intro
12 pages
Big Data (Unit 1)
No ratings yet
Big Data (Unit 1)
32 pages
Big Data and Blockchain Basics: Dr. Poonam Saini Poonamsaini@pec - Edu.in
No ratings yet
Big Data and Blockchain Basics: Dr. Poonam Saini Poonamsaini@pec - Edu.in
42 pages
APJ Elevate - Databricks Certification Exam Overview Training Data Analyst Associate
No ratings yet
APJ Elevate - Databricks Certification Exam Overview Training Data Analyst Associate
96 pages
From Issue To Action: The Six Data Analysis Phases: Step 1: Ask
No ratings yet
From Issue To Action: The Six Data Analysis Phases: Step 1: Ask
4 pages
Vsphere Esxi Vcenter Server 67 Storage Guide PDF
100% (1)
Vsphere Esxi Vcenter Server 67 Storage Guide PDF
357 pages
Bda - Unit 1
No ratings yet
Bda - Unit 1
32 pages
Itfm Assignment Group 8
100% (1)
Itfm Assignment Group 8
16 pages
R19 Bda Unit-1
No ratings yet
R19 Bda Unit-1
22 pages
III - CC11 - Practical Data Science Lab
No ratings yet
III - CC11 - Practical Data Science Lab
4 pages
JDBC
No ratings yet
JDBC
60 pages
Q1. Difference Between Cache and Pe
No ratings yet
Q1. Difference Between Cache and Pe
13 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
38 pages
Sqlite Internals PDF
No ratings yet
Sqlite Internals PDF
124 pages
Module 1. 16974328175990
No ratings yet
Module 1. 16974328175990
119 pages
Getting An Overview of Big Data (Module1)
No ratings yet
Getting An Overview of Big Data (Module1)
58 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
37 pages
Unit1 - Introduction To Big Data
No ratings yet
Unit1 - Introduction To Big Data
53 pages
Unit 2 Bda
No ratings yet
Unit 2 Bda
5 pages
bIG dATA PDF
No ratings yet
bIG dATA PDF
11 pages
Big Data Analytics
No ratings yet
Big Data Analytics
58 pages
Challenges in Big Data Analytics Techniques
No ratings yet
Challenges in Big Data Analytics Techniques
6 pages
Release Notes
No ratings yet
Release Notes
33 pages
Unit-I Bdaur-Bcom
No ratings yet
Unit-I Bdaur-Bcom
5 pages
Module I Big Data
No ratings yet
Module I Big Data
7 pages
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
No ratings yet
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
9 pages
Dataanalyticsunit 1
No ratings yet
Dataanalyticsunit 1
26 pages
Big Data
No ratings yet
Big Data
11 pages
Unit 1
No ratings yet
Unit 1
59 pages
Transaction & Concurrency Control
100% (1)
Transaction & Concurrency Control
55 pages
1.big Data and Its Importance
No ratings yet
1.big Data and Its Importance
17 pages
Assignment 01: F C Riphah International University
No ratings yet
Assignment 01: F C Riphah International University
4 pages
Introduction To Bigdata
No ratings yet
Introduction To Bigdata
31 pages
Data Pipeline Pharmarack
No ratings yet
Data Pipeline Pharmarack
3 pages
Chapter 6: Formal Relational Query Languages: Database System Concepts, 6 Ed
No ratings yet
Chapter 6: Formal Relational Query Languages: Database System Concepts, 6 Ed
27 pages
SQL Lecture
No ratings yet
SQL Lecture
24 pages
1st Class ABAP
No ratings yet
1st Class ABAP
24 pages
Nasdaq Data Link Data Fabric
No ratings yet
Nasdaq Data Link Data Fabric
12 pages
Black Spot Project
100% (1)
Black Spot Project
10 pages
XML Bursting ESS Job Scheduling
100% (1)
XML Bursting ESS Job Scheduling
14 pages
1 s2.0 S1359644621003287 Main 1
No ratings yet
1 s2.0 S1359644621003287 Main 1
8 pages
Assignment: Advance Marketing Research & Data Analytics
No ratings yet
Assignment: Advance Marketing Research & Data Analytics
4 pages
How Can I Change The Location of Docker Images When Using Docker Desktop On WSL2 With Windows 10 Home - Stack Overflow
No ratings yet
How Can I Change The Location of Docker Images When Using Docker Desktop On WSL2 With Windows 10 Home - Stack Overflow
7 pages
Dell Powerscale: Modern, Flexible Scale-Out File Storage
No ratings yet
Dell Powerscale: Modern, Flexible Scale-Out File Storage
4 pages
Archiving Process of A Handling Unit
No ratings yet
Archiving Process of A Handling Unit
12 pages
cp5293 Big Data Analytics Question Bank
0% (1)
cp5293 Big Data Analytics Question Bank
13 pages
Explain The Following Commands
No ratings yet
Explain The Following Commands
7 pages
Easa Form 19 - TM - Cad.002 - Issue 8 - September 2019
No ratings yet
Easa Form 19 - TM - Cad.002 - Issue 8 - September 2019
4 pages
Select Modifying Data: SQL Cheat Sheet - Mysql
No ratings yet
Select Modifying Data: SQL Cheat Sheet - Mysql
3 pages
Generic Extraction From SAP R3 Using Infoset Query
No ratings yet
Generic Extraction From SAP R3 Using Infoset Query
4 pages
Heloel Hernandez - Python
No ratings yet
Heloel Hernandez - Python
2 pages
Lab 8
No ratings yet
Lab 8
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Introduction To Big Data Platform (Module-3)

Uploaded by

Introduction To Big Data Platform (Module-3)

Uploaded by

Module-3

Introduction to Big Data Platform – Challenges of conventional systems - Web data –

Introduction to Big Data Platform

Big data can be described by the following characteristics

Semi Structured Data

Multi Structured Data

Examples of unstructured Data

Web analytics technologies

A performance indicator or key performance indicator (KPI) is a type of performance measurement.

1. Define the Problem or Research Question

4. Analyzing the Data

6. Presenting the Data

Data Analytics Tools

7) SAP HYPERLINK https://www.sap.com/products/bi-platform.html

10) InetSoft Style Intelligence

13) Apache Spark

Here are five differences between reporting and analysis:

Data — Reporting — Analysis — Decision-making — Action — VALUE

Uses of Tableau Public

Limitations of Tableau Public

5. Google Fusion Tables

Uses of Google Fusion Tables

Limitations of Google Fusion Tables

Uses of Wolfram Alpha

Limitations of Wolfram Alpha

Uses of Google Search Operators

10. Dataiku DSS

Uses of Dataiku DSS

Limitation of Dataiku DSS

For example, consider a normal population with mean and variance

The distribution of these means, or averages, is called the "sampling distribution of

Bootstrapping is a statistical method for estimating the sampling distribution of an

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.