Big Data Analytics Notes
Big Data Analytics Notes
Big Data Analytics Notes
0 Big Data: -
Data refers to the quantities, characters, or symbols on which operations are
performed by a computer, which may be stored and transmitted in the form of
electrical signals and recorded on magnetic, optical, or mechanical recording
media.
Big data refers to a collection of data sets so large and complex that it becomes
difficult to process using on-hand database management tools or traditional data
processing applications.
Diagnostic Analytics
As the name suggests, gives a diagnosis to a problem. It gives a detailed and in-
depth insight into the root cause of a problem.
Predictive Analytics
As the name suggests, is concerned with predicting future incidents. These
future incidents can be market trends, consumer trends, and many such market-
related events.
Prescriptive analytics
Is a combination of data and various business rules. The data of prescriptive
analytics can be both internal (organizational inputs) and external (social media
insights). Prescriptive analytics allows businesses to determine the best possible
solution to a problem. When combined with predictive analytics, it adds the
benefit of manipulating a future occurrence like mitigate future risk.
4.0 Datawarehouse: -
A data warehouse is a type of data management system that is designed to
enable and support business intelligence (BI) activities, especially analytics.
Data warehouses are solely intended to perform queries and analysis and often
contain large amounts of historical data. The data within a data warehouse is
usually derived from a wide range of sources such as application log files and
transaction applications.
Data Mart usually draws data from only a few sources compared to a Data
warehouse. Data marts are small in size and are more flexible compared to a
Datawarehouse.
Examples –:
Spotify analysed songs by users to come up with the personalized homepage of
their songs and playlist. Netflix movie recommendation system.
Examples –:
ATM centre is an OLTP application.
OLTP handles the ACID properties during data transaction via the application.
It’s also used for Online banking, Online airline ticket booking, sending a text
message, add a book to the shopping cart.
A Data warehouse would extract information from multiple data sources and
formats like text files, excel sheet, multimedia files, etc.
The extracted data is cleaned and transformed. Data is loaded into an OLAP
server (or OLAP cube) where information is pre-calculated in advance for
further analysis.
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
1) Roll-up:
2) Drill-down
In drill-down data is fragmented into smaller parts. It is the opposite of the
rollup process. It can be done via
1.0Moving down the concept hierarchy
2.0Increasing a dimension
1.0Quarter Q1 is drilled down to months January, February, and March.
Corresponding sales are also registers.
2.0In this example, dimension months are added.
3) Slice:
Here, one dimension is selected, and a new sub-cube is created.
Following diagram explain how slice operation performed:
1.0Dimension Time is Sliced with Q1 as the filter.
2.0A new cube is created altogether.
Dice:
This operation is similar to a slice. The difference in dice is you select 2 or more
dimensions that result in the creation of a sub-cube.
4) Pivot
In Pivot, you rotate the data axes to provide a substitute presentation of data.
Parametric Test: -
If the mean more accurately represents the centre of the distribution of the data,
and the sample size is large enough, use a parametric test.
If the median more accurately represents the centre of the distribution of the
data, use a nonparametric test even if you have a large sample size.
RDBMS: -
A database management system (DBMS) that incorporates the relational-data
model, normally including a Structured Query Language (SQL) application
programming interface. It is a DBMS in which the database is organized and
accessed according to the relationships between data items. In a relational
database, relationships between data items are expressed by means of tables.
Interdependencies among these tables are expressed by data values rather than
by pointers. This allows a high degree of data independence.