Big Data Analytics Notes

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

1.

0 Big Data: -
Data refers to the quantities, characters, or symbols on which operations are
performed by a computer, which may be stored and transmitted in the form of
electrical signals and recorded on magnetic, optical, or mechanical recording
media.

Big data refers to a collection of data sets so large and complex that it becomes
difficult to process using on-hand database management tools or traditional data
processing applications.

2.0 Types Analytics: -


Descriptive analytics
Provide insights into what has occurred in the past and with the trends to dig
into for more detail. This helps in creating reports like a company’s revenue,
profits, sales, and so on. 

Diagnostic Analytics
As the name suggests, gives a diagnosis to a problem. It gives a detailed and in-
depth insight into the root cause of a problem. 

Predictive Analytics
As the name suggests, is concerned with predicting future incidents. These
future incidents can be market trends, consumer trends, and many such market-
related events.

Prescriptive analytics
Is a combination of data and various business rules. The data of prescriptive
analytics can be both internal (organizational inputs) and external (social media
insights). Prescriptive analytics allows businesses to determine the best possible
solution to a problem. When combined with predictive analytics, it adds the
benefit of manipulating a future occurrence like mitigate future risk.

Big data Characteristics: -

The three Vs of Big data are Velocity, Volume and Variety


3.0 STRUCTURED & UNSTRUCTURED DATA: -
Structured data is normally found in traditional databases (SQL or others) where
data is organized into tables based on defined business rules. Structured data
usually prove to be the easiest type of data to work with, simply because the
data is defined and indexed, making access and filtering easier.

Ex: - Employee database

Unstructured data, in contrast, normally have no BI behind them. Unstructured


data are not organized into tables and cannot be natively used by applications or
interpreted by a database. A good example of unstructured data would be a
collection of binary image files.

Ex: - Search returned by google

4.0 Datawarehouse: -
A data warehouse is a type of data management system that is designed to
enable and support business intelligence (BI) activities, especially analytics.
Data warehouses are solely intended to perform queries and analysis and often
contain large amounts of historical data. The data within a data warehouse is
usually derived from a wide range of sources such as application log files and
transaction applications.

What is Data Mart?


A Data Mart is focused on a single functional area of an organization and
contains a subset of data stored in a Data Warehouse. A Data Mart is a
condensed version of Data Warehouse and is designed for use by a specific
department, unit or set of users in an organization. E.g., Marketing, Sales, HR or
finance. It is often controlled by a single department in an organization.

Data Mart usually draws data from only a few sources compared to a Data
warehouse. Data marts are small in size and are more flexible compared to a
Datawarehouse.

5.0 OLAP & OTAP: -


Online Analytical Processing (OLAP) –
Online Analytical Processing consists of a type of software tools that are used
for data analysis for business decisions. OLAP provides an environment to get
insights from the database retrieved from multiple database systems at one time.

Examples –:
Spotify analysed songs by users to come up with the personalized homepage of
their songs and playlist. Netflix movie recommendation system.

Online transaction processing (OLTP) –


Online transaction processing provides transaction-oriented applications in a 3-
tier architecture. OLTP administers day to day transaction of an organization.

Examples –:
ATM centre is an OLTP application.
OLTP handles the ACID properties during data transaction via the application.
It’s also used for Online banking, Online airline ticket booking, sending a text
message, add a book to the shopping cart.

6.0 OLAP CUBE: -


At the core of the OLAP concept, is an OLAP Cube. The OLAP cube is a data
structure optimized for very quick data analysis. The OLAP Cube consists of
numeric facts called measures which are categorized by dimensions. OLAP
Cube is also called the hypercube. Usually, data operations and analysis are
performed using the simple spreadsheet, where data values are arranged in row
and column format. This is ideal for two-dimensional data. However, OLAP
contains multidimensional data, with data usually obtained from a different and
unrelated source. Using a spreadsheet is not an optimal option. The cube can
store and analyse multidimensional data in a logical and orderly manner.

How does it work?

A Data warehouse would extract information from multiple data sources and
formats like text files, excel sheet, multimedia files, etc.

The extracted data is cleaned and transformed. Data is loaded into an OLAP
server (or OLAP cube) where information is pre-calculated in advance for
further analysis.

Basic analytical operations of OLAP


Four types of analytical OLAP operations are:

Roll-up
Drill-down
Slice and dice
Pivot (rotate)

1) Roll-up:

Roll-up is also known as “consolidation” or “aggregation.” The Roll-up


operation can be performed in 2 ways
1.0Reducing dimensions
2.0Climbing up concept hierarchy. Concept hierarchy is a system of grouping
things based on their order or level.
1.0In this example, cities New jersey and Lost Angles and rolled up into
country USA
2.0The sales figure of New Jersey and Los Angeles are 440 and 1560
respectively. They become 2000 after roll-up
3.0In this aggregation process, data is location hierarchy moves up from city to
the country.
4.0In the roll-up process at least one or more dimensions need to be removed. In
this example, Cities dimension is removed.

2) Drill-down
In drill-down data is fragmented into smaller parts. It is the opposite of the
rollup process. It can be done via
1.0Moving down the concept hierarchy
2.0Increasing a dimension
1.0Quarter Q1 is drilled down to months January, February, and March.
Corresponding sales are also registers.
2.0In this example, dimension months are added.
3) Slice:
Here, one dimension is selected, and a new sub-cube is created.
Following diagram explain how slice operation performed:
1.0Dimension Time is Sliced with Q1 as the filter.
2.0A new cube is created altogether.

Dice:
This operation is similar to a slice. The difference in dice is you select 2 or more
dimensions that result in the creation of a sub-cube.
4) Pivot

In Pivot, you rotate the data axes to provide a substitute presentation of data.

7.0 Creating a database: -


Step1: Create a Database:
We can create a Database using the command:

Syntax: CREATE DATABASE DATABASE_NAME;


CREATE DATABASE studentmarks;

Step2: Adding table into Database:


To add a table into the database we use the below command:

Syntax: CREATE TABLE table_name (Attribute_name datatype...);


So, let’s create a students table within the studentmarks database as shown
below:
CREATE TABLE Students(
Id int,
Name varchar (20),
Total Marks int);

Step3: Inserting values into Tables:


For inserting records into the table, we can use the below command:
Syntax: INSERT INTO table_name (column1,
column2,
column 3,…)
VALUES (value1,
value2,
value3,.....);
So, let’s add some records to the students table:
INSERT INTO Students VALUES (1,'Neha',90);
INSERT INTO Students VALUES (2,'Sahil',50);
INSERT INTO Students VALUES (3,'Rohan',70);
INSERT INTO Students VALUES (4,'Ankita',80);
INSERT INTO Students VALUES (5,'Rahul',65);
INSERT INTO Students VALUES (6,'Swati',55);
INSERT INTO Students VALUES (7,'Alka',75);

Step4: The query  for the data:


Use the below syntax for querying for all students with greater marks than the
average of the class:
Syntax:
SELECT column1 FROM table_name
WHERE column2 > (SELECT AVG(
column2)
FROM table_name);
Now use the above syntax to make the query on our students table as shown
below:
SELECT Name FROM Students WHERE Total Marks > (SELECT AVG(Total
Marks) FROM Students);

Parametric Test: -
If the mean more accurately represents the centre of the distribution of the data,
and the sample size is large enough, use a parametric test.
If the median more accurately represents the centre of the distribution of the
data, use a nonparametric test even if you have a large sample size.
RDBMS: -
A database management system (DBMS) that incorporates the relational-data
model, normally including a Structured Query Language (SQL) application
programming interface. It is a DBMS in which the database is organized and
accessed according to the relationships between data items. In a relational
database, relationships between data items are expressed by means of tables.
Interdependencies among these tables are expressed by data values rather than
by pointers. This allows a high degree of data independence.

Data mining: - This is a process in which data is analysed from different


perspectives and then turned into a summary of data that are deemed useful.
Data mining is normally used with data at rest or with archival data. Data
mining techniques focus on modelling and knowledge discovery for predictive,
rather than purely descriptive, purposes—an ideal process for uncovering new
patterns from large data sets.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy