0% found this document useful (0 votes)
30 views

bdc notes

The document discusses different types of data: structured, semi-structured, and unstructured, highlighting their definitions, characteristics, and use cases. It also covers the history and architecture of Hadoop, a framework for processing large datasets, including its components like Namenode and Datanode. Additionally, it touches on big data analytics, various analytical techniques, and the future scope of data management in sectors like healthcare and finance.

Uploaded by

hakurap112005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

bdc notes

The document discusses different types of data: structured, semi-structured, and unstructured, highlighting their definitions, characteristics, and use cases. It also covers the history and architecture of Hadoop, a framework for processing large datasets, including its components like Namenode and Datanode. Additionally, it touches on big data analytics, various analytical techniques, and the future scope of data management in sectors like healthcare and finance.

Uploaded by

hakurap112005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Vidyalekhani

DATE
PAGE
Structured
TYPES seMT TTUttrtd
Lunsturturet

patgb ase such as oracle


truetye : DB2, MysoL, ete
datq
spÝeadsheet i:

Transa on. PrO Cess


SYSte
1. STRUTURED DATA

strutured data be det'ned as the data hat


esides

thats most tamil to


+he type
febírthday addres
bwriNeny dny
the tuued daA &tutre d
(SQL)

pRAWBACKS

strytured data an, be Used in of


this Mean thet
prede-ttned furntionalites and is
de tq hs" uttedfleibtity
for (ertain 3pec/fic

data 's toTedto


datawarehouse with
sfrut
Conatatnt detined
qnddetined schenna
0 pro (ess different COPL
Vidyalekhans
DATE
PAGE

2. SE MI STRUCrURAED DATA
Semi sru ttyred dala not bournd by any
riqtd schema for data storage
There ajve' soime
sonMe fe qturey like key
paiywhich is uwed t0 hep ht
AfernHattng en+fti ro tath othev

Tn demi ttruit data NosQL { y . is used

Dgtd serializ qt'o n

(bntetnt to store mea daty


about the busíness roces S.


his tyr *ternal
of Info riÝn atton
typicalty (omes
St socda
Me dia ptattorms btt, other Sott'web
based data 4eeds

Audio Emaíl
semi

Json
dae sevializaHen othey markup
key vatue anguaqes
No sQL
Vidyalekharn
DATE
PAGE

UNSTRVCTURED DATA

that doesn't
I4 is kind of data havin
set of mUle.
oefined schema

IH manqment is
tx-texts photoS,lg ile!

Addionay known 45 glak data


þecause (annot be analyged wIth ut
the proper S0ftwoNe to ol

A-udio
(VI4eos
nstruetyred
Text
Messq es

chats Soua free form


mediq data

HISTORy of HADOOp

NeNseen
Hadoop qpen
by Aprhe J/w foun dation whjoh tS witten in
JAVA proceASing of hugt dotajc
with he H/N.

Hadeop ýond tgyed with Doup (uting and


Mite (afarella the yedn 2002
yon when hey
both stovted to Work Apache Nutch pYc
Vidyalekhans
DATE
PAGE

tondvdel
ATter a lot reseqreh they Nutth
cost aou nd
that uLh a systm will tttttt (ost
winh
half a mill lon dolloVs. In hlwiand along
MontH running (ost ot $3000 0 approx.
which S Yoryr txpunsive.
Papm that
acyoSS
In 2003, they car
tame
desribed he
. th i's ¥Ile d GFSi(G0oq
uted fle syste
half soluton
f)e syjiem) whlch found
thu oblen
publised ne more papr 0n
In 200+; tyooq1e
tchnique map Rtouce. No w
Map Reduce
papt
Wy qnethe halt so|n a the problt m
(UHing qnd Mite cafaYui Rtduce) tn
Douq cutHiny
ttehnyul telqES and Map
ther Nukh projett
cutHng found that Nuteh is 1imitd
In 200s, node dus ters b{cauJ
20 t 0 40
to
he profec there vere hwo enginog
which ae wotking the prolect
(utng jolned Yahoo wih
Ln 2006, (uttin oin
Yenamedt
Nutth Progret and he
Hadoop
Wy
The name Hadoop
yello w tlephant toy 's
GOPI
Vidyalekhan
DATE
PAGE

RDBMS Had 00p

Tradittonal To wI(o| ba An openeource slw used FoY


sed dqtabase , batqlly Sto nng datel and unnn
for ol oq ta sto qde, processes
mantputaien, vetrVel. (on ren H

In hs, shuhured
dta rs meJHy ctured dote yoroce sed
proce JS

at s best syited Best for B


for DLTP (onln e
trans a tion

4) e s sca ab|? It
highly Scql a ble.
than Hadoop

|Data noma l)saton Data nomnu|s abon isn


regu)red rijwred.

6 stores hransom huge Volume


e and cg gregate d
dota

1) sche-ma SHaic 8chem is dynam ic hype


type
datq aVail
3) Hrgh data integby
nwailoble able

(os4 is applieqble for no (o& open couri


IMP
k datq ana GOPL
Vidyalekhan2
DATE
PAGE

USe Map Reduce

ACID pYorrth
follow ACID doesnot folow the

BIG DA TA A NALY TICS

In this new dig thal wovld duta is


gehera ted in an enorm0u aMount bet
opens new panadist. ad
As We hawe hyh (owputng power ng
larqe CUmount of dat. We can
dato to help us t6 make date driven
deesion Mekinq
1) redicive (FoYeagttn)
Descriptrve
3) Pesip tVe (opttrmlzatton stfmulaton)
+) Diagaostfc
DiagnesH rdietve presrigh
DeSertprve Analy tic Analytics AnatyHe
AnatyHs

Deuy wlth uhat Dells orth Deals wiHh wht Houw can
huppenkdtn He why did f we make
happined tn the futuYc
the past
GOPI
Vidyalekhan
DATE
PAGE

Predlctlve AnalyHes
USes datd to detemine the probab|e ou(ome of

’Tehniques that re uye for redictve analyHes


Hnea reqretilon
Fype sertes analys IS and foYeasttn
data mning
2 pesripttve Analy ttes
(00ks at data and analyze past event
inslgh as how to approcth future
fu event

Common exqm p e Data querie


Reports
Desmiptve stHsH'c
Da t dashboqrds

PreserlptNe AnalyHes
Synthesizes d ata y Mathe mattca sence
bu'ness rute and
big qthine learnng FO
Make preorctHon and then suggest
of predieHen
to
cpton
EX
Heal thcare
3trateq ic planning

by uin analy tis.


Vicyalekhan
OATE
PAGE

4:
Diaqnestte Anaytes
ne qene ralls ust historieal ddata over other data
wütortcal
to answr any quetisn or for the so|^ of
any prOblem.

ommon preb data dýoVemy


data nmining
o- relatton

STE PS IN DATA ANALYSIs

step1 : Def'ne data requiremenf

Data colle ettor

Datel oTgranrsqHon

SIep+ Data cleantng


FUTURE ScOPE OF DA

Re tafl
Healtheare
finance (6) Tranpartat'o
t proce-s big dattog store
Vidyalekhari
DATE
PAGE

HDFS [HADO OP DISTRI BUTED FILE


3ys TEM
chartersHICr
cqn
store Rtabytes oF datq
Mighly scalable
(omm odih 86 se YVel qnd op en soue sJw.
Sp?os compu titton in eaCh sevey
Treates falle as inevíteble. (neAg ligible)
ARHITECTURE

Namenode Jobrackey Scconda x


Namenode.

ANam nade Metadak (Namne,


Meadq ta ofs epiasy-): thome

rient Bloct o
Read| D4}4 odes
Data nodes

Kepticottt

wYite
Rar
ftient
keptictlon no ot copies ho nn qny tols GOPI
fa (tor de
Vidyalekhansa
DATE
PAGE

HOFS Imp aspects "Clien : in ter fa ce


beth us l and ite
NaMNode and 0alaNode system
(ommuniea tes ith NN
Namen ode tor narnspact arnd 1N
for data 4cet
maste Serve
nly one. na enode. lesystem
n manages namespace. unique Iden Heicatron )
Requiates acrets to files by
qnd Psog3amse (clfens)
V-cpening l closing reromtng ftes and dveo iey
MAPping q9 biocks to aNode. (assigns blodk)
Dqta Node.
One per node in thu (luste
t managu stosoge attat ed to nede)
(reqting blocks <da ta. deletion, replicahcn
sey ve sead qnd wiite iequess kOm dients.

gecondoryName node
helper to the Namnode
saves the metada tq in case 0f fallure .
Status
1eplicaof meta &toragl
6oks heattbeat dqty node
awe
pATA STORAYE AND REPLI(ATION

blo ck
files ure toTed in a sequence
UYe of sqme size extept lat block.
AllB10C¢S
Te po vt blocks
port then the Mep licq +f0 n
to Te
i1 the datq node fails
iacr changes.
-|28
Hadoo
Vidyalekhan
DATE
PAGE

soltch

ToR tuot tch 4To R s w|tch ToP suo'h


block2
btott3
Nme Dat DataNode
0ode. Node

Dqta Node DataNod e.

24Y4
napme Dqta DataNode
Nede

Job
Tiackeu Doa Nod

FILE BLOCKS AND RE PLT(ATION FACTORS

Nheh tver ys mport any ile to Hadoop DishibuHd


fik &ystem thut gets divlded tno bocks Some
& then theee bloues
Vario slqve | ata nodey

default in Hadoop4 these blocks aYe 64 B


An i2 e and
ind in Madoop 2, 128 M8 n ize
.

Ex - J4ppo se we have ypl0a ded a file of 40D MB


this Ale 4ots dtde into 28 t12 t|2t
16 MB
By defawt tVay tqble COP
Vicdyalekhan
DATE
PAGE

In aboVe exanMple NA hav 4 tle blocks whith


means that 3 Tepll'ta o copres of ea ch
made neas tot 4 X3 12
blocks CYe made.

4DVANTAGE
O Fault Tolerence
eqn. make copies a t e blocks tor lback-up
purpose.
(clyter Datanodes)
RACK AWARENESS
treduce nw afic)
phyitel
hadoop lyter.
hadoop hutu
naks

atk tnform atfon Nameno de


Nith the help of thiy qnimum
shceses the closeyt Data nodeto ach teve
whlle puforming the read on wvÍte
perfermante nlw frafc
infomatHon Nhleh reduey the

82

B1 r B
2 B3 6 B, B
B2
3 . B3
B3
4

R2
R3
R1
fye Maste COP
blacks name Vidyalekhan
batanodes slaC DATE
PAGE

Hacdoop has sone aclc aworeness polteie


)1hene houtd not be moe tho n 1 ep ca on the
same datanode.

MOYe than fvo Iepllca'k the .&ngle block is not


allowed o the Brme Tace

The no wed mside the hadoop cut


Muet be &naller than the no
|READ o PE RAT IN HOFS

Digtrlbwted
2get btoc
Namenoce
Ile syotem
HDFS
Irens 3read FS

6close.
hput sam
eient vm 4read 6Tead.
L1'e node Datd Ddtd Data
Node Node Node
HOFS W&eks on the SI Team In4 datq aceegt ptt an
Means it s p eks wtR once and d many
Peatuyes

Whenere a cient sends a eeq to HOFS. to

read sornethinq pom HOfS to the da ta


oe oata nod¿ whee a ua data stored.
not dij ectly oanted 1 th cint becaQe
dicnt doesn't iare iner Tmatin about
the data e on nhich data node . daa
IS Stored or nelo 1h eplica od data
get’read foY yayn GOP1
Viyalckhani
PAGE

Is maoe on datanode stop

-’o, that' why the cllent irst gends a sequ


to nanmen ode since the narmeno de can tqln
the metadata.

bnce the rey is reueved by th


reeponds and send all the int NN.It
datanodes the ocation nhee th (no-9
u made, the no
repiro
databiotks -and the
Lotatiom.
Now the cint CAn. Tead dota wih all thy
normatton.
The elient reads the cdata paraHely tince, the
Kepllca of the s4me data /sis availbble on the

onu the whole olata y Tead it eombirnes


all tne blects a oriq|nl ile

HpFS NRIJE oPE RATZONN

1. 1he olien! mteqs wIh HO£S Namenede.


[o wit tile inHde fhe HDFS the
in terauh w' the the ng enode

Namenede s t checks tor thesltent prtv|eye


cllent Namenode
privllege9’ fy e R
yalekhan
DATE
PAGE.

Tregte
HDES
cllent File syst enm Name No de.
2 Wslte.

pa ta queue
8tehm AUK queue:
(ltent J VM
c)fen node
4.wIlte pacte
rs qcknowledgoment

tlpeine
2

JI the cdtent has ruffletent privlege &. there Is


nam
nodl 3eco d ne

Namenode then roVides the oBalreH. all


dotanods and Seusi teken

I the fie aready ents then file c Yeatro


f a . and the client reeve EAeptren"
The cieni tnteaths with the datqnodt
After TLdeving the tiet a the datane dey and
permi&tlon he d'ent stars wittnq
first dtanooe in th
dala, DiTetly to the firsH
Lin.

4tes finihing wwritog ofdata the a taNode


StATk making neplito' blocks to othey dataNod0
depe
dt) ndi upon xplieaton facton
manyndon
Vicdyalekhari2
DATE
PAGE

NOTE TMP VE
what happens if datansde Raily whie wrlHng
ke in tlDF S
4:ttens > he pipehne ges closed., packefs n the
qutue U then added, to ront 9 the dala
farttaty

walttent
que e makmg Aatamo dej do0m stream
hon the faed node to not miss any packet

Then the eument bloch an he allve da ta aode


a new tdentfty

The igiled dataNe de gett remred from the


pipellne nd newptpellne qets constucted
430 the tWo alive datanoded.

The Namenode ob serves that the tblok. is


underrep licated and Hqrranges for
fusthu eopy en 4nother Oatanode.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy