File Organization

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 45

FILE ORGANISATION

• A database consist of a huge amount of data. The data is grouped


within a table in RDBMS, and each table have related records. A user
can see that the data is stored in form of tables, but in actual this
huge amount of data is stored in physical memory in form of files

• File – A file is named collection of related information that is recorded


on secondary storage such as magnetic disks, and optical disks. 
• Storing the files in certain order is called file Organization.
Types of File Organizations –
• Some types of File Organizations are : 
 
• Sequential File Organization
• Heap File Organization 
• Hash File Organization 
• Clustered File Organization 
 
Sequential File Organization –
• The easiest method for file Organization is Sequential method. In this
method the file are stored one after another in a sequential manner.
There are two ways to implement this method:  
• Pile File Method – This method is quite simple, in which we store the
records in a sequence i.e one after other in the order in which they
are inserted into the tables. 
1.Insertion of new record – 
Let the R1, R3 and so on upto R5 and R4 be four records in the
sequence. Here, records are nothing but a row in any table. Suppose a
new record R2 has to be inserted in the sequence, then it is simply
placed at the end of the file.  
• Sorted File Method –In this method, As the name itself suggest
whenever a new record has to be inserted, it is always inserted in a
sorted (ascending or descending) manner. Sorting of records may be
based on any primary key or any other key. 
• Insertion of new record – 
Let us assume that there is a preexisting sorted sequence of four
records R1, R3, and so on upto R7 and R8. Suppose a new record R2
has to be inserted in the sequence, then it will be inserted at the end
of the file and then it will sort the sequence . 
 
• Pros and Cons of Sequential File Organization – 
Pros – 
• Simple design.
• Files can be easily stored in magnetic tapes i.e cheaper storage
mechanism.
• Cons –  
• Time wastage as we cannot jump on a particular record that is required,
but we have to move in a sequential manner which takes our time.
• Sorted file method is inefficient as it takes time and space for sorting
records. 
Heap File Organization –
• Heap File Organization works with data blocks. In this method records
are inserted at the end of the file, into the data blocks. No Sorting or
Ordering is required in this method. If a data block is full, the new
record is stored in some other block, Here the other data block need
not be the very next data block, but it can be any block in the
memory.
• Insertion of new record – 
Suppose we have four records in the heap R1, R5, R6, R4 and R3 and
suppose a new record R2 has to be inserted in the heap then, since
the last data block i.e data block 3 is full it will be inserted in any of
the data blocks selected by the DBMS, lets say data block 1.
• If we want to search, delete or update data in heap file Organization
the we will traverse the data from the beginning of the file till we get
the requested record. Thus if the database is very huge, searching,
deleting or updating the record will take a lot of time. 
hashing
• Hashing is an efficient technique to directly search the location of
desired data on the disk. Data is stored at the data blocks whose
address is generated by using hash function.
Hash File Organization :
• Data bucket – Data buckets are the memory locations where the
records are stored. These buckets are also considered as Unit Of
Storage. . A bucket typically stores one complete disk block, which in
turn can store one or more records.
• Hash Function – Hash function is a mapping function that maps all
the set of search keys to actual record address. Hash function can be
simple mathematical function to any complex mathematical function.
• Hashing is further divided into two sub categories :
• Static Hashing –
• In static hashing, when a search-key value is provided, the hash
function always computes the same address. For example, if we want
to generate address for STUDENT_ID = 76 using mod (5) hash
function, it always result in the same bucket address 1.  There will not
be any changes to the bucket address here.
• Operations –
• Insertion – When a new record is inserted into the table, The hash function h generate a bucket
address for the new record based on its hash key K.
Bucket address = h(K)
• Searching – When a record needs to be searched, The same hash function is used to retrieve
the bucket address for the record. For Example, if we want to retrieve whole record for ID 76,
and if the hash function is mod (5) on that ID, the bucket address generated would be 1. Then
we will directly got to address 1 and retrieve the whole record for ID 76. Here ID acts as a hash
key.
• Deletion – If we want to delete a record, Using the hash function we will first fetch the record
which is supposed to be deleted.  Then we will remove the records for that address in memory.
• Updation – The data record that needs to be updated is first searched using hash function, and
then the data record is updated.
• Now, If we want to insert some new records into the file But the data
bucket address generated by the hash function is not empty or the
data already exists in that address. This becomes a critical situation to
handle.  This situation in the static hashing is called bucket overflow.
How will we insert data in this case?
There are several methods provided to overcome this situation. Some
commonly used methods are discussed below:
• Open Hashing –
In Open hashing method, next available data block is used to enter
the new record, instead of overwriting older one. This method is also
called  linear probing.For example, D3 is a new record which needs to
be inserted , the hash function generates address as 105. But it is
already full. So the system searches next available data bucket, 123
and assigns D3 to it.
Handling of Bucket Overflows
• Although the probability of bucket overflow can be reduced, it cannot be
eliminated; it is handled by using overflow buckets.
• Overflow chaining – the overflow buckets of a given bucket
are chained together in a linked list.
• Above scheme is called closed hashing.
Dynamic Hashing –
• The drawback of static hashing is that it does not expand or shrink dynamically as
the size of the database grows or shrinks. In Dynamic hashing, data buckets grows
or shrinks (added or removed dynamically) as the records increases or decreases.
Dynamic hashing is also known as extended hashing.

• In dynamic hashing, the hash function is made to produce a large number of


values. For Example, there are three data records D1, D2 and D3 . The hash
function generates three addresses 1001, 0101 and 1110 respectively.  This
method of storing considers only part of this address – especially only first one bit
to store the data. So it tries to load three of them at address 0 and 1.
But the problem is that No bucket address is remaining for D3. The bucket has to grow dynamically to
accommodate D3. So it changes the address have 2 bits rather than 1 bit, and then it updates the existing data to
have 2 bit address. Then it tries to accommodate D3.
• Clustered File Organization
• Cluster is defined as “when two or more related tables are stored within
the same file”. The related column of two or more database tables in
the cluster is called the cluster key. And this cluster key is used to map
the two tables together.
• This method minimizes the cost of accessing and searching the various
records because they are combined and available in a single cluster.
• Example:
• Suppose we have two tables whose names are Student and Subject.
Both of the following given tables are related to each other.
Student subject
Student_Nam
Student_ID Student_Age Subject_ID
e
101 John 20 C01
Subject_ID Subject_Name
102 Robert 20 C04
103 Anik 21 C01 C01 Math
104 James 22 C02 C02 Java
105 Trump 21 C03 C03 C
106 Charles 20 C04 C04 DBMS
107 Deny 20 C03
108 Varun 21 C04

Therefore, both these tables ‘student’ and ‘subject’ are allowed to combine
using a join operation and can be seen as following in the cluster file
Cluster Key        
Subject_ID Subject_Name Student_ID Student_Name Student_Age
C01 Math 101 John 20
    103 Anik 21
C02 Java 104 James 22
C03 C 105 Trump 21
    107 Deny 20
C04 DBMS 102 Robert 20
       106 Charles 20
    108 Varun 21
INDEXING
• Indexing is used to optimize the performance of a database by
minimizing the number of disk accesses required when a query is
processed.
• The index is a type of data structure. It is used to locate and access
the data in a database table quickly.
• Index structure:
• Indexes can be created using some database columns.
• The first column of the database is the search key that contains a copy
of the primary key or candidate key of the table. The values of the
primary key are stored in sorted order so that the corresponding data
can be accessed easily.
• The second column of the database is the data reference. It contains a
set of pointers holding the address of the disk block where the value
of the particular key can be found.
Indexing methods
Ordered indices

• The indices are usually sorted to make searching faster. The indices which
are sorted are known as ordered indices.
• Example: Suppose we have an employee table with thousands of record and
each of which is 10 bytes long. If their IDs start with 1, 2, 3....and so on and
we have to search student with ID-543.
• In the case of a database with no index, we have to search the disk block
from starting till it reaches 543. The DBMS will read the record after reading
543*10=5430 bytes.
• In the case of an index, we will search using indexes and the DBMS will read
the record after reading 542*2= 1084 bytes which are very less compared to
the previous case.
Primary Index
• If the index is created on the basis of the primary key of the table,
then it is known as primary indexing.
• These primary keys are unique to each record and contain 1:1 relation
between the records.
• As primary keys are stored in sorted order, the performance of the
searching operation is quite efficient.
• The primary index can be classified into two types:
• Dense index and Sparse index.
Dense index
• The dense index contains an index record for every search key value in
the data file. It makes searching faster.
• In this, the number of records in the index table is same as the
number of records in the main table.
• It needs more space to store index record itself. The index records
have the search key and a pointer to the actual record on the disk.
Sparse index
• In the data file, index record appears only for a few items. Each item
points to a block.
• In this, instead of pointing to each record in the main table, the index
points to the records in the main table in a gap.
Clustering Index
• A clustered index can be defined as an ordered data file. Sometimes
the index is created on non-primary key columns which may not be
unique for each record.
• In this case, to identify the record faster, we will group two or more
columns to get the unique value and create index out of them. This
method is called a clustering index.
• The records which have similar characteristics are grouped, and
indexes are created for these group.
BIG DATA
• Big data is a collection of massive and complex data sets and data
volume that include the huge quantities of data, data management
capabilities, social media analytics and real-time data.
• Big data is about data volume and large data set's measured in terms
of terabytes or petabytes.
• This phenomenon is called Bigdata
• The challenges include capturing, analysis, storage, searching, sharing,
visualization, transferring and privacy violations.
• It can neither be worked upon by using traditional SQL queries nor
can the relational database management system (RDBMS) be used for
storage.
• Though, a wide variety of scalable database tools and techniques has
evolved.
• Hadoop is an open source distributed data processing is one of the
prominent and well known solutions.
• The NoSQL has a non-relational database with the likes of MongoDB
from Apache.
Types of Big data
• Structured- Any data that can be stored, accessed and processed in the
form of fixed format is termed as a ‘structured’ data.
• E.g. Table with column names
• Unstructured-Any data with unknown form or the structure is classified
as unstructured data.
• E.g. Google search results
• Semi-structured-Semi-structured data can contain both the forms of
data. We can see semi-structured data as a structured in form but it is
actually not defined with e.g. a table definition in relational DBMS.
• E.g. data represented in an XML file.
NOSQL
• A NoSQL originally referring to non SQL
• database that provides a mechanism for storage and retrieval of data.
• This data is modeled in means other than the tabular relations used in
relational databases.
• NoSQL databases are used in real-time web applications and big data
• their use are increasing over time.
• NoSQL systems are also sometimes called Not only SQL to emphasize
the fact that they may support SQL-like query languages.
Types of NoSQL database:
• MongoDB falls in the category of NoSQL document based database.
• Key value store: Memcached, Redis, Coherence
• Tabular: Hbase, Big Table, Accumulo
• Document based: MongoDB, CouchDB, Cloudant
Advantages of NOSQL
• High scalability 
• High availability 
Disdvantages of NOSQL
• Narrow focus-mainly focused on storage not much functionality
provided
• Open-source
• Management challenge
• GUI is not available 
• Backup 
• Large document size

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy