0% found this document useful (0 votes)
14 views55 pages

File Processing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views55 pages

File Processing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

File Processing

and orgnization
Dr. Nehal Nabil Hassan Mostafa
References
1-Introduction to
File Pocessing
Data Structures vs.
File Structures

►Difference:
–Data Structures deal with data
►Both involve: Representation
in main memory
of Data
–File Structures deal with data
+
in secondary storage device
Operations for accessing data
(File).
Computer Architecture
increase in capacity & Access

CPU Differences
Increase in cost per byte

—Fast
Register —Small
—Expensive
—Volatile
time

Cache

—Slow
RAM Main Memory —Large
—Cheap
—Stable
HDD,SSD,CD Second storage
Memory hierarchy
On systems with 32-bit addressing, only 2^32 bytes can be
directly referenced in main memory.
The number of data objects may exceed this number!
Data must be maintained across program executions. This
requires storage devices that retain information when the
computer is restarted.
- We call such storage nonvolatile.
-Primary storage is usually volatile, whereas secondary and
tertiary storage are nonvolatile.
►Typical times for getting info

How Fast? – Main memory: ~120 nanoseconds =120x10^-9


– Magnetic Disks: ~30 milliseconds =30 x10^-6

►An analogy keeping same time proportion as


above
– Looking at the index of a book: 20 seconds
VS
– Going to the library: 58 days
Comparison

Main Memory Secondary Storage

Fast (since electronic) Slow (since electronic and mechanical)

Small (since expensive) Large (since cheap)

Volatile (information is lost when power failure Stable, persistent (information is preserved
occurs longer)
The goal of the Course
Minimize number of trips to the disk in order to get desired
information. Ideally get what we need in one disk access or get it with as
few disk access as possible.

Grouping related information so that we are likely to get everything we


need with only one trip to the disk (e.g. name, address, phone number,
account balance).

Locality of Reference in Time and Space


Fast access to great capacity

Reduce the number of disk Good file orgnization By collecting data into buffers,
accesses and processing blocks or buckets

Manage growth by splitting


these collections
1- In the beginning… it was the tape
–Sequential access
–Access cost proportional to size of file
[Analogy to sequential access to array data

History of File structure]

Processing 2- Disks became more common

Design –Direct access


[Analogy to access to position in array]
–Indexes were invented
•list of keys and points stored in small file
•allows direct access to a large primary file
Great if index fits into main memory.
As file grows we have the same problem we had with a
large primary file
3- Tree structures emerged for main memory (1960`s)
–Binary search trees (BST`s)
–Balanced, self adjusting BST`s: e.g. AVL trees (1963)

History of File 4- A tree structure suitable for files was invented:


B trees (1979) and B+ trees

Processing good for accessing millions of records with 3 or 4 disk


accesses.

Design
5- What about getting info with a single request?
Hashing Tables (Theory developed over 60’s and 70’s but
still a research topic)
good when files do not change too much in time.
Expandable, dynamic hashing (late 70’s and 80’s) one or
two disk accesses even if file grows dramatically
2-Fundemental of
File Processing
Operations
What is a File? A collection of data is placed under
permanent or non-volatile storage

Examples: anything that you can store in a


disk, hard drive, tape, optical media, and any
other medium which doesn’t lose the
information when the power is turned off.

Notice that this is only an informal


definition!
Where do File
Structures fit in CS?
Applications

File DBMS
Processing
Operating systems

Hardware

Back to Agenda Page


Physical File
vs
Logical File
Physical file : physically exists Logical file, what your program
on secondary storage; known actually uses, a ‘pipe’ though
by the operating system; which information can be
appears in its file directory. extracted, or sent.

The program (application) sends (or receives) bytes to (from) a file through the logical file. The
program knows nothing about where the bytes go (came from).
The operating system is responsible for associating a logical file n a program to a physical file in
disk or tape. Writing to or reading from a file in a program in done through the operating system.
The program (application) sends (or receives) bytes to (from) a file through the
logical file. The program knows nothing about where the bytes go (came from).
The operating system is responsible for associating a logical file n a program
to a physical file in disk or tape. Writing to or reading from a file in a program
in done through the operating system.
Note that from the program point of view, input devices (keyboard) and output
devices (console, printer, etc) are treated as files - places where bytes come
from or are sent to.
There may be thousands of physical files on a disk, but a program only have
about 20 logical files open at the same time.
The physical file has a name, for instance myfile.txt
The logical file has a logical name used for referring to the file inside the
program. This logical name is a variable inside the program, for instance
outfile.
Basic file Opening a file
basically, links a logical file to a physical file.
–On open, the O/S performs a series operations that
operations end in the program that is trying to open the file being
assigned a file descriptor.
–Additionally, the O/S will perform particular
operations on the file at the request of the calling
program, these operations are intended to ‘initialize’
the file for use by the program.
►Two options for opening a file:
–Open an existing file
–Create a new file
The mode
Example #include <fstream>
#include <iostream>
using namespace std ;
int main(){
char c;
fstream infile ;
infile.open("account.txt",ios::in) ;
infile.unsetf(ios::skipws) ;
infile >> c ;
while (! infile.fail()){
cout << c ;
infile >> c ;
}
infile.close() ;
return 0;
}
Basic file
operations Closing a file

-cuts the link between physical and logical files


–Upon closing, the OS takes care of
‘synchronizing’ the contents of the file, e.g.
often a buffer is used, need to write buffer
content to file.
In general, files are automatically closed when
the program ends.
Basic file
operations Reading and Writing
– basic I/O operations.
–Usually require three parameters: a logical
file, an address, and the amount of data that
is to be read or written.
Additional file operations
►Seeking: source file, offset.
►Detecting the end of a file
►Detecting I/O error
Seeking with C++
Stream Classes

A fstream has 2 file pointers: get pointer & put pointer


(for input) (for output)
file1.seekg ( byte_offset, origin); //moves get pointer file1.seekp (
byte_offset, origin); //moves put pointer
origin can be ios::beg (beginning of file)
ios::cur (current position)
ios::end (end of file)
3- Managing Files
of Records
File types
A file can be treated as
a stream of bytes
a collection of records with fields
Stream

Every stream has an associated file position


When we open a file, the file position is set to the beginning
The file size read 8 into c++ , increment the file position
The 38th fread() will read the newline character (referred to as
‘\n’ in C/C++) into c and increment the file position.
The 39th fread() will read 0 into c and increment the file
position, and so on.
Field and Record Organization
►Field: a data value, smallest unit of data with logical meaning
►Record: A group of fields that forms a logical unit
► Key: a subset of the fields in a record used to uniquely identify the
record
In our example, “example.txt” contains information about books:
Each line of the file is a record.
Fields in each record:
-ISBN Number,
-Author Name,
– Book Title
In order to manage fields in a file, we need to include
information to identify where one field ends and the next one
begins.
In this case, you might use capital letters to mark field
separators, but that would not work for names like O’Leary or
MacAllen. There are four common methods to delimit fields in
a file.
Methods for Organizing Fields
Fixed length
Begin each field with its Length indicator
Delimiters to separate fields
“keyword=value” pair identifies each field and its content
Identifying records
once records are positioning in a file , a related questions arise. when
we’re searching for target record, how can we identify the record ? that
is , how we distinguish the record we want from the other records in
the file?
Primary keys are used to uniquely identify a record in a file, but they
cannot guarantee uniqueness between records and can cause costly
system updates. A better approach is to generate a non-data field for
each record, like a student ID.
Primary and Secondary Back to Agenda Page

Keys

Secondary key
Primary Key
Other keys that may be used for
A key that uniquely identifies a
search
record.

►Note that
In general not every field is a key, Keys correspond to fields, or combination of
fields, that may be used in a search.
FILE ACCESS METHODS
Search for a record matching a given key
1.Sequential Search
Look at records sequentially until matching record is found. Time is
in O(n) for n records.
Appropriate for Pattern matching, file with few records
FILE ACCESS METHODS
Search for a record matching a given key
2.Direct Access
We might prefer to jump directly to the location of a target record, then
read its contents.
. Time is in O(1) for n records.
One example of direct access you will immediately recognize is array
indexing.
Direct access
First, we need fixed-length records, since we need
to know how far to offset from the front of the file
to find the i-th record.
Second, we need some way to convert a record’s
key into an offset location.
Very Slow
Finding Information
Information Fast
If we have a sorted file, we
can perform a binary search
to locate information, this is
much faster than sequentially
looking at each record! (recall
that sequential search is O(n),
while binary search is
O(lg n) ).
but....
The file must be sorted, and maintaining this
property is very expensive.
Records must be fixed length, otherwise we cannot
jump directly to the i-th record in the file.
Binary search still requires more than one or two
seeks to find a record, even on moderately sized
files.
3. An index: a list of pairs (key, reference), sorted by key

it provides an efficient way to access all the data blocks or records


within a large file without having to search the entire file for the
data
3. An index: a list of pairs (key, reference), sorted by key
Allow direct fast access to files
Eliminates the need to re-organize or sort the file (files can be entry sequenced)
Provide direct access for files with variable length records
Provide multiple access paths to the file
Impose an order on a file without rearranging the file
Index of a File of Books
Primary Index
Contains a primary key in canonical form,
and a pointer to a record in the file
Each entry in the primary index identifies
uniquely a record in the file
Designed to support binary search on the
primary key
Basic Operations
on Indexes
Index creation
Index loading
Updating of index files
Record additions /
deletions / updates
Use of Multiple Indexes
►Provides multiple views of a data file
►Allows us to search for particular values within fields
that are not primary keys
►Allows us to search using combinations of secondary
/ primary keys
►Each entry in a secondary index contains a key value
and a primary key (or list of primary keys).
Secondary Key

►Does not identify records uniquely


►It is not dataless
► Has a canonical form (i.e.there are
restrictions on the values that the key must
take)
Secondary Index Structure
►List of secondary keys, sorted first by value of the
secondary key, and then by the value of the primary key
►Updates to the file must now be applied on the
secondary indexes as well.
►The fact that we store primary keys instead of pointers
into the file minimizes the impact of file updates on the
secondary index.
Deletion of a Record
►Change only data file and primary index

►Search secondary key, find primary key,


search for p.k. in primary index
---> record-not-found
►saved from reading wrong data
Update a Record
►Change secondary key:

X rearrange secondary index


►Change primary key: rearrange primary index
rewrite reference fields of secondary index (no
rearrangement)
►Change other fields: no effect on secondary index
Improving Secondary Indexes
►We can store several primary keys per row in the
secondary index
—This, however, wastes space for some records, and is
not sufficient for other secondary keys.
►We can store a pointer to a linked list of primary keys
—We want these lists to be stored in a file, and to be easy
to manage; hence, the inverted list
Inverted Lists
►Solve the problems associated with the
variability in the number of references a
secondary key can have
►Greatly reduces the need to reorganize /
sort the
secondary index
►Store primary keys in the order they are
entered, do not need to be sorted
►The downside is that references for one
secondary key are spread across the
inverted list
Some Notes
►Even though it is preferred to store lists of primary
keys, under certain circumstances it could be better to store
pointers into the file.
-When access speed is critical
-When the file is static (does not suffer updates, or
updates are very seldom)
►Consider also that there is a safety issue related to having to
propagate updates to the file to several indexes, the updating
algorithm must be robust to different types of failure.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy