Database System Concepts and Architecture
Database System Concepts and Architecture
DATABASE A database is an integrated collection of well defined data and information, centrally controlled in all its aspects, created and stored in a typical structure for an organization. Typically a database is made up of many linked tables of rows and columns, each containing specific data (E.g. name, address, location, size, weight, dates). Field: each column represents a field. , Record: each row represents one record. , Table: collection records. The database is a collection of data on which we can perform operations such as: Storing data(Save) , Manipulating data (Update, Delete, insert, sort) , Retrieve data (Whenever necessary, we can see the data in a particular table form, or use selected data, as in a computer program) Why Database? Conventionally, in an information system, the information is obtained by developing the systems and integrating them. This calls for breaking the system into various subsystems and developing the information systems independently. In this approach, each system will have its master files and transactions files. The file layouts and access methods could be different in different systems. This approach affects the quality of the information across all the the systems due to various reasons. In typical file processing system, there are various disadvantages which are described as under: 1) The data redundancy and inconsistency: Since the files are created for each application differently, the files are likely to have different formats and designs. Hence the same data record may be present in more than one file, the creation, updation and deletion of which is managed by different programmers. Over a period of time, a situation arises when the data is redundant and inconsistent due to the changes not being incorporated simultaneously. 2) Difficulty in access to the data: In conventional system design, the file structure is consistent to the specific information needs. If the information needs change, gaining access to the data present in different files , requires writing the necessary application programmes every time. This is very difficult and time consuming. 3) Concurrent access anomalies: The file systems are incapable of supervising and coordinating the changes arising out of the concurrent access to the record. It is quite likely that the record may be accessed within seconds and information may not be current 4) Security problems: The file systems have a limitation of controlling the access to the record causing insecurity with respect to the information. Since the application programmes are written time and again, it is difficult to enforce a discipline on security constraints across all the applications. 5) Integrity of the data: The integrity rules are added when the programmes are written. If any changes in the rules occur, it is very difficult to ensure that it is affected across the files in all the applications. Database - Advantages & Disadvantages
Advantages
Reduced data redundancy Reduced updating errors and increased consistency Greater data integrity and independence from applications programs Improved data access to users through use of host and query languages Improved data security Reduced data entry, storage, and retrieval costs Facilitated development of new applications program
Disadvantages
Database systems are complex, difficult, and time-consuming to design Substantial hardware and software start-up costs Damage to database affects virtually all applications programs Extensive conversion costs in moving form a file-based system to a database system Initial training required for all programmers and users
DBMS: is a set of computer programs that controls the creation, maintenance, and the use of a database.
It allows organizations to place control of database development in the hands of database administrators (DBAs) and other specialists. A DBMS is a system software package that helps the use of integrated collection of data records and files known as databases. It allows different user application programs to easily access the same database. Data Models, Schemas and Instances: Different view of the database. Data Abstraction: The suppression of details of data organization and storage and the highlighting of essential features for better understanding Data Model: A set of concepts to describe the structure of a database, and certain constraints that the database should obey. Data model helps to achieve data abstraction A data model is a collection of concepts that can be used to describe the structure of a database. The model provides the necessary means to achieve the abstraction. The structure of a database is characterized by data types, relationships, and constraints that hold for the data. Models also include a set of operations for specifying retrievals and updates.Data models are changing to include concepts to specify the behaviour of the database application. This allows designers to specify a set of user defined operations that are allowed.
Logical schema : describes the semantics, as represented by a particular data manipulation technology. This consists of descriptions of tables and columns, object oriented classes, and XML tags, among other things. Physical schema : describes the physical means by which data are stored. This is concerned with partitions, CPUs, tablespaces, and the like.
Initial Database State: Refers to the database state when it is initially loaded into the system. Valid State: A state that satisfies the structure and constraints of the database. Schema and State Distinction The database schema changes very infrequently. The database state changes every time the database is updated. Schema is also called intension whereas State is also called extension.
Database State:
Database Models: A database model is the theoretical foundation of a database and fundamentally
determines in which manner data can be stored, organized and manipulated in a database system. Following are database models which are common in the industry. Hierarchical data model is a data model in which the data is organized into a tree-like structure. The structure allows repeating information using parent/child relationships: each parent can have many children but each child only has one parent (also known as a 1:many ratio ). All attributes of a specific record are listed under an entity type.
Example of a Hierarchical Model. In a database, an entity type is the equivalent of a table; each individual record is represented as a row and an attribute as a column. Network Database Model: interconnects the entities of an organization into a network. The network model organizes data using two fundamental constructs, called records and sets Records contain fields (which may be organized hierarchically, as in the programming language. Sets define one-to-many relationships between records: one owner, many members. A record may be an owner in any number of sets, and a member in any number of sets. The operations of the network model are navigational in style: a program maintains a current position, and navigates from one record to another by following the relationships in which the record participates. Records can also be located by supplying key values.Although it is not an essential feature of the model, network databases generally implement the set relationships by means of pointers that directly address the location of a record on disk. This gives excellent retrieval performance, at the expense of operations such as database loading and reorganization.
Relational model: In RDBM, the concept of two dimensional table is used to show the relation. Three key terms are used extensively in relational database models: relations, attributes, and domains. A relation is a table with columns and rows. The named columns of the relation are called attributes, and the domain is the set of values the attributes are allowed to take. The basic data structure of the relational model is the table, where information about a particular entity (say, an employee) is represented in rows (also called tuples) and columns. Thus, the "relation" in "relational database" refers to the various tables in the database; a relation is a set of tuples. The columns enumerate the various attributes of the entity (the employee's name, address or phone number, for example), and a row is an actual instance of the entity (a specific employee) that is represented by the relation. As a result, each tuple of the employee table represents various attributes of a single employee.
External View
External View
Conceptual Schema
Internal Schema
INTERNAL LEVEL
STORED DATABASE
The goal of the three schema architecture is to separate the user applications and the physical database. The schemas can be defined at the following levels: 1. The internal level has an internal schema which describes the physical storage structure of the database. Uses a physical data model and describes the complete details of data storage and access paths for the database. 2. The conceptual level has a conceptual schema which describes the structure of the database for users. It hides the details of the physical storage structures, and concentrates on describing entities, data types, relationships, user operations and constraints. Usually a representational data model is used to describe the conceptual schema. 3. The External or View level includes external schemas or user vies. Each external schema describes the part of the database that a particular user group is interested in and hides the rest of the database from that user group. Represented using the representational data model. The three schema architecture is used to visualize the schema levels in a database. The three schemas are only descriptions of data, the data only actually exists is at the physical level. Each user group refers only to its own external schema. The DBMS must transform a request specified on an external schema into a request against the conceptual schema, and then into a request on the internal schema for processing over the database. The process of transforming requests and results between levels is called mapping.
Data Independence
The three schema architecture further explains the concept of data independence, the capacity to change the schema at one level without having to change the schema at the next higher level. There are two types of data independence: 1. Logical data independence the ability to change the conceptual schema without having to change the external schemas or application programs. When data is added or removed, only the view definition and the mappings need to be changed in the DBMS that support logical data independence. If the conceptual schema undergoes a logical reorganization, application programs that reference the external schema constructs must work as before. 2. Physical data independence The ability to change the internal schema without having to change the conceptual schema. By extension, the external schema should not change as well. Physical file reorganization to improve performance (such as creating access structures) results in a change to the internal schema. If the same data as before remains in the database, the conceptual schema should not change. For example, providing an access path to improve retrieval speed of section records by semester and year, should not require a query to be changed, although it should become more efficient by utilizing the access path.
With a multi-level DBMS, the catalogue must be expanded to include information on how to map requests and data among the levels. The DBMS uses additional software to accomplish the mappings. Data independence occurs because when the schema is changed at some level, the schema at the next higher level remains unchanged. Only the mapping between the levels is changed.
DBMS Languages
DDL the data definition language, used by the DBA and database designers to define the conceptual and internal schemas. The DBMS has a DDL compiler to process DDL statements in order to identify the schema constructs, and to store the description in the catalogue. In databases where there is a separation between the conceptual and internal schemas, DDL is used to specify the conceptual schema, and SDL, storage definition language, is used to specify the internal schema. For a true three-schema architecture, VDL, view definition language, is used to specify the user views and their mappings to the conceptual schema. But in most DBMSs, the DDL is used to specify both the conceptual schema and the external schemas. Once the schemas are compiled, and the database is populated with data, users need to manipulate the database. Manipulations include retrieval, insertion, deletion and modification. The DBMS provides operations using the DML, data manipulation language. In most DBMSs, the VDL, DML and the DML are not considered separate languages, but a comprehensive integrated language for conceptual schema definition, view definition and data manipulation. Storage definition is kept separate to fine-tune the performance, usually done by the DBA staff. An example of a comprehensive language: SQL, which represents a VDL, DDL, DML as well as statements for constraint specification, etc.
Therefore it needs to use programming language constructs such as loops. Low-level DMLs are also called record at a time DMLS because of this. High-level DMLs, such as SQL can specify and retrieve many records in a single DML statement, and are called set at a time or set oriented DMLs. High-level languages are often called declarative, because the DML often specifies what to retrieve, rather than how to retrieve it. .
DDL Compiler Processes the schema definitions and stores the descriptions (meta-data) in the catalogue.
Runtime Database Processor Handles database access at runtime. Received retrieval or update operations and carries them out on the database. Access to the disk goes through the stored data manager. Query Compiler Handles high-level queries entered interactively. Parses, analyzes and interprets a query, then generates calls to the runtime processor for execution. Precompiler Extracts DML commands from an application program written in a host language. Commands are sent to DML compiler for compilation into code for database access. The rest is sent to the host language compiler. Client Program Accesses the DBMS running on a separate computer from the computer on which the database resides. It is called the client computer, and the other is the database server. In some cases a middle level is called the application server.
The main difference between centralized & distributed databases is that the distributed databases are typically geographically separated, are separately administered, & have slower interconnection. Also in distributed databases we differentiate between local & global transactions. A local transaction is one that accesses data only from sites where the transaction originated. A global transaction, on the other hand, is one that either accesses data in a site different from the one at which the transaction was initiated, or accessed data in several different sites. Advantages of distributed databases 1. Management of distributed data with different levels of transparency. 2. Increase reliability and availability. 3. Easier expansion. 4. Reflects organizational structure database fragments are located in the departments they relate to. 5. Local autonomy a department can control the data about them (as they are the ones familiar with it.) 6. Protection of valuable data if there were ever a catastrophic event such as a fire, all of the data would not be in one place, but distributed in multiple locations. 7. Improved performance data is located near the site of greatest demand, and the database systems themselves are parallelized, allowing load on the databases to be balanced among servers. (A high load on one module of the database won't affect other modules of the database in a distributed database.) 8. Economics it costs less to create a network of smaller computers with the power of a single large computer. 9. Modularity systems can be modified, added and removed from the distributed database without affecting other modules (systems). 10. Reliable transactions - Due to replication of database. 11. Hardware, Operating System, Network, Fragmentation, DBMS, Replication and Location Independence. 12. Continuous operation. 13. Single site failure does not affect performance of system.
Disadvantages of distributed databases 1. Complexity extra work must be done by the DBAs to ensure that the distributed nature of the system is transparent. Extra work must also be done to maintain multiple disparate systems, instead of one big one. Extra database design work must also be done to account for the disconnected nature of the database for example, joins become prohibitively expensive when performed across multiple systems. 2. Economics increased complexity and a more extensive infrastructure means extra labour costs. 3. Security remote database fragments must be secured, and they are not centralized so the remote sites must be secured as well. The infrastructure must also be secured (e.g., by encrypting the network links between remote sites). 4. Difficult to maintain integrity in a distributed database, enforcing integrity over a network may require too much of the network's resources to be feasible. 5. Inexperience distributed databases are difficult to work with, and as a young field there is not much readily available experience on proper practice. 6. Lack of standards there are no tools or methodologies yet to help users convert a centralized DBMS into a distributed DBMS. 7. Database design more complex besides of the normal difficulties, the design of a distributed database has to consider fragmentation of data, allocation of fragments to specific sites and data replication. 8. Additional software is required. 9. Operating System should support distributed environment. 10. Concurrency control: it is a major issue. It is solved by locking and timestamping
Terminals/ PC
Display Monitor
Display Monitor
Display Monitor
Network Mainframe SOFTWARE (Application Programs, DBMS, Text Editors, Compilers etc) HARDWARE (CPU, Controller, Memory, Disk, IO Devices)
DATAWAREHOUSE: The concept of DW emerges from several sets of information which users need. The need have arisen from change in the management style of different classes of users, who now need organization wide view of the information. These needs are critical to the success of business. The decision makers are required to react quickly to mission critical needs due to rapidly changing volatile and competitive markets. They need multidimensional support of information. They need information for strategic decisions. They need both internal and external information which gives larger view of a problem scenario. The features of such needs are fundamental for patterns and trends and also require enterprise view as against functional localized view of the subject. The DW is designed to meet these needs delivers the same effectively. There are three kind of end users of in formations: The management Knowledge workers Operations staff. The management needs holistic view of a situation expected predicting in the future. It helps to critical changes has taken place in the business showing any patterns and factors affecting the change and use it to business advantage. The knowledge workers belong to middle management level in the organizational hierarchy. Their needs are multidimensional depending on their role and position. The needs of operations staff are fulfilled through transaction processing system, where decision making process is automated by embedding the rules in the system. The data warehouse is defined by Bill Immon as, a collection of nonvolatile data of different business subjects and objects, which is time variant and integrated down various sources and applications and stored in a manner to make a quick analysis of business situations Hence, Data warehousing is subject-oriented, integrated, time-variant, and non-volatile collection of data in support of managements decision-making process. Subject-orientedWH is organized around the major subjects of the enterprise..rather than the major application areas.. This is reflected in the need to store decision-support data rather than application-oriented data Integratedbecause the source data come together from different enterprise-wide applications systems. The source data is often inconsistent using..The integrated data source must be made consistent to present a unified view of the data to the users Time-variantthe source data in the WH is only accurate and valid at some point in time or over some time interval. The time-variance of the data warehouse is also shown in the extended time
that the data is held, the implicit or explicit association of time with all data, and the fact that the data represents a series of snapshots Non-volatiledata is not update in real time but is refresh from OS on a regular basis. New data is always added as a supplement to DB, rather than replacement. The DB continually absorbs this new data, incrementally integrating it with previous data .
First step in building DW is to extract data from different sources. After this, the data needs to be validated for coding structures, name and formats/ It is rationalized to a common unit of measure
through transformations or conversion process. Such data is then consolidated to common reference level such as end of month, region, zone etc. The data so processed is them moved to DW. All these processes are handled by middleware, written o construct the DW. Middleware is a set of programs and routines which pulls data from various sources, checks and validate, moves it from one platform to other and transforms to the DW design specifications and then loads in DW. Since data in DW is ready to user for decision making, it needs to be delivered in DW after instituting QA measures on the data
One consistent data store for reporting, forecasting, and analysis Easier and timely access to data Improved end-user productivity Improved IS productivity Reduced costs Scalability Flexibility Reliability Competitive advantage Trend analysis and detection Key ratio indicator measurement and tracking Drill down analysis Problem monitoring Executive analysis
There are three types of data in the data warehouse: Base-level data, Summary-level data, Metadata. Business data in data warehouse can be stored in atomic form or in summary. For eg, sales data could be stored by product that is in atomic form or also summarized by product family.
Base-Level Data Base-level data contains historical data that is normalized. It is at the atomic level and is used to create summary-level data. Base-level data is also used to reconcile the data contained in the summary-level to the operational data. Summary-Level Data Summary-level data contains historical data that is derived (i.e., summarized and aggregated) to support end-user reports and queries. It is accessed by the end-user to perform decision making activities. The three currency features for business data are:
Current data view of business at the present time. Point in time data - snapshot of business data at a particular moment. Periodic data business data is represented by periods such as last three years, last 12 quarters. Etc
Example of business data:
Components of Datwarehouse
LOAD MANAGEMENT relates to the collection of information from disparate internal or external sources. In most cases the loading process includes summarizing, manipulating and changing the data structures into a format that lends itself to analytical processing. Actual raw data should be kept alongside, or within, the data warehouse itself, thereby enabling the construction of new and different representations. A worst-case scenario, if the raw data is not stored, would be to reassemble the data from the various disparate sources around the organization simply to facilitate a different analysis. WAREHOUSE MANAGEMENT relates to the day-to-day management of the data warehouse. The management tasks associated with the warehouse include ensuring its availability, the effective backup of its contents, and its security. QUERY MANAGEMENT relates to the provision of access to the contents of the warehouse and may include the partitioning of information into different areas with different privileges to different users. Access may be provided through custom-built applications, or ad hoc query tools.
The architecture
Operational data source1
Detailed data
DBMS
Warehouse Manager
load manageralso called the frontend component, it performance all the operations associated with the extraction and loading of data into the warehouse. These operations include simple transformations of the data to prepare the data for entry into the warehouse warehouse managerperforms all the operations associated with the management of the data in the warehouse. The operations performed by this component include analysis of data to ensure consistency, transformation and merging of source data, creation of indexes and views, generation of denormalizations and aggregations, and archiving and backing-up data query manageralso called backend component, it performs all the operations associated with the management of user queries. The operations performed by this component include directing queries to the appropriate tables and scheduling the execution of queries detailed, lightly and lightly summarized data,archive/backup data meta-data end-user access toolscan be categorized into five main groups: data reporting and query tools, application development tools, executive information system (EIS) tools, online analytical processing (OLAP) tools, and data mining tools
DataMart
Data mart a subset of a data warehouse that supports the requirements of particular department or business function The characteristics that differentiate data marts and data warehouses include: a data mart focuses on only the requirements of users associated with one department or business function. Data marts do not normally contain detailed operational data, unlike data warehouses as data marts contain less data compared with data warehouses, data marts are more easily understood and navigated A data mart includes only the data and functions needed to support a single organizational unit (e.g., department). The data mart typically contains lightly and highly summarized data.
Reasons for creating datamart: To give users access to the data they need to analyze most often To provide data in a form that matches the collective view of the data by a group of users in a department or business function To improve end-user response time due to the reduction in the volume of data to be accessed To provide appropriately structured data as ditated by the requirements of end-user access tools Normally use less data so tasks such as data cleansing, loading, transformation, and integration are far easier, and hence implementing and setting up a data mart is simpler than establishing a corporate data warehouse The cost of implementing data marts is normally less than that required to establish a data warehouse The potential users of a data mart are more clearly defined and can be more easily targeted to obtain support for a data mart project rather than a corporate data warehouse project
Data Mining
Data mining is the process of extracting patterns from data. Data mining is seen as an increasingly important tool by modern business to transform data into business intelligence giving an informational advantage. It is currently used in a wide range of profiling practices, such as marketing, surveillance, fraud detection, and scientific discovery. The related terms data dredging, data fishing and data snooping refer to the use of data mining techniques to sample portions of the larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered (see also data-snooping bias). These techniques can, however, be used in the creation of new hypotheses to test against the larger data populations. Data mining commonly involves four classes of tasks: Clustering - is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data. Classification - is the task of generalizing known structure to apply to new data. For example, an email program might attempt to classify an email as legitimate or spam. Common algorithms include decision tree learning, nearest neighbor, naive Bayesian classification, neural networks and support vector machines.
Regression - Attempts to find a function which models the data with the least error. Association rule learning - Searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.
MULTIMEDIA APPROACH TO INFORMATION PROCESSING: Information processing is the change (processing) of information in any manner detectable by an observer. As such, it is a process which describes everything which happens (changes) in the universe, from the falling of a rock (a change in position) to the printing of a text file from a digital computer system. In the latter case, an information processor is changing the form of presentation of that text file. Multimedia is media and content that uses a combination of different content forms. The term can be used as a noun (a medium with multiple content forms) or as an adjective describing a medium as having multiple content forms. The term is used in contrast to media which only use traditional forms of printed or hand-produced material. Multimedia includes a combination of text, audio, still images, animation, video, and interactivity content forms. Multimedia is media that uses multiple forms of information content and information processing like text, audio, graphics, animation, video, interactivity) to inform or entertain the audience. Multimedia also refers to the use of electronic media to store and experience multimedia content. Multimedia is similar to traditional mixed media in fine art, but with a broader scope. The term "rich media" is synonymous for interactive multimedia. Multimedia means that computer info can be represented through audio, graphics, image, video and animation in addition to traditional media (text and graphics). Hypermedia can be considered one particular multimedia application Multimedia processing is becoming increasingly important with a wide variety of applications ranging from multimedia cellphones to high-definition interactive television. Media processing involves the capture, storage, manipulation and transmission of multimedia objects such as text, handwritten data, audio objects, still images, 2D/3D graphics, animation, and full-motion video. A number of implementation strategies have been proposed for processing multimedia data. These approaches can be broadly classified based on the evolution of processing architectures and the functionality of the processors. In order to provide media processing solutions to different consumer markets, designers have combined some of the classical features from both the functional and evolution-based classifications resulting in many hybrid solutions. Multiprocessing: A method and system for processing multimedia data is provided. The method for processing multimedia data in a multiprocessor system includes enabling communication between a plurality of processors in response to receipt of multimedia data. The method also includes providing portions of the multimedia data selectively to the plurality of processors. Further, the method includes processing the portions of the multimedia data by the plurality of processors. Moreover, the method includes synchronizing the portions of the multimedia data. The method also includes performing at least one of queuing the portions of the multimedia data to be played, playing the portions of the multimedia data and skipping the portions of the multimedia data based on the synchronizing. A system for processing multimedia data includes one or more sub-processors for processing portions of multimedia data selectively. The system also includes a master processor for synchronizing the portions of the multimedia data.
Parallel Processing: based on matrix architecture performs multimedia processing f flexibly with high speed and low power consumption
Cognitive processing of multimedia information - performance of some composite cognitive activity; an operation that affects mental contents; "the process of thinking"; "the cognitive operation of remembering-- The mental faculty of knowing, which includes perceiving, recognizing, conceiving, judging, reasoning, and imagining.
Multimedia information processing; Neural network; An Artificial Neural Network (ANN) is an information processing paradigm that is inspired by the way biological nervous systems, such as the brain, process information. The key element of this paradigm is the novel structure of the information processing system. It is composed of a large number of highly interconnected processing elements (neurones) working in unison to solve specific problems Pattern recognition: In machine learning, pattern recognition is the assignment of some sort of output value (or label) to a given input value (or instance), according to some specific algorithm. An example of pattern recognition is classification, which attempts to assign each input value to one of a given set of classes (for example, determine whether a given email is "spam" or "non-spam"). Expert system; An expert system is software that attempts to provide an answer to a problem, or clarify uncertainties where normally one or more human experts would need to be consulted. Expert systems are most common in a specific problem domain, and is a traditional application and/or subfield of artificial intelligence (AI). Image analysis; Image analysis is the extraction of meaningful information from images; mainly from digital images by means of digital image processing techniques. Image analysis tasks can be as simple as reading bar coded tags or as sophisticated as identifying a person from their face. Speech recognition; Speech recognition (also known as automatic speech recognition or computer speech recognition) converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speakeras is the case for most desktop recognition software. Recognizing the speaker can simplify the task of translating speech.