19763
19763
com
OR CLICK HERE
DOWLOAD EBOOK
https://ebookball.com/product/financial-management-and-real-
options-1st-edition-by-jack-broyles-0471899348-9780471899341-24768/
ebookball.com
ebookball.com
Packing and Squeezing Subgraphs into Planar Graphs 1st
Edition by Fabrizio Frati, Markus Geyer, Michael Kaufmann
ISBN 9783540744566
https://ebookball.com/product/packing-and-squeezing-subgraphs-into-
planar-graphs-1st-edition-by-fabrizio-frati-markus-geyer-michael-
kaufmann-isbn-9783540744566-10480/
ebookball.com
ebookball.com
https://ebookball.com/product/microsoft-sql-server-2005-1st-edition-
by-michael-otey-denielle-otey-isbn-0072260998-9780072260991-10180/
ebookball.com
ebookball.com
https://ebookball.com/product/big-data-analytics-in-cybersecurity-1st-
edition-by-onur-savas-isbn-1351650416-9781351650410-16614/
ebookball.com
Andreas Meier
Michael Kaufmann
Springer Vieweg
© Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2019
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage
and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or
hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does
not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective
laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors
or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
This Springer Vieweg imprint is published by the registered company Springer Fachmedien Wiesbaden GmbH
part of Springer Nature
The registered company address is: Abraham-Lincoln-Str. 46, 65189 Wiesbaden, Germany
Foreword
The term “database” has long since become part of people’s everyday vocabulary, for
managers and clerks as well as students of most subjects. They use it to describe a logi-
cally organized collection of electronically stored data that can be directly searched and
viewed. However, they are generally more than happy to leave the whys and hows of its
inner workings to the experts.
Users of databases are rarely aware of the immaterial and concrete business values
contained in any individual database. This applies as much to a car importer’s spare parts
inventory as to the IT solution containing all customer depots at a bank or the patient
information system of a hospital. Yet failure of these systems, or even cumulative errors,
can threaten the very existence of the respective company or institution. For that rea-
son, it is important for a much larger audience than just the “database specialists” to be
well-informed about what is going on. Anyone involved with databases should under-
stand what these tools are effectively able to do and which conditions must be created
and maintained for them to do so.
Probably the most important aspect concerning databases involves (a) the distinction
between their administration and the data stored in them (user data) and (b) the economic
magnitude of these two areas. Database administration consists of various technical and
administrative factors, from computers, database systems, and additional storage to the
experts setting up and maintaining all these components—the aforementioned database
specialists. It is crucial to keep in mind that the administration is by far the smaller part
of standard database operation, constituting only about a quarter of the entire efforts.
Most of the work and expenses concerning databases lie in gathering, maintaining,
and utilizing the user data. This includes the labor costs for all employees who enter data
into the database, revise it, retrieve information from the database, or create files using
this information. In the above examples, this means warehouse employees, bank tellers,
or hospital personnel in a wide variety of fields—usually for several years.
In order to be able to properly evaluate the importance of the tasks connected with
data maintenance and utilization on the one hand and database administration on the
other hand, it is vital to understand and internalize this difference in the effort required
v
vi Foreword
for each of them. Database administration starts with the design of the database, which
already touches on many specialized topics such as determining the consistency checks
for data manipulation or regulating data redundancies, which are as undesirable on the
logical level as they are essential on the storage level. The development of database solu-
tions is always targeted at their later use, so ill-considered decisions in the development
process may have a permanent impact on everyday operations. Finding ideal solutions,
such as the golden mean between too strict and too flexible when determining consist-
ency conditions, may require some experience. Unduly strict conditions will interfere
with regular operations, while excessively lax rules will entail a need for repeated expen-
sive data repairs.
To avoid such issues, it is invaluable that anyone concerned with database develop-
ment and operation, whether in management or as a database specialist, gain systematic
insight into this field of computer sciences. The table of contents gives an overview of
the wide variety of topics covered in this book. The title already shows that, in addition
to an in-depth explanation of the field of conventional databases (relational model, SQL),
the book also provides highly educational information about current advancements and
related fields, the keywords being “NoSQL” or “post-relational” and “Big Data.” I am
confident that the newest edition of this book will, once again, be well received by both
students and professionals—its authors are quite familiar with both groups.
It is remarkable how stable some concepts are in the field of databases. Information
technology is generally known to be subject to rapid development, bringing forth new
technologies at an unbelievable pace. However, this is only superficially the case. Many
aspects of computer science do not essentially change at all. This includes not only the
basics, such as the functional principles of universal computing machines, processors,
compilers, operating systems, databases and information systems, and distributed sys-
tems, but also computer language technologies such as C, TCP/IP, or HTML, which are
decades old but in many ways provide a stable fundament of the global, earth-spanning
information system known as the World Wide Web. Likewise, the SQL language has
been in use for over four decades and will remain so in the foreseeable future. The the-
ory of relational database systems was initiated in the 1970s by Codd (relation model
and normal forms), Chen (entity and relationship model) and Chamberlin and Boyce
(SEQUEL). However, these technologies have a major impact on the practice of data
management today. Especially, with the Big Data revolution and the widespread use of
data science methods for decision support, relational databases, and the use of SQL for
data analysis are actually becoming more important. Even though sophisticated statistics
and machine learning are enhancing the possibilities for knowledge extraction from data,
many if not most data analyses for decision support rely on descriptive statistics using
SQL for grouped aggregation. In that sense, although SQL database technology is quite
mature, it is more relevant today than ever.
Nevertheless, a lot has changed in the area of database systems lately over the years.
Especially the developments in the Big Data ecosystem brought new technologies into
the world of databases, to which we pay enough attention to. The nonrelational database
technologies, which are finding more and more fields of application under the generic
term NoSQL, differ not only superficially from the classical relational databases, but
also in the underlying principles. Relational databases were developed in the twentieth
century with the purpose of enabling tightly organized, operational forms of data man-
agement, which provided stability but limited flexibility. In contrast, the NoSQL data-
base movement emerged in the beginning of the current century, focusing on horizontal
vii
viii Preface
partitioning and schema flexibility, and with the goal of solving the Big Data problems
of volume, variety, and velocity, especially in Web-scale data systems. This has far-
reaching consequences and has led to a new approach in data management, which devi-
ates significantly from the previous theories on the basic concept of databases: the way
data is modeled, how data is queried and manipulated, how data consistency is handled,
and the system architecture. This is why we compare these two worlds, SQL and NoSQL
databases, from different perspectives in all chapters.
We have also launched a website called sql-nosql.org, where we share teaching and
tutoring materials such as slides, tutorials for SQL and Cypher, case studies, a work-
bench for MySQL and Neo4j, so that language training can be done either with SQL or
with Cypher, the graph-oriented query language of the NoSQL database Neo4j.
At this point, we would like to thank Anja Kreutel for her great effort and success
in translating the eighth edition of the German textbook to English. We also thank
Alexander Denzler and Marcel Wehrle for the development of the workbench for rela-
tional and graph-oriented databases. For the redesign of the graphics, we were able to
win Thomas Riediker and we thank him for his tireless efforts. He has succeeded in giv-
ing the pictures a modern style and an individual touch. For the further development
of the tutorials and case studies, which are available on the website sql-nosql.org, we
thank the computer science students Andreas Waldis, Bettina Willi, Markus Ineichen,
and Simon Studer for their contributions to the tutorial in Cypher and to the case study
Travelblitz with OpenOffice Base and with Neo4J. For the feedback on the manuscript
we thank Alexander Denzler, Daniel Fasel, Konrad Marfurt, and Thomas Olnhoff, for
their willingness to contribute to the quality of our work with their hints. A big thank you
goes to Sybille Thelen, Dorothea Glaunsinger, and Hermann Engesser of Springer, who
have supported us with patience and expertise.
1 Data Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Information Systems and Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 SQL Databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Relational Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Structured Query Language (SQL) . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 Relational Database Management System. . . . . . . . . . . . . . . . . . . . 8
1.3 Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 NoSQL Databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.1 Graph-based Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.2 Graph Query Language Cypher. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.3 NoSQL Database Management System. . . . . . . . . . . . . . . . . . . . . . 16
1.5 Organization of Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6 Further Reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2 Data Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1 From Data Analysis to Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 The Entity-Relationship Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.1 Entities and Relationships. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.2 Association Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.3 Generalization and Aggregation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3 Implementation in the Relational Model. . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.1 Dependencies and Normal Forms . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.2 Mapping Rules for Relational Databases. . . . . . . . . . . . . . . . . . . . . 46
2.3.3 Structural Integrity Constraints. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4 Implementation in the Graph Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.4.1 Graph Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.4.2 Mapping Rules for Graph Databases. . . . . . . . . . . . . . . . . . . . . . . . 68
2.4.3 Structural Integrity Constraints. . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2.5 Enterprise-Wide Data Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
ix
x Contents
Glossary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
List of Figures
xiii
xiv List of Figures
Fig. 6.15 Classification matrix with the attributes Revenue and Loyalty. . . . . . . . 192
Fig. 6.16 Fuzzy partitioning of domains with membership functions. . . . . . . . . . . 194
Fig. 7.1 Massively distributed key-value store with sharding and
hash-based key distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Fig. 7.2 Storing data in the Bigtable model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Fig. 7.3 Example of a document store. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Fig. 7.4 Illustration of an XML document represented by tables. . . . . . . . . . . . . 211
Fig. 7.5 Schema of a native XML database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Fig. 7.6 Example of a graph database with user data of a website. . . . . . . . . . . . 216
Data Management
1
These properties clearly show that digital goods (information, software, multime-
dia, etc.), i.e., data, are vastly different from material goods in both handling and eco-
nomic or legal evaluation. A good example is the loss in value that physical products
often experience when they are used—the shared use of information, on the other hand,
may increase its value. Another difference lies in the potentially high production costs
for material goods, while information can be multiplied easily and at significantly lower
costs (with only computing power and a storage medium). This causes difficulties in
determining property rights and ownership, even though digital watermarks and other
privacy and security measures are available.
Considering data as the basis of information as a production factor in a company has
significant consequences:
• Basis for decision-making: Data allows well-informed decisions, making it vital for
all organizational functions.
• Quality level: Data can be available from different sources; information quality
depends on the availability, correctness, and completeness of the data.
• Need for investments: Data gathering, storage, and processing cause work and
expenses.
• Degree of integration: Fields and holders of duties within any organization are con-
nected by informational relations, meaning that the fulfillment of said duties largely
depends on the degree of data integration.
Information System
Software system
with
Knowledge base
User
User guidance
Dialog design Request
Query language
Method database Manipulation
Communication network Response
or WWW language
Research help
Access permissions
Database Data protection
information systems and online platforms in the World Wide Web that use search engines
to process arbitrary queries.
The computer-based information system in Fig. 1.1 is connected to a communication
network/the World Wide Web in order to allow for online research and global informa-
tion exchange in addition to company-specific analyses. Any information system of a
certain size uses database technologies to avoid the necessity to redevelop data manage-
ment and analysis every time it is used.
Database management systems are software for application-independently describ-
ing, storing, and querying data. All database management systems contain a storage and
a management component. The storage component includes all data stored in an organ-
ized form plus their description. The management component contains a query and data
manipulation language for evaluating and editing the data and information. This compo-
nent does not only serve as the user interface, but also manages access and editing per-
missions for users.
SQL databases (SQL = Structured Query Language, cf., Sect. 1.2) are the most com-
mon in practical use. However, providing real-time web-based services referencing het-
erogeneous data sets is especially challenging (Sect. 1.3 on Big Data) and calls for new
solutions such as NoSQL approaches (Sect. 1.4). When deciding whether to use rela-
tional or nonrelational technologies, the pros and cons have to be considered carefully—
in some use cases, it may even be ideal to combine different technologies (cf., operating
a web shop in Sect. 5.6). Depending on the database architecture of choice, data manage-
ment within the company must be established and developed with the support of quali-
fied experts (Sect. 1.5). References for further reading are listed in Sect. 1.6.
One of the simplest and most intuitive ways to collect and present data is in a table. Most
tabular data sets can be read and understood without additional explanations.
To collect information about employees, a table structure as shown in Fig. 1.2 can
be used. The all-capitalized table name EMPLOYEE refers to the entire table, while the
individual columns are given the desired attribute names as headers; in this example, the
employee number “E#,” the employeeʼs name “Name,” and their city of residence “City.”
An attribute assigns a specific data value from a predefined value range called domain
as a property to each entry in the table. In the EMPLOYEE table, the attribute E# allows
individual employees to be uniquely identified, making it the key of the table. To mark
key attributes more clearly, they are italicized in the table headers throughout this book.
The attribute City is used to label the respective places of residence and the attribute
Name for the names of the respective employees (Fig. 1.3).
4 1 Data Management
Table name
EMPLOYEE Attribute
E# Name City
Key attribute
Column
EMPLOYEE
E# Name Ort
E19 Stewart Stow
E4 Bell Kent
E1 Murphy Kent
E7 Howard Cleveland
The required information of the employees can now easily be entered row by row.
In the columns, values may appear more than once. In our example, Kent is listed as
the place of residence of two employees. This is an important fact, telling us that both
employee Murphy and employee Bell live in Kent. In our EMPLOYEE table, not only
cities, but also employee names may exist multiple times. For this reason, the aforemen-
tioned key attribute E# is required to uniquely identify each employee in the table.
u Identification key The identification key or just key of a table is one attribute or a min-
imal combination of attributes whose values uniquely identify the records (called rows or
tuples) within the table.
1.2 SQL Databases 5
• Uniqueness: Each key value uniquely identifies one record within the table, i.e., dif-
ferent tuples must not have identical keys.
• Minimality: If the key is a combination of attributes, this combination must be mini-
mal, i.e., no attribute can be removed from the combination without eliminating the
unique identification.
According to this definition, the relational model considers each table as a set of unor-
dered tuples.
u Relational model The relational model represents both data and relationships
between data as tables. Mathematically speaking, any relation R is simply a subset of a
6 1 Data Management
The relational model is based on the work of Edgar Frank Codd from the early 1970s.
This was the foundation for the first relational database systems, created in research
facilities and supporting SQL or similar database languages. Today, their sophisticated
successors are firmly established in many practical uses.
As explained, the relational model presents information in tabular form, where each table
is a set of tuples (or records) of the same type. Seeing all the data as sets makes it pos-
sible to offer query and manipulation options based on sets.
The result of a selective operation, for example, is a set, i.e., each search result is
returned by the database management system as a table. If no tuples of the scanned table
show the respective properties, the user gets a blank results table. Manipulation opera-
tions similarly target sets and affect an entire table or individual table sections.
The primary query and data manipulation language for tables is called Structured
Query Language, usually shortened to SQL (see Fig. 1.4). It was standardized by
ANSI (American National Standards Institute) and ISO (International Organization for
Standardization)1.
SQL is a descriptive language, as the statements describe the desired result instead of
the necessary computing steps. SQL queries follow a basic pattern as illustrated by the
query in Fig. 1.4:
“SELECT the attribute Name FROM the EMPLOYEE table WHERE the city is
Kent.”
A SELECT-FROM-WHERE query can apply to one or several tables and always gen-
erates a table as a result. In our example, the query would yield a results table with the
names Bell and Murphy, as desired.
The set-based method offers users a major advantage, since a single SQL query can
trigger multiple actions within the database management system. It is not necessary for
users to program all searches themselves.
Relational query and data manipulation languages are descriptive. Users get the
desired results by merely setting the requested properties in the SELECT expres-
sion. They do not have to provide the procedure for computing the required records.
1ANSI is the national standards organization of the US. The national standardization organizations
are part of ISO.
1.2 SQL Databases 7
EMPLOYEE
E# Name City
E19 Stewart Stow
E4 Bell Kent
E1 Murphy Kent
E7 Howard Cleveland
Example query:
“Select the names of the employees living in Kent.”
The database management system takes on this task, processes the query or manipulation
with its own search and access methods, and generates the results table.
With procedural database languages, on the other hand, the methods for retrieving the
requested information must be programmed by the user. In this case, each query yields
only one record, not a set of tuples.
With its descriptive query formula, SQL requires only the specification of the desired
selection conditions in the WHERE clause, while procedural languages require the user
to specify an algorithm for finding the individual records. As an example, let us take a
look at a query language for hierarchical databases (see Fig. 1.5): For our initial oper-
ation, we use GET_FIRST to search for the first record that meets our search criteria.
Next, we access all other corresponding records individually with the command GET_
NEXT until we reach the end of the file or a new hierarchy level within the database.
Overall, we can conclude that procedural database management languages use record-
based or navigating commands to manipulate collections of data, requiring some experi-
ence and knowledge of the databaseʼs inner structure from the users. Occasional users
basically cannot independently access and use the contents of a database. Unlike proce-
dural languages, relational query and manipulation languages do not require the specifi-
cation of access paths, processing procedures, or navigational routes, which significantly
reduces the development effort for database utilization.
If database queries and analyses are to be done by company departments and end
users instead of IT, the descriptive approach is extremely useful. Research on descriptive
8 1 Data Management
Natural language:
Descriptive language:
SELECT Name
FROM EMPLOYEE
WHERE City = ‘Kent’
Procedural language:
database interfaces has shown that even occasional users have a high probability of suc-
cessfully executing the desired analyses using descriptive language elements. Figure 1.5
also illustrates the similarities between SQL and natural language. In fact, there are mod-
ern relational database management systems that can be accessed with natural language.
Databases are used in the development and operation of information systems in order to
store data centrally, permanently, and in a structured manner.
As shown in Fig. 1.6, relational database management systems are integrated sys-
tems for the consistent management of tables. They offer service functionalities and the
descriptive language SQL for data description, selection, and manipulation.
Every relational database management system consists of a storage and a manage-
ment component. The storage component stores both data and the relationships between
pieces of information in tables. In addition to tables with user data from various applica-
tions, it contains the predefined system tables necessary for database operation. These
contain descriptive information and can be queried but not manipulated by users.
The management componentʼs most important part is the relational data definition,
selection, and manipulation language SQL. This component also contains service func-
tions for data restoration after errors, for data protection, and for backup.
Relational database management systems are common bases for businessesʼ informa-
tion systems and can be defined as follows:
1.2 SQL Databases 9
• Model: The database model follows the relational model, i.e., all data and data
relations are represented in tables. Dependencies between attribute values of tuples
or multiple instances of data can be discovered (cf., normal forms in Sect. 2.3.1).
• Schema: The definitions of tables and attributes are stored in the relational data-
base schema. The schema further contains the definition of the identification keys
and rules for integrity assurance.
• Language: The database system includes SQL for data definition, selection, and
manipulation. The language component is descriptive and facilitates analyses and
programming tasks for users.
• Architecture: The system ensures extensive data independence, i.e., data and
applications are mostly segregated. This independence is reached by separating the
actual storage component from the user side using the management component.
Ideally, physical changes to relational databases are possible without the need to
adjust related applications.
• Multi-user operation: The system supports multi-user operation (Sect. 4.1), i.e.,
several users can query or manipulate the same database at the same time. The
RDBMS ensures that parallel transactions in one database do not interfere with
each other or, worse, with the correctness of data (Sect. 4.2).
• Consistency assurance: The database management system provides tools for
ensuring data integrity, i.e., the correct and uncompromised storage of data.
• Data security and data protection: The database management system pro-
vides mechanisms to protect data from destruction, loss, or unauthorized access
(Sect. 3.8).
10 1 Data Management
NoSQL database management systems meet these criteria only partially (Sect. 1.4.3 and
Chaps. 4 and 7). For this reason, most corporations, organizations, and especially SMEs
(small and medium enterprises) rely heavily on relational database management systems.
However, for spread-out web applications or applications handling Big Data, relational
database technology must be augmented with NoSQL technology in order to ensure
uninterrupted global access to these services.
The term Big Data is used to label large volumes of data that push the limits of conven-
tional software. This data is usually unstructured (Sect. 5.1) and may originate from a
wide variety of sources: social media postings, e-mails, electronic archives with multi-
media content, search engine queries, document repositories of content management sys-
tems, sensor data of various kinds, rate developments at stock exchanges, traffic flow
data and satellite images, smart meters in household appliances, order, purchase, and
payment processes in online stores, e-health applications, monitoring systems, etc.
There is no binding definition for Big Data yet, but most data specialists will agree
on three v’s: volume (extensive amounts of data), variety (multiple formats, structured,
semi-structured, and unstructured data, Fig. 1.7), and velocity (high-speed and real-time
processing). Gartner Group’s IT glossary offers the following definition:
Multimedia
Big Data
‘Big data is high-volume, high-velocity and high-variety information assets that
demand cost-effective, innovative forms of information processing for enhanced
insight and decision making.’2
With this definition, Gartner Group positions Big Data as information assets for com-
panies. It is, indeed, vital for companies and organizations to generate decision-relevant
knowledge in order to survive. In addition to internal information systems, they increas-
ingly utilize the numerous resources available online to better anticipate economic, eco-
logic, and social developments on the markets.
Big Data is a challenge faced by not only for-profit companies in digital markets, but
also governments, public authorities, NGOs (nongovernmental organizations), and NPOs
(nonprofit organizations).
A good example are programs to create smart cities or ubiquitous cities, i.e., to use
Big Data technologies in cities and urban agglomerations for sustainable development
of social and ecologic aspects of human living spaces. They include projects facilitating
mobility, the use of intelligent systems for water and energy supply, the promotion of
social networks, expansion of political participation, encouragement of entrepreneurship,
protection of the environment, and an increase of security and quality of life.
All use of Big Data applications requires successful management of the three v's men-
tioned above:
• Volume: There are massive amounts of data involved, ranging from tera to zet-
tabytes (megabyte = 106 bytes, gigabyte = 109 bytes, terabyte = 1012 bytes, peta-
byte = 1015 bytes, exabyte = 1018 bytes, zettabyte = 1021 bytes).
• Variety: Big Data involves storing structured, semi-structured, and unstructured mul-
timedia data (text, graphics, images, audio, and video; Fig. 1.7).
• Velocity: Applications must be able to process and analyze data streams (Sect. 5.1) in
real-time as the data is gathered.
As in the Gartner Groupʼs definition, Big Data can be considered an information
asset, which is why sometimes another V is added:
• Value: Big Data applications are meant to increase the enterprise value, so invest-
ments in personnel and technical infrastructure are made where they will bring lever-
age or added value can be generated.
There are numerous open source solutions for NoSQL databases, and the technologies
do not require expensive hardware, while they also offer good scalability. Specialized
personnel, however, is lacking, since the data scientist profession (Sect. 1.5) is only
just emerging, and professional education in this sector is still in its pilot phase or
only under discussion.
To complete our consideration of the concept of Big Data we will look at another V:
• Veracity: Since much data is vague or inaccurate, specific algorithms evaluating the
validity and assessing result quality are needed. Large amounts of data do not auto-
matically mean better analyses.
Veracity is an important factor in Big Data, where the available data is of variable
quality, which has to be taken into consideration in analyses. Aside from statisti-
cal methods, there are fuzzy methods of soft computing which assign a truth value
between 0 (false) and 1(true) to any result or statement (fuzzy databases in Sect. 6.8).
NoSQL databases support various database models (Sect. 1.4.3 and Fig. 1.11). We
picked out graph databases as an example to look at and discuss their characteristics.
u Property graph Property graphs consists of nodes (concepts, objects) and directed
edges (relationships) connecting the nodes. Both nodes and edges are given a label and
can have properties. Properties are given as attribute-value pairs following the pattern
(attribute: value) with the names of attributes and the respective values.
A graph abstractly presents the nodes and edges with their properties. Figure 1.8 shows
part of a movie collection as an example. It contains the nodes MOVIE with attributes
Title and Year (of release), GENRE with the respective Type (e.g., crime, mystery, com-
edy, drama, thriller, Western, science fiction, documentary, etc.), ACTOR with Name and
Year of Birth, and DIRECTOR with Name and Nationality.
The example uses three directed edges: The edge ACTED_IN shows which artist
from the ACTOR node starred in which film from the MOVIE node. This edge also has a
property, the Role of the actor in the movie. The other two edges, HAS and DIRECTED_
BY, go from the MOVIE node to the GENRE and DIRECTOR node, respectively.
In the manifestation level, i.e., the graph database, the property graph contains the
concrete values (Fig. 1.9 in Sect. 1.4.2).
The property graph model for databases is formally based on graph theory. Depending
on their maturity, relevant software products may offer algorithms to calculate the fol-
lowing traits:
MOVIE
Title HAS
Year
GENRE
Type
ACTED_IN
Role DIRECTED_BY
ACTOR
Name DIRECTOR
Year of birth Name
Nationality
ACTOR
Name: Keanu Reeves
Year of Birth: 1964
AC ole
k
ar
R
TE : Ne
M
DIRECTED_BY
D o
on N
a
_I
ak
: D _I
N
le ED
Ro CT
A
MOVIE MOVIE
Title: Man of Tai Chi Title: The Matrix
Year: 2013 Year: 1999
• Nearest neighbor: In graphs with weighted edges (e.g., by distance or time in a trans-
port network), the nearest neighbors of a node can be determined by finding the mini-
mal intervals (shortest route in terms of distance or time).
• Matching: Matching in graph theory means finding a set of edges that have no com-
mon nodes.
14 1 Data Management
These graph characteristics are significant in many kinds of applications. Finding the
shortest path or the nearest neighbor, for example, is of great importance in calculating
travel or transport routes. The algorithms listed can also sort and analyze relationships in
social networks by path length (Sect. 2.4).
Cypher is a declarative query language for extracting patterns from graph databases.
Users define their query by specifying nodes and edges. The database management sys-
tem then calculates all patterns meeting the criteria by analyzing the possible paths (con-
nections between nodes via edges). In other words, the user declares the structure of the
desired pattern, and the database management systemʼs algorithms traverse all necessary
connections (paths) and assemble the results.
As described in Sect. 1.4.1, the data model of a graph database consists of nodes
(concepts, objects) and directed edges (relationships between nodes). In addition to
their name, both nodes and edges can have a set of properties (see the Property Graph in
Sect. 1.4.1). These properties are represented by attribute-value pairs.
Figure 1.9 shows a segment of a graph database on movies and actors. To keep things
simple, only two types of node are shown: ACTOR and MOVIE. ACTOR nodes contain
two attribute-value pairs, specifically (Name: FirstName LastName) and (YearOfBirth:
Year).
The segment in Fig. 1.9 includes different types of edges: The ACTED_IN relation-
ship represents which actors starred in which movies. Edges can also have properties if
attribute-value pairs are added to them. For the ACTED_IN relationship, the respective
roles of the actors in the movies are listed. For example, Keanu Reeves was cast as the
hacker Neo in ‘The Matrix.’
Nodes can be connected by multiple relationship edges. The movie ‘Man of Tai Chi’
and actor Keanu Reeves are linked not only by the actorʼs role (ACTED_IN), but also by
the director position (DIRECTED_BY). The diagram therefore shows that Keanu Reeves
both directed the movie ‘Man of Tai Chi’ and starred in it as Donaka Mark.
If we want to analyze this graph database on movies, we can use Cypher. It uses the
following basic query elements:
• MATCH: Specification of nodes and edges, as well as declaration of search patterns.
• WHERE: Conditions for filtering results.
• RETURN: Specification of the desired search result, aggregated if necessary
For instance, the Cypher query for the year the movie ‘The Matrix’ was released would be:
MATCH (m: Movie {Title: “The Matrix”})
RETURN m.Year
1.4 NoSQL Databases 15
The query sends out the variable m for the movie ‘The Matrix’ to return the movieʼs year
of release by m.Year. In Cypher, parentheses always indicate nodes, i.e., (m: Movie)
declares the control variable m for the MOVIE node. In addition to control variables,
individual attribute-value pairs can be included in curly brackets. Since we are specifi-
cally interested in the movie ‘The Matrix’, we can add {Title: “The Matrix”} to the node
(m: Movie).
Queries regarding the relationships within the graph database are a bit more compli-
cated. Relationships between two arbitrary nodes (a) and (b) are expressed in Cypher
by the arrow symbol “- > ”, i.e., the path from (a) to (b) is declared as “(a)- > (b)”. If the
specific relationship between (a) and (b) is of importance, the edge [r] can be inserted
in the middle of the arrow. The square brackets represent edges, and r is our variable for
relationships.
Now, if we want to find out who played Neo in ‘The Matrix’, we use the following
query to analyze the ACTED_IN path between ACTOR and MOVIE:
MATCH (a: Actor)-[: Acted_In {Role: “Neo”}]- >
(: Movie {Title: “The Matrix”}])
RETURN a.Name
Cypher will return the result Keanu Reeves.
For a list of movie titles (m), actor names (a), and respective roles (r), the query
would have to be:
MATCH (a: Actor)-[r: Acted_In] - > (m: Movie)
RETURN m.Title, a.Name, r.Role
Since our example graph database only contains one actor and two movies, the result
would be the movie ‘Man of Tai Chi’ with actor Keanu Reeves in the role of Donaka
Mark and the movie ‘The Matrix’ with Keanu Reeves as Neo.
In real life, however, such a graph database of actors, movies, and roles has count-
less entries. A manageable query would, therefore, have to remain limited, e.g., to actor
Keanu Reeves, and would then look like this:
MATCH (a: Actor)-[r: Acted_In] - > (m: Movie)
WHERE (a.Name = “Keanu Reeves”)
RETURN m.Title, a.Name, r.Role
Similar to SQL, Cypher uses declarative queries where the user specifies the desired
properties of the result pattern (Cypher) or results table (SQL), and the respective data-
base management system then calculates the results. However, analyzing relationship
networks, using recursive search strategies, or analyzing graph properties are hardly pos-
sible with SQL.
16 1 Data Management
Before Ted Coddʼs introduction of the relational model, nonrelational databases such as
hierarchical or network-like databases existed. After the development of relational data-
base management systems, nonrelational models were still used in technical or scientific
applications. For instance, running CAD (computer-aided design) systems for structural
or machine components on relational technology is rather difficult. Splitting technical
objects across a multitude of tables proved problematic, as geometric, topological, and
graphical manipulations all had to be executed in real time.
The advent of the internet and numerous web-based applications has provided quite a
boost to the relevance of nonrelational data concepts versus relational ones, as managing
Big Data applications with relational database technology is difficult to impossible.
While ‘nonrelational’ would be a better description than NoSQL, the latter has become
established with database researchers and providers on the market over the last few years.
u NoSQL The term NoSQL is now used for any nonrelational data management
approaches meeting two criteria:
NoSQL is also sometimes interpreted as ‘Not only SQL’ to express that other technolo-
gies besides relational data technology are used in massively distributed web applica-
tions. NoSQL technologies are especially necessary if the web service requires high
availability. Section 5.6 discusses the example of an online shop that uses various
NoSQL technologies in addition to a relational database (Fig. 5.13).
The basic structure of an NoSQL database management system is shown in Fig. 1.10.
NoSQL database management systems mostly use a massively distributed storage archi-
tecture. The actual data is stored in key-value pairs, columns or column families, docu-
ment stores, or graphs (Fig. 1.11 and Chap. 7). In order to ensure high availability and
avoid outages in NoSQL database systems, various redundancy concepts (cf., “consistent
hashing” in Sect. 5.2.3) are supported.
The massively distributed and replicated architecture also enables parallel analyses
( “MapReduce” in Sect. 5.4). Especially analyses of large volume of data or the search
for specific information can be significantly accelerated with distributed computing pro-
cesses. In the map/reduce method, subtasks are delegated to various computer nodes and
simple key-value pairs are extracted (map), then the partial results are aggregated and
returned (reduce).
There are also multiple consistency models or massively distributed computing
networks (Sect. 4.3). Strong consistency means that the NoSQL database manage-
ment system ensures full consistency at all times. Systems with weak consistency
tolerate that changes will be copied to replicated nodes with a delay, resulting in
1.4 NoSQL Databases 17
Parallel execution
Weak to strong consistency
Document E
DI
RE
Document D MOVIE
CT
<Value = Order>
ED
Document C
_B
Document A
Shopping cart
TE
DIRECTOR
D_
Item 1
IN
Item 2
ACTOR
Item 3
u NoSQL database system Web-based storage systems are considered NoSQL database
systems if they meet the following requirements:
The researchers and operators of the NoSQL Archive list more than 225 NoSQL data-
base products on their website, most of them open source. However, the multitude of
solutions indicates that the market for NoSQL products is not yet completely secure.
Moreover, implementation of suitable NoSQL technologies requires specialists who
know not only the underlying concepts, but also the various architectural approaches and
tools.
Figure 1.11 shows three different NoSQL database management systems.
Key-value stores (Sect. 7.2) are the simplest version. Data is stored as an identifica-
tion key <key = “key”> and a list of values <value = “value 1”, “value 2”, …> . A good
example is an online store with session management and shopping basket. The session
ID is the identification key; the individual items from the basket are stored as values in
addition to the customer profile.
In document stores (Sect. 7.4), records are managed as documents within the NoSQL
database. These documents are structured text files, e.g., in JSON or XML format, which
can be searched for by a unique key or attributes within the documents. Unlike key-value
stores, documents have some structure; however, it is schema free, i.e., the structures of
individual records (documents) can vary.
The third example revisits the graph database on movies and actors discussed in the
previous sections (for more details on graph databases, see also Sect. 7.6).
Many companies and organizations view their data as a vital resource, increasingly
joining in public information gathering in addition to maintaining their own data. The
continuous global increase and growth of information providers and their 24/7 services
reinforce the importance of web-based data pools.
1.5 Organization of Data Management 19
The necessity for current information based in the real world has a direct impact on the
conception of the field of IT. In many places, specific positions for data management have
been created for a more targeted approach to data-related tasks and obligations. Pro-active
data management deals both strategically with information gathering and utilization and
operatively with the efficient provision and analysis of current and consistent data.
Development and operation of data management incur high costs, while the return is
initially hard to measure. Flexible data architecture, noncontradictory and easy-to-under-
stand data description, clean and consistent databases, effective security concepts, cur-
rent information readiness, and other factors involved are hard to assess and include in
profitability considerations. Only the realization of the dataʼs importance and longevity
makes the necessary investments worthwhile for the company.
For better comprehension of the term data management, we will look at the four
subfields: data architecture, data governance, data technology, and data utiliza-
tion. Figure 1.12 illustrates the objectives and tools of these four fields within data
management.
Employees in data architecture analyze, categorize, and structure the company data
with a sophisticated methodology. In addition to the assessment of data and informa-
tion requirements, the major data classes and their relationships with each other must be
Tasks Tools
documented in data models of varying specificity. These models, created from abstrac-
tion of reality and matched to each other, form the foundation of the data architecture.
Data governance aims for a unified coverage of data descriptions and formats as well
as the respective responsibilities in order to ensure a cross-application use of the long-
lived company data. Todayʼs tendency towards decentralized data storage on intelligent
workplace computers or distributed department machines is leading to a growing respon-
sibility of data governance experts for maintaining data and assigning permissions.
Data technology specialists install, monitor, and reorganize databases and are in
charge of their multilayer security. Their field, also known as database technology or
database administration, further includes technology management and the need for the
integration of new extensions and constant updates and improvements of existing tools
and methods.
The fourth column of data management, data utilization, enables the actual, profitable
application of business data. A specialized team of data scientists (see job profile below)
conducts business analytics, providing and reporting on data analyses to management.
They also support individual departments, e.g., marketing, sales, customer service, etc.,
in generating specific relevant insights from Big Data.
Based on the characterization of data-related tasks and obligations, data management
can be defined as:
Over the past years, new specializations have evolved in the data management field, most
importantly:
• Data architects: Data architects are in charge of a companiesʼ entire data architec-
ture. They decide where and how data has to be accessible in the respective business
model and collaborate with database specialists on questions of data distribution, rep-
lication, or fragmentation.
• Database specialists: Database specialists are experts on database and system tech-
nology and manage the physical implementation of the data architecture. They decide
which database management systems (SQL and/or NoSQL) to use for which applica-
tion architecture components. Moreover, they are responsible for designing a distribu-
tion concept and for archiving, reorganizing, and restoring existing data.
• Data scientists: Data scientists are business analytics experts. They handle data anal-
ysis and interpretation, extracting previously unknown facts from data (knowledge
generation) and providing prognoses for future business developments. Data scientists
use methods and tools from data mining (pattern recognition), statistics, and visuali-
zation of multidimensional connections between data.
1.6 Further Reading 21
The conceptualization proposed above for both data management and the occupational
profiles involved contains technical, organizational, and operational aspects. However,
that does not mean that all roles within data architecture, data governance, data technol-
ogy, and data utilization must be consolidated into one unit within the structure of a com-
pany or organization.
The wide range of literature on the subject of databases shows the importance of this
field of IT. Some books describe not only relational, but also distributed, object-ori-
ented, or knowledge-based database management systems. Well-known works include
Connolly and Begg (2014), Hoffer et al. (2012), and Date (2004). The textbook by
Ullman (1982) follows a rather theoretical approach, Silberschatz et al. (2010) is quite
informative, and for a comprehensive overview, turn to Elmasri and Navathe (2015) or
Garcia-Molina et al. (2014). Gardarin and Valduriez (1989) offer an introduction to rela-
tional database technology and knowledge bases.
German works of note in the field of databases include Lang and Lockemann (1995),
Schlageter and Stucky (1983), Wedekind (1981), and Zehnder (2005). The textbooks by
Saake et al. (2013), Kemper and Eickler (2013), and Vossen (2008) explore the founda-
tions and extensions of database systems. Our definition of databases is based on the
work of Claus and Schwill (2001).
As for Big Data, the market has been flooded with books over the recent years; how-
ever, most of them merely give a superficial description of the subject. Two short intro-
ductions by Celko (2014) and Sadalage and Fowler (2013) explain the terminology and
present the most influential NoSQL database approaches. The work of Redmond and
Wilson (2012) provides concrete descriptions of seven database management systems for
a more in-depth technical understanding.
There are also some German publications on the topic of Big Data. The book by Edlich
et al. (2011) offers an introduction to NoSQL database technologies and presents various
products for key-value store, document store, column store, and graph databases, respec-
tively. Freiknecht (2014) describes Hadoop, a popular framework for scalable and distributed
systems, including its components for data storage (HBase) and data warehousing (Hive).
The volume compiled by Fasel and Meier (2016) provides an overview over the develop-
ment of Big Data in business environments—introducing the major NoSQL databases, pre-
senting use cases, discussing legal aspects, and giving practical implementation advice.
For technical information on operational aspects of data management, we recommend
Dippold et al. (2005). Biethahn et al. (2000) dedicate several chapters of their volume on
data and development management to data architecture and governance. Heinrich and
Lehner (2005) and Österle et al. (1991) touch on some facets of data management in
their books on information management, while Meierʼs (1994) article defines the goals
and tasks of data management from a practical perspective. Meier and Johner (1991) and
Ortner et al. (1990) also handle some aspects of data governance.
22 1 Data Management
References
Biethahn, J., Mucksch, H., Ruf, W.: Ganzheitliches Informationsmanagement (Bd. II: Daten- und
Entwicklungsmanagement). Oldenbourg, München (2000)
Celko, J.: Joe Celkoʼs Complete Guide to NoSQL—What Every SQL Professional Needs to Know
About Nonrelational Databases. Morgan Kaufmann, Amsterdam (2014)
Claus, V., Schwill, A.: Duden Informatik. Ein Fachlexikon für Studium und Praxis. 3. Edition.
Duden, Mannheim (2001)
Connolly, T., Begg, C.: Database Systems—A Practical Approach to Design, Implementation and
Management. Addison-Wesley, Boston (2014)
Date, C.J.: An Introduction to Database Systems. Addison-Wesley, Boston (2004)
Dippold, R., Meier, A., Schnider, W., Schwinn, K.: Unternehmensweites Datenmanagement—Von
der Datenbankadministration bis zum Informationsmanagement. Vieweg, Wiesbaden (2005)
Edlich, S., Friedland, A., Hampe, J., Brauer, B., Brückner, M.: NoSQL—Einstieg in die Welt
nichtrelationaler Web 2.0 Datenbanken. Hanser, München (2011)
Elmasri, R., Navathe, S.B.: Fundamentals of Database Systems. Addison-Wesley, Boston (2015)
Fasel, D., Meier, A. (eds.): Big Data—Grundlagen, Systeme und Nutzungspotenziale. Edition
HMD. Springer, Wiesbaden (2016)
Freiknecht, J.: Big Data in der Praxis—Lösungen mit Hadoop, HBase und Hive—Daten speichern,
aufbereiten und visualisieren. Hanser, München (2014)
Garcia-Molina, H., Ullman, J.D., Widom: Database Systems—The Complete Book. Pearson
Education Limited, Harlow (2014)
Gardarin, G., Valduriez, P.: Relational Databases and Knowledge Bases. Addison Wesley, Mass.
(1989)
Heinrich, L.J., Lehner, F.: Informationsmanagement—Planung, Überwachung und Steuerung der
Informationsinfrastruktur. Oldenbourg, München (2005)
Hoffer, I.A., Prescott, M.B., Toppi, H.: Modern Database Management. Prentice Hall, Upper
Saddle River (2012)
Kemper, A., Eickler, A.: Datenbanksysteme—Eine Einführung. Oldenbourg, München (2013)
Lang, S.M., Lockemann, P.C.: Datenbankeinsatz. Springer, Berlin (1995)
Meier, A.: Ziele und Aufgaben im Datenmanagement aus der Sicht des Praktikers.
Wirtschaftsinformatik 36(5), 455–464 (1994)
Meier, A., Johner, W.: Ziele und Pflichten der Datenadministration. Theorie und Praxis der
Wirtschaftsinformatik 28(161), 117–131 (1991)
Ortner, E., Rössner, J., Söllner, B.: Entwicklung und Verwaltung standardisierter Datenelemente.
Informatik-Spektrum 13(1), 17–30 (1990)
Österle, H., Brenner, W., Hilbers, K.: Unternehmensführung und Informationssystem—Der Ansatz
des St. Galler Informationssystem-Managements. Teubner, Wiesbaden (1991)
Redmond, E., Wilson, J.R.: Seven Databases in Seven Weeks—A Guide to Modern Databases and
The NoSQL Movement. The Pragmatic Bookshelf, Dallas (2012)
Saake G., Sattler K.-H., Heuer A.: Datenbanken—Konzepte und Sprachen. mitp, Frechen (2013)
Sadalage, P.J., Fowler, M.: NoSQL Distilled—A Brief Guide to the Emerging World of Polyglot
Persistence. Addison-Wesley, Upper Saddle River (2013)
Schlageter, G., Stucky, W.: Datenbanksysteme—Konzepte und Modelle. Teubner, Stuttgart (1983)
Silberschatz, A., Korth, H.F., Sudarshan, S.: Database Systems Concepts. McGraw-Hill, New York
(2010)
Ullman, J.: Principles of Database Systems. Computer Science Press, Rockville (1982)
References 23
Data models provide a structured and formal description of the data and data relation-
ships required for an information system. When data is needed for IT projects, such as
the information about employees, departments, and projects in Fig. 2.1, the necessary
data categories and their relationships with each other can be defined. The definition
of those data categories, called entity sets, and the determination of relationship sets is
at this point done without considering the kind of database management system (SQL
or NoSQL) to be used for entering, storing, and maintaining the data later. This is to
ensure that the data and data relationships will remain stable from the usersʼ perspective
throughout the development and expansion of information systems.
It takes three steps to set up a database for describing a section of the real world: data
analysis, designing a conceptual data model (here: entity-relationship model), and con-
verting it into a relational or nonrelational database schema.
The goal of data analysis (see point 1 in Fig. 2.1) is to find, in cooperation with the
user, the data required for the information system and their relationships to each other
including the quantity structure. This is vital for an early determination of the system
boundaries. The requirement analysis is prepared in an iterative process, based on inter-
views, demand analyses, questionnaires, form compilations, etc. It contains at least a ver-
bal task description with clearly formulated objectives and a list of relevant pieces of
information (see the example in Fig. 2.1). The written description of data connections
can be complemented by graphical illustrations or a summarizing example. It is impera-
tive that the data analysis puts the facts necessary for the later development of a database
in the language of the users.
Step 2 in Fig. 2.1 shows the conception of the entity-relationship model, which con-
tains both the required entity sets and the relevant relationship sets. Our model depicts
the entity sets as rectangles and the relationship sets as rhombi. Based on the data analy-
sis from step 1, the main entity sets are DEPARTMENT, EMPLOYEE, and PROJECT1.
To record which departments the employees are working in and which projects they
are part of, the relationship sets SUBORDINATE and INVOLVED are established and
graphically connected to the respective entity sets. The entity-relationship model, there-
fore, allows for the structuring and graphic representation of the facts gathered during
data analysis. However, it should be noted that the identification of entity and relation-
ship sets, as well as the definition of the relevant attributes is not always a simple, clear-
cut process. Rather, this design step requires quite some experience and practice from the
data architect.
Next, the entity-relationship model is converted into a relational database schema
(Fig. 2.1, 3a) or a graph-oriented database schema (Fig. 2.1, 3b). A database schema is
the formal description of the database objects using either tables or nodes and edges.
Since relational database management systems allow only tables as objects, both the
entity and the relationship sets must be expressed as tables. For this reason, there is one
entity set table each for the entity sets DEPARTMENT, EMPLOYEE, and PROJECT in
Fig. 2.1, step 3a. In order to represent the relationships in tables as well, separate tables
have to be defined for each relationship set. In our example, this results in the tables
SUBORDINATE and INVOLVED. Such relationship set tables always contain the keys
of the entity sets affected by the relationship as foreign keys and potentially additional
attributes of the relationship.
In step 3b of Fig. 2.1, we see the depiction of an equivalent graph database. Each
entity set corresponds to a node in the graph, so we have the nodes DEPARTMENT,
EMPLOYEE, and PROJECT. The relationship sets SUBORDINATE and INVOLVED
from the entity-relationship model are converted into edges in the graph-based model.
The relationship set SUBORDINATE becomes a directed edge from the DEPARTMENT
node to the EMPLOYEE node and is named HAS_AS_SUBORDINATE. Similarly, a
directed edge with the name IS_INVOLVED is drawn from the EMPLOYEE node to the
PROJECT node.
This is only a rough sketch of the process of data analysis, development of an entity-
relationship model, and definition of a relational or graph-based database schema. The
core insight is that a database design should be developed based on an entity-relationship
model. This allows for the gathering and discussion of data modeling factors with the
users, independently of any specific database system. Only in the next design step is the
most suitable database schema determined and mapped out. Both for relational and for
graph-oriented databases, there are clearly defined mapping rules (Sects. 2.3.2 and 2.4.2,
respectively).
1The names of entity and relationship sets are spelled in capital letters, analogous to table, node,
and edge names.
2.1 From Data Analysis to Database 27
4. ...
2. Entity-relationship
DEPARTMENT model
Entity sets
Relationship sets
SUBORDINATE
DEPARTMENT
EMPLOYEE
EMPLOYEE
TE
NA
DI
OR
UB
ED
_S
LV
AS
VO
PROJECT
S_
_IN
HA
IS
SUBORDINATE DEPARTMENT
INVOLVED
PROJECT
Section 2.2 explores the entity-relationship model in more detail, including the
methods of generalization and aggregation. Section 2.3 discusses modeling aspects for
relational databases, and Sect. 2.4 for graph databases; both explain the respective map-
ping rules for entity and relationship sets, as well as generalization and aggregation.
Section 2.5 illustrates the necessity to develop a ubiquitous data architecture within an
organization. A formula for the analysis, modeling, and database steps is provided in
Sect. 2.6, and a short literature review can be found in Sect. 2.7.
An entity is a specific object in the real world or our imagination that is distinct from all
others. This can be an individual, an item, an abstract concept, or an event. Entities of the
same type are combined into entity sets and are further characterized by attributes. These
attributes are property categories of the entity and/or the entity set, such as size, name,
weight, etc.
For each entity set, an identification key, i.e., one attribute or a specific combina-
tion of attributes, is set as unique. In addition to uniqueness, it also has to meet the cri-
terion of the minimal combination of attributes for identification keys as described in
Sect. 1.2.1.
In Fig. 2.2, an individual employee is characterized as an entity by their concrete
attributes. If, in the course of internal project monitoring, all employees are to be listed
with their names and address data, an entity set EMPLOYEE is created. An artificial
EMPLOYEE
E# City
Name Street
employee number in addition to the attributes Name, Street, and City allows for the
unique identification of the individual employees (entities) within the staff (entity set).
Besides the entity sets themselves, the relationships between them are of interest and
can form sets of their own. Similar to entity sets, relationship sets can be characterized
by attributes.
Figure 2.3 presents the statement “Employee Murphy does 70% of her work on pro-
ject P17” as a concrete example of an employee-project relationship. The respective rela-
tionship set INVOLVED is to list all project participations of the employees. It contains a
concatenated key constructed from the foreign keys employee number and project num-
ber. This combination of attributes ensures the unique identification of each project par-
ticipation by an employee. Along with the concatenated key, the relationship set receives
its own attribute named “Percentage” specifying the percentage of working hours that
employees allot to each project they are involved in.
In general, relationships can be understood as associations in two directions: The
relationship set INVOLVED can be interpreted from the perspective of the EMPLOYEE
entity set as ‘one employee can participate in multiple projects’; from the entity set
PROJECT as ‘one project is handled by multiple employees’.
The association of an entity set ES_1 to another entity set ES_2 is the meaning of the
relationship in that direction. As an example, the relationship DEPARTMENT_HEAD
in Fig. 2.4 has two associations: On one hand, each department has one employee in the
E# P#
Percentage
Association types:
DEPARTMENT Type 1: “exactly one”
c 1 Type c: “none or one”
Type m: “one or
multiple”
DEPARTMENT_HEAD SUBORDINATE Type mc: “none, one, or
multiple”
1 m
role of department head, on the other hand, some employees could fill the role of depart-
ment head for a specific department.
Each association from an entity set ES_1 to an entity set ES_2 can be weighted by an
association type. The association type from ES_1 to ES_2 indicates how many entities of
the associated entity set ES_2 can be assigned to a specific entity from ES_12. The main
distinction is between single, conditional, multiple, and multiple-conditional association
types.
u Unique association (type 1) In unique, or type 1, associations, each entity from the
entity set ES_1 is assigned exactly one entity from the entity set EM_2. For example, our
data analysis showed that each employee is subordinate to exactly one department, i.e.,
matrix management is not permitted. The SUBORDINATE relationship from employees
to departments in Fig. 2.4 is, therefore, a unique/type 1 association.
u Conditional association (type c) A type c association means that each entity from the
entity set ES_1 is assigned zero or one, i.e., maximum one, entity from the entity set
ES_2. The relationship is optional, so the association type is conditional. An example of
a conditional association is the relationship DEPARTMENT_HEAD (Fig. 2.4), since not
every employee can have the role of a department head.
2Itis common in database literature to note the association type from ES_1 to ES_2 next to the
associated entity set, i.e., ES_2.
2.2 The Entity-Relationship Model 31
u Multiple-conditional association (type mc) Each entity from the entity set ES_1 is
assigned zero, one, or multiple entities from the entity set ES_2. Multiple-conditional
associations differ from multiple associations in that not every entity from ES_1 must
have a relationship to any entity in ES_2. In analogy to that type, they are also called
conditional-complex. We will exemplify this with the INVOLVED relationship in Fig. 2.4
as well, but this time from the employeesʼ perspective: While not every employee has to
participate in projects, there are some employees involved in multiple projects.
The association types provide information about the cardinality of the relationship. As
we have seen, each relationship contains two association types. The cardinality of a
relationship between the entity sets ES_1 and ES_2 is, therefore, a pair of association
types of the form:
Cardinality:
= (association type from ES_1 to ES_2, association type from ES_2 to
ES_1)3
For example, the pair (mc,m) of association types between EMPLOYEE and PROJECT
indicates that the INVOLVED relationship is (multiple-conditional, multiple).
Figure 2.5 shows all 16 possible combinations of association types. The first quad-
rant contains four options of unique-unique relationships (case B1 in Fig. 2.5). They are
characterized by the cardinalities (1,1), (1,c), (c,1), and (c,c). For case B2, the unique-
complex relationships, also called hierarchical relationships, there are eight possible
combinations. The complex-complex or network-like relationships (case B3) comprise
the four cases (m,m), (m,mc), (mc,m), and (mc,mc).
Instead of the association types, minimum and maximum thresholds can be set if
deemed more practical. For instance, instead of the multiple association type from pro-
jects to employees, a range of (MIN,MAX): = (3,8) could be set. The lower threshold
defines that at least three employees must be involved in a project, while the maximum
threshold limits the number of participating employees to eight.
A2
A1 1 c m mc
• Overlapping entity subsets: The specialized entity sets overlap with each other.
As an example, if the entity set EMPLOYEE has two subsets PHOTO_CLUB and
SPORTS_CLUB, the club members are consequently considered employees.
However, employees can be active in both the companyʼs photography and sports
club, i.e., the entity subsets PHOTO_CLUB and SPORTS_CLUB overlap.
• Overlapping-complete entity subsets: The specialization entity sets overlap with
each other and completely cover the generalized entity set. If we add a CHESS_
CLUB entity subset to the PHOTO_CLUB and SPORTS_CLUB and assume that
every employee joins at least one of these clubs when starting work at the company,
we obtain an overlapping complete constellation. Every employee is a member of at
least one of the three clubs, but they can also be in two or all three clubs.
• Disjoint entity subsets: The entity sets in the specialization are disjoint, i.e., mutually
exclusive. To illustrate this, we will once again use the EMPLOYEE entity set, but
this time with the specializations MANAGEMENT_POSITION and SPECIALIST.
Since employees cannot at the same time hold a leading position and pursue a spe-
cialization, the two entity subsets are disjoint.
2.2 The Entity-Relationship Model 33
• Disjoint-complete entity subsets: The specialization entity sets are disjoint, but
together completely cover the generalized entity set. As a result, there must be a
subentity in the specialization for each entity in the superordinate entity set and
vice versa. For example, take the entity set EMPLOYEE with a third specialization
TRAINEE in addition to the MANAGEMENT_POSITION and SPECIALIST sub-
sets, where every employee is either part of management, a technical specialist, or a
trainee.
EMPLOYEE
disjoint
complete
c c c
MANAGEMENT_
SPECIALIST TRAINEE
POSITION
COMPANY
“Company “Subsidiary is
consists of...” dependent on...”
mc mc
CORPORATION_
STRUCTURE
of the entity set COMPANY with itself. Each company ID from the COMPANY entity
set is used in CORPORATION_STRUCTURE as a foreign key twice, once as ID for
superordinate and once for subordinate company holdings (Figs. 2.21 and 2.36).
CORPORATION_STRUCTURE can also contain additional relationship attributes such
as shares.
In general, aggregation describes the structured merging of entities in what is called
a part-of structure. In CORPORATION_STRUCTURE, each company can be part of
a corporate group. Since CORPORATION_STRUCTURE in our example is defined as
a network, the association types of both superordinate and subordinate parts must be
multiple-conditional.
The two abstraction processes of generalization and aggregation are major struc-
turing elements4 in data modeling. In the entity-relationship model, they can be repre-
sented by specific graphic symbols or as special boxes. For instance, the aggregation
in Fig. 2.7 could also be represented by a generalized entity set CORPORATION
implicitly encompassing the entity set COMPANY and the relationship set
CORPORATION_STRUCTURE.
PART-OF structures do not have to be networks, but can also be hierarchic. Figure 2.8
shows an ITEM_LIST as illustration: Each item can be composed of multiple subi-
tems, while on the other hand, each subitem points to exactly one superordinate item
(Figs. 2.22 and 2.37).
The entity-relationship model is very important for computer-based data mod-
eling tools, as it is supported by many CASE (computer-aided software engineering)
tools to some extent. Depending on the quality of these tools, both generalization and
ITEM
c mc
ITEM_LIST
aggregation can be described in separate design steps, on top of entity and relationship
sets. Only then can the entity-relationship model be converted, in part automatically, into
a database schema. Since this is not always a one-on-one mapping, it is up to the data
architect to make the appropriate decisions. Sections 2.3.2 and 2.4.2 provide some sim-
ple mapping rules to help convert an entity-relationship model into a relational or graph
database.
The study of the relational model has spawned a new database theory that describes for-
mal aspects precisely. One of the major fields within this theory is the normal forms,
which are used to discover and study dependencies within tables in order to avoid redun-
dant information and resulting anomalies.
DEPARTMENT_EMPLOYEE
is redundant, since the same value is listed in the table multiple times. It would be prefer-
able to store the name going with each department number in a separate table for future
reference instead of redundantly carrying it along for each employee.
Tables with redundant information can lead to database anomalies, which can take
one of three forms: If, for organizational reasons, a new department A9, labeled market-
ing, is to be defined in the DEPARTMENT_EMPLOYEE table from Fig. 2.9, but there
are no employees assigned to that department yet, there is no way of adding it. This is
an insertion anomaly—no new table rows can be inserted without a unique employee
number.
Deletion anomalies occur if the removal of some data results in the inadvertent loss of
other data. For example, if we were to delete all employees from the DEPARTMENT_
EMPLOYEE table, we would also lose the department numbers and names.
The last kind are update anomalies (or modification anomalies): If the name of
department A3 were to be changed from IT to Data Processing, each of the departmentʼs
employees would have to be edited individually, meaning that although only one detail is
changed, the DEPARTMENT_EMPLOYEE table has to be adjusted in multiple places.
This inconvenient situation is what we call an update anomaly.
The following paragraphs discuss normal forms, which help to avoid redundancies
and anomalies. Figure 2.10 gives an overview over the various normal forms and their
definition. Below, we will take a closer look at different kinds of dependencies and give
some practical examples.
As can be seen in Fig. 2.10, the normal forms progressively limit acceptable tables.
For instance, a table or entire database schema in the third normal form must meet all
Another Random Document on
Scribd Without Any Related Topics
Stípia, stuble, chaffe, haume, straw, litter, thatch, holme left in the field
after it is reaped. Also the huske that closeth in the straw.
Stipidíto, vsed anciently for Stúpido.
Stípite, a stumpe, a snag or stocke of any tree. Also a log or blocke. Also a
bat or cudgell.
Stípiti, the side-postes of a dore, the vpright sides or vpholders of a
chimney. Also stakes, piles or logs set in the ground. Also certain
woodden shooes to slide vpon the ice withall, sharpe and crcoked at the
end.
Stípo, an Armorie. Also an Ambrie.
Stípola, as Stípia.
Stipoláre, to make a couenant, to confirme a bargaine, to require afore by
couenant, to promise effectually, to require or seeke for by course and
order of law.
Strangusciáre, as Strangoláre.
Straniaménte, strangely. Looke Stráno.
Straniánza, as Stranézza.
Straniáre, to estrange, to alienate, to make strange. Also to stray out or
misse the way.
Straniéro, a stranger, an allian, a forrenner.
Straniézza, as Stranézza.
Stránio, as Stráno.
Straniólo, a strange selfe-conceited-man.
Straníre, nísco, níto, to estrange or to vse strangely.
Stráno, strange, vnwonted, new, seldome seene, wonderfull. Also an alian,
a forrainer, a fremd, or a stranger. Also diuers, vnknowne. Also
vnconuenient, vnproportioned, without all fashion. Also peeuish, fretfull,
angry, fantasticall, hard to be pleased.
Stranutáre, to sneese.
Stranúto, a sneesing, a sneese.
Straordinário, extraordinary, vnwont.
Strapagáre, to ouer pay, to pay beyond.
Straparláre, to ouer or misspeake.
Straparlatóre, an ouer or misspeaker.
Strapassáre, to passe or glide thorow.
Strapiè. Looke A strapiè.
Strapióuere, to ouer shoure or raine.
Strapocíno, a kind of little bird.
Strapontáre, to counterpoint or quilt.
Strapontíno, a quilt or counterpoint.
Strapotẻnte, more then mighty or powerfull.
Strapotére, to be ouer powerfull.
Strappáre, to snatch or pull away by force.
Strappasánti, a snatch-saint, an vnholy yet holy-seeming Puritan or
Iesuite.
Strappatúra, a snatching away or from.
Strappazzáre, to hurry, to misuse, to oppresse, to put to all drudgery.
Strappazzatóre, a tormenter. Also an executioner.
Strappázzo, hurring, ill vsing, drudgerie.
Strapregáre, to ouer intreate.
Strapsiceróte, a kinde of Roe-bucke in Affrica or a wild Goate.
Straricchíre, to grow or make exceeding rich.
Straripáre, as Precipitáre.
Straripéuole, as Precipitóso.
Strárre, to extract, to draw out.
Strarupáre, as Precipitáre.
Strasandáre, to goe out, to wend beyond.
Strasandáta, an out-going, a wending beyond.
Strasánio, ouer or beyond wise.
Strascicáre, as Strascináre.
Stráscico, as Strascíno.
Strascináre, to drag, to hale or traile along the ground.
Strascinatóre, a trayler along the ground.
Strascíno, the traile or traine of a Princes or Ladies garment.
Stráscio, as Strascíno.
Strascíro, a drag to catch fish.
Strasentíre, to ouerheare, to heare more and beyond.
Strasognáre, to ouer dote, dreame or raue.
Strasordinário, extraordinary, vnwont.
Strassicáre, as Strascináre.
Strássico, as Strascíno.
Strassináre, as Strascináre.
Stráta, as Stráda. Also a plaine or flat spreading on the ground.
Stratagẻma, a stratagem, a policy, a wile or witty shift in war.
Stratagliáre, to cut, to iag or snip thorow.
Stratágli, through cuts or iagges.
Strathióne, Sope-wort or Fullers-weed.
Stratiáre, to torture, to racke, to torment, to misuse, to stratiate.
Strátio, torture, torment, rough handling, ill vsage.
Stratióte, Water-millfolly, which liueth without roote and swimmeth onely
aboue the water, called also the souldiers hearbe.
Stratiótico, war like. Also of or belonging to war.
Stratiótto, as Stadiótto, as Guastatóre.
Strattióne, an extraction.
Strátto, extracted, abstracted, drawne out.
Strauaccáre, to commit or wallow in beastlinesse as a Cow, or beyond a
Cow.
Strauaccaríe, all manner of beastly filthinesse or letcheries.
Strauagánte, extrauagant, new-fangled, fantasticall, out of common
course.
Strauagánza, extrauagancie, fantasticalnesse, newfanglenesse,
strangenesse.
Strauagáre, to wander, to gad, to goe, to gad or stray beyond or out of
the way.
Strauaghézza, as Strauáganza.
Strauagliáre, to free or be deliuered from trauell or trouble. Also to ouer-
trauell.
Strauedére, véggo, víddi, vedúto, to ouersee, to see through or beyond.
Strauedúto, ouerseene, seene or looked through or beyond.
Strauénti, by or quarter windes.
Strauẻstíre, vẻsto, vẻstíto, to disguise, or shift in clothes or apparell.
ebookball.com