Rough Set Theory
Rough Set Theory
Rough Set Theory
1. Introduction
Rough Set Theory, proposed in 1982 by Zdzislaw Pawlak, is in a state of constant
development. Its methodology is concerned with the classification and analysis of imprecise,
uncertain or incomplete information and knowledge, and of is considered one of the first
non-statistical approaches in data analysis (Pawlak, 1982).
The fundamental concept behind Rough Set Theory is the approximation of lower and
upper spaces of a set, the approximation of spaces being the formal classification of
knowledge regarding the interest domain.
The subset generated by lower approximations is characterized by objects that will
definitely form part of an interest subset, whereas the upper approximation is characterized
by objects that will possibly form part of an interest subset. Every subset defined through
upper and lower approximation is known as Rough Set.
Over the years Rough Set Theory has become a valuable tool in the resolution of various
problems, such as: representation of uncertain or imprecise knowledge; knowledge analysis;
Open Access Database www.intechweb.org
•
The chapter is divided into the four following topics:
•
Fundamental concepts
•
Rough set with tools for data mining
•
Applications of rough set theory;
Case – Rough set with tools in dengue diagnosis.
2. Fundamental concepts
Rough Sets Theory has been under continuous development for over years, and a growing
number of researchers have became its interested in methodology. It is a formal theory
derived from fundamental research on logical properties of information systems. From the
Source: Data Mining and Knowledge Discovery in Real Life Applications, Book edited by: Julio Ponce and Adem Karahoca,
ISBN 978-3-902613-53-0, pp. 438, February 2009, I-Tech, Vienna, Austria
www.intechopen.com
36 Data Mining and Knowledge Discovery in Real Life Applications
outset, rough set theory has been a methodology of database mining or knowledge
discovery in relational databases. This section presents the concepts of Rough Set Theory;
which coincide partly with the concepts of other theories that treat uncertain and vagueness
information. Among the existent, most traditional approaches for the modeling and
treatment of uncertainties, they are the Theories of the Uncertainty of Dempster-Shafer and
Fuzzy Set (Pawlak et al., 1995). The main concepts related to Rough Set Theory are
presented as the following:
2.1 Set
A set of objects that possesses similar characteristics it is a fundamental part of mathematics.
All the mathematical objects, such as relations, functions and numbers can be considered as
a set. However, the concept of the classical set within mathematics is contradictory; since a
set is considered to be "grouping" without all elements are absent and is know as an empty
set (Stoll, 1979). The various components of a set are known as elements, and relationship
between an element and a set is called of a pertinence relation. Cardinality is the way of
measuring the number of elements of a set. Examples of specific sets that treat vague and
imprecise date are described below:
a. Fuzzy Set
Proposed by mathematician Loft Zadeh in the second half of the sixties, it has as its objective
the treatment of the mathematical concept of vague and approximate, for subsequent
programming and storage on computers.
In order for Zadeh to obtain the mathematical formalism for fuzzy set, it was necessary to
use the classic set theory, where any set can be characterized by a function. In the case of the
fuzzy set, the characteristic function can be generalized so that the values are designated as
elements of the Universe Set U belong to the interval of real numbers [0,1].
The characteristic Function Fuzzy is µA: U å [0,1], where the values indicate the degree of
pertinence of the elements of set U in relation to the set A, which indicated as it is possible
for an element of x of U to belong to A, this function is known as Function of Pertinence and
the set A is the Fuzzy Set (Zadeh, 1965).
b. Rough Set
An approach first forwarded by mathematician Zdzislaw Pawlak at the beginning of the
eighties; it is used as a mathematical tool to treat the vague and the imprecise. Rough Set
Theory is similar to Fuzzy Set Theory, however the uncertain and imprecision in this
approach is expressed by a boundary region of a set, and not by a partial membership as in
Fuzzy Set Theory. Rough Set concept can be defined quite generally by means of interior
and closure topological operations know approximations (Pawlak, 1982).
Observation:
It is interesting to compare definitions of classical sets, fuzzy sets and rough sets. Classical
set is a primitive notion and is defined intuitively or axiomatically. Fuzzy set is defined by
employing the fuzzy membership function, which involves advanced mathematical
structures, numbers and functions. Rough set is defined by topological operations called
approximations, thus this definition also requires advanced mathematical concepts.
www.intechopen.com
Rough Set Theory – Fundamental Concepts, Principals, Data Extraction, and Applications 37
These objects are described in accordance with the format of the data table, in which rows
are considered objects for analysis and columns as attributes (Wu et al., 2004). Below is
shown an example of an information Table 1.
Attributes
Patient
Headache Vomiting Temperature Viral illness
#1 No Yes High Yes
#2 Yes No High Yes
#3 Yes Yes Very high Yes
#4 No Yes Normal No
#5 Yes No High No
#6 No Yes Very high Yes
Table 1. Example of information table
2.4 Approximations
The starting point of rough set theory is the indiscernibility relation, generated by
information concerning objects of interest. The indiscernibility relation is intended to
express the fact that due to the lack of knowledge it is unable to discern some objects
employing the available information Approximations is also other an important concept in
Rough Sets Theory, being associated with the meaning of the approximations topological
operations (Wu et al., 2004). The lower and the upper approximations of a set are interior
and closure operations in a topology generated by the indiscernibility relation. Below is
presented and described the types of approximations that are used in Rough Sets Theory.
a. Lower Approximation (B”)
Lower Approximation is a description of the domain objects that are known with certainty
to belong to the subset of interest.
The Lower Approximation Set of a set X, with regard to R is the set of all of objects, which
certainly can be classified with X regarding R, that is, set B”.
b. Upper Approximation (B*)
Upper Approximation is a description of the objects that possibly belong to the subset of
interest. The Upper Approximation Set of a set X regarding R is the set of all of objects
which can be possibly classified with X regarding R, that is, set B*.
www.intechopen.com
38 Data Mining and Knowledge Discovery in Real Life Applications
otherwise, if the boundary region is a set X ≠ ∅ (empty) the set X "Rough" is considered. In
a set X =∅ (Empty), then the set is considered "Crisp", that is, exact in relation to R;
NEG(B)
B* BR(B)
B” POS(B)
approximation. The coefficient used in measuring the quality is represented by αB(X), where
It is obtained numerically using its own elements, specifically those of lower and upper
regarding the attributes B, that is, X is crisp set. If αB(X) < 1, X is rough set regarding the
attributes B.
Applying it formulates for Table 1, it has αB(X) = 3/5 for the patients with possibility of they
•
are with viral illness.
Quality Coefficient of upper and lower approximation
www.intechopen.com
Rough Set Theory – Fundamental Concepts, Principals, Data Extraction, and Applications 39
consistence, which can be denoted by γ(C, D), where C is the condition and D the decision. If
The number of consistency rules, contained in the decision table are known as a factor of
γ(C,D) = 1, the decision table is consistent, but if γ(C,D) ≠ 1 the table of decision is
Given that Table 1, γ(C,D) = 4/6, that is, the Table 1 possesses two inconsistent rules
inconsistent.
(patient2, patient5) and four consistent rules (patient1, patient3, patient4, patient6), inside of
universe of six rules for all the Table 1 (Ziarko & Shan, 1995). The decision rules are
frequently shown as implications in the form of “if... then... “. To proceed is shown one rule
for the implication viral illness:
If
Headache = no and
Vomiting = yes and
Temperature = high
Then
Viral Illness = yes
www.intechopen.com
40 Data Mining and Knowledge Discovery in Real Life Applications
A set of decision rules is designated as decision algorithms, because for each decision table it
can be associated with the decision algorithm, consisting of all the decision rules that it
occur in the respective decision table. A may be made distinction between decision
algorithm and decision table. A decision table is a data set, whereas a decision algorithm is
a collection of implications, that is, a logical expressions (Pawlak, 1991).
all values of attributes from D are uniquely determined by values of attributes from C, then
D depends totally on C, if there exists a functional dependency between values of D and C.
For example, in Table 1 there are no total dependencies whatsoever, if in Table 1, the value
of the attribute Temperature for patient p5 was “no” instead of “high”, there would be a
total dependency {Temperature}⇒{viral illness}, because to each value of the attribute
Temperature there would correspond a unique value of the attribute viral illness.
It would also necessitate a more global concept of dependency of attributes, designated as
partial dependency of attributes, in Table 1, the temperature attribute determines some
uniquely values of the attribute viral illness. That is, (temperature, very high) implies (viral
illness, yes), similarly (temperature, normal) implies (viral illness, no), but (temperature,
high) does not imply always (viral illness, yes). Thus the partial dependency means that
only some values of D are determined by values of C. Formally the dependence among the
that D depends on C in degree K (0 ≤ k ≤ 1), denoted ⇒kD and if k = γ(C, D). If K=1, D
attributes can be defined in the following way: If D and C are subsets of A, can be affirmed
It can be easily seen that if D depends totally on C then I(C) ⊆ I(D). That means that the
and {Vomiting}⇒{Viral illness} it has k=0.
partition generated by C is finer than the partition generated by D, and that the concept of
dependency presented in the section corresponds to that considered in relational databases.
www.intechopen.com
Rough Set Theory – Fundamental Concepts, Principals, Data Extraction, and Applications 41
properties and more importantly the consistency of the system (Cerchiari et al, 2006). If is
subtract relative data from the headache and vomiting, the resultant data set is equivalent to
original data in relation to approximation and dependency, as it has the same the
approximation precision and the same dependency degree using the original set of
attributes, however with one fundamental difference, the set of attributes to be considered
will be fewer.
The process of reducing an information system such that the set of attributes of the reduced
information system is independent and no attribute can be eliminated further without losing
subset B ⊆ A preserves the indiscernibility relation RA, then the attributes A - B are
some information from the system, the result is known as reduct. If an attribute from the
dispensable. Reducts are such subsets minimal, i.e., that do not contain any dispensable
attributes. Therefore, the reduction should have the capacity to classify objects, without
altering the form of representing the knowledge (Geng & Zhu, 2006). When the definition
above is applied, the information system presented in the Table 1, B is a subset of A and a
•
belongs to B:
•
a is dispensable in B if I (B) = I (B - {a}); otherwise a is indispensable in B;
•
Set B is independent if all its attributes are indispensable;
Subset B' of B is a reduct of B if B' it is independent and I (B') = I (B); and
A reduct is a set of attributes that preserve the basic characteristics of the original data set;
therefore, the attributes that do not belong to a reduct are superfluous with regard to
classification of elements of the Universe.
www.intechopen.com
42 Data Mining and Knowledge Discovery in Real Life Applications
Users
Table 2. Methods of Data Mining that can be applied in the tasks of KDD
During the data mining stage much useful knowledge is gained is respect of the application.
Many authors consider data mining synonymous with KDD, in the context this stage, the
www.intechopen.com
Rough Set Theory – Fundamental Concepts, Principals, Data Extraction, and Applications 43
KDD process is often known as Data Mining; in the research it will Data Mining be
indented as KDD (Piatetsky-Shapiro & Matheus, 1995; Mitchell, 1999; Wei, 2003).
Data mining has become an area of research increasing importance, and is also referred to as
knowledge discovery in databases (KDD), consequently this has resulted in a process of non
trivial extraction of implicit, previously unknown and potentially useful information, such
as knowledge rules, constraints, regularities from data in databases
c. Post Processing Stage
In the post processing stage the treatment of knowledge obtained during the data mining
stage. This stage is not always necessary; however, it allows the possibility of validation of
the usefulness of the discovered knowledge.
www.intechopen.com
44 Data Mining and Knowledge Discovery in Real Life Applications
rough set techniques can be employed as an approach to the problem of data mining and
knowledge extraction.
www.intechopen.com
Rough Set Theory – Fundamental Concepts, Principals, Data Extraction, and Applications 45
www.intechopen.com
46 Data Mining and Knowledge Discovery in Real Life Applications
www.intechopen.com
Rough Set Theory – Fundamental Concepts, Principals, Data Extraction, and Applications 47
www.intechopen.com
48 Data Mining and Knowledge Discovery in Real Life Applications
www.intechopen.com
Rough Set Theory – Fundamental Concepts, Principals, Data Extraction, and Applications 49
5.3 Approximation
The lower and the upper approximations of a set are interior and closure operations in a
topology generated by a indiscernibility relation. Below is presented and described the types
of approximations are followed using in Rough Set Theory; the approximations concepts are
applied in the Table 3, shown to proceed:
Decision
Conditional Attributes
Attribute
Patient
muscular_pain_articu
blotched_red_skin Temperature Dengue
lations
P1 No No Normal No
P2 No No High No
P8 No No High No
P10 Yes No High No
P11 Yes No Very High No
P12 No Yes Normal No
P14 No Yes Normal No
P15 Yes Yes Normal No
P16 Yes No Normal No
P17 Yes No High No
P19 Yes No Normal No
P20 No Yes Normal No
P3 No No Very High Yes
P4 No Yes High Yes
P5 No Yes Very High Yes
P6 Yes Yes High Yes
P7 Yes Yes Very High Yes
P9 Yes No Very High Yes
P13 No Yes High Yes
P18 Yes Yes Very High Yes
Table 8. Table 3 organized in relation decision attribute
a. Lower Approximation set B”
- Lower Approximation set (B”) of the patients that are definitely have dengue are
identified as B” = {P3,P4,P5,P6,P7,P13,P18}
- Lower Approximation set (B”) of patients that certain have not dengue are identified as
B” = {P1 ,P2 ,P8 ,P10 ,P12, P14, P15, P16, P17, P19,P20}
b. Upper Approximation set B*
- Upper Approximation set (B*) of the patients that possibly have dengue are identified
as B* = {P3,P4,P5,P6,P7, P9, P13,P18}
- Upper Approximation set (B*) of the patients that possibly have not dengue are
identified as B* = {P1, P2, P8, P10, P11, P12, P14, P15, P16, P17, P19, P20}
c. Boundary Region (BR)
- Boundary Region (B*) of the patients that not have dengue are identified as:
BR = {P1,P2,P8,P10,P11,P12,P14,P15,P16,P17,P19,P20} -
{P1,P2,P8,P10,P12,P14,P15,P16,P17, P19,P20} = {P11};
www.intechopen.com
50 Data Mining and Knowledge Discovery in Real Life Applications
- Boundary Region (B*), the set of the patients that have dengue are identified as: BR =
{P3,P4,P5,P6,P7, P9, P13,P18} - {P3,P4,P5,P6,P7,P13,P18} = {P9}
Observation: Boundary Region (BR), the set constituted by elements P9 and P11, which
cannot be classified, since they possess the same characteristics, but with differing
conclusions differ in the decision attribute.
•
- Imprecision coefficient, using Eq. (1)):
for the patients with possibility of they are with dengue αB(X) = =7/8;
• for the patients with possibility of they are not with dengue αB(X) = 8/12.
•
- Quality Coefficient of upper and lower approximation, using Eq. 2 and 3:
αB(B*(X)) =8/20, for the patients that have the possibility of they be with dengue;
• αB(B*(X)) =11/20, for the patients that not have the possibility of they be with
•
dengue;
αB(B”(X)) =7/20, for the patients that have dengue;
• αB(B”(X)) =8/20, for the patients that not have dengue.
Observations:
1. Patient with dengue: αB(B”(X))=7/20, that is, 35% of patients certainly with
dengue.
2. Patient that don't have dengue: αB(B”(X)) = 11/20, that is, approximately 55% of
patients certainly don't have dengue.
3. 10% of patients (P9 and P11) cannot be classified neither with dengue nor without
dengue, since the characteristics of all attributes are the same, with only the
decision attribute (dengue) not being identical and generates an inconclusive
diagnosis for dengue.
www.intechopen.com
Rough Set Theory – Fundamental Concepts, Principals, Data Extraction, and Applications 51
attribute that is different. Therefore, the data of patient P9 and patient P11 will be excluded
from Table 3.
b. Verification of equivalent information
Step 2 – Analysis of data contained in Table 3 shows that it possesses equivalent
information.
P2 No No High No
P8 No No High No
Decision
Conditional Attributes
Attribute
Patient
muscular_pain_articu
blotched_red_skin Temperature Dengue
lations
P1 No No Normal No
P2 No No High No
P3 No No Very High Yes
P4 No Yes High Yes
P5 No Yes Very High Yes
P6 Yes Yes High Yes
P7 Yes Yes Very High Yes
P8 No No High No
P10 Yes No High No
P12 No Yes Normal No
P15 Yes Yes Normal No
P16 Yes No Normal No
P19 Yes No Normal No
Table 9. Reduct of Information of Table 3
Step 3 – Analysis of each condition attributes with the attributes set.
www.intechopen.com
52 Data Mining and Knowledge Discovery in Real Life Applications
www.intechopen.com
Rough Set Theory – Fundamental Concepts, Principals, Data Extraction, and Applications 53
www.intechopen.com
54 Data Mining and Knowledge Discovery in Real Life Applications
www.intechopen.com
Rough Set Theory – Fundamental Concepts, Principals, Data Extraction, and Applications 55
6. Decision rules
With the information reduct shown above, it can be generated the necessary decision rules
for aid to the dengue diagnosis. The rules are presented to proceed:
Rule-1
R1: If patient
blotched_red_skin = No and
muscular_pain_articulations = No and
temperature = Normal
Then dengue = No.
www.intechopen.com
56 Data Mining and Knowledge Discovery in Real Life Applications
Rule-2
R2: If patient
blotched_red_skin = No and
muscular_pain_articulations = No and
temperature = Very High
Then dengue = Yes.
Rule-3
R3: If patient
blotched_red_skin = No and
muscular_pain_articulations = Yes and
temperature = High
Then dengue = Yes.
7. Conclusion
This study, it has discussed the Rough set theory, was proposed in 1982 by Z. Pawlak, as an
approach to knowledge discovery from incomplete, vagueness and uncertain data. The
rough set approach to processing of incomplete data is based on the lower and the upper
approximation, and the theory is defined as a pair of two crisp sets corresponding to
approximations.
The main advantage of rough set theory in data analysis is that it does not need any
preliminary or additional information concerning data, such as basic probability assignment
in Dempster-Shafer theory, grade of membership or the value of possibility in fuzzy set
theory. The Rough Set approach to analysis has many important advantages such as
(Pawlak, 1997): Finding hidden patterns in data; Finds minimal sets of data (data reduction);
Evaluates significance of data; Generates sets of decision rules from data; Facilitates the
interpretation of obtained result
Different problems can be addressed though Rough Set Theory, however during the last few
years this formalism has been approached as a tool used with different areas of research.
There has been research concerning be relationship between Rough Set Theory and the
Dempster-Shafer Theory and between rough sets and fuzzy sets. Rough set theory has also
provided the necessary formalism and ideas for the development of some propositional
machine learning systems.
Rough set has also been used for knowledge representation; data mining; dealing with
imperfect data; reducing knowledge representation and for analyzing attribute
dependencies.
Rough set Theory has found many applications such as power system security analysis,
medical data, finance, voice recognition and image processing; and one of the research areas
that has successfully used Rough Set is the knowledge discovery or Data Mining in
database.
8. References
Cerchiari, S.C.; Teurya, A.; Pinto, J.O.P.; Lambert-Torres, G.; Sauer, L. & Zorzate, E.H. (2006).
Data Mining in Distribution Consumer Database using Rough Sets and Self-
Organizing Maps, Proceedings of the 2006 IEEE Power Systems Conference and
www.intechopen.com
Rough Set Theory – Fundamental Concepts, Principals, Data Extraction, and Applications 57
Exposition, pp. 38-43, ISBN 1-4244-0177-1, Atlanta–USA, Oct. 29–Nov. 1, 2006, IEEE
Press, New Jersey-USA.
Fayyad, U.; Piatetsky-Shapiro, G. & Smyth, P. (1996a). From Data Mining to Knowledge
Discovery: An Overview, In : Advances in Knowledge Discovery & Data Mining,
Fayyad, U.; Piatetsky-Shapiro, G.; Smyth, P. & Uthurusamy, R. (Ed.), pp. 1–34.
AAAI Press, ISBN 978-0-262-56097-9, Menlo Park-USA.
Fayyad, U., G. Piatetsky-Shapiro, & P. Smyth. (1996b). Knowledge Discovery and Data
Mining: Towards an Unifying Framework, The Proceedings of the Second International
Conference on Knowledge Discovery and Data Mining, pp. 82–88, ISBN 978-1-57735-
004-0, Portland–USA, Aug. 2–4, 1996, AAAI Press, Menlo Park-USA.
Geng, Z. & Qunxiong, Z. (2006). A New Rough Set-Based Heuristic Algorithm for Attribute
Reduct, Proceedings of the 6th World Congress on Intelligent Control and Automation,
pp. 3085-3089, ISBN 1-4244-0332-4, Dalian-China, Jun. 21-23, 2006, Wuhan
University of Technology Press, Wuhan-China.
He, Y.; Chen, D.; Zhao, W. (2007). Integration Method of Ant Colony Algorithm and Rough
Set Theory for Simultaneous Real Value Attribute Discretization and Attribute
Reduction, In: Swarm Intelligence: Focus on Ant and Particle Swarm Optimization,
Chan, F.T. S. & Tiwari, M.K. (Ed.), pp. 15–36, I-TECH Education and Publishing.
ISBN 978-3-902613-09-7, Budapest-Hungary.
Komorowski, J.; Pawlak, Z.; Polkowski, L. & Skowron, A. (1999). Rough Sets Perspective on
Data and Knowledge, In: The Handbook of Data Mining and Knowledge Discovery,
Klosgrn, W. & Zylkon, J. (Ed.), pp. 134–149, Oxford University Press, ISBN 0-19-
511831-6, New York-USA.
Kostek, B. (1999). Assessment of Concert Hall Acoustics using Rough Set and Fuzzy Set
Approach, In: Rough Fuzzy Hybridization: A New Trend in Decision-Making, Pal, S. &
Skowron, A. (Ed.), pp. 381-396, Springer-Verlag Co., ISBN 981-4021-00-8, Secaucus-
USA.
Lambert-Torres, G.; Rossi, R.; Jardini, J.A.; Alves da Silva, A.P. & Quintana, V.H. (1999).
Power System Security Analysis based on Rough Classification, In: Rough Fuzzy
Hybridization: A New Trend in Decision-Making, Pal, S. & Skowron, A. (Ed.), pp. 263-
300, Springer-Verlag Co., ISBN 981-4021-00-8, Secaucus-USA.
Lin, T. Y. (1997). An Overview of Rough Set Theory from the Point of View of Relational
Databases, Bulletin of International Rough Set Society, Vol. 1, No. 1, Mar. 1997, pp. 30-
34, ISSN 1346-0013.
Mitchell, T. M. (1999). Machine learning and data mining, Communications of the ACM, Vol.
42, No. 11, Nov. 1999, pp. 30-36, ISSN 0001-0782.
Mrozek, A. & Cyran, K. (2001). Rough Set in Hybrid Methods for Pattern Recognition,
International Journal of Intelligence Systems, Vol. 16. No. 2, Feb. 2001, pp.149-168,
ISSN 0884-8173.
Nguyen, S.H.; Nguyen, T.T. & Nguyen, H.S. (2005). Rough Set Approach to Sunspot
Classification Problem, Proceedings of the 2005 International Conference on Rough Sets,
Fuzzy Sets, Data Mining and Granular Computing - Lecture Notes in Artificial
Intelligence 3642, pp. 263–272, ISBN 978-3-540-28653-0, Regina-Canada, Aug. 31-
Sept. 3, 2005, Springer, Secaucus-USA.
Piatetsky-Shapiro, G. & Matheus, C. J. (1995). The Interestingness of Deviations, Proceedings
of the International Conference on Knowledge Discovery and Data Mining, pp. 23–36,
www.intechopen.com
58 Data Mining and Knowledge Discovery in Real Life Applications
www.intechopen.com
Data Mining and Knowledge Discovery in Real Life Applications
Edited by Julio Ponce and Adem Karahoca
ISBN 978-3-902613-53-0
Hard cover, 436 pages
Publisher I-Tech Education and Publishing
Published online 01, January, 2009
Published in print edition January, 2009
This book presents four different ways of theoretical and practical advances and applications of data mining in
different promising areas like Industrialist, Biological, and Social. Twenty six chapters cover different special
topics with proposed novel ideas. Each chapter gives an overview of the subjects and some of the chapters
have cases with offered data mining solutions. We hope that this book will be a useful aid in showing a right
way for the students, researchers and practitioners in their studies.
How to reference
In order to correctly reference this scholarly work, feel free to copy and paste the following:
Silvia Rissino and Germano Lambert-Torres (2009). Rough Set Theory — Fundamental Concepts, Principals,
Data Extraction, and Applications, Data Mining and Knowledge Discovery in Real Life Applications, Julio Ponce
and Adem Karahoca (Ed.), ISBN: 978-3-902613-53-0, InTech, Available from:
http://www.intechopen.com/books/data_mining_and_knowledge_discovery_in_real_life_applications/rough_set
_theory_-_fundamental_concepts__principals__data_extraction__and_applications