BDA-Lec3
BDA-Lec3
● The CEO and the directors are eager to ) (متشوق ليsee Big Data in action.
○ In response to this, the IT team, in partnership with the business
personnel, take on ETI’s first Big Data project.
● The team then follows a step-by-step approach as set forth by the Big
Data Analytics Lifecycle in pursuit of achieving this objective.
Stage 1: Business Case Evaluation
● Also called “define the problem”
● To measure the success of the Big Data solution for fraud detection, one
of the KPIs set is the reduction in fraudulent claims by 15%.
Stage 1: Business Case Evaluation (Case Study)
● Taking their budget into account)(في الحسبان – في عين االعتبار,
○ the team decides that their largest expense will be in the procuring) الحصول على- (التوريد
○ of new infrastructure that is appropriate for building a Big Data solution environment.
○ They realize that they will be leveraging open source technologies to support batch
processing and therefore do not believe that a large, initial up-front investment is
required for tooling.
○ However, when they consider the broader Big Data analytics lifecycle, the team
members realize that they should budget for the acquisition of additional data
quality and cleansing tools and newer data visualization technologies.
○ After accounting for these expenses, a cost-benefit analysis reveals that the
investment in the Big Data solution can return itself several times over if the targeted
fraud-detecting KPIs can be attained.
○ As a result of this analysis, the team believes that a strong business case exists for
using Big Data for enhanced data analysis.
Stage 2: Data Identification
● This stage is dedicated to identifying the
datasets required for the analysis project
and their sources.
● A quick check of the data qualities of Twitter feeds and weather reports
suggests that around four to five percent of their records are corrupt.
○ Consequently, two batch data filtering jobs are established to remove
the corrupt records.
Stage 4: Data Extraction
● Some of the data identified as input for the
analysis may arrive in a format incompatible
with the Big Data solution.
Further transformation is needed in order to separate the data into two separate fields as required
by the Big Data solution.
Stage 4: Data Extraction (Case Study)
● The IT team observes that some of the datasets will need to be pre-
processed in order to extract the required fields.
● For example,
○ the tweets dataset is in JSON format.
■ In order to be able to analyze the tweets, the user id,
timestamp and the tweet text need to be extracted and
converted to tabular form.
○ the weather dataset arrives in a hierarchical format (XML),
■ fields such as timestamp, temperature forecast, wind speed
forecast, wind direction forecast, snow forecast and flood
forecast are also extracted and saved in a tabular form.
Stage 5: Data Validation and Cleansing
● It is dedicated to establishing often complex
validation rules and removing any known
invalid data.
● Invalid data can skew and falsify analysis
results.
● Unlike traditional enterprise data, where the
data structure is pre-defined and data is
pre-validated, data input into Big Data
analyses can be unstructured without any
indication of validity.
○ Its complexity can further make it
difficult to arrive at a set of suitable
validation constraints.
Stage 5: Data Validation and Cleansing
● Big Data solutions often receive redundant data
across different datasets.
○ This redundancy can be exploited to explore
interconnected datasets in order to
assemble validation parameters and fill in
missing valid data.
○ For example, as illustrated in this Figure:
■ • The first value in Dataset B is
validated against its corresponding
value in Dataset A.
■ • The second value in Dataset B is not
validated against its corresponding
value in Dataset A.
■ • If a value is missing, it is inserted
from Dataset A.
Stage 5: Data Validation and Cleansing
● For batch analytics,
○ data validation and cleansing can be
achieved via an offline ETL operation.
● For real-time analytics,
○ a more complex in-memory system is
required to validate and cleanse the
data as it arrives from the source.
● The large volumes processed by Big Data solutions can make data
aggregation a time and effort-intensive operation.
A simple example of data aggregation where two datasets are aggregated together using
the Id field.
Stage 6: Data Aggregation and Representation
● This Figure shows the same piece of data stored in two different formats.
○ Dataset A contains the desired piece of data, but it is part of a BLOB that is
not readily accessible for querying.
○ Dataset B contains the same piece of data organized in column-based
storage, enabling each field to be queried individually.
Dataset A and B can be combined to create a standardized data structure with a Big Data
solution.
Stage 6: Data Aggregation and Representation (Case
Study)
● For meaningful analysis of data,
○ it is decided to join together policy data, claim data and call center
agent notes in a single dataset that is tabular in nature where each
field can be referenced via a data query.
○ It is thought that this will not only help with the current data analysis
task of detecting fraudulent claims but will also help with other data
analysis tasks, such as risk evaluation and speedy settlement of
claims.
○ The resulting dataset is stored in a NoSQL database.
Stage 7: Data Analysis
● The Data Analysis stage is dedicated to carrying out the
actual analysis task, which typically involves one or more
types of analytics.
● This stage can be iterative in nature,
○ especially if the data analysis is predictive analytics,
in which case analysis is repeated until the
appropriate pattern or correlation is uncovered.
● Depending on the type of analytic result required,
○ this stage can be as simple as querying a dataset to
compute an aggregation for comparison.
○ On the other hand, it can be as challenging as
combining data mining and complex statistical
analysis techniques to discover patterns and
anomalies or to generate a statistical or
mathematical model to depict relationships between
variables.
Stage 7: Data Analysis (Case Study)
● The IT team involves the data analysts at this stage as it does not have the right
skillset for analyzing data in support of detecting fraudulent claims.
● In order to be able to detect fraudulent transactions,
○ first the nature of fraudulent claims needs to be analyzed in order to find
which characteristics differentiate a fraudulent claim from a legitimate claim.
○ For this, the predictive data analysis approach is taken. As part of this
analysis, a range of analysis techniques are applied.
● This stage is repeated a number of times as the results generated after the first
pass are not conclusive enough to comprehend what makes a fraudulent claim
different from a legitimate claim.
● As part of this exercise, attributes that are less indicative of a fraudulent claim are
dropped while attributes that carry a direct relationship are kept or added.
Stage 8: Data Visualization
● The ability to analyze massive amounts of data and
find useful insights carries little value if the only
ones that can interpret the results are the analysts.
● The Data Visualization stage is dedicated to using
data visualization techniques and tools to
graphically communicate the analysis results for
effective interpretation by business users.
● The results of completing the Data Visualization
stage provide users with the ability to perform
visual analysis, allowing for the discovery of
answers to questions that users have not yet even
formulated.
Stage 8: Data Visualization
Stage 8: Data Visualization (Case Study)
● The team has discovered some interesting findings and now needs to
convey the results to the actuaries, underwriters and claim adjusters.
● Different visualization methods are used including bar and line graphs
and scatter plots.
○ Scatter plots are used to analyze groups of fraudulent and
legitimate claims in the light of different factors, such as customer
age, age of policy, number of claims made and value of claim.
Stage 8: Data Visualization (Case Study)
Stage 9: Utilization of Analysis Results
● Subsequent to analysis results being made available to
business users to support business decision-making, such
as via dashboards, there may be further opportunities to
utilize the analysis results.
● This stage is dedicated to determining how and where
processed analysis data can be further leveraged.
● Depending on the nature of the analysis problems being
addressed, it is possible for the analysis results to produce
“models” that encapsulate new insights and understandings
about the nature of the patterns and relationships that exist
within the data that was analyzed.
○ A model may look like a mathematical equation or a set of rules.
○ Models can be used to improve business process logic and
application system logic, and they can form the basis of a new
system or software program.
Stage 9: Utilization of Analysis Results (Case Study)
● Based on the data analysis results, the underwriting and the claims
settlement users have now developed an understanding of the nature of
fraudulent claims.
○ Steps: https://medium.com/@sauravkarki10.12/insurance-claim-
fraud-detection-project-c700e31c7602
○ https://www.kaggle.com/datasets/mykeysid10/insurance-claims-
fraud-detection/data
Thanks!
Do you have any questions?