Chapter 1 - Introduction: Agile Big Data Analytics
Chapter 1 - Introduction: Agile Big Data Analytics
Chapter 1 - Introduction: Agile Big Data Analytics
Chapter 1 – Introduction
A general model for data science is a lifecycle of data collection, curation,
analysis, and action to achieve mission goals. Depending on the data under
observation, the desired goal, and metrics for success, the amount of time
allocated across these steps will vary. Advanced analytics and machine learning
are fast-growing areas of practice and research, so there is a need to develop a
modernized, standard process that addresses current challenges in machine
learning application development and deployment cycles – and encompasses
changing techniques and technologies.
Much of the thinking about data science has been driven by large internet
companies, which are fundamentally data driven. The focus has been to “fail
fast”, that is, to learn quickly if an analytics direction has promise. The goal is a
business outcome rather than improving a measure of accuracy on a specific
analytic model. In some cases, a surrogate hypothesis is developed that can lead
to the same business outcome by inference from an entirely different
hypothesis. This process is typically ad hoc for a data science team and strongly
depends on the skills of the leading data scientist. While the specifics of the
analytics are discussed, the process is not standardized to allow repeatability
across the broader community.
Typically, data science is discussed only in the context of the pre-analytics
cleansing and transformations and, in the analytics step, – relying on a
traditional data analytics lifecycle. Operational Big Data Analytic (BDA) systems,
introduce a number of complexities due to the distribution of the data, the novel
data storage techniques, and the involvement of multiple organizations when
for example using cloud services.
Software development lifecycles (SDLC) have changed dramatically with
the introduction of agile methodologies, to reduce development times, to
provide earlier value, and to lower the risk of failure. The data science lifecycle
for BDA systems is currently in the same situation as SDLC-driven development
prior to the introduction of agile methods.
We present in this paper our approach to a modern process model for
BDA that aligns with new technological changes and implements agility in the
1
Agile Big Data Analytics
2
Agile Big Data Analytics
3
Agile Big Data Analytics
B. Outcomes-Focused Requirements
4
Agile Big Data Analytics
5
Agile Big Data Analytics
6
Agile Big Data Analytics
Despite these issues (and the fact that alternative, iterative processes
were in use throughout this period), the waterfall method persisted as the
dominant methodology in software development through at least the early
2000s. Factors that contributed to its longevity include its ease of
understanding, implied promise of predictability, perception of management
control from stage gate approvals and extensive documentation, and well-
defined roles for various specialists.
B. The Emergence of Agile
C. Drivers of Agile
There are three main drivers for the move to agile software development.
1) Human factors in software development
7
Agile Big Data Analytics
While this may appear obvious, it is not something on which the waterfall
method placed much importance. The waterfall process evolved from the
practices of mass manufacturing physical products and valued standardization,
automation, and repetition. Individuals performed limited functions, and
provided they produced a result within certain tolerances, their output could be
8
Agile Big Data Analytics
reliably consumed by the next stage of the process. The end result is a final big-
bang delivery at the end, with little change in knowledge until the end, as
illustrated by Cockburn in Fig. 3,
Individuals could (with minimal training) be substituted at any particular
stage of the overall process with no change to the final product or initial design.
This model seemed to apply well enough when computing resources were
scarce and software development involved producing large quantities of
physical cards to input data for calculation. As computing power increased and
developers gained access to their own personal computing resources, the
human factors began to have significant impact.
Agile strives to address these issues relating to human factors by placing
the focus on individuals and interactions over processes and tools. Small, cross-
functional teams of five to nine members allow for a mix of complementary skills
and efficient team communication. Teams are empowered to make decisions on
the software design and to organize their own work. Product owners and
stakeholders (representing the interests of the consumers of the end value) and
developers collaborate frequently.
Teams are kept stable so that group dynamics are optimized and
performance increases over time. Teams deliver work incrementally in time-
boxed iterations of 2 to 4 weeks. Each iteration contains elements of analysis,
design, coding, testing, and deployment to create an increment of working
software. Iterative delivery facilitates frequent feedback from users and the
ability to respond to changing requirements, mitigates the risk of mistakes
through limited work-in-process and continuous integration and testing.
In modern software development, the primary constraint is the rate of
learning of the development team. Agile accelerates learning through its use of
frequent integration, iterative releases, and user feedback. Prototyping is
relatively fast and inexpensive in software, and early feedback can replace big
upfront design with better end results. Sometimes multiple prototypes are
employed, using A/B testing to rapidly converge on the best solution. The
learning curve from Cockburn in Fig. 4 illustrates the rapid learning by the team
early on in the project.
9
Agile Big Data Analytics
Or, if the business innovation cycle is sufficiently long, perhaps the organization
can tolerate the cost of delay when the development project delivers no working
software for several years. In either scenario, there may not be any motivation
to look for an alternative to the waterfall method. Currently, it is unlikely that
either of those scenarios apply in any industry or organization.
10
Agile Big Data Analytics
11
Agile Big Data Analytics
12
Agile Big Data Analytics
and then reinforcing the experiments that work and discontinuing those that do
not. Here, practices emerge. Examples of systems from the Complex domain are
an ant colony and a stock market.
13
Agile Big Data Analytics
14
Agile Big Data Analytics
SAIC has developed a BDA process model extending the earlier CRISP-DM
data mining model to incorporate the new technologies of big data and cloud.
This BDA process model is called Data Science EdgeTM (DSE), shown in Fig. 6,
which serves as our process model for Knowledge Discovery in Data Science
(KDDS) [13].
It is formed around the National Institute of Standards and Technology
(NIST) big data reference architecture (RA), explicitly showing the frameworks
that the application draws from. DSE provides a number of enhancements over
CRISP-DM.
Fig 4.1. Data Science EdgeTM—a big data analytics process model
15
Agile Big Data Analytics
DSE guides the data processing and analytics experiments that are
beyond a software development lifecycle for BDA systems. DSE provides the
step-by-step guidance for the additional needs of computational experiments to
develop of machine learning analytics.
DSE is a five-step model—plan, collect, curate, analyse, act. It is
organized around the maturity of the data to address the mission needs – going
from the original raw data to contextualized information, to synthesized
knowledge.
The steps roughly align with the major steps from CRISP-DM but
explicitly consider the collection from external sources and the data storage.
CRISP-DM subtasks are re-grouped in different ways to accommodate big data
techniques which change the ways data is managed. The sixth step in CRISPDM
is evaluation, which essentially was applied only to the accuracy of the modelling
step.
Evaluation has been pulled explicitly into each step rather that solely
referring to the accuracy of models. This process model contains all the lower
level activities of CRISP-DM—albeit with a different grouping—while adding in
BDA, automation, and security and privacy activities.
16
Agile Big Data Analytics
17
Agile Big Data Analytics
each of the steps could be performed in a given sprint. The second issue was the
difficulty in determining how long specific steps will take.
Advanced analytic systems have always operated in the Complex domain
of Fig. 5. Datasets are typically not as described in the documentation; over time
changes have been made and not documented. Datasets are typically not as
clean as expected. Data cleansing is typically an open-ended task unless the data
are well-governed and managed in the organization.
This led to the introduction of an initial assessment phase, described in
agile parlance as a spike. This initial effort clarified the issues in the data sources
to guide the initial sprint planning.
E. Analytics Improvement
18
Agile Big Data Analytics
Like many data science process models, the author emphasizes the
iterative nature of the task to create, test, and train learning algorithms,
including any manual learning done during the stages of data preparation
and exploration.
19
Agile Big Data Analytics
20
Agile Big Data Analytics
21
Agile Big Data Analytics
accommodation for the differing storage paradigms; the discussion assumes the
use of traditional relational databases.
BDA methods are no longer separable from the choices for
distributing the data across nodes and running parallel analytics. Their data
science process is in good alignment with DSE except that they separate out
individual steps within the analyse process.
Table 5.1. A COMPARISON OF THE TWO MODELS PRESENTED IN [16] WITH THE
RELATED SAIC DSE STEPS
In an interview this year [17], former senior data science manager and
principal data scientist at Walmart Labs, Jennifer Prendki, discusses how she, as
the interviewer described it, “built agile processes and teams for machine
22
Agile Big Data Analytics
23
Agile Big Data Analytics
Through a series of three research cycles spanning ten case studies, the
authors explored various architectural design methods in which changes were
adopted after identifying the strengths and limitations of each. Some key
lessons learned include the following:
• “Architecture-supported agile spikes are necessary to address rapid
technology changes and emerging requirements.”
• “The use of reference architectures (RA) increases agility.”
• “A key part of architecture agility is the ability to change
technologies within [an RA’s] block.”
• “An architectural approach to DevOps was critical to achieving
strategic control over continuous delivery goals.”
• Technical and business “feedback loops need to be open.”
24
Agile Big Data Analytics
The lessons learned from the authors’ work show that, in order to support
agile advanced analytics, the architecture also needs to embrace agility.
Specifically, the RA must consist of flexible technology families so new
technologies can be adopted easily and accommodate changing requirements
during the analytics development lifecycle.
The authors provide interesting lessons learned in continuing to increase
big data platform capacity to meet the ever-increasing demands of analytics.
25
Agile Big Data Analytics
Big data engineering is a separate discipline from BDA done through data
science. While data scientists need to be conceptually aware of the techniques
and issues of big data platforms, they need not be the ones installing and
maintaining such platforms.
For agile analytics, there is a significant need for analytics platforms that
insulate the data scientist from the underlying storage implementation details.
Jurney described an open source platform using Avro, Pig, MongoDB, Elastic
Search, and Bootstrap, which could be used to address a number of data science
problems. SAIC has internally developed the Machine Augmented Analytics
Platform (MAAP), which is an enhancement of Zeppelin notebooks that can run
across a Mesos cluster using Spark and Docker containers.
The Cloudera Data Workbench (CDW) enables data scientists to develop
notebooks that invoke distributed compute jobs. The Griffon Data Science
Virtual Environment runs on Ubuntu MATE and includes a number of tools for
programming languages, editors and notebooks, and machine learning.
Traditional relational databases continue to evolve, pushing analytics into
the database as functions. Likewise, BDA platforms will continue to evolve, but
data science needs BDA platforms that allow the rapid development of scalable
agile analytics without the hindrance of continual software and platform
installation.
26
Agile Big Data Analytics
Chapter 6 – Discussion
As discussed in Section III, agile methodologies have made a significant
impact on reducing development and rework time and increasing the success
rate of software development projects. Following this paradigm, a number of
fundamental assumptions must change for the BDA lifecycle.
The fundamental trade-off for agile BDA is that results must be presented
before they are considered done. The data quality may be low, but the results
should be presented to the customer for feedback anyway.
The data science community has proposed this approach, arguing that
large data volumes allows the curation assumption that most errors will cancel
each other out. This shortcut to data quality improvement bypasses the usual
full scan statistics to determine distributions and outliers—a task that is often
impractical. However, this assumption cannot hold when there are systemic
errors in the data.
A second assumption often made is that more data beats better
algorithms. For large enough datasets, it is assumed that many errors will cancel.
Thus, it is important to get an overview of any proposed analytic, so a quick first
look at the analysis is good enough to evaluate whether to continue. It is
assumed that more data will result in better outcomes than continual small
refinements to analytic models.
These two assumptions are truly fundamental changes in typical systems
development, especially for data warehouses or for those used to sample data
for their analytics. A large part of the effort in building an enterprise data
warehouse has been concerned with ensuring the absolute quality of the data.
Likewise, data miners would never want to present results that have not been
evaluated as accurate.
While there is always the risk of inaccuracies, they should resolve in
subsequent agile iterations.
27
Agile Big Data Analytics
B. Agile Effects
28
Agile Big Data Analytics
Chapter 7 - Conclusion
Agile practices and philosophy have transformed software development.
It solves some of the issues inherent in the highly linear approach of waterfall
methodologies. There is similarity between waterfall development and
outcomes driven analytics development, where final results are expected to
fulfil an initial set of well-defined goals.
However, because of the experimental nature of analytics development,
detailed requirements cannot be set with complete confidence. It is only when
the results are meeting the needs of the organization that the details of the end-
state analytics models become clear.
There is a critical need for BDA to follow the software development
lessons learned to adopt agile methodologies. By adopting agile philosophy for
analytics development, results are expected to be shared more frequently to
form a feedback loop of stakeholder opinion and use those needs to validate the
current state and influence its evolution to an agreeable end state.
We have presented our process model for BDA and discussed how it is an
improvement over the current industry standard, CRISP-DM, because of the
alignment with software development, the incorporation of agile practices, and
the adoption of a modern, flexible architecture based on NIST’s big data RA.
This new process model provides the added agility to adapt to the
differences in architecture depending on the data characteristics of volume,
velocity, variety or variability by explicitly considering the architecture and the
choice of NoSQL methods.
29
Agile Big Data Analytics
References
P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, C. Shearer, and R.
Wirth. “CRISP-DM 1.0: Step-by-step data mining guide”, The CRISP-DM
Consortium, 2000.
G. Piatetsky-Shapiro, “CRISP-DM, still the top methodology for analytics,
datamining, or data science projects”,
https://www.kdnuggets.com/2014/10/crisp-dm-top-methodologyanalytics-
data-mining-data-science-projects.html, 2014, accessed online October 9, 2017.
J. Taylor, “Four problems in using CRISP-DM and how to fix them”,
http://www.kdnuggets.com/2017/01/four-problems-crisp-dm-fix.html,
accessed October 9, 2017.
“NIST Big Data Interoperability Framework: Volume 1, Definitions”, NIST Special
Publication 1500-2, N. Grady and W. Chang, Eds.
https://bigdatawg.nist.gov/V2_output_docs.php, accessed October 6, 2017.
W. Royce, “Managing the Development of Large Software Systems”,
Proceedings of IEEE WESCON, Aug. 26, 1970.
K. Beck, M. Beedle, A. van Bennekum, A. Cockburn, W. Cunningham, M. Fowler,
D. Thomas, (2001). “Manifesto for Agile Software Development”, Retrieved
October 9, 2017, from http://agilemanifesto.org
A. Cockburn (2008). “Design as Knowledge Acquisition”, Retrieved October 9,
2017, from http://a.cockburn.us/1973.
J. Pelrine, (2011, May 11). On Understanding Software Agility—A Social
Complexity Point Of View. E:CO, pp. 27-37.
M. M. Waldrop, “Complexity: The Emerging Science at the Edge of Order and
Chaos”, Simon & Schuster, 1992.
Ivanov, D & Ivanov, M, “A Framework of Adaptive Xontrol for Complex
Production and Logistics”, in Dynamics and Logistics: First International
Conference (LDIC 2007), Springer-Verlag, Berlin Heidelberg, pp. 151-159, doi:
10.1007/978-3-540-76862-3_14
30