Data Science Solutions With Python
Data Science Solutions With Python
Data Science Solutions With Python
Solutions
with Python
Fast and Scalable Models Using
Keras, PySpark MLlib, H2O, XGBoost,
and Scikit-Learn
—
Tshepo Chris Nokeri
Data Science
Solutions with Python
Fast and Scalable Models Using
Keras, PySpark MLlib, H2O, XGBoost,
and Scikit-Learn
Introduction�������������������������������������������������������������������������������������������������������������xv
v
Table of Contents
ML Frameworks�������������������������������������������������������������������������������������������������������������������������� 13
Scikit-Learn��������������������������������������������������������������������������������������������������������������������������� 13
H2O���������������������������������������������������������������������������������������������������������������������������������������� 13
XGBoost��������������������������������������������������������������������������������������������������������������������������������� 14
DL Frameworks��������������������������������������������������������������������������������������������������������������������������� 14
Keras������������������������������������������������������������������������������������������������������������������������������������� 14
vi
Table of Contents
vii
Table of Contents
Chapter 10: Automating the Machine Learning Process with H2O����������������������� 111
Exploring Automated Machine Learning����������������������������������������������������������������������������������� 111
Preprocessing Features������������������������������������������������������������������������������������������������������������ 112
H2O AutoML in Action���������������������������������������������������������������������������������������������������������� 112
Conclusion�������������������������������������������������������������������������������������������������������������������������������� 116
Index��������������������������������������������������������������������������������������������������������������������� 117
viii
About the Author
Tshepo Chris Nokeri harnesses advanced analytics and
artificial intelligence to foster innovation and optimize
business performance. In his work, he delivered complex
solutions to companies in the mining, petroleum, and
manufacturing industries. He earned a Bachelor’s degree
in Information Management and then graduated with an
honour’s degree in Business Science from the University of
the Witwatersrand, on a TATA Prestigious Scholarship and
a Wits Postgraduate Merit Award. He was also unanimously
awarded the Oxford University Press Prize. He is the author of Data Science Revealed,
Implementing Machine Learning in Finance, and Econometrics and Data Science, all
published by Apress.
ix
About the Technical Reviewer
Joos Korstanje is a data scientist with over five years of industry experience in
developing machine learning tools, a large part of which has been forecasting models.
He currently works at Disneyland Paris, where he develops machine learning for a
variety of tools. His experience in writing and teaching have motivated him to contribute
to this book on advanced forecasting with Python.
xi
Acknowledgments
Writing a single-authored book is demanding, but I received firm support and active
encouragement from my family and dear friends. Many heartfelt thanks to the Apress
Publishing team for all their support throughout the writing and editing processes.
Lastly, my humble thanks to all of you for reading this; I earnestly hope you find it
helpful.
xiii
Introduction
This book covers the in-memory, distributed cluster computing framework called
PySpark, the machine learning framework platforms called Scikit-Learn, PySpark MLlib,
H2O, and XGBoost, and the deep learning framework known as Keras. After reading this
book, you will be able to apply supervised and unsupervised learning to solve practical
and real-world data problems. In this book, you will learn how to engineer features,
optimize hyperparameters, train and test models, develop pipelines, and automate the
machine learning process.
To begin, the book carefully presents supervised and unsupervised ML and DL
models and examines big data frameworks and machine learning and deep learning
frameworks. It also discusses the parametric model called Generalized Linear Model
and a survival regression model known as the Cox Proportional Hazards model and
Accelerated Failure Time (AFT). It presents a binary classification model called Logistic
Regression and an ensemble model called Gradient Boost Trees. It also introduces DL
and an artificial neural network, the Multilayer Perceptron (MLP) classifier. It describes
a way of performing cluster analysis using the k-means model. It explores dimension
reduction techniques like Principal Components Analysis and Linear Discriminant
Analysis and concludes by unpacking automated machine learning.
The book targets intermediate data scientists and machine learning engineers who
want to learn how to apply key big data frameworks, as well as ML and DL frameworks.
Before exploring the contents of this book, be sure that you understand basic statistics,
Python programming, probability theories, and predictive analytics.
The books uses Anaconda (an open source distribution of Python programming) for
the examples. The following list highlights some of the Python libraries that this book
covers.
xv
Introduction
xvi
CHAPTER 1
Exploring Machine
Learning
This chapter introduces the best machine learning methods and specifies the main
differences between supervised and unsupervised machine learning. It also discusses
various applications of both.
Machine learning has been around for a long time; however, it has recently gained
widespread recognition. This is because of the increased computational power of
modern computer systems and the ease of access to open source platforms and
frameworks. Machine learning involves inducing computer systems with intelligence by
implementing various programming and statistical techniques. It draws from fields such
as statistics, computational linguistics, and neuroscience, among others. It also applies
modern statistics and basic programming. It enables developers to develop and deploy
intelligent computer systems and create practical and reliable applications.
1
© Tshepo Chris Nokeri 2022
T. C. Nokeri, Data Science Solutions with Python, https://doi.org/10.1007/978-1-4842-7762-1_1
Chapter 1 Exploring Machine Learning
Linear regression method Applied when there is one dependent feature (continuous feature) and
an independent feature (continuous feature of categorical). The main
linear regression methods are GLM, Ridge, Lasso, Elastic Net, etc.
Survival regression method Applied to time-event-related censored data, where the dependent
feature is categorical and the independent feature is continuous.
Time series analysis method Applied to uncovering patterns in sequential data and forecasting
future instances. Principal time series models include the ARIMA
model, SARIMA, Additive model, etc.
2
Chapter 1 Exploring Machine Learning
Binary classification method Applied when the categorical feature has only two possible
outcomes. The popular binary classifier is the logistic regression
model.
Multiclass classification Applied when the categorical feature has more than two
method possible outcomes. The main multi-class classifier is the
Linear Discriminant Analysis model (which can also be used for
dimension reduction).
Survival classification Applied when you’re computing the probabilities of an event
occurring using a categorical feature.
3
Chapter 1 Exploring Machine Learning
Centroid clustering Applied to determine the center of the data and draw data points toward
the center. The main centroid clustering method is the k-means method.
Density clustering Applied to determine where the data is concentrated. The main density
clustering model is the DBSCAN method.
Distribution clustering Identifies the probability of data points belonging to a cluster based on
some distribution. The main distribution clustering method is the Gaussian
Mixture method.
Factor analysis Applied to determine the extent to which factors elucidate related changes
of features in the data.
Principal component Applied to determine the extent to which factors elucidate related changes
analysis of features in the data.
Restricted Boltzmann Machine (RBM) The most common neural network that contains only the
hidden and visible layers.
Multilayer Perceptron (MLP) A neural network that prolongs a restricted Boltzmann
machine with input, hidden and output layers.
Recurrent Neural Network (RNN) Serves as a sequential modeler.
Convolutional Neural Network (CNN) Serves as a dimension reducer and classifier.
Conclusion
This chapter covered two ways in which machines learn—via supervised and
unsupervised learning. It began by explaining supervised machine learning and
discussing the three types of supervised learning methods and their applications. It then
covered unsupervised learning techniques, dimension reduction, and cluster analysis.
5
CHAPTER 2
B
ig Data
Big data means different things to different people. In this book, we define big data as
large amounts of data that we cannot adequately handle and manipulate using classic
methods. We must undoubtedly use scalable frameworks and modern technologies to
process and draw insight from this data. We typically consider data “big” when it cannot
fit within the current in-memory storage space. For instance, if you have a personal
computer and the data at your disposal exceeds your computer’s storage capacity, it’s big
data. This equally applies to large corporations with large clusters of storage space. We
often speak about big data when we use a stack with Hadoop/Spark.
7
© Tshepo Chris Nokeri 2022
T. C. Nokeri, Data Science Solutions with Python, https://doi.org/10.1007/978-1-4842-7762-1_2
Chapter 2 Big Data, Machine Learning, and Deep Learning Frameworks
Velocity Modern technologies and improved connectivity enable you to generate data at an
unprecedented speed. Characteristics of velocity include batch data, near or real-time
data, and streams.
Volume The scale at which data increases. The nature of data sources and infrastructure
influence the volume of data. Characteristics of the volume include exabyte,
zettabyte, etc.
Variety Data can come from unique sources. Modern technological devices leave digital
footprints here and there, which increase the number of sources from which businesses
and people can get data. Characteristics of variety include the structure and complexity
of the data.
Veracity Data must come from reliable sources. Also, it must be of high quality, consistent, and
complete.
8
Chapter 2 Big Data, Machine Learning, and Deep Learning Frameworks
I mproved Decision-Making
When a business has big data, it can use it to uncover complex patterns of a
phenomenon to influence strategy. This approach helps management make well-
informed decisions based on evidence, rather than on subjective reasoning. Data-driven
organizations foster a culture of evidence-based management.
We also use big data in fields like life sciences, physics, economics, and medicine.
There are many ways in which big data affects the world. This chapter does not consider
all factors. The next sections explain big data warehousing and ETL activities.
9
Chapter 2 Big Data, Machine Learning, and Deep Learning Frameworks
To perform ETL activities, you must use a query language. The most popular query
language is SQL (Standard Query Language). There are other query languages that
emerged with the open source movement, such as HiveQL and BigQuery. The Python
programming language supports SQL. Python frameworks can connect to databases by
implementing libraries, such as SQLAlchemy, pyodbc, SQLite, SparkSQL, and pandas,
among others.
Apache Spark
Apache Spark executes in-memory cluster computing. It enables developers to
build scalable applications using Java, Scala, Python, R, and SQL. It includes cluster
components like the driver, cluster manager, and executor. You can use it as a standalone
cluster manager or on top of Mesos, Hadoop, YARN, or Baronets. You can use it to access
data in the Hadoop File System (HDFS), Cassandra, HBase, and Hive, among other data
sources. The Spark data structure is considered a resilient distributed data set. This book
introduces a framework that integrates both Python and Apache Spark (PySpark). The
book uses it to operate Spark MLlib. To understand this framework, you first need to
grasp the idea behind resilient distributed data sets.
10
Chapter 2 Big Data, Machine Learning, and Deep Learning Frameworks
S
park Configuration
Areas of Spark configuration include Spark properties, environment variables, and
logging. The default configuration directory is SPARK_HOME/conf.
You can install the findspark library in your environment using pip install
findspark and install the pyspark library using pip install pyspark.
Listing 2-1 prepares the PySpark framework using the findspark framework.
Listing 2-2 stipulates the PySpark app using the SparkConf() method.
Listing 2-3 prepares the PySpark session with the SparkSession() method.
S
park Frameworks
Spark frameworks extend the core of the Spark API. There are four main Spark
frameworks—SparkSQL, Spark Streaming, Spark MLlib, and GraphX.
11
Chapter 2 Big Data, Machine Learning, and Deep Learning Frameworks
SparkSQL
SparkSQL enables you to use relational query languages like SQL, HiveQL, and Scala. It
includes a schemaRDD that has row objects and schema. You create it using an existing
RDD, parquet file, or JSON data set. You execute the Spark Context to create a SQL
context.
Spark Streaming
Spark streaming is a scalable streaming framework that supports Apache Kafka, Apache
Flume, HDFS, and Apache Kensis, etc. It processes input data using DStream in small
batches you push using HDFS, databases, and dashboards. Recent versions of Python
do not support Spark Streaming. Consequently, we do not cover the framework in this
book. You can use a Spark Streaming application to read input from any data source and
store a copy of the data in HDFS. This allows you to build and launch a Spark Streaming
application that processes incoming data and runs an algorithm on it.
Spark MLlib
MLlib is an ML framework that allows you to develop and test ML and DL models.
In Python, the frameworks work hand-in-hand with the NumPy framework. Spark
MLlib can be used with several Hadoop data sources and incorporated alongside
Hadoop workflows. Common algorithms include regression, classification, clustering,
collaborative filtering, and dimension reduction. Key workflow utilities include feature
transformation, standardization and normalization, pipeline development, model
evaluation, and hyperparameter optimization.
GraphX
GraphX is a scalable and fault-tolerant framework for iterative and fast graph parallel
computing, social networks, and language modeling. It includes graph algorithms
such as PageRank for estimating the importance of each vertex in a graph, Connected
Components for labeling connected components of the graph with the ID of its lowest-
numbered vertex, and Triangle Counting for finding the number of triangles that pass
through each vertex.
12
Chapter 2 Big Data, Machine Learning, and Deep Learning Frameworks
M
L Frameworks
To solve ML problems, you need to have a framework that supports building and scaling
ML models. There is no shortage of ML models – there are innumerable frameworks
for ML. There are several ML frameworks that you can use. Subsequent chapters cover
frameworks like Scikit-Learn, Spark MLlib, H2O, and XGBoost.
S
cikit-Learn
The Scikit-Learn framework includes ML algorithms like regression, classification, and
clustering, among others. You can use it with other frameworks such as NumPy and
SciPy. It can perform most of the tasks required for ML projects like data processing,
transformation, data splitting, normalization, hyperparameter optimization, model
development, and evaluation. Scikit-Learn comes with most distribution packages that
support Python. Use pip install sklearn to install it in your Python environment.
H
2O
H2O is an ML framework that uses a driverless technology. It enables you to accelerate
the adoption of AI solutions. It is very easy to use, and it does not require any technical
expertise. Not only that, but it supports numerical and categorical data, including text.
Before you train the ML model, you must first load the data into the H2O cluster. It
supports CSV, Excel, and Parquet files. Default data sources include local file systems,
remote files, Amazon S3, HDFS, etc. It has ML algorithms like regression, classification,
cluster analysis, and dimension reduction. It can also perform most tasks required
for ML projects like data processing, transformation, data splitting, normalization,
hyperparameter optimization, model development, checking pointing, evaluation, and
productionizing. Use pip install h2o to install the package in your environment.
Listing 2-4 prepares the H2O framework.
import h2o
h2o.init()
13
Chapter 2 Big Data, Machine Learning, and Deep Learning Frameworks
X
GBoost
XGBoost is an ML framework that supports programming languages, including Python.
It executes gradient-boosted models that are scalable, and learns fast parallel and
distributed computing without sacrificing memory efficiency. Not only that, but it is
an ensemble learner. As mentioned in Chapter a, ensemble learners can solve both
regression and classification problems. XGBoost uses boosting to learn from the errors
committed in the preceding trees. It is useful when tree-based models are overfitted. Use
pip install xgboost to install the model in your Python environment.
D
L Frameworks
DL frameworks provide a structure that supports scaling artificial neural networks. You
can use it stand-alone or with other models. It typically includes programs and code
frameworks. Primary DL frameworks include TensorFlow, PyTorch, Deeplearning4j,
Microsoft Cognitive Toolkit (CNTK), and Keras.
K
eras
Keras is a high-level DL framework written using Python; it runs on top of an ML
platform known as TensorFlow. It is effective for rapid prototyping of DL models. You
can run Keras on Tensor Processing Units or on massive Graphic Processing Units. The
main Keras APIs include models, layers, and callbacks. Chapter 7 covers this framework.
Execute pip install Keras and pip install tensorflow to use the Keras framework.
14
CHAPTER 3
Linear Modeling
with Scikit-Learn,
PySpark, and H2O
This introductory chapter explains the ordinary least-squares method and executes it
with the main Python frameworks (i.e., Scikit-Learn, Spark MLlib, and H2O). It begins by
explaining the underlying concept behind the method.
import pandas as pd
df = pd.read_csv(r"filepath\WA_Fn-UseC_-Marketing_Customer_Value_
Analysis.csv")
Listing 3-2 stipulates the names of columns to drop and then executes the drop()
method. It then stipulates axes as columns in order to drop the unnecessary columns in
the data.
15
© Tshepo Chris Nokeri 2022
T. C. Nokeri, Data Science Solutions with Python, https://doi.org/10.1007/978-1-4842-7762-1_3
Chapter 3 Linear Modeling with Scikit-Learn, PySpark, and H2O
Listing 3-3 attains the dummy values for the categorical features in this data.
import numpy as np
int_x = initial_data.iloc[::,0:19]
fin_x = initial_data.iloc[::,19:21]
x_combined = pd.concat([int_x, fin_x], axis=1)
x = np.array(x_combined)
y = np.array(initial_data.iloc[::,19])
16
Chapter 3 Linear Modeling with Scikit-Learn, PySpark, and H2O
S
cikit-Learn in Action
Listing 3-5 randomly divides the dataframe.
Listing 3-8 determines the best hyperparameters for the Scikit-Learn ordinary least-
squares regression method.
Listing 3-8. Determine the Best Hyperparameters for the Scikit-Learn Ordinary
Least-Squares Regression Method
from sklearn.preprocessing import GridSearchCV
sk_linear_model_param = {'fit_intercept':[True,False]}
sk_linear_model_param_mod = GridSearchCV(estimator=sk_linear_model,
param_grid=sk_linear_model_param, n_jobs=-1)
sk_linear_model_param_mod.fit(sk_standard_scaled_x_train, y_train)
17
Chapter 3 Linear Modeling with Scikit-Learn, PySpark, and H2O
18
Chapter 3 Linear Modeling with Scikit-Learn, PySpark, and H2O
Listing 3-13 assesses the Scikit-Learn ordinary least-squares method (see Table 3-1).
Table 3-1 shows that the Scikit-Learn ordinary least-squares method explains the
entire variability.
19
Chapter 3 Linear Modeling with Scikit-Learn, PySpark, and H2O
P
ySpark in Action
This section executes and assesses the ordinary least-squares method with the PySpark
framework. Listing 3-14 prepares the PySpark framework with the findspark framework.
Listing 3-15 stipulates the PySpark app with the SparkConf() method.
Listing 3-16 prepares the PySpark session with the SparkSession() method.
Listing 3-17 changes the pandas dataframe created earlier in this chapter to a
PySpark dataframe using the createDataFrame() method.
pyspark_initial_data = pyspark_session.createDataFrame(initial_data)
Listing 3-18 creates a list for independent features and a string for the dependent
feature. It converts data using the VectorAssembler() method for modeling with the
PySpark framework.
20
Chapter 3 Linear Modeling with Scikit-Learn, PySpark, and H2O
pyspark_data_columns = x_list
pyspark_vector_assembler = VectorAssembler(inputCols=pyspark_data_columns,
outputCol="variables")
pyspark_data = pyspark_vector_assembler.transform(pyspark_initial_data)
pyspark_linear_model_assessment = pyspark_fitted_linear_model.summary
print("PySpark root mean squared error", pyspark_linear_model_assessment.
rootMeanSquaredError)
print("PySpark determinant coefficient", pyspark_linear_model_assessment.r2)
21
Chapter 3 Linear Modeling with Scikit-Learn, PySpark, and H2O
H
2O in Action
This section executes and assesses the ordinary least-squares method with the H2O
framework.
Listing 3-23 prepares the H2O framework.
h2o_data = initialize_h2o.H2OFrame(initial_data)
y = y_list
x = h2o_data.col_names
x.remove(y_list)
22
Chapter 3 Linear Modeling with Scikit-Learn, PySpark, and H2O
h2o_yhat = h2o_linear_model.predict(h2o_test_data)
h2o_linear_model_std_coefficients = h2o_linear_model.std_coef_plot()
h2o_linear_model_std_coefficients
23
Chapter 3 Linear Modeling with Scikit-Learn, PySpark, and H2O
Listing 3-30 computes the H2O ordinary least-squares method’s partial dependency
(see Figure 3-2).
h2o_linear_model_dependency_plot = h2o_linear_model.partial_plot
(data = h2o_data, cols = list(initial_data.columns[[0,19]]), server=False,
plot = True)
h2o_linear_model_dependency_plot
24
Chapter 3 Linear Modeling with Scikit-Learn, PySpark, and H2O
Listing 3-31 arranges the features that are the most important to the H2O ordinary
least-squares method in ascending order (see Figure 3-3).
h2o_linear_model_feature_importance = h2o_linear_model.varimp_plot()
h2o_linear_model_feature_importance
h2o_linear_model_assessment = h2o_linear_model.model_performance()
print(h2o_linear_model_assessment)
ModelMetricsRegressionGLM: glm
** Reported on train data. **
MSE: 24844.712331260016
26
Chapter 3 Linear Modeling with Scikit-Learn, PySpark, and H2O
RMSE: 157.6220553452467
MAE: 101.79904883889066
RMSLE: NaN
R^2: 0.7004468136072375
Mean Residual Deviance: 24844.712331260016
Null degrees of freedom: 7325
Residual degrees of freedom: 7304
Null deviance: 607612840.7465751
Residual deviance: 182012362.53881088
AIC: 94978.33944003603
Listing 3-33 improves the performance of the H2O ordinary least-squares method by
specifying remove_collinear_columns as True.
h2o_linear_model_collinear_removed = H2OGeneralizedLinearEstimator(family="
gaussian", lambda_ = 0,remove_collinear_columns = True)
h2o_linear_model_collinear_removed.train(x=x,y=y,training_frame=h2o_
training_data,validation_frame=h2o_validation_data)
h2o_linear_model_collinear_removed_assessment = h2o_linear_model_collinear_
removed.model_performance()
print(h2o_linear_model_collinear_removed)
MSE: 23380.71864337616
RMSE: 152.9075493341521
MAE: 102.53007935777588
RMSLE: NaN
R^2: 0.7180982143647627
Mean Residual Deviance: 23380.71864337616
Null degrees of freedom: 7325
Residual degrees of freedom: 7304
Null deviance: 607612840.7465751
27
Chapter 3 Linear Modeling with Scikit-Learn, PySpark, and H2O
ModelMetricsRegressionGLM: glm
** Reported on validation data. **
MSE: 25795.936313899092
RMSE: 160.6111338416459
MAE: 103.18677222520363
RMSLE: NaN
R^2: 0.7310558588001701
Mean Residual Deviance: 25795.936313899092
Null degrees of freedom: 875
Residual degrees of freedom: 854
Null deviance: 84181020.04623385
Residual deviance: 22597240.210975606
AIC: 11430.364002305443
Conclusion
This chapter executed three key machine learning frameworks (Scikit-Learn, PySpark,
and H2O) to model data and spawn a continuous output feature using a linear method.
It also explored ways of assessing that method.
28
CHAPTER 4
t ,x ,x
1 2 ,,x p t exp b1 X 1 b2 X 2 bp X p (Equation 4-1)
Listing 4-1 attains the necessary data from a Microsoft Excel file.
29
© Tshepo Chris Nokeri 2022
T. C. Nokeri, Data Science Solutions with Python, https://doi.org/10.1007/978-1-4842-7762-1_4
Chapter 4 Survival Analysis withPySpark and Lifelines
import pandas as pd
initial_data = pd.read_excel(r"filepath\survival_data.xlsx", index_col=[0])
int(initial_data.shape[0]) * 0.8
345.6
lifeline_training_data = initial_data.loc[:346]
lifeline_test_data = initial_data.loc[346:]
L ifeline in Action
This section executes and assesses the Cox Proportional Hazards method with the
Lifeline framework. Listing 4-4 executes the Lifeline Cox Proportional Hazards method.
Listing 4-5 computes the test statistics (see Table 4-1) and assesses the Lifeline
Cox Proportional Hazards method with a scaled Schoenfeld, which helps disclose any
abnormalities in the residuals (see Figure 4-1).
30
Chapter 4 Survival Analysis withPySpark and Lifelines
Listing 4-5. Compute the Lifeline Cox Proportional Hazards Method’s Test
Statistics and Residuals
lifeline_cox_method_test_statistics_schoenfeld = lifeline_cox_method.
check_assumptions(lifeline_training_data, show_plots=True)
lifeline_cox_method_test_statistics_schoenfeld
Table 4-1. Test Statistics for the Lifeline Cox Proportional Hazards Method
Test Statistic p
31
Chapter 4 Survival Analysis withPySpark and Lifelines
Listing 4-6 determines the Lifeline Cox Proportional Hazards method’s assessment
summary (see Table 4-2).
lifeline_cox_method_assessment_summary = lifeline_cox_method.
print_summary()
lifeline_cox_method_assessment_summary
32
Chapter 4 Survival Analysis withPySpark and Lifelines
Fin -0.71 0.49 0.23 -1.16 -0.27 0.31 0.77 -3.13 <0.005 9.14
Age -0.03 0.97 0.02 -0.08 0.01 0.93 1.01 -1.38 0.17 2.57
Race 0.39 1.48 0.37 -0.34 1.13 0.71 3.09 1.05 0.30 1.76
Wexp -0.11 0.90 0.24 -0.59 0.37 0.56 1.44 -0.45 0.65 0.62
Mar -1.15 0.32 0.61 -2.34 0.04 0.10 1.04 -1.90 0.06 4.11
Paro 0.07 1.07 0.23 -0.37 0.51 0.69 1.67 0.31 0.76 0.40
Prio 0.10 1.11 0.03 0.04 0.16 1.04 1.17 3.24 <0.005 9.73
Listing 4-7 determines the log test confidence interval for each feature in the data
(see Figure 4-2).
lifeline_cox_log_test_ci = lifeline_cox_method.plot()
lifeline_cox_log_test_ci
33
Chapter 4 Survival Analysis withPySpark and Lifelines
P
ySpark in Action
This section executes the accelerated failure time method with the PySpark framework.
Listing 4-8 runs the PySpark framework with the findspark framework.
Listing 4-9 stipulates the PySpark app using the SparkConf() method.
34
Chapter 4 Survival Analysis withPySpark and Lifelines
Listing 4-10 prepares the PySpark session using the SparkSession() method.
Listing 4-11 changes the pandas dataframe created earlier in this chapter to a
PySpark dataframe using the createDataFrame() method.
pyspark_initial_data = pyspark_session.createDataFrame(initial_data)
Listing 4-12 creates a list for independent features and a string for the dependent
feature. It the converts the data using the VectorAssembler() method for modeling with
the PySpark framework.
35
Chapter 4 Survival Analysis withPySpark and Lifelines
Listing 4-14 computes the PySpark accelerated failure time method’s predictions.
+------+------------------+
|arrest| prediction|
+------+------------------+
| 1|18.883982665910125|
| 1| 16.88228128814963|
| 1|22.631360777172517|
| 0|373.13041474613107|
| 0| 377.2238319806288|
| 0| 375.8326538406928|
| 1| 20.9780526816987|
| 0| 374.6420738270714|
| 0| 379.7483494080467|
| 0| 376.1601473382181|
| 0| 377.1412349521787|
| 0| 373.7536844216336|
| 1| 36.36443059383637|
| 0|374.14261327949384|
| 1| 22.98494042401171|
| 1| 50.61463874375869|
36
Chapter 4 Survival Analysis withPySpark and Lifelines
| 1| 25.56399364288275|
| 0|379.61997114629696|
| 0| 384.3322960430372|
| 0|376.37634062210844|
+------+------------------+
Listing 4-15 computes the PySpark accelerated failure time method’s coefficients.
C
onclusion
This chapter executed two key machine learning frameworks (Lifeline and PySpark) to
model censored data with the Cox Proportional Hazards and accelerated failure time
methods.
37
CHAPTER 5
Nonlinear Modeling
With Scikit-Learn,
PySpark, and H2O
This chapter executes and appraises a nonlinear method for binary classification (called
logistic regression) using a diverse set of comprehensive Python frameworks (i.e., Scikit-
Learn, Spark MLlib, and H2O). To begin, it clarifies the underlying concept behind the
sigmoid function.
L ex
Sx (Equation 5-1)
1 ex e x 1
Both Equation 5-1 and Figure 5-1 suggest that the function produces binary output
values.
39
© Tshepo Chris Nokeri 2022
T. C. Nokeri, Data Science Solutions with Python, https://doi.org/10.1007/978-1-4842-7762-1_5
Chapter 5 Nonlinear Modeling With Scikit-Learn, PySpark, and H2O
Listing 5-1 attains the necessary data from a Microsoft CSV file using the pandas
framework.
Listing 5-2 stipulates the names of columns to drop and then executes the drop()
method. It stipulates axes as columns in order to drop the unnecessary columns in the
data.
40
Chapter 5 Nonlinear Modeling With Scikit-Learn, PySpark, and H2O
Listing 5-3 attains dummy values for categorical features in the data.
initial_data = initial_data.dropna()
S
cikit-Learn in Action
This section executes and assesses the logistic regression method with the Scikit-Learn
framework. Listing 5-5 outlines the independent and dependent features.
import numpy as np
x = np.array(initial_data.iloc[::, 0:17])
y = np.array(initial_data.iloc[::,-1])
41
Chapter 5 Nonlinear Modeling With Scikit-Learn, PySpark, and H2O
Listing 5-9 determines the best hyperparameters for the Scikit-Learn logistic
regression method.
Listing 5-9. Determine the Best Hyperparameters for the Scikit-Learn Logistic
Regression Method
from sklearn.model_selection import GridSearchCV
sk_logistic_regression_method_param = {"penalty":("l1","l2")}
sk_logistic_regression_method_param_mod = GridSearchCV(estimator=sk_
logistic_regression_method, param_grid=sk_logistic_regression_method_param,
n_jobs=-1)
sk_logistic_regression_method_param_mod.fit(sk_standard_scaled_x_train, y_train)
print("Best logistic regression score: ", sk_logistic_regression_method_
param_mod.best_score_)
print("Best logistic regression parameter: ", sk_logistic_regression_
method_param_mod.best_params_)
Best logistic regression score: 0.8986039453717755
Best logistic regression parameter: {'penalty': 'l2'}
Listing 5-10 executes the logistic regression method with the Scikit-Learn framework.
sk_logistic_regression_method = LogisticRegression(penalty="l2")
sk_logistic_regression_method.fit(sk_standard_scaled_x_train, y_train)
42
Chapter 5 Nonlinear Modeling With Scikit-Learn, PySpark, and H2O
print(sk_logistic_regression_method.intercept_)
[-2.4596243]
print(sk_logistic_regression_method.coef_)
[[ 0.03374725 0.04330667 -0.01305369 -0.02709009 0.13508899 0.01735913
0.00816758 0.42948983 -0.12670658 -0.25784955 -0.04025993 -0.14622466
-1.14143485 0.70803518 0.23256046 -0.02295578 -0.02857435]]
Listing 5-14 computes the appropriate classification report (see Table 5-2).
43
Chapter 5 Nonlinear Modeling With Scikit-Learn, PySpark, and H2O
Listing 5-15 arranges the Scikit-Learn logistic regression method’s receiver operating
characteristics curve. The goal is to condense the arrangement of the true positive rate (the
proclivity of the method to correctly differentiate positive classes) and the false positive
rate (the proclivity of the method to correctly differentiate negative classes). See Figure 5-2.
Figure 5-2. Receiver operating characteristics curve for the Scikit-Learn logistic
regression method
Listing 5-17 arranges the learning curve for the Scikit-Learn logistic regression
method to disclose the variations in weighted training and cross-validation accuracy
(see Figure 5-4).
Listing 5-17. Learning Curve for the Logistic Regression Method Executed by
Scikit-Learn
from sklearn.model_selection import learning_curve
train_port_sk_logistic_regression_method, trainscoresk_logistic_regression_
method, testscoresk_logistic_regression_method = learning_curve(sk_
logistic_regression_method, x, y,
cv=3, n_jobs=-5, train_sizes=np.linspace(0.1,1.0,50))
trainscoresk_logistic_regression_method_mean = np.mean(trainscoresk_
logistic_regression_method, axis=1)
46
Chapter 5 Nonlinear Modeling With Scikit-Learn, PySpark, and H2O
testscoresk_logistic_regression_method_mean = np.mean(testscoresk_logistic_
regression_method, axis=1)
plt.plot(train_port_sk_logistic_regression_method, trainscoresk_logistic_
regression_method_mean, label="Weighted training accuracy")
plt.plot(train_port_sk_logistic_regression_method, testscoresk_logistic_
regression_method_mean, label="Weighted cv accuracy Score")
plt.xlabel("Training values")
plt.ylabel("Weighted accuracy score")
plt.legend(loc="best")
plt.show()
Figure 5-4. Learning curve for the logistic regression method executed by
Scikit-Learn
47
Chapter 5 Nonlinear Modeling With Scikit-Learn, PySpark, and H2O
P
ySpark in Action
This section executes and assesses the logistic regression method with the PySpark
framework.
Listing 5-18 prepares the PySpark framework using the findspark framework.
Listing 5-19 stipulates the PySpark app using the SparkConf() method.
Listing 5-20 prepares the PySpark session using the SparkSession() method.
Listing 5-21 changes the pandas dataframe created earlier in this chapter to a
PySpark dataframe using the createDataFrame() method.
pyspark_initial_data = pyspark_session.createDataFrame(initial_data)
Listing 5-22 creates a list of independent features and a string for the dependent
feature. It then converts the data using the VectorAssembler() method for modeling
with the PySpark framework.
pyspark_data_columns = x_list
pyspark_vector_assembler = VectorAssembler(inputCols=pyspark_data_columns,
outputCol="features")
pyspark_data = pyspark_vector_assembler.transform(pyspark_initial_data)
Listing 5-26 arranges the PySpark logistic regression method’s receiver operating
characteristics curve to condense the arrangement of the precision and recall (see
Figure 5-5).
Listing 5-26. Receiver Operating Characteristics Curve for the PySpark Logistic
Regression Method
pyspark_logistic_regression_method_assessment = pyspark_logistic_
regression_method_fitted.summary
pyspark_logistic_regression_method_roc = pyspark_logistic_regression_
method_assessment.roc.toPandas()
pyspark_logistic_regression_method_auroc = pyspark_logistic_regression_
method_assessment.areaUnderROC
49
Chapter 5 Nonlinear Modeling With Scikit-Learn, PySpark, and H2O
plt.plot(pyspark_logistic_regression_method_roc["FPR"], pyspark_logistic_
regression_method_roc["TPR"],
label="AUC= "+str(pyspark_logistic_regression_method_auroc))
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.legend(loc=4)
plt.show()
Figure 5-5. Receiver operating characteristics curve for the PySpark logistic
regression method
Listing 5-27 arranges the PySpark logistic regression method’s precision-recall curve
to condense the arrangement of the precision and recall (Figure 5-6).
50
Chapter 5 Nonlinear Modeling With Scikit-Learn, PySpark, and H2O
Listing 5-27. Precision-Recall Curve for the PySpark Logistic Regression Method
pyspark_logistic_regression_method_assessment = pyspark_logistic_
regression_method_fitted.summary
pyspark_logistic_regression_method_assessment_pr = pyspark_logistic_
regression_method_assessment.pr.toPandas()
pyspark_logistic_regression_method_assessment_wpr = pyspark_logistic_
regression_method_assessment.weightedPrecision
plt.plot(pyspark_logistic_regression_method_assessment_pr["precision"],
pyspark_logistic_regression_method_assessment_pr["recall"],
label="WPR: "+str(pyspark_logistic_regression_method_assessment_
wpr))
plt.xlabel("Precision")
plt.ylabel("Recall")
plt.legend(loc="best")
plt.show()
Figure 5-6. Precision-recall curve for the PySpark logistic regression method
51
Chapter 5 Nonlinear Modeling With Scikit-Learn, PySpark, and H2O
H
2O in Action
This section executes and assesses the logistic regression method using the H2O
framework. Listing 5-28 prepares the H2O framework.
h2o_data = initialize_h2o.H2OFrame(initial_data)
52
Chapter 5 Nonlinear Modeling With Scikit-Learn, PySpark, and H2O
h2o_yhat = h2o_logistic_regression_method.predict(h2o_test_data)
Listing 5-34 computes the H2O logistic regression method’s predictions (see
Figure 5-7).
53
Chapter 5 Nonlinear Modeling With Scikit-Learn, PySpark, and H2O
Listing 5-35 computes the H2O logistic regression method’s partial dependence
(see Figure 5-8).
54
Chapter 5 Nonlinear Modeling With Scikit-Learn, PySpark, and H2O
Listing 5-36 arranges the features that are most important to the H2O logistic
regression method in ascending order (see Figure 5-9).
Listing 5-37 arranges the H2O logistic regression method’s receiver operating
characteristics curve to condense the arrangement of the true positive rate and false
positive rate (see Figure 5-10).
56
Chapter 5 Nonlinear Modeling With Scikit-Learn, PySpark, and H2O
Listing 5-37. Receiver Operating Characteristics Curve for the H2O Logistic
Regression Method
h2o_logistic_regression_method_assessment = h2o_logistic_regression_method.
model_performance()
Figure 5-10. Receiver operating characteristics curve for the H2O logistic
regression method
Conclusion
This chapter executed three key machine learning frameworks (Scikit-Learn, PySpark,
and H2O) to model data and spawn a categorical output feature with two classes using
the logistic regression method.
57
CHAPTER 6
D
ecision Trees
The decision tree is an elementary, non-parametric method suitable for linear and
nonlinear modeling. It executes decision rules and develops a tree-like structure that
divides values into varying groups with the least depth (see a Figure 6-1).
59
© Tshepo Chris Nokeri 2022
T. C. Nokeri, Data Science Solutions with Python, https://doi.org/10.1007/978-1-4842-7762-1_6
Chapter 6 Tree Modeling and Gradient Boosting with Scikit-Learn, XGBoost, PySpark, and H2O
P
reprocessing Features
This chapter manipulates the data from Chapter 5, so it does not sequentially cover the
preprocessing tasks. Listing 6-1 executes all the preprocessing tasks.
60
Chapter 6 Tree Modeling and Gradient Boosting with Scikit-Learn, XGBoost, PySpark, and H2O
S
cikit-Learn in Action
This section executes and assesses the decision tree method with the Scikit-Learn
framework. Listing 6-2 executes the decision tree method with the Scikit-Learn
framework.
Listing 6-3 determines the best hyperparameters for the Scikit-Learn decision tree
method.
Listing 6-3. Determine the Best Hyperparameters for the Scikit-Learn Decision
Tree Method
from sklearn.model_selection import GridSearchCV
sk_decision_tree_method_parameters = {"criterion":("gini","entropy"),
"max_depth":[1, 2, 3, 4, 5, 6]}
sk_decision_tree_method_g_search = GridSearchCV(estimator = sk_decision_
tree_method, param_grid = sk_decision_tree_method_parameters)
sk_decision_tree_method_g_search.fit(sk_standard_scaled_x_train, y_train)
print("Best decision tree regression score: ", sk_decision_tree_method_g_
search.best_score_)
print("Best decision tree parameter: ", sk_decision_tree_method_g_search.
best_params_)
61
Chapter 6 Tree Modeling and Gradient Boosting with Scikit-Learn, XGBoost, PySpark, and H2O
Listing 6-4 executes the decision tree method using the Scikit-Learn framework.
Listing 6-5 arranges the Scikit-Learn decision tree method’s classification report (see
Table 6-1).
Listing 6-6 arranges the decision tree method’s receiver operating characteristics
curve to condense the arrangement of the true positive rate and the false positive rate
(see Figure 6-2).
62
Chapter 6 Tree Modeling and Gradient Boosting with Scikit-Learn, XGBoost, PySpark, and H2O
Listing 6-6. Receiver Operating Characteristics Curve for the Decision Tree
Method (Executed by the Scikit-Learn Framework)
import matplotlib.pyplot as plt
%matplotlib inline
sk_yhat_proba = sk_decision_tree_method.predict_proba(sk_standard_scaled_
x_test)[::,1]
fpr_sk_decision_tree_method, tprr_sk_decision_tree_method, _ = metrics.
roc_curve(y_test, sk_yhat_proba)
area_under_curve_sk_decision_tree_method = metrics.roc_auc_
score(y_test, sk_yhat_proba)
plt.plot(fpr_sk_decision_tree_method, tprr_sk_decision_tree_method,
label="AUC= "+ str(area_under_curve_sk_decision_tree_method))
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.legend(loc="best")
plt.show()
Figure 6-2. Receiver operating characteristics curve for the Scikit-Learn decision
63
Chapter 6 Tree Modeling and Gradient Boosting with Scikit-Learn, XGBoost, PySpark, and H2O
Listing 6-7 arranges the Scikit-Learn decision tree method’s receiver operating
characteristics curve to condense the arrangement of the precision and recall (see
Figure 6-3).
Listing 6-7. Precision-Recall Curve for the Scikit-Learn Decision Tree Method
Figure 6-3. Precision-recall curve for the Scikit-Learn framework decision tree
method
64
Chapter 6 Tree Modeling and Gradient Boosting with Scikit-Learn, XGBoost, PySpark, and H2O
Listing 6-8 arranges the Scikit-Learn decision tree method’s learning curve (see
Figure 6-4).
Listing 6-8. Learning Curve for the Decision Tree Method Executed by the
Scikit-Learn Framework
from sklearn.model_selection import learning_curve
train_port_sk_decision_tree_method, trainscore_sk_decision_tree_method,
testscore_sk_decision_tree_method = learning_curve(sk_decision_tree_method,
x, y, cv=3, n_jobs=-5, train_sizes=np.
linspace(0.1,1.0,50))
trainscoresk_decision_tree_method_mean = np.mean(trainscore_sk_decision_
tree_method, axis=1)
testscoresk_decision_tree_method_mean = np.mean(testscore_sk_decision_tree_
method, axis=1)
plt.plot(train_port_sk_decision_tree_method, trainscoresk_decision_tree_
method_mean, label="Weighted training accuracy")
plt.plot(train_port_sk_decision_tree_method, testscoresk_decision_tree_
method_mean, label="Weighted cv accuracy Score")
plt.xlabel("Training values")
plt.ylabel("Weighted accuracy score")
plt.legend(loc="best")
plt.show()
65
Chapter 6 Tree Modeling and Gradient Boosting with Scikit-Learn, XGBoost, PySpark, and H2O
Figure 6-4. Learning curve for the decision tree method executed by Scikit-Learn
G
radient Boosting
Gradient boosting methods inherit values of input features and then execute countless
tree models to halt the loss function (i.e., the mean absolute error for linear modeling).
It does this by assimilating weak models, then incrementally and iteratively models
weighing data, diligently accompanied by an election of a weak model with the best
performance.
X
GBoost in Action
This section executes and assesses the decision tree method with the XGBoost
framework. Listing 6-9 executes the XGBoost gradient boosting method.
66
Chapter 6 Tree Modeling and Gradient Boosting with Scikit-Learn, XGBoost, PySpark, and H2O
Listing 6-10 arranges the XGBoost gradient boosting method’s classification report
(see Table 6-2).
Listing 6-11 arranges the XGBoost gradient boosting method’s receiver operating
characteristics curve to condense the arrangement of the precision and recall (see
Figure 6-5).
Listing 6-11. Receiver Operating Characteristics Curve for the XGBoost Gradient
Boosting Method
yhat_proba_xgb_gradient_boosting_method = xgb_gradient_boosting_method.
predict_proba(sk_standard_scaled_x_test)[::,1]
fpr_xgb_gradient_boosting_method, tprr_xgb_gradient_boosting_method, _ =
metrics.roc_curve(y_test, yhat_proba_xgb_gradient_boosting_method)
area_under_curve_xgb_gradient_boosting_method = metrics.roc_auc_score(y_
test, yhat_proba_xgb_gradient_boosting_method)
67
Chapter 6 Tree Modeling and Gradient Boosting with Scikit-Learn, XGBoost, PySpark, and H2O
plt.plot(fpr_xgb_gradient_boosting_method, tprr_xgb_gradient_boosting_
method, label="AUC= "+ str(area_under_curve_xgb_gradient_boosting_method))
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.legend(loc="best")
plt.show()
Figure 6-5. Receiver operating characteristics curve for the XGBoost gradient
boosting method
Listing 6-12 arranges the XGBoost gradient boosting method’s precision-recall curve
to condense the arrangement of the precision and recall (see Figure 6-6).
Listing 6-12. Precision-Recall Curve for the Scikit-Learn Decision Tree Method
sk_yhat_xgb_gradient_boosting_method)
weighted_ps_xgb_gradient_boosting_method = metrics.roc_auc_score(y_test,
sk_yhat_xgb_gradient_boosting_method)
68
Chapter 6 Tree Modeling and Gradient Boosting with Scikit-Learn, XGBoost, PySpark, and H2O
plt.plot(p_xgb_gradient_boosting_method, r__xgb_gradient_boosting_method,
label="WPR= " +str(weighted_ps_xgb_gradient_boosting_method))
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.legend(loc="best")
plt.show()
Figure 6-6. Precision-recall curve for the Scikit-Learn decision tree method
P
ySpark in Action
This section executes and assesses the gradient boosting method with the PySpark
framework. Listing 6-13 prepares the PySpark framework using the findspark framework.
69
Chapter 6 Tree Modeling and Gradient Boosting with Scikit-Learn, XGBoost, PySpark, and H2O
Listing 6-14 stipulates the PySpark app using the SparkConf() method.
Listing 6-15 prepares the PySpark session using the SparkSession() method.
Listing 6-16 changes the pandas dataframe created earlier in this chapter to a
PySpark dataframe using the createDataFrame() method.
pyspark_initial_data = pyspark_session.createDataFrame(initial_data)
Listing 6-17 creates a list for independent features and a string for the dependent
feature. It then converts the data using the VectorAssembler() method to model with
the PySpark framework.
70
Chapter 6 Tree Modeling and Gradient Boosting with Scikit-Learn, XGBoost, PySpark, and H2O
Listing 6-20 computes the gradient boosting regression method’s predictions using
the PySpark framework.
H
2O in Action
This section executes and assesses the principal component method with the H2O
framework. Listing 6-21 prepares the H2O framework.
h2o_data = initialize_h2o.H2OFrame(initial_data)
71
Chapter 6 Tree Modeling and Gradient Boosting with Scikit-Learn, XGBoost, PySpark, and H2O
Listing 6-26 assesses the H2O gradient boosting method (see Table 6-3).
h2o_gradient_boosting_method_history = h2o_gradient_boosting_method.
scoring_history()
print(h2o_gradient_boosting_method_history.head(5))
72
Chapter 6
73
Tree Modeling and Gradient Boosting with Scikit-Learn, XGBoost, PySpark, and H2O
Chapter 6 Tree Modeling and Gradient Boosting with Scikit-Learn, XGBoost, PySpark, and H2O
Conclusion
This chapter executed four key machine learning frameworks (Scikit-Learn, XGBoost,
PySpark, and H2O) to model data and spawn a categorical output feature with two
classes using the decision tree and gradient boosting methods.
74
CHAPTER 7
75
© Tshepo Chris Nokeri 2022
T. C. Nokeri, Data Science Solutions with Python, https://doi.org/10.1007/978-1-4842-7762-1_7
Chapter 7 Neural Networks with Scikit-Learn, Keras, and H2O
Figure 7-1 shows the multilayer perceptron neural network that this chapter uses.
age
job
marital
education
default
housing
loan
Deposit
contact
campaign
pdays No deposit
previous
poutcome
emp_var_rate
cons_price_idx
cons_conf_idx
euribor3m
nr_employed
P
reprocessing Features
This chapter manipulates the data attained in Chapter 6, so it does not describe the
preprocessing tasks in detail. Listing 7-1 executes all the preprocessing tasks.
76
Chapter 7 Neural Networks with Scikit-Learn, Keras, and H2O
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
df = pd.read_csv(r"C:\Users\i5 lenov\Downloads\banking.csv")
drop_column_names = df.columns[[8, 9, 10]]
initial_data = df.drop(drop_column_names, axis="columns")
initial_data.iloc[::, 1] = pd.get_dummies(initial_data.iloc[::, 1])
initial_data.iloc[::, 2] = pd.get_dummies(initial_data.iloc[::, 2])
initial_data.iloc[::, 3] = pd.get_dummies(initial_data.iloc[::, 3])
initial_data.iloc[::, 4] = pd.get_dummies(initial_data.iloc[::, 4])
initial_data.iloc[::, 5] = pd.get_dummies(initial_data.iloc[::, 5])
initial_data.iloc[::, 6] = pd.get_dummies(initial_data.iloc[::, 6])
initial_data.iloc[::, 7] = pd.get_dummies(initial_data.iloc[::, 7])
initial_data.iloc[::, 11] = pd.get_dummies(initial_data.iloc[::, 11])
initial_data = initial_data.dropna()
x = np.array(initial_data.iloc[::,0:17])
y = np.array(initial_data.iloc[::,-1])
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,
random_state=0)
sk_standard_scaler = StandardScaler()
sk_standard_scaled_x_train = sk_standard_scaler.fit_transform(x_train)
sk_standard_scaled_x_test = sk_standard_scaler.transform(x_test)
S
cikit-Learn in Action
This section executes and assesses a multilayer perceptron method using the Scikit-Learn
framework. Listing 7-2 executes the Scikit-Learn multilayer perceptron neural network.
77
Chapter 7 Neural Networks with Scikit-Learn, Keras, and H2O
Listing 7-4 arranges the multilayer perceptron neural network’s receiver operating
characteristics curve to condense the arrangement of the true positive rate and the false
positive rate (see Figure 7-2).
78
Chapter 7 Neural Networks with Scikit-Learn, Keras, and H2O
fpr_sk_multilayer_perceptron_net, tprr_sk_multilayer_perceptron_net, _ =
metrics.roc_curve(y_test, yhat_proba_sk_multilayer_perceptron_net)
area_under_curve_sk_multilayer_perceptron_net = metrics.roc_auc_score(y_
test, yhat_proba_sk_multilayer_perceptron_net)
plt.plot(fpr_sk_multilayer_perceptron_net, tprr_sk_multilayer_perceptron_
net, label="AUC= "+ str(area_under_curve_sk_multilayer_perceptron_net))
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.legend(loc="best")
plt.show()
Figure 7-2. Receiver operating characteristics curve for the multilayer perceptron
network executed with the Scikit-Learn framework
79
Chapter 7 Neural Networks with Scikit-Learn, Keras, and H2O
80
Chapter 7 Neural Networks with Scikit-Learn, Keras, and H2O
Listing 7-6 arranges the Scikit-Learn multilayer perceptron neural network’s learning
curve (see Figure 7-4).
Listing 7-6. Arrange a Learning Curve for the Multilayer Perceptron Network
Executed with the Scikit-Learn Framework
from sklearn.model_selection import learning_curve
train_port_sk_multilayer_perceptron_net, trainscore_sk_multilayer_
perceptron_net, testscore_sk_multilayer_perceptron_net = learning_
curve(sk_multilayer_perceptron_net, x, y, cv=3, n_jobs=-5, train_sizes=np.
linspace(0.1,1.0,50))
trainscoresk_multilayer_perceptron_net_mean = np.mean(trainscore_sk_
multilayer_perceptron_net, axis=1)
testscoresk_multilayer_perceptron_net_mean = np.mean(testscore_sk_
multilayer_perceptron_net, axis=1)
plt.plot(train_port_sk_multilayer_perceptron_net, trainscoresk_multilayer_
perceptron_net_mean, label="Weighted training accuracy")
plt.plot(train_port_sk_multilayer_perceptron_net, testscoresk_multilayer_
perceptron_net_mean, label="Weighted cv accuracy Score")
plt.xlabel("Training values")
plt.ylabel("Weighted accuracy score")
plt.legend(loc="best")
plt.show()
81
Chapter 7 Neural Networks with Scikit-Learn, Keras, and H2O
Figure 7-4. Learning curve for the multilayer perceptron network executed with
the Scikit-Learn framework
K
eras in Action
This section executes and assesses a deep belief neural network using the Keras
framework. Listing 7-7 preprocesses the features.
82
Chapter 7 Neural Networks with Scikit-Learn, Keras, and H2O
Listing 7-8 employs the tensorflow framework as a backend and installs the Keras
framework.
Listing 7-8. Employ the Tensorflow Framework as Backend and Install the Keras
Framework
import tensorflow as tf
from tensorflow.keras import Sequential, regularizers
from tensorflow.keras.layers import Dense
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
Listing 7-9 structures a multilayer perceptron neural network with two hidden layers
that have 17 neurons. The input layer contains a "l1" regularizer with a 0.001 rate, the
relu activation function, and binary_crossentropy loss and accuracy metrics.
def keras_multilayer_perceptron_net(optimizer="adam"):
keras_multilayer_perceptron_net_model = Sequential()
keras_multilayer_perceptron_net_model.add(Dense(17, input_dim=17,
activation="sigmoid",kernel_regularizer=regularizers.l1(0.001), bias_
regularizer=regularizers.l1(0.01)))
keras_multilayer_perceptron_net_model.add(Dense(17, activation="relu"))
keras_multilayer_perceptron_net_model.add(Dense(17, activation="relu"))
keras_multilayer_perceptron_net_model.add(Dense(1, activation="relu"))
keras_multilayer_perceptron_net_model.compile(loss="binary_
crossentropy", optimizer=optimizer, metrics=["accuracy"])
return keras_multilayer_perceptron_net_model
keras_multilayer_perceptron_net_model = KerasClassifier(build_fn=keras_
multilayer_perceptron_net)
Listing 7-11 executes the Keras multilayer perceptron neural network with 56 epochs
and a batch_size of 14.
83
Chapter 7 Neural Networks with Scikit-Learn, Keras, and H2O
Listing 7-11. Execute the Multilayer Perceptron Neural Network with the Keras
Framework
keras_multilayer_perceptron_net_model_history = keras_multilayer_
perceptron_net_model.fit(sk_standard_scaled_x_train, y_train, validation_
data=(x_val, y_.val), batch_size=14, epochs=56)
print(keras_multilayer_perceptron_net_model_history)
Listing 7-12 arranges a classification report for the multilayer perceptron network
executed with the Keras framework (see Table 7-2).
Table 7-2. Classification Report for the Multilayer Perceptron Network Executed
with the Keras Framework
Precision Recall F1-score Support
Listing 7-13 arranges the training and CV loss for a multilayer perceptron network
executed with the Keras framework (see Figure 7-5).
84
Chapter 7 Neural Networks with Scikit-Learn, Keras, and H2O
Listing 7-13. Training and CV Loss for Multilayer Perceptron Network Executed
by the Keras Framework
plt.plot(keras_multilayer_perceptron_net_model_history.history["loss"],
label="Training loss")
plt.plot(keras_multilayer_perceptron_net_model_history.history["val_loss"],
label="CV loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend(loc="best")
plt.show()
Figure 7-5. Training and CV loss for multilayer perceptron network executed with
the Keras framework
Listing 7-14 arranges the training and CV accuracy for the multilayer perceptron
network executed with the Keras framework (see Figure 7-6).
85
Chapter 7 Neural Networks with Scikit-Learn, Keras, and H2O
Listing 7-14. Arrange the Training and CV Accuracy for Multilayer Perceptron
Network Executed with the Keras Framework
plt.plot(keras_multilayer_perceptron_net_model_history.history["accuracy"],
label="Training accuracy score")
plt.plot(keras_multilayer_perceptron_net_model_history.history
["val_accuracy"], label="CV accuracy score")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend(loc="best")
plt.show()
Figure 7-6. Training and CV accuracy for the multilayer perceptron network
executed with the Keras framework
86
Chapter 7 Neural Networks with Scikit-Learn, Keras, and H2O
H
2O in Action
This section executes a deep belief neural network using the H2O framework.
Listing 7-15 prepares the H2O framework.
h2o_data = initialize_h2o.H2OFrame(initial_data)
87
Chapter 7 Neural Networks with Scikit-Learn, Keras, and H2O
Conclusion
This chapter executed three key machine learning frameworks (Scikit-Learn, Keras, and
H2O) to model the data. It produced a binary outcome using the multilayer perceptron
and deep belief network. It also applied the binary_crossentropy loss function (to
assess the neural networks) and the adam optimizer (to enhance the neural network’s
performance).
88
CHAPTER 8
Cluster Analysis
with Scikit-Learn,
PySpark, and H2O
This chapter explains the k-means cluster method by implementing a diverse set of
Python frameworks (i.e., Scikit-Learn, PySpark, and H2O). To begin, it clarifies how the
method apportions values to clusters.
import pandas as pd
df = pd.read_csv(r"filepath\Mall_Customers.csv")
89
© Tshepo Chris Nokeri 2022
T. C. Nokeri, Data Science Solutions with Python, https://doi.org/10.1007/978-1-4842-7762-1_8
Chapter 8 Cluster Analysis with Scikit-Learn, PySpark, and H2O
Listing 8-4 executes the principal component method using the Scikit-Learn
framework.
Listing 8-4. Execute the Principal Component Method with the Scikit-Learn
Framework
from sklearn.decomposition import PCA
sk_principal_component_method = PCA(n_components=3)
sk_principal_component_method.fit(sk_standard_scaled_data)
sk_principal_component_method_transformed_data = sk_principal_component_
method.transform(sk_standard_scaled_data)
Listing 8-5 discloses the number of k using an elbow curve (see Figure 8-1).
90
Chapter 8 Cluster Analysis with Scikit-Learn, PySpark, and H2O
plt.xlabel("Number of k")
plt.ylabel("Eigenvalues")
plt.show()
Figure 8-1 shows that the curve abruptly twists, which indicates that the k-means
method must contain three components.
S
cikit-Learn in Action
This section executes the k-means method using the Scikit-Learn framework;
see Listing 8-6.
Listing 8-6. Execute the K-Means Method Using the Scikit-Learn Framework
sk_kmeans_method = KMeans(n_clusters=3)
sk_kmeans_method_fitted = sk_kmeans_method.fit(sk_standard_scaled_data)
91
Chapter 8 Cluster Analysis with Scikit-Learn, PySpark, and H2O
sk_kmeans_method_labels = pd.DataFrame(sk_kmeans_method_fitted.labels_,
columns = ["Labels"])
Listing 8-8 computes the cluster centers and determines their mean and standard
deviation (see Table 8-1).
Listing 8-8. Compute Cluster Centers and Their Mean and Standard Deviation
sk_kmeans_method = KMeans(n_clusters=3)
sk_kmeans_method_centers = sk_kmeans_method_fitted.cluster_centers_
sk_kmeans_method_centers = pd.DataFrame(sk_kmeans_method_centers, columns =
("1st center","2nd center","3rd center"))
sk_kmeans_method_centers_mean = pd.DataFrame(sk_kmeans_method_centers.
mean())
sk_kmeans_method_centers_std = pd.DataFrame(sk_kmeans_method_centers.std())
sk_kmeans_method_centers_mean_std = pd.concat([sk_kmeans_method_centers_
mean, sk_kmeans_method_centers_std], axis=1)
sk_kmeans_method_centers_mean_std.columns =["Center mean", "Center standard
deviation"]
print(sk_kmeans_method_centers_mean_std)
Listing 8-9 determines the scaled data and labels that the Scikit-Learn k-means
method computed (see Figure 8-2).
92
Chapter 8 Cluster Analysis with Scikit-Learn, PySpark, and H2O
plt.scatter(sk_principal_component_method_transformed_data[:,0],
sk_principal_component_method_transformed_data[:, 1],
c=sk_kmeans_method_fitted.labels_,cmap="coolwarm", s=120)
plt.xlabel("y")
plt.show()
P
ySpark in Action
This section executes and assesses the k-means method using the PySpark framework.
Listing 8-10 prepares the PySpark framework using the findspark framework.
93
Chapter 8 Cluster Analysis with Scikit-Learn, PySpark, and H2O
Listing 8-11 stipulates the PySpark app using the SparkConf() method.
Listing 8-12 prepares the PySpark session using the SparkSession() method.
Listing 8-13 changes the pandas dataframe created earlier in this chapter to a
PySpark dataframe using the createDataFrame() method.
pyspark_initial_data = pyspark_session.createDataFrame(initial_data)
Listing 8-14 creates a list for the independent features and a string for the dependent
feature. It then converts the data using the VectorAssembler() method for modeling
with the PySpark framework.
94
Chapter 8 Cluster Analysis with Scikit-Learn, PySpark, and H2O
Listing 8-15 executes the k-means method using the PySpark framework.
Listing 8-15. Execute the K-Means Method Using the PySpark Framework
Listing 8-16 computes the k-means method’s labels and passes them to a pandas
dataframe.
pyspark_yhat = pyspark_kmeans_method_fitted.transform(pyspark_data)
pyspark_yhat_pandas_df = pyspark_yhat.toPandas()
Listing 8-17 shows the scaled data and labels that the PySpark k-means method
computed (see Figure 8-3).
95
Chapter 8 Cluster Analysis with Scikit-Learn, PySpark, and H2O
pyspark_kmeans_method_centers = pyspark_kmeans_method_fitted.
clusterCenters()
print("K-Means method cluster centers")
for pyspark_cluster_center in pyspark_kmeans_method_centers:
print(pyspark_cluster_center)
K-Means method cluster centers
96
Chapter 8 Cluster Analysis with Scikit-Learn, PySpark, and H2O
[26.5952381 33.14285714 65.66666667]
[45.30769231 60.52991453 34.47008547]
[32.97560976 88.73170732 79.24390244]
Listing 8-19 assesses the PySpark k-means method using the Silhouette method.
Listing 8-19. Assess the PySpark K-Means Method Using the Silhouette Method
H
2O in Action
This section executes and assesses the k-means method using the H2O framework.
Listing 8-20 prepares the H2O framework.
h2o_data = initialize_h2o.H2OFrame(initial_data)
y = y_list
x = h2o_data.col_names
x.remove(y_list)
Listing 8-23 reserves data for training and validating the H2O k-means method.
97
Chapter 8 Cluster Analysis with Scikit-Learn, PySpark, and H2O
Listing 8-24. Execute the K-Means Method Using the H2O Framework
h2o_yhat = h2o_kmeans_method.predict(h2o_validation_data)
Listing 8-26 assesses the H2O k-means method (see Table 8-2).
Listing 8-26. Assess the K-Means Method Executed with the H2O Framework
ModelMetricsClustering: kmeans
** Reported on train data. **
MSE: NaN
RMSE: NaN
Total Within Cluster Sum of Square Error: 238.6306804611187
Total Sum of Square Error to Grand Mean: 479.99999503427534
Between Cluster Sum of Square Error: 241.36931457315666
98
Chapter 8 Cluster Analysis with Scikit-Learn, PySpark, and H2O
Conclusion
This chapter executed three key machine learning frameworks (Scikit-Learn, PySpark,
and H2O) in order to group data into three clusters. You saw how to identify the number
of k using the elbow curve. To assess the method, you employed the Silhouette method
and looked at the sum of squared errors.
99
CHAPTER 9
Principal Component
Analysis with Scikit-Learn,
PySpark, and H2O
This chapter executes a simple dimension reducer (a principal component method) by
implementing a diverse set of Python frameworks (Scikit-Learn, PySpark, and H2O). To
begin, it clarifies how the method computes components.
101
© Tshepo Chris Nokeri 2022
T. C. Nokeri, Data Science Solutions with Python, https://doi.org/10.1007/978-1-4842-7762-1_9
Chapter 9 Principal Component Analysis with Scikit-Learn, PySpark, and H2O
import pandas as pd
initial_data = df.drop(drop_column_names, axis="columns")
S
cikit-Learn in Action
Listing 9-3 scales the whole data set with a standard scaler.
Listing 9-4 executes the principal component method with the Scikit-Learn
framework. It then discloses the variance that each component clarifies (see Figure 9-1).
Listing 9-4. Arrange the Explained Variance from the Scikit-Learn Principal
Components Method
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.decomposition import PCA
sk_principal_component_method = PCA(n_components=3)
sk_principal_component_method.fit_transform(sk_standard_scaled_data)
sk_principal_component_method_variance = sk_principal_component_method.
explained_variance_
plt.bar(range(3), sk_principal_component_method_variance, label="Variance")
plt.legend()
102
Chapter 9 Principal Component Analysis with Scikit-Learn, PySpark, and H2O
plt.ylabel("Variance ratio")
plt.xlabel("Components")
plt.show()
Figure 9-1 shows all that components disclose over 0.5 of variance. Therefore, all the
principal component methods will have three components.
Listing 9-5 displays the variance from the Scikit-Learn principal component method.
Listing 9-5. Print the Variance that the Scikit-Learn Principal Component
Method Discloses
print("SciKit-Learn PCA explained variance ratio", sk_principal_component_
method_variance)
103
Chapter 9 Principal Component Analysis with Scikit-Learn, PySpark, and H2O
Listing 9-6 executes the principal component method with the Scikit-Learn
framework and shows fewer dimensions (see Figure 9-2).
Listing 9-6. Execute the Principal Component Method with the Scikit-Learn
Framework
sk_principal_component_method_ = PCA(n_components=3)
sk_principal_component_method_.fit(sk_standard_scaled_data)
sk_principal_component_method_reduced_data = sk_principal_component_
method_.transform(sk_standard_scaled_data)
plt.scatter(sk_principal_component_method_reduced_data[:,0], sk_
principal_component_method_reduced_data[:,2], c=initial_data.iloc[::, -1],
cmap="coolwarm")
plt.xlabel("y")
plt.show()
Figure 9-2 shows the dimensions found by the Scikit-Learn principal component
method.
P
ySpark in Action
This section executes the principal component method using the PySpark framework.
Listing 9-7 prepares the PySpark framework using the findspark framework.
Listing 9-8 stipulates the PySpark app using the SparkConf() method.
Listing 9-9 prepares the PySpark session using the SparkSession() method.
Listing 9-10 changes the pandas dataframe created earlier in this chapter to a
PySpark dataframe using the createDataFrame() method.
pyspark_initial_data = pyspark_session.createDataFrame(initial_data)
Listing 9-11 creates a list for the independent features and a string for the dependent
feature. It then converts the data using the VectorAssembler() method for modeling
with the PySpark framework.
105
Chapter 9 Principal Component Analysis with Scikit-Learn, PySpark, and H2O
Listing 9-12 scales the whole data set with the Lifeline PySpark framework.
Listing 9-12. Scale the Whole Data Set with the Lifeline PySpark Framework
Listing 9-14 discloses the variance that each component clarifies (see Figure 9-3).
plt.legend()
plt.ylabel("Variance ratio")
plt.xlabel("Components")
plt.show()
Figure 9-3. Explained variance from the PySpark principal components method
Listing 9-15 prints the variance that the PySpark principal component method
reveals.
Listing 9-15. Print the Variance that the PySpark Principal Component Method
Reveals
print('Explained variance ratio', pyspark_principal_components_method.
explainedVariance.toArray())
PySpark PCA variance ratio [0.44266167 0.33308378 0.22425454]
107
Chapter 9 Principal Component Analysis with Scikit-Learn, PySpark, and H2O
The results show that the first component discloses 0.44 of the related changes in
features in the data, the second component discloses 0.33, and the third component
discloses 0.22.
Listing 9-16 reduces the data using the PySpark principal component method.
Listing 9-16. Reduce the Data with the PySpark Principal Component Method
pyspark_principal_components_method_reduced_data = pyspark_principal_
components_method_data.rdd.map(lambda row: row.pyspark_scaled_features).
collect()
pyspark_principal_components_method_reduced_data = np.array(pyspark_
principal_components_method_reduced_data)
Listing 9-17 discloses the data that the PySpark principal component method
reduced (see Figure 9-4).
Listing 9-17. Disclose the Data that the PySpark Principal Component Method
Reduced
plt.scatter(pyspark_principal_components_method_reduced_data[:,0], pyspark_
principal_components_method_reduced_data[:,1], c=initial_data.iloc[::, -1],
cmap = "coolwarm")
plt.xlabel("y")
plt.show()
108
Chapter 9 Principal Component Analysis with Scikit-Learn, PySpark, and H2O
Figure 9-4 shows the dimensions found by the PySpark principal component
method.
H
2O in Action
This section executes and assesses the principal component method using the H2O
framework.
Listing 9-18 prepares the H2O framework.
109
Chapter 9 Principal Component Analysis with Scikit-Learn, PySpark, and H2O
h2o_data = initialize_h2o.H2OFrame(initial_data)
Listing 9-22 reduces the data with the H2O principal component method.
Listing 9-22. Reduce the Data with the H2O Principal Component Method
h2o_yhat = h2o_principal_components_method.predict(h2o_validation_data)
C
onclusion
This chapter executed three key machine learning frameworks (Scikit-Learn, PySpark,
and H2O) in order to condense data into a few dimensions by employing the principal
component method. As mentioned at the beginning of this chapter, the chapter did
not execute the method to test plausible hypotheses, thus there are no key metrics for
objectively assessing the method.
110
CHAPTER 10
111
© Tshepo Chris Nokeri 2022
T. C. Nokeri, Data Science Solutions with Python, https://doi.org/10.1007/978-1-4842-7762-1_10
Chapter 10 Automating the Machine Learning Process with H2O
P
reprocessing Features
This chapter manipulates the data attained in Chapter 3, so it does not describe the
preprocessing tasks in detail. Listing 10-1 executes all the preprocessing tasks.
import pandas as pd
df = pd.read_csv(r"filepath\WA_Fn-UseC_-Marketing_Customer_Value_Analysis.csv")
drop_column_names = df.columns[[0, 6]]
initial_data = df.drop(drop_column_names, axis="columns")
initial_data.iloc[::, 0] = pd.get_dummies(initial_data.iloc[::, 0])
initial_data.iloc[::, 2] = pd.get_dummies(initial_data.iloc[::, 2])
initial_data.iloc[::, 3] = pd.get_dummies(initial_data.iloc[::, 3])
initial_data.iloc[::, 4] = pd.get_dummies(initial_data.iloc[::, 4])
initial_data.iloc[::, 5] = pd.get_dummies(initial_data.iloc[::, 5])
initial_data.iloc[::, 6] = pd.get_dummies(initial_data.iloc[::, 6])
initial_data.iloc[::, 7] = pd.get_dummies(initial_data.iloc[::, 7])
initial_data.iloc[::, 8] = pd.get_dummies(initial_data.iloc[::, 8])
initial_data.iloc[::, 9] = pd.get_dummies(initial_data.iloc[::, 9])
initial_data.iloc[::, 15] = pd.get_dummies(initial_data.iloc[::, 15])
initial_data.iloc[::, 16] = pd.get_dummies(initial_data.iloc[::, 16])
initial_data.iloc[::, 17] = pd.get_dummies(initial_data.iloc[::, 17])
initial_data.iloc[::, 18] = pd.get_dummies(initial_data.iloc[::, 18])
initial_data.iloc[::, 20] = pd.get_dummies(initial_data.iloc[::, 20])
initial_data.iloc[::, 21] = pd.get_dummies(initial_data.iloc[::, 21])
112
Chapter 10 Automating the Machine Learning Process with H2O
h2o_data = initialize_h2o.H2OFrame(initial_data)
int_x = initial_data.iloc[::,0:19]
fin_x = initial_data.iloc[::,19:21]
x_combined = pd.concat([int_x, fin_x], axis=1)
x_list = list(x_combined.columns)
y_list = initial_data.columns[19]
y = y_list
x = h2o_data.col_names
x.remove(y_list)
113
Chapter 10 Automating the Machine Learning Process with H2O
Listing 10-7 ranks H2O’s methods in ascending order (see Table 10-1).
h2o_method_ranking = h2o_automatic_ml.leaderboard
print(h2o_method_ranking)
114
Chapter 10 Automating the Machine Learning Process with H2O
highest_ranking_method = h2o_automatic_ml.leader
print(highest_ranking_method)
ModelMetricsRegressionGLM: stackedensemble
** Reported on train data. **
MSE: 4900.131929292816
RMSE: 70.00094234574857
MAE: 45.791336342731775
RMSLE: 0.3107457400431741
R^2: 0.9421217261180531
Mean Residual Deviance: 4900.131929292816
Null degrees of freedom: 7288
Residual degrees of freedom: 7276
Null deviance: 617106545.1167994
Residual deviance: 35717061.632615335
AIC: 82648.0458245655
ModelMetricsRegressionGLM: stackedensemble
** Reported on validation data. **
MSE: 17967.946605287365
RMSE: 134.04456947331872
MAE: 83.66742154264892
RMSLE: 0.43883929226519075
R^2: 0.7718029372846945
Mean Residual Deviance: 17967.946605287365
Null degrees of freedom: 930
115
Chapter 10 Automating the Machine Learning Process with H2O
ModelMetricsRegressionGLM: stackedensemble
** Reported on cross-validation data. **
MSE: 18371.339394332357
RMSE: 135.5409140973026
MAE: 86.62574587341507
RMSLE: 0.4820432203973515
R^2: 0.7830055540572105
Mean Residual Deviance: 18371.339394332357
Null degrees of freedom: 7288
Residual degrees of freedom: 7275
Null deviance: 617269595.4214188
Residual deviance: 133908692.84528854
AIC: 92282.67565857261
C
onclusion
The highest-ranking H2O AutoML method (StackedEnsemble_AllModels_
AutoML_20210824_115721) pales in comparison to the ordinary least-square methods
executed in the chapter. The H20 method disclosed 77% of the related changes in the test
data. In contrast, all methods executed in Chapter 3 (with the Scikit-Learn and PySpark
frameworks) explained 100% of the related changes in the test data. Unfortunately,
the StackedEnsemble_AllModels_AutoML_20210824_115721 method does not have a
summary function. This concludes the book.
116
Index
A Deep learning (DL), 4, 7, 75
Dimension reducers, 4
Accelerated failure time method, 34
DL frameworks, Keras, 14
Apache Spark, 7, 10
AutoML method, 111
H2O, 112, 113, 115, 116 E
preprocessing tasks, 112
Elbow curve, 89, 91
Ensemble methods, 3
B
Big data, 7
F
business, 8
customer relationships, 8 findspark framework, 69
decision making, 9
ETL, 10
G
features, 8
product development, 9 Gaussian distribution, 1
warehousing, 9 Gradient boosting methods, 66
H2O framework, 71, 72
PySpark, 69, 71
C XGBoost, 66–68
Categories, 2 GraphX, 12
Cluster centers, 92
Cluster methods, 3, 4
Cox Proportional Hazards method, 29, 30
H, I, J
H2OAutoML, 112
H2O framework, 22, 23, 25–27, 52, 71
D H2O logistic regression
Data-driven organizations, 9 method, 54, 55, 57
Decision tree, 59, 60, 62 Hadoop Distributed File System (HDF), 11
Deep belief networks, 87 Hadoop File System (HDFS), 10
117
© Tshepo Chris Nokeri 2022
T. C. Nokeri, Data Science Solutions with Python, https://doi.org/10.1007/978-1-4842-7762-1
Index
118
Index
T V, W
TensorFlow, 14 VectorAssembler() method, 35
U X, Y, Z
Unsupervised learning, 3 XGBoost, 14
119