0% found this document useful (0 votes)

123 views5 pages

Regression: Pyspark - SQL

This document summarizes the steps taken to perform linear regression on a dataset using PySpark. It loads data from a CSV file, cleans and prepares the data for modeling, trains a linear regression model in a pipeline, evaluates the model on test data to calculate the RMSE, and computes the r2 score. Key steps include splitting the data into training and test sets, fitting a linear regression model in a pipeline, calculating performance metrics like RMSE and r2 score on test data, and extracting the model coefficients and statistics.

Uploaded by

Ali Abdi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

123 views5 pages

Regression: Pyspark - SQL

Uploaded by

Ali Abdi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

regression

December 27, 2021

[2]: from pyspark.sql import SparkSession

sp= SparkSession.builder.appName("Python Spark regression example").
,→config("spark.some.config.option", "some-value").getOrCreate()

[3]: df = spark.read.format('csv').options(header='true',inferschema='true').
,→load("data.csv",header=True);

[4]: import pandas as pd

pd.DataFrame(df.take(3), columns=df.columns)

[4]: TV Radio Newspaper Sales

0 230.1 37.8 69.2 22.1
1 44.5 39.3 45.1 10.4
2 17.2 45.9 69.3 9.3

[5]: df.describe().toPandas()

[5]: summary TV Radio Newspaper \

0 count 200 200 200
1 mean 147.0425 23.264000000000024 30.553999999999995
2 stddev 85.85423631490805 14.846809176168728 21.77862083852283
3 min 0.7 0.0 0.3
4 max 296.4 49.6 114.0

Sales
0 200
1 14.022500000000003
2 5.217456565710477
3 1.6
4 27.0

[6]: df.printSchema()

root

1
|-- TV: double (nullable = true)
|-- Radio: double (nullable = true)
|-- Newspaper: double (nullable = true)
|-- Sales: double (nullable = true)

[7]: df.show(5)

+-----+-----+---------+-----+
| TV|Radio|Newspaper|Sales|
+-----+-----+---------+-----+
|230.1| 37.8| 69.2| 22.1|
| 44.5| 39.3| 45.1| 10.4|
| 17.2| 45.9| 69.3| 9.3|
|151.5| 41.3| 58.5| 18.5|
|180.8| 10.8| 58.4| 12.9|
+-----+-----+---------+-----+
only showing top 5 rows

[32]: transformed= transData(df)

transformed.show(5)

[Stage 20:> (0 + 1) / 1]
+-----------------+-----+
| features|label|
+-----------------+-----+
|[230.1,37.8,69.2]| 22.1|
| [44.5,39.3,45.1]| 10.4|
| [17.2,45.9,69.3]| 9.3|
|[151.5,41.3,58.5]| 18.5|
|[180.8,10.8,58.4]| 12.9|
+-----------------+-----+
only showing top 5 rows

[34]: from pyspark.ml import Pipeline

from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator

# Automatically identify categorical features, and index them.

# We specify maxCategories so features with > 4 distinct values are treated as␣
,→continuous.

2
featureIndexer =␣
,→VectorIndexer(inputCol="features",outputCol="indexedFeatures",maxCategories=4).

,→fit(transformed)

data = featureIndexer.transform(transformed)
data.show(5,True)

+-----------------+-----+-----------------+
| features|label| indexedFeatures|
+-----------------+-----+-----------------+
|[230.1,37.8,69.2]| 22.1|[230.1,37.8,69.2]|
| [44.5,39.3,45.1]| 10.4| [44.5,39.3,45.1]|
| [17.2,45.9,69.3]| 9.3| [17.2,45.9,69.3]|
|[151.5,41.3,58.5]| 18.5|[151.5,41.3,58.5]|
|[180.8,10.8,58.4]| 12.9|[180.8,10.8,58.4]|
+-----------------+-----+-----------------+
only showing top 5 rows

[35]: # Split the data into training and test sets (40% held out for testing)
(trainingData, testData) = transformed.randomSplit([0.8, 0.2])

[36]: # Import LinearRegression class

from pyspark.ml.regression import LinearRegression

# Define LinearRegression algorithm

lr = LinearRegression()

[38]: import warnings

warnings.filterwarnings('ignore')
# Chain indexer and tree in a Pipeline
pipeline = Pipeline(stages=[featureIndexer, lr])

model = pipeline.fit(trainingData)

21/12/14 14:37:13 WARN Instrumentation: [8d638690] regParam is zero, which might

cause numerical instability and overfitting.

[39]: def modelsummary(model):

import numpy as np
print ("Note: the last rows are the information for Intercept")
print ("##","-------------------------------------------------")
print ("##"," Estimate | Std.Error | t Values | P-value")
coef = np.append(list(model.coefficients),model.intercept)
Summary=model.summary

3
for i in range(len(Summary.pValues)):
print ("##",'{:10.6f}'.format(coef[i]),\
'{:10.6f}'.format(Summary.coefficientStandardErrors[i]),\
'{:8.3f}'.format(Summary.tValues[i]),\
'{:10.6f}'.format(Summary.pValues[i]))

print ("##",'---')
print ("##","Mean squared error: % .6f" \
% Summary.meanSquaredError, ", RMSE: % .6f" \
% Summary.rootMeanSquaredError )
print ("##","Multiple R-squared: %f" % Summary.r2, ", \
Total iterations: %i"% Summary.totalIterations)

[40]: modelsummary(model.stages[-1])

Note: the last rows are the information for Intercept

## -------------------------------------------------
## Estimate | Std.Error | t Values | P-value
## 0.044758 0.001555 28.783 0.000000
## 0.186763 0.009541 19.575 0.000000
## 0.006556 0.007003 0.936 0.350575
## 2.921133 0.343975 8.492 0.000000
## ---
## Mean squared error: 2.828389 , RMSE: 1.681782
## Multiple R-squared: 0.897012 , Total iterations: 0

[41]: # Make predictions.

predictions = model.transform(testData)

[42]: from pyspark.ml.evaluation import RegressionEvaluator

# Select (prediction, true label) and compute test error
evaluator =␣
,→RegressionEvaluator(labelCol="label",predictionCol="prediction",metricName="rmse")

rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

[Stage 30:> (0 + 1) / 1]
Root Mean Squared Error (RMSE) on test data = 1.66064

[43]: y_true = predictions.select("label").toPandas()

y_pred = predictions.select("prediction").toPandas()

import sklearn.metrics
r2_score = sklearn.metrics.r2_score(y_true, y_pred)

4
print('r2_score: {0}'.format(r2_score))

r2_score: 0.8900904334948799

[ ]:

Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Pandas DataFrame Notes
67% (3)
Pandas DataFrame Notes
13 pages
Cheat Sheet
No ratings yet
Cheat Sheet
10 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
10 pages
Aris A. Syntetos, John E. Boylan - Intermittent Demand Forecasting - Context, Methods and Applications-Wiley (2022)
No ratings yet
Aris A. Syntetos, John E. Boylan - Intermittent Demand Forecasting - Context, Methods and Applications-Wiley (2022)
383 pages
22MCA1008 - Varun ML LAB ASSIGNMENTS
100% (1)
22MCA1008 - Varun ML LAB ASSIGNMENTS
41 pages
Machine Learning
No ratings yet
Machine Learning
43 pages
Gaurav - Data Mining Lab Assignment
No ratings yet
Gaurav - Data Mining Lab Assignment
36 pages
Deepak Data Analysis 1
No ratings yet
Deepak Data Analysis 1
31 pages
ML Lab Records
No ratings yet
ML Lab Records
101 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
43 pages
PR Final File
No ratings yet
PR Final File
70 pages
Project Intern - Jupyter Notebook
No ratings yet
Project Intern - Jupyter Notebook
16 pages
ASSESSMENT2
No ratings yet
ASSESSMENT2
22 pages
Linear Reg Signal and Noise PDF
No ratings yet
Linear Reg Signal and Noise PDF
20 pages
Predictive Modelling Coded Project
No ratings yet
Predictive Modelling Coded Project
33 pages
Machine Exercise 3
No ratings yet
Machine Exercise 3
22 pages
Cheat Sheet - Pandas
No ratings yet
Cheat Sheet - Pandas
12 pages
Netflix Stock Price Prediction
No ratings yet
Netflix Stock Price Prediction
20 pages
MEE 6070 Data Science and Analytics: Importing Data Using Plotting The Data Checking For Linearity
No ratings yet
MEE 6070 Data Science and Analytics: Importing Data Using Plotting The Data Checking For Linearity
13 pages
Heart Disease Diagnosis Using Machine Learning
No ratings yet
Heart Disease Diagnosis Using Machine Learning
26 pages
Implementation of Simple Linear Regression Algorithm Using Python
No ratings yet
Implementation of Simple Linear Regression Algorithm Using Python
12 pages
Da Lab Mannual
No ratings yet
Da Lab Mannual
25 pages
Advanced ML PDF
No ratings yet
Advanced ML PDF
25 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
10 pages
Group Work Assignment Supervised and Unsupervised Learning
No ratings yet
Group Work Assignment Supervised and Unsupervised Learning
10 pages
Linear Regression Besant
No ratings yet
Linear Regression Besant
11 pages
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
No ratings yet
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
12 pages
Importing Libraries: Pandas PD Matplotlib - Pyplot PLT Numpy NP
No ratings yet
Importing Libraries: Pandas PD Matplotlib - Pyplot PLT Numpy NP
10 pages
Advertising - Paulina Frigia Rante (34) - PPBP 1 - Colaboratory
No ratings yet
Advertising - Paulina Frigia Rante (34) - PPBP 1 - Colaboratory
7 pages
ML Manual
No ratings yet
ML Manual
18 pages
Load Dataset: Import As
No ratings yet
Load Dataset: Import As
8 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
10 pages
Data Science Record - 05
No ratings yet
Data Science Record - 05
20 pages
ML 7
No ratings yet
ML 7
6 pages
Cognitive Class - Answers Data Analysis With Python
No ratings yet
Cognitive Class - Answers Data Analysis With Python
6 pages
Aiml Programs
No ratings yet
Aiml Programs
12 pages
Implementation of Artificial Intelligence in Food Science Food Quality and Consumer Preference Assessment
No ratings yet
Implementation of Artificial Intelligence in Food Science Food Quality and Consumer Preference Assessment
116 pages
Linear Regression - Cheatsheet
No ratings yet
Linear Regression - Cheatsheet
8 pages
Logistic Binary Classification
No ratings yet
Logistic Binary Classification
3 pages
ML LAB Manual-1
No ratings yet
ML LAB Manual-1
33 pages
Handling Missing Values in A Real-Time Dataset During
No ratings yet
Handling Missing Values in A Real-Time Dataset During
5 pages
1 2 3 4 6 7 8 9 10 Merged
No ratings yet
1 2 3 4 6 7 8 9 10 Merged
21 pages
ML Functions
No ratings yet
ML Functions
12 pages
DA Lab
No ratings yet
DA Lab
27 pages
Asg5 Dmds
No ratings yet
Asg5 Dmds
4 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
A926534728 - 28953 - 8 - 2025 - Spark Mllib
No ratings yet
A926534728 - 28953 - 8 - 2025 - Spark Mllib
8 pages
Untitled 57
No ratings yet
Untitled 57
4 pages
Linear Regression
No ratings yet
Linear Regression
8 pages
Program
No ratings yet
Program
10 pages
ML0101EN Reg Simple Linear Regression Co2 Py v1
No ratings yet
ML0101EN Reg Simple Linear Regression Co2 Py v1
4 pages
Introduction To Machine Learning - Unit 9 - Week 6
No ratings yet
Introduction To Machine Learning - Unit 9 - Week 6
3 pages
Experiment 2 FDL - Jupyter Notebook
No ratings yet
Experiment 2 FDL - Jupyter Notebook
2 pages
Unit 6 Pyspark - MLlib
No ratings yet
Unit 6 Pyspark - MLlib
6 pages
Loan ML Complete Guide
No ratings yet
Loan ML Complete Guide
3 pages
Unit Ii (57 92)
No ratings yet
Unit Ii (57 92)
36 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
10 pages
Aggialavura - Python Linear Regression Model
No ratings yet
Aggialavura - Python Linear Regression Model
1 page
Unit 2 Supervised Learning Regression
No ratings yet
Unit 2 Supervised Learning Regression
111 pages
NN Examples
No ratings yet
NN Examples
91 pages
Know Your Dataset: Season Holiday Weekday Workingday CNT 726 727 728 729 730
No ratings yet
Know Your Dataset: Season Holiday Weekday Workingday CNT 726 727 728 729 730
1 page
Passenger Demand Forecasting For Railway Systems
No ratings yet
Passenger Demand Forecasting For Railway Systems
15 pages
Pandas DataFrameObject
No ratings yet
Pandas DataFrameObject
4 pages
Interview Questions
No ratings yet
Interview Questions
26 pages
Application of Artificial Intelligence Techniques To Predict Strip Foundation Capacity Near Slope Surfaces
No ratings yet
Application of Artificial Intelligence Techniques To Predict Strip Foundation Capacity Near Slope Surfaces
24 pages
AIML Lab Manual
No ratings yet
AIML Lab Manual
44 pages
Materials 17 03606
No ratings yet
Materials 17 03606
21 pages
MCQS ML
No ratings yet
MCQS ML
27 pages
Experiment No
No ratings yet
Experiment No
29 pages
Chuong 13C (Doc Them)
No ratings yet
Chuong 13C (Doc Them)
13 pages
Powering The Future of Electrical Load Forecasting Using A Regression Learner in Machine Learning
No ratings yet
Powering The Future of Electrical Load Forecasting Using A Regression Learner in Machine Learning
11 pages
Google Cloud Platform and Machine Learning Specialization Coursera
100% (5)
Google Cloud Platform and Machine Learning Specialization Coursera
25 pages
1 s2.0 S0038092X23006898 Main
No ratings yet
1 s2.0 S0038092X23006898 Main
10 pages
MLND Capstone Project Report
No ratings yet
MLND Capstone Project Report
14 pages
Estimation EMV
No ratings yet
Estimation EMV
37 pages
1 s2.0 S0016706123000903 Main
No ratings yet
1 s2.0 S0016706123000903 Main
19 pages
Comparison of Machine Learning Approaches For Time-Series-Based Quality Monitoring of Resistance Spot Welding (RSW)
No ratings yet
Comparison of Machine Learning Approaches For Time-Series-Based Quality Monitoring of Resistance Spot Welding (RSW)
17 pages
Monkeylearn Thesis
No ratings yet
Monkeylearn Thesis
19 pages
Volatility Surface Interpolation
No ratings yet
Volatility Surface Interpolation
24 pages
Final Exam in System Identification For F and STS Answers and Brief Solutions
No ratings yet
Final Exam in System Identification For F and STS Answers and Brief Solutions
4 pages
Forecasting Methods
No ratings yet
Forecasting Methods
20 pages
Python
No ratings yet
Python
4 pages
2021-Approach For Real-Time Prediction of Pipe Stuck Risk Using A LongShort-Term Memory Autoencoder Architecture
No ratings yet
2021-Approach For Real-Time Prediction of Pipe Stuck Risk Using A LongShort-Term Memory Autoencoder Architecture
10 pages
Risks 06 00144 v2 PDF
No ratings yet
Risks 06 00144 v2 PDF
16 pages
3-Predicting Stock Prices Using Deep Learning - by Yacoub Ahmed - Towards Data Science PDF
No ratings yet
3-Predicting Stock Prices Using Deep Learning - by Yacoub Ahmed - Towards Data Science PDF
15 pages
Tutorial2 Solution Jan21
No ratings yet
Tutorial2 Solution Jan21
5 pages
TensorFlow深度学习项目实战: Chinese Edition
From Everand
TensorFlow深度学习项目实战: Chinese Edition
Posts & Telecom Press
No ratings yet
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
From Everand
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
Equity Press
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Regression: Pyspark - SQL

Uploaded by

Regression: Pyspark - SQL

Uploaded by

regression

December 27, 2021

[2]: from pyspark.sql import SparkSession

[4]: import pandas as pd

[4]: TV Radio Newspaper Sales

[5]: summary TV Radio Newspaper \

[32]: transformed= transData(df)

[34]: from pyspark.ml import Pipeline

# Automatically identify categorical features, and index them.

[36]: # Import LinearRegression class

# Define LinearRegression algorithm

[38]: import warnings

21/12/14 14:37:13 WARN Instrumentation: [8d638690] regParam is zero, which might

[39]: def modelsummary(model):

Note: the last rows are the information for Intercept

[41]: # Make predictions.

[42]: from pyspark.ml.evaluation import RegressionEvaluator

[43]: y_true = predictions.select("label").toPandas()

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.