0% found this document useful (0 votes)

129 views

Building Data Report

The report explores energy usage data for over 11,000 buildings to analyze factors that predict a building's Energy Star Score. It finds issues with how the Energy Star Score is distributed and suggests the Energy Use Intensity metric may be more objective. Different building types are also found to have varying distributions of Energy Star Scores, with office buildings tending to score higher on average.

Uploaded by

uri20

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

129 views

Building Data Report

Uploaded by

uri20

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Building Energy Star Score Prediction Report

Will Koehrsen
March 1, 2018

Contents
1 Introduction 2
1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Data Exploration 3
2.1 Distribution of Energy Star Scores . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Potential Issue with Energy Star Scores . . . . . . . . . . . . . . . . . . . 4
2.2 Energy Star Score by Building Type . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Site EUI by Building Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Energy Star Score versus Site EUI . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Linear Correlations with Energy Star Score . . . . . . . . . . . . . . . . . . . . . 9
2.6 Remove Collinear Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Machine Learning Approach 12

3.1 Feature Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Regression of Energy Star Scores 13

4.1 Establish A Regression Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Random Forest Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.4 Interpret Regression Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4.1 Predicted Scores of New Buildings . . . . . . . . . . . . . . . . . . . . . 17
4.4.2 Interpret Feature Importances . . . . . . . . . . . . . . . . . . . . . . . . 18

5 Classification of Energy Star Scores 23

5.1 Random Forest Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2 Interpret Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2.1 Predicted Classifications of New Data . . . . . . . . . . . . . . . . . . . . 26

6 Conclusions 27

1
1 Introduction
In this report I present findings from an exploration of the NYC benchmarking dataset which
measures 60 variables related to energy use for over 11,000 buildings. Of primary interest is
the Energy Star Score, which is often used as an aggregate measure of the overall efficiency of
a building. The Energy Star Score is a percentile measure of a building’s energy performance
calculated from self-reported energy usage.

1.1 Objectives
• Identify predictors within the dataset for the Energy Star Score
• Build Regression/Classification models that can predict the Energy Star Score of a building
given the building’s energy data
• Interpret the results of the model and use the trained model to infer Energy Star Scores of
new buildings
Documentation of the column names can be found in this document. Preliminary data cleaning
and exploration was completed in another Jupyer Notebook.
In [1]: # Pandas and numpy for data manipulation
import pandas as pd
import numpy as np

# Matplotlib and seaborn for visualization

import matplotlib.pyplot as plt
%matplotlib inline

from IPython.core.pylabtools import figsize

import seaborn as sns

sns.set_context(font_scale=2)

# Imputing missing values

from sklearn.preprocessing import Imputer

# Machine Learning Models

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier

# Evaluating Models
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, confusion_matrix

# Helpers
import itertools

2
In [2]: # Read in cleaned data and display shape
data = pd.read_csv('data/cleaned_data.csv')
print(data.shape)

(11746, 60)

There are over 11,000 buildings in the dataset with 60 energy-related features each. Many
of the columns contain a significant portion of missing values which were initially encoded as
’Not Available’. In a previous notebook, I filled in the missing values with nan and converted the
appropriate columns to numerical values.

2 Data Exploration
As we are concerned mainly with the energy star score, the first chart to make shows the distribution
of this measure across all the buildings in the dataset that have a score (9642).

2.1 Distribution of Energy Star Scores

In [3]: plt.style.use('fivethirtyeight')
figsize(8, 8)

# Histogram of Energy Star Scores with 100 bins

plt.hist(data['ENERGY STAR Score'].dropna(), bins = 100);
plt.title('Energy Star Score Distribution');
plt.xlabel('Score'); plt.ylabel('Count');

3
2.1.1 Potential Issue with Energy Star Scores
The Energy Star Score is a measure based on the self-reported energy usage of a building (see
this documentation for a thorough discussion) which is supposed to represent the percentile energy
performance rank of a building. Given the above distribution, either this dataset has a high portion
of extraordinarily efficient buildings, or the measure is not objective. Roughly 7% of the buildings
have a score of exactly 100, but we would expect the distribution to be uniform across scores
because this is a percentile measure.
In contrast to the Energy Star Score, the Energy Use Intensity (EUI) is based on actual energy
use as determined by the utility. It is straightforward to calculate: take the total annual energy
use (in kBtu) and divide by the square footage of the building. The EUI is meant for normalized
energy use comparisons between buildings of the same building type. This measure is likely more
objective because it uses actual measure consumption. In the graph below we can see that the

4
distribution of Energy Use Intensity is normal, which is what we would expect from a random
variable.
In [4]: figsize(8, 8)

# Histograms of Site EUI across all buildings in dataset

plt.hist(data['Weather Normalized Site EUI (kBtu/ft2 )'].dropna(),
bins = 20, range = (0, 300));
plt.title('Site EUI Distribution'); plt.xlabel('Site EUI (kBtu/ft^2)');
plt.ylabel('Count');

Even though there are issues with the Energy Star Score, this analysis will focus on that mea-
sure. However, further investigative work should be carried out on these scores to determine why
the distribution is unusual. Given the suspect distribution of this value, it would be worthwhile to
find a better overall measure of a building’s energy performance. We can still try to predict the
Energy Star Score, but just because we can predict a value does not mean it is useful!

5
2.2 Energy Star Score by Building Type
In order to determine if certain building types tend to score better or worse on the Energy Star
Score, we can plot the distribution of energy star scores by buiding type. Following is a density
plot showing the distribution of scores for building types with more than 80 measurements in the
data.
The kernel density plot x-limits do extend beyond the range of actual scores because of the
kernel density estimation method used to make the probability density distribution. The actual
values in a density plot can be difficult to interpret, and it is more instructive to focus on the
distribution/shape of the figure.

In [5]: # Create a list of buildings with more than 80 measurements

types = data.dropna(subset=['ENERGY STAR Score'])
types = types['Primary Property Type - Self Selected'].value_counts()
types = list(types[types.values > 80].index)

figsize(14, 10)

# Plot each building

for b_type in types:
# Select the building type
subset = data[data['Primary Property Type - Self Selected'] == b_type]

# Density plot of Energy Star scores

sns.kdeplot(subset['ENERGY STAR Score'].dropna(),
label = b_type);

plt.xlabel('Energy Star Score'); plt.ylabel('Density');

plt.title('Density Plot of Energy Star Scores');

6
This gives us our first conclusion: office buildings tend to have much higher Energy Star Scores
than hotels and senior care communities. Residence halls and non-refrigerated warehouses also
have higher scores while multifamily housing has a bi-modal distribution of scores similar to the
overall distribution. It would be useful to see if this holds across a larger sample size, and to get
more data to figure out why some building types tend to do better.
In the exploratio notebook, I identified that there was a negative correlation between the Site
EUI and the Energy Star Score. We can plot the Site EUI by building type to see if there is also a
difference in Site EUI between building types.

2.3 Site EUI by Building Type

In [6]: figsize(14, 10)

# Plot the site EUI density plot for each building type
for b_type in types:
# Remove outliers before plotting
subset = data[(data['Weather Normalized Site EUI (kBtu/ft2 )'] < 300) &
(data['Primary Property Type - Self Selected'] == b_type)]

# Plot the site EUI on a density plot

sns.kdeplot(subset['Weather Normalized Site EUI (kBtu/ft2 )'].dropna(),
label = b_type);

7
plt.xlabel('Site EUI (kBtu/ft^2)'); plt.ylabel('Density');
plt.title('Density Plot of Site EUI');

This plot provides us with another conclusion: there appears to be a negative correlation be-
tween the Site EUI and the Energy Star Score based on comparing the two distributions between
building types. Building types with lower Site EUI’s tend to have higher Energy Star Scores. The
higher the energy use intensity (which is energy use / area), the "worse" the building’s energy effi-
ciency performance. We can visualize the relationship between the Energy Star Score and the Site
EUI in a scatterplot.

2.4 Energy Star Score versus Site EUI

In [7]: figsize(14, 10)

# Subset to the buildings with most measurements and remove outliers

subset = data[(data['Weather Normalized Site EUI (kBtu/ft2 )'] < 300) &
(data['Primary Property Type - Self Selected'].isin(types))]

# Drop the buildings without a value

subset = subset.dropna(subset=['ENERGY STAR Score',
'Weather Normalized Site EUI (kBtu/ft2 )'])

8
subset = subset.rename(columns={'Primary Property Type - Self Selected': 'Proper

# Linear Plot of Energy Star Score vs EUI

sns.lmplot('Weather Normalized Site EUI (kBtu/ft2 )', 'ENERGY STAR Score',
data = subset, hue = 'Property Type',
scatter_kws={'alpha': 0.8, 's': 32}, fit_reg=False,
size = 12, aspect = 1.2);

plt.title('Energy Star Score vs Site EUI', size = 24);

The plot shows the expected negative relationship between Energy Star Score and Site EUI.
This relationship looks like it holds across building types. To quantify the relationship, we can
calculate the Pearson Correlation Coeffiecient between the two variables. This is a measure of
linear correlation which shows both the strength and the direction of the relationship. We will look
at the correlation coefficient between Energy Star Scores and several measures.

2.5 Linear Correlations with Energy Star Score

In [8]: # List of Variables to find correlation coefficients
features = ['Primary Property Type - Self Selected',

9
'Weather Normalized Site EUI (kBtu/ft2 )',
'Weather Normalized Site Electricity Intensity (kWh/ft2 )',
'Largest Property Use Type - Gross Floor Area (ft2 )',
'Natural Gas Use (kBtu)',
'ENERGY STAR Score']

subset = data[features].dropna()
subset = subset[subset['Primary Property Type - Self Selected'].isin(types)]

# Rename the columns

subset.columns = ['Property Type', 'Site EUI',
'Electricity Intensity', 'Floor Area',
'Natural Gas', 'Energy Star Score']

# Remove outliers
subset = subset[subset['Site EUI'] < 300]

In [9]: # Group by the building type and calculate correlations

corrs = pd.DataFrame(subset.groupby('Property Type').corr())
corrs = pd.DataFrame(corrs['Energy Star Score'])

# Format the dataframe for display

corrs = corrs.reset_index()
corrs.columns = ['Property Type', 'Variable', 'Correlation with Score']
corrs = corrs[corrs['Variable'] != 'Energy Star Score']
corrs

Out[9]: Property Type Variable Correlation with Score

0 Hotel Electricity Intensity -0.553256
2 Hotel Floor Area 0.042204
3 Hotel Natural Gas -0.097727
4 Hotel Site EUI -0.598284
5 Multifamily Housing Electricity Intensity -0.602274
7 Multifamily Housing Floor Area -0.007159
8 Multifamily Housing Natural Gas -0.168528
9 Multifamily Housing Site EUI -0.743034
10 Non-Refrigerated Warehouse Electricity Intensity -0.719545
12 Non-Refrigerated Warehouse Floor Area 0.125143
13 Non-Refrigerated Warehouse Natural Gas -0.221939
14 Non-Refrigerated Warehouse Site EUI -0.726648
15 Office Electricity Intensity -0.705143
17 Office Floor Area 0.041398
18 Office Natural Gas -0.141701
19 Office Site EUI -0.784825
20 Residence Hall/Dormitory Electricity Intensity -0.727305

10
22 Residence Hall/Dormitory Floor Area -0.071806
23 Residence Hall/Dormitory Natural Gas -0.391847
24 Residence Hall/Dormitory Site EUI -0.876173
25 Senior Care Community Electricity Intensity -0.572131
27 Senior Care Community Floor Area 0.105373
28 Senior Care Community Natural Gas -0.283634
29 Senior Care Community Site EUI -0.566617

This shows the correlation between Energy Star Score and three different measures by building
type. For all buildings we see the following relationships: Energy Star Score is strongly negatively
correlated with the Electricity Intensity and the Site EUI. The strength of the natural gas correlation
depends on the dataset, but natural gas usage tends to be negatively correlated with the Energy Star
Score as well. The relationship between floor area and the Energy Star score is weak for all building
types.

2.6 Remove Collinear Features

Before we build a machine learning model, we want to remove features that are highly collinear
with one another (not with the Energy Star Score). The objective is to determine if there are
independent variables that are predictive of the Energy Star Score, so we do not want to use
multiple collinear variables in our model. For example, Site EUI and Weather Normalized
Site EUI are very highly correlated (nearly 1.0) and therefore we only need to include one of
these measures in our model.
A simple method for removing collinear features is to calculate the correlation coefficient be-
tween every pair of variables and remove those with a magnitude greater than a specified threshold.
This procedure is carried out here, using a threshold of 0.5 for the magnitude of the correlation co-
efficient.

In [10]: # Remove correlations from the dataframe that are above corr_val
def corr_df(x, corr_val):
# Dont want to remove correlations between Energy Star Score
y = x['ENERGY STAR Score']
x = x.drop(columns = ['ENERGY STAR Score'])

# Matrix of all correlations

corr_matrix = x.corr()
iters = range(len(corr_matrix.columns) - 1)
drop_cols = []

# Iterate through all correlations

for i in iters:
for j in range(i):
item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
col = item.columns
row = item.index

11
val = abs(item.values)
# If correlation is above the threshold, add to list to remove
if val >= corr_val:
drop_cols.append(col.values[0])

# Remove collinear variables

drops = set(drop_cols)
x = x.drop(columns = drops)
x = x.drop(columns = ['Site EUI (kBtu/ft2 )'])

# Make sure to add the label back in to the data

x['ENERGY STAR Score'] = y

return x

In [11]: new_data = corr_df(data, corr_val = 0.5)

print('Old Data Shape:', data.shape)
print('New Data Shape with correlated features removed', new_data.shape)

Old Data Shape: (11746, 60)

New Data Shape with correlated features removed (11746, 42)

The procedure removed 18 highly collinear columns from the data. The next step is to de-
termine if we can build machine learning models that can use these features to predict the target,
Energy Star Score.

3 Machine Learning Approach

The objective is to predict the Energy Star Score of a building given the variables available in
the data. First, we will separate out the buildings that have a score and those without for later
inference. Of the buildings with an Energy score, we will use 75% for training the model and 25%
for evaluating our model. We will try two different methods:

1. Regression: Predict the numerical value of the Energy Star Score

• Metrics: Mean Absolute Error, the average deviation between predictions and true
values and Rˆ2, the percentage of variation in the response explained by the model
2. Classification: Divide the buildings into 5 groupings based on Energy Star Score intervals of
20 points. Predict the class for a building.
• Metrics: Accuracy, f1_score (the harmonic mean of precision and recall), and the con-
fusion matrix

We could also try clustering on the buildings, but for now we will stick to supervised machine
learning methods. Unsupervised methods would be an interesting avenue for further study.

12
3.1 Feature Preprocessing
To prepare the data for machine learning, we will use a few standard preprocessing steps:

1. Subset to numerical columns and 3 categorical columns. I primarily want to focus on using
the numeric values to predict the score, but I will also include categorical columns including
the building type because we saw that is related to the score.
2. One-hot encoding of categorical variables.
3. Split into training and testing sets (after extracting the buildings with no score). We will be
using 75% training and 25% testing.
4. Impute the missing values. There are many methods for imputation, but I will use the
straightforward median imputation. We need to train the imputer only on the training data
and make transformations of both the training data and the testing data.
5. Return the values for training and testing the model.

4 Regression of Energy Star Scores

Regression Preprocessing

In [12]: # Takes in a dataframe with observations for each building

# Returns all the data needed for training and testing a regression model
def train_test_reg(df):

# Select the numerical columns

X = df.select_dtypes('number')

# Add the selected categorical columns

X['Largest Property Use Type'] = df['Largest Property Use Type']
X['Metered Areas (Energy)'] = df['Metered Areas (Energy)']
X['DOF Benchmarking Submission Status'] = df['DOF Benchmarking Submission S

# One-hot encoding of categorical values

X = pd.get_dummies(X)

# Extract the buildings with no score

missing_scores = X[X['ENERGY STAR Score'].isnull()]

# Drop the missing scores from the data

X = X.dropna(subset = ['ENERGY STAR Score'])

# Remove the labels from the features

y = X['ENERGY STAR Score']
X = X.drop(columns = ['Order', 'Property Id', 'ENERGY STAR Score'])

missing_scores = missing_scores.drop(columns = ['Order',

13
'Property Id',
'ENERGY STAR Score'])

# Feature names will be used for later interpretation

feature_names = list(X.columns)

# Split into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.25,
random_state = 42)

# Impute missing values using a median strategy

imputer = Imputer(strategy = 'median')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)
missing_scores = imputer.transform(missing_scores)

# Return all data

return X_train, X_test, y_train, y_test, missing_scores, feature_names

In [13]: X_train, X_test, y_train, y_test, missing_scores, feature_names = train_test_re

print('Training Data Shape:', X_train.shape)
print('Testing Data Shape:', X_test.shape)

Training Data Shape: (7231, 77)

Testing Data Shape: (2411, 77)

We will be training the model with over 7000 buildings using 77 features (one-hot encoding
expands the number of features substanitially because it creates one feature for every value of a
categorical variables). The training and testing data have the same number of features which is a
good sanity check!

4.1 Establish A Regression Baseline

Before we implement a machine learning model, we need to calculate a common-sense baseline.
If our model cannot beat this mark, then machine learning may not be appropriate for the task. For
regression tasks, a simple baseline is to predict the mean value of the target in the training data for
all the testing examples.

In [14]: # Baseline is mean of training label

baseline = np.mean(y_train)
base_error = np.mean(abs(baseline - y_test))

print('Baseline Error: {:0.4f}.'.format(base_error))

14
Baseline Error: 25.9698.

If our model cannot beat an average error of 26, then we will need to rethink our approach

4.2 Linear Regression

A linear regression is extremely straightforward and produces explainable results. It is a good
method to start with in machine learning because we can interpret the results. However, if the
problem is not linear, then our model will not be very useful.

In [15]: # Create and train the model

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

# Make predictions and evaluate

lin_pred = lin_reg.predict(X_test)

print('Linear Regression Error: {:0.4f}.'.format(np.mean(abs(lin_pred - y_test)

Linear Regression Error: 27.9440.

That is not very great! We will move on to more serious machine learning models. The problem
as we increase in complexity is that while our accuracy will increase, the interpretability will
decrease.

4.3 Random Forest Regression

Next I implement a random forest regression. The random forest is somewhat interpretable be-
cause it returns feature importances which we can use to compare the most helpful variables for
making predictions. The random forest is generally very accurate and performs well on non-linear
problems with many features because it does implicit feature selection.
I am using all the scikit-learn defaults for the random forest except I increased the number of
trees in the forest from 10 to 200. Generally the default hyperparameters work well but can be
tuned using cross-validation.

In [16]: # Create and train random forest

rf_reg = RandomForestRegressor(n_estimators=200, n_jobs=-1)
rf_reg.fit(X_train, y_train)

# Make predicitons and evaluate

rf_reg_pred = rf_reg.predict(X_test)

print('Random Forest Error: {:0.4f}.'.format(np.mean(abs(rf_reg_pred - y_test))

15
Random Forest Error: 10.6761.

The error is substantially lower than the baseline so we can concude that using maching learn-
ing to predict the Energy Star Score is possible given the dataset. The test error shows that our
model can predict the Energy Star Score within 10 points of the true value on average.
As another evaluation of our model, we can calculate the rˆ2 score between the predictions
and the true values. This gives us an indication of the ratio of variance explained by the model.
A higher value (the max is 1) indicates that our model is better able to explain the relationship
between inputs and outputs.

In [17]: print('Random Forest R^2: {:0.4f}.'.format(np.corrcoef(rf_reg_pred, y_test)[0][

Random Forest R2: 0.8610.

The interpretation of the Rˆ2 value is that our model explains 86% of the variance in the Energy
Star Score. The rest of the variation is due to noise or latent (hidden) variables that we do not have
access to in the data.

4.4 Interpret Regression Results

Let’s first look at the distribution of predicted scores and true scores.

In [18]: figsize(14, 8)

# Plot predictions
ax = plt.subplot(121)
ax.hist(rf_reg_pred, bins = 100)
ax.set_xlabel('Score'); ax.set_ylabel('Count');
ax.set_title('Predicted Distribution')

# Plot true values

ax2 = plt.subplot(122)
ax2.hist(y_test, bins = 100)
ax2.set_xlabel('Score'); ax2.set_ylabel('Count');
ax2.set_title('Actual Distribution');

16
The model can predict the mode of scores at 100, does not seem to be able to predict the low
scores very well. Fixing this imbalance could be an area of future work. We could also try different
models to optimize the error, but for now we have shown that machine learning is amenable to the
problem.

4.4.1 Predicted Scores of New Buildings

Now we can use the model to generate new predictions for the buildings with missing scores.

In [19]: # Predict for the buildings missing energy star scores

missing_preds = rf_reg.predict(missing_scores)

figsize(8, 8)

# Distribution of predicted scores

plt.hist(missing_preds, bins = 100);
plt.xlabel('Score'); plt.ylabel('Count');
plt.title('Predicted Scores for Missing Buildings');

17
Our model can now make a prediction for any new building as long as it has the measurements
recorded in the original data. We can use this trained model to infer the score and we know that
our estimate should be within 10 points of the true value.

4.4.2 Interpret Feature Importances

The main way we can get an understanding of the random forest model is to look at which features
it considers most important. The absolute values of the feature importances are not as crucial as
the relative ranking of variables. These tell us which varibles are most useful for predicting the
Energy Star Score.

In [20]: def interpret_features(feature_list, importances):

# Make a dataframe of features and importances
feature_df = pd.DataFrame({'feature': feature_list,

18
'importance': importances})

# Sort the features from most to least important

feature_df = feature_df.sort_values('importance', ascending = False)
figsize(8, 8)

# Plot the most important features

plt.bar(feature_df[feature_df['importance'] > 0.01]['feature'],
feature_df[feature_df['importance'] > 0.01]['importance'])

# Plot formatting
plt.xticks(rotation = 90)
plt.xlabel('feature'); plt.ylabel('importance');
plt.title('Top Feature Importances')
plt.show();

return feature_df

In [21]: reg_features = interpret_features(feature_names, rf_reg.feature_importances_)

19
20
In [22]: reg_features.head(10)

Out[22]: feature importance

7 Weather Normalized Site EUI (kBtu/ft2 ) 0.534404
8 Weather Normalized Site Electricity Intensity ... 0.159948
40 Largest Property Use Type_Office 0.054748
12 Natural Gas Use (kBtu) 0.034163
1 Largest Property Use Type - Gross Floor Area (... 0.028412
0 DOF Gross Floor Area 0.024430
13 Water Use (All Water Sources) (kgal) 0.023072
14 Longitude 0.022914
4 Year Built 0.022421
37 Largest Property Use Type_Multifamily Housing 0.021825

According to the random forest model, the most importance features for predicting the Energy
Star Score of a building are the Site EUI, the Electricity Intensity, the Property Use Type, the
Natural Gas Use, and the Area. These are in line with the variables that we saw are the most
correlated with the Energy Star Score.
Let’s take a look at the relationship between the Energy Star Score, the Site EUI, the electricity
intensity, the Natural Gas Use, and the area of the building. In the exploration, we noticed that
these relationships are closer to linear when the log transform of the variable is taken. We will
apply that transformation and look at the resulting relationships.

In [23]: # List of Features to Plot

features = ['Weather Normalized Site EUI (kBtu/ft2 )',
'Weather Normalized Site Electricity Intensity (kWh/ft2 )',
'Largest Property Use Type - Gross Floor Area (ft2 )',
'Natural Gas Use (kBtu)',
'ENERGY STAR Score']

subset = new_data[features].dropna()

# Rename the columns

subset.columns = ['Site EUI', 'Electricity Intensity',
'Floor Area', 'Natural Gas', 'Score']

# Remove outliers and problem values

subset = subset[subset['Site EUI'] < 300]

# Log Transforms
subset['log Site EUI'] = np.log(subset['Site EUI'])
subset['log Electricity Intensity'] = np.log(subset['Electricity Intensity'])

21
subset['log Floor Area'] = np.log(subset['Floor Area'])
subset['log Natural Gas'] = np.log(subset['Natural Gas'])

subset = subset.replace({-np.inf: np.nan})

In [24]: sns.pairplot(subset.dropna()[['log Site EUI', 'log Electricity Intensity',
'log Floor Area', 'log Natural Gas', 'Score']]);

This graph is not that informative. We can see that the Energy Star Score is negatively corre-
lated with the Site EUI, the Electricity Intensity, and the natural gas usage. There is no significant
relationship between the floor area and the Energy Star Score.
The overall results of regression show us that it is possible to accurately infer the Energy Star
Score of a building given the data available. The model results are not easily explained, but the
most important features are in line with those found to be correlated with the Energy Star Score.

22
5 Classification of Energy Star Scores
As an additional exercise, we can see if it is possible to place buildings into classes based on their
energy star scores. We will use a simple grading scheme where every 20 point interval is a grade
giving us 5 total.

Classification Preprocessing

In [25]: # Function to generate a training and testing dataset for classification

# Codes the energy star score in 5 unique classes
def train_test_clf(df):

# Select numeric columns

X = df.select_dtypes('number')

# Add in specific categorical columns

X['Largest Property Use Type'] = df['Largest Property Use Type']
X['Metered Areas (Energy)'] = df['Metered Areas (Energy)']
X['DOF Benchmarking Submission Status'] = df['DOF Benchmarking Submission S

# One Hot Encoding of Categorical Variables

X = pd.get_dummies(X)

# Save the buildings with missing scores

missing_scores = X[X['ENERGY STAR Score'].isnull()]
X = X.dropna(subset = ['ENERGY STAR Score'])

# Assign a grade based on energy star score

X['grade'] = pd.cut(X['ENERGY STAR Score'], bins = range(0, 101, 20), label

y = X['grade']

# Drop labels from training data

X = X.drop(columns = ['Order', 'Property Id', 'ENERGY STAR Score', 'grade']
missing_scores = missing_scores.drop(columns = ['Order', 'Property Id', 'EN
feature_names = list(X.columns)

# Split into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,

# Impute the missing values using a median strategy

imputer = Imputer(strategy = 'median')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)
missing_scores = imputer.transform(missing_scores)

23
return X_train, X_test, y_train, y_test, missing_scores, feature_names

In [26]: X_train, X_test, y_train, y_test, missing_scores, feature_names = train_test_cl

# Look at value counts of grades

pd.Series(y_train).value_counts()

Out[26]: A 2254
B 1690
C 1280
F 1057
D 950
Name: grade, dtype: int64

The most common ’grade’ for a building is an A and the least common is an F. The skew
towards higher energy star scores is visible once again (self-reporting does not seem to be an
objective strategy).

5.1 Random Forest Classifier

In [27]: # Create and train the classifier
rf_clf = RandomForestClassifier(n_estimators=200)
rf_clf.fit(X_train, y_train)

# Make predictions
rf_clf_pred = rf_clf.predict(X_test)

5.2 Interpret Classification Results

There are many metrics to use for classification. Here we will use accuracy, the f1 score, and the
confusion matrix. We will use the weighted average parameter for the f1 score to account for the
class imbalances. The confusion matrix allows us to see what errors our classifier makes.

In [28]: # Function to plot a confusion matrix

# Includes the numbers of each classification on the plot
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Reds):

# Normalize the counts

if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

24
# Plot the confusion matrix as an image
plt.figure(figsize(8, 8))
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title, size = 22)

# Tick marks show the labels

tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes)
plt.yticks(tick_marks, classes)

# Add the numbers to the plot

fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center", size = 18,
color="white" if cm[i, j] > thresh else "black")

# Formatting
plt.tight_layout()
plt.xticks(size = 18)
plt.yticks(size = 18)
plt.ylabel('True label', size = 18)
plt.xlabel('Predicted label', size = 18)

# Function the evalute classifications from a model

def evaluate_classifications(true, predictions):
# Metrics for classifications
accuracy = np.mean(true == predictions)
f1_value = f1_score(true, predictions, average='weighted')
cm = confusion_matrix(true, predictions)

print('Accuracy: {:0.2f}.'.format(accuracy))
print('F1 score: {:0.2f}.'.format(f1_value))

plot_confusion_matrix(cm, classes = sorted(set(true)))

In [29]: evaluate_classifications(y_test, rf_clf_pred)

Accuracy: 0.59.
F1 score: 0.58.

25
Accuracy is a straightforward metric and it shows that our model can correctly choose among
the five classes 60% of the time. The confusion matrix is useful because it shows the model
mistakes. The most common mistakes are predicting an ’A’ when the true label was a ’B’, and
predicting a ’B’ when the true label was a ’C’. The random forest classifier appears to do very well
and can accurately infer the Energy Star Score of a building if provided with the building data.

5.2.1 Predicted Classifications of New Data

The classifier can also be used to assign grades to the buildings which were missing a score. We
do not have the true values for these buildings, but we can at least look at the distribution of
predictions.

26
In [30]: missing_classes = rf_clf.predict(missing_scores)

pd.Series(missing_classes).value_counts()

Out[30]: A 900
F 471
B 341
C 232
D 160
dtype: int64

Any new building can now be classified by the model as long as it is provided with the appro-
priate data. We can be confident in the predictions of the model as shown by the high performance
on the testing data.

6 Conclusions
We set out to answer the question: Can we build a model to predict the Energy Star Score of a
building and what variables provide us the most information about the score?
Given the exploration in this notebook, I conclude that yes, we can create a model to accurately
infer the Energy Star Score of a building and we have determined that the Site Energy Use Inten-
sity, the Electricity Intensity, the floor area, the natural gas use, and the Building Type are
the most useful measures for determining the energy star score.
The highlights from the report are:

1. Energy Star Scores are skewed high with a disproportionate global maximum at 100 and a
secondary maximum at 1.
2. Offices, residence halls, and non-refrigerated warehouses have higher energy star scores than
senior care communities and hotels with scores of multi-family housing falling in between.
3. The Site Energy Use Intensity, the Electricity Intensity, and the natural gas usage are all
negatively correlated with the Energy Star Score.
4. A random forest regressor trained on the training data was able to achieve an average absolute
error of 10 points on a hold-out testing set, which was significantly better than the baseline
measure.
5. If provided with data for a new building, a trained model can accurately infer the Energy
Star Score.

This report also identified areas for follow-up. These include finding an objective measurement
of overall building energy performance and determining why the Energy Star Score distributions
vary between building types. That’s all for now!

Grayson BBM JMPPro Chapter7
No ratings yet
Grayson BBM JMPPro Chapter7
39 pages
Energy Star Score
No ratings yet
Energy Star Score
14 pages
Team_19_Project_Report
No ratings yet
Team_19_Project_Report
27 pages
Energy Efficiency in Smart Buildings.pptx (1)
No ratings yet
Energy Efficiency in Smart Buildings.pptx (1)
12 pages
CIVI6731 Week7
No ratings yet
CIVI6731 Week7
34 pages
1 s2.0 S0973082621001307 Main
No ratings yet
1 s2.0 S0973082621001307 Main
14 pages
Sample II
No ratings yet
Sample II
8 pages
buildings-12-02039
No ratings yet
buildings-12-02039
25 pages
Energy-Efficiency-Prediction
No ratings yet
Energy-Efficiency-Prediction
8 pages
Energies 16 03748
No ratings yet
Energies 16 03748
23 pages
Proposed System and Methodology Part 2
No ratings yet
Proposed System and Methodology Part 2
42 pages
Anomaly Detection in Electricity Consumption Data of Buildings Using Predictive Models
No ratings yet
Anomaly Detection in Electricity Consumption Data of Buildings Using Predictive Models
20 pages
Cucs 008 12
No ratings yet
Cucs 008 12
6 pages
Manufacturing Project Proposel - Removed-4
No ratings yet
Manufacturing Project Proposel - Removed-4
5 pages
Paper 1
No ratings yet
Paper 1
40 pages
1 s2.0 S0360544221003145 Main
No ratings yet
1 s2.0 S0360544221003145 Main
12 pages
Urban Building Energy Performance Prediction and Retrofit Analysis Using Data-Driven Machine Learning Approach - ScienceDirect
No ratings yet
Urban Building Energy Performance Prediction and Retrofit Analysis Using Data-Driven Machine Learning Approach - ScienceDirect
38 pages
Urban Building Energy Performance Prediction and Retrofit Analysis Using Data-Driven Machine Learning Approach
No ratings yet
Urban Building Energy Performance Prediction and Retrofit Analysis Using Data-Driven Machine Learning Approach
16 pages
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
From Everand
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
Vladimir Kiselev
No ratings yet
fumo2015
No ratings yet
fumo2015
12 pages
Statistical Analysis and Forecasting of Solar Energy (Intra-State)
No ratings yet
Statistical Analysis and Forecasting of Solar Energy (Intra-State)
36 pages
An Ensemble Model for the Energy Consumption Prediction of Residential Buildings - ScienceDirect
No ratings yet
An Ensemble Model for the Energy Consumption Prediction of Residential Buildings - ScienceDirect
15 pages
Goyal 2020
No ratings yet
Goyal 2020
5 pages
Renewable and Sustainable Energy Reviews: Nelson Fumo
No ratings yet
Renewable and Sustainable Energy Reviews: Nelson Fumo
8 pages
Greenbuilding tr09
No ratings yet
Greenbuilding tr09
14 pages
1 s2.0 S0378778822000597 Main
No ratings yet
1 s2.0 S0378778822000597 Main
19 pages
Decision Tree for Building Energy Demand Method
No ratings yet
Decision Tree for Building Energy Demand Method
10 pages
2023 MScIT Patel Mirza
No ratings yet
2023 MScIT Patel Mirza
54 pages
Machine Learning Algorithms for Predicting Energy
No ratings yet
Machine Learning Algorithms for Predicting Energy
19 pages
Predictive Analytics
No ratings yet
Predictive Analytics
29 pages
An Evolutionary Deep Learning Model Based on EWKM, Random Forest Algorithm, SSA and BiLSTM for Building Energy Consumption Prediction
No ratings yet
An Evolutionary Deep Learning Model Based on EWKM, Random Forest Algorithm, SSA and BiLSTM for Building Energy Consumption Prediction
15 pages
M&V2 0
No ratings yet
M&V2 0
24 pages
NM
No ratings yet
NM
25 pages
Adavnced course outline for Engineering (2)
No ratings yet
Adavnced course outline for Engineering (2)
5 pages
Data Mining Project 11
No ratings yet
Data Mining Project 11
18 pages
Solar Power Forecasting With Machine Learning Techniques: Emil Isaksson Mikael Karpe Conde
No ratings yet
Solar Power Forecasting With Machine Learning Techniques: Emil Isaksson Mikael Karpe Conde
64 pages
Kel-1 Assignment 1 - ENG
No ratings yet
Kel-1 Assignment 1 - ENG
3 pages
2020 - Aaron Zeng - Prediction of BuildingmElectricity Usage Using Gaussian Process Regression
No ratings yet
2020 - Aaron Zeng - Prediction of BuildingmElectricity Usage Using Gaussian Process Regression
8 pages
Smart Energy Management System[1]
No ratings yet
Smart Energy Management System[1]
11 pages
Developing a hybrid time-series artificial intelligence model to forecast energy use in buildings_Ngo_etal_2022
No ratings yet
Developing a hybrid time-series artificial intelligence model to forecast energy use in buildings_Ngo_etal_2022
24 pages
Fonseca City Simulation Tool
No ratings yet
Fonseca City Simulation Tool
19 pages
Energies: Building Energy Consumption Prediction: An Extreme Deep Learning Approach
No ratings yet
Energies: Building Energy Consumption Prediction: An Extreme Deep Learning Approach
20 pages
1 s2.0 S1364032122005937 Main
No ratings yet
1 s2.0 S1364032122005937 Main
18 pages
Simplified Pred Model for AC Energy Use
No ratings yet
Simplified Pred Model for AC Energy Use
18 pages
s12273-024-1181-y
No ratings yet
s12273-024-1181-y
19 pages
Bai Bao Online SCIE
No ratings yet
Bai Bao Online SCIE
12 pages
A Building Energy Consumption Prediction Model Bas
No ratings yet
A Building Energy Consumption Prediction Model Bas
10 pages
Modeling and Forecasting Building Energy Consumption - A Review of Data-Driven Techniques
No ratings yet
Modeling and Forecasting Building Energy Consumption - A Review of Data-Driven Techniques
27 pages
Sustainable Energy Assessment
No ratings yet
Sustainable Energy Assessment
23 pages
2015 - Exploiting IoT-based Sensed Data in Smart Buildings To Model Its Energy Consumption
No ratings yet
2015 - Exploiting IoT-based Sensed Data in Smart Buildings To Model Its Energy Consumption
6 pages
ENB2012
No ratings yet
ENB2012
9 pages
DataScience With R (Assignment 5-Report)
No ratings yet
DataScience With R (Assignment 5-Report)
9 pages
Fault Detection Analysis Using Data Mining Techniques For A Cluster of Smart Office Buildings
No ratings yet
Fault Detection Analysis Using Data Mining Techniques For A Cluster of Smart Office Buildings
15 pages
Khalil 34 2021
No ratings yet
Khalil 34 2021
6 pages
Physical Energy and Data-Driven Models in Building Energy Prediction
No ratings yet
Physical Energy and Data-Driven Models in Building Energy Prediction
16 pages
Comparing deep learning models for multi energy vectors prediction on multiple types of building
No ratings yet
Comparing deep learning models for multi energy vectors prediction on multiple types of building
25 pages
Research
No ratings yet
Research
194 pages
1 s2.0 S0306261921015592 Main
No ratings yet
1 s2.0 S0306261921015592 Main
13 pages
A621421 PDF
No ratings yet
A621421 PDF
127 pages
A03 Research Paper
No ratings yet
A03 Research Paper
11 pages
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
All About Encoding - by Baijayanta Roy - Towards Data Science
No ratings yet
All About Encoding - by Baijayanta Roy - Towards Data Science
25 pages
SOB 1040B Lecture 1 - Introduction To Business Statistics
No ratings yet
SOB 1040B Lecture 1 - Introduction To Business Statistics
25 pages
Chapter 3
No ratings yet
Chapter 3
33 pages
Unit-II (Data Analytics)
100% (1)
Unit-II (Data Analytics)
17 pages
Unit 3 Categorical - Data
No ratings yet
Unit 3 Categorical - Data
5 pages
Journal of Baltic Science Education, Vol. 23, No. 6, 2024
No ratings yet
Journal of Baltic Science Education, Vol. 23, No. 6, 2024
201 pages
Business Analytics Nanodegree - Udacity: Variable Types
No ratings yet
Business Analytics Nanodegree - Udacity: Variable Types
4 pages
IT2110 - Tutorial 01 - Answers
No ratings yet
IT2110 - Tutorial 01 - Answers
2 pages
Week One STE 3004
No ratings yet
Week One STE 3004
19 pages
Complete Download Data Visualization For People of All Ages (AK Peters Visualization Series) 1st Edition Nancy Organ PDF All Chapters
100% (5)
Complete Download Data Visualization For People of All Ages (AK Peters Visualization Series) 1st Edition Nancy Organ PDF All Chapters
52 pages
16901435
No ratings yet
16901435
76 pages
Trust wallet spamming
No ratings yet
Trust wallet spamming
50 pages
Handbook of Tracer Study
No ratings yet
Handbook of Tracer Study
282 pages
The Analysis of Cross Classified Categorical Data Compress
No ratings yet
The Analysis of Cross Classified Categorical Data Compress
208 pages
Introduction to Statistics for IGCSE Students
No ratings yet
Introduction to Statistics for IGCSE Students
10 pages
Data Management: Bryan S. Ambre
100% (2)
Data Management: Bryan S. Ambre
104 pages
Dissertation Proposal Logistic Regression
100% (2)
Dissertation Proposal Logistic Regression
6 pages
Unit - 1: Statistics: Meaning, Significance & Limitations
No ratings yet
Unit - 1: Statistics: Meaning, Significance & Limitations
11 pages
Categorical Regression Models With Optimal Scaling For Predicting Indoor Air Pollution Concentrations Inside Kitchens in Nepalese Households
No ratings yet
Categorical Regression Models With Optimal Scaling For Predicting Indoor Air Pollution Concentrations Inside Kitchens in Nepalese Households
8 pages
Compensation and Organizational Strategy
No ratings yet
Compensation and Organizational Strategy
18 pages
Research Method
No ratings yet
Research Method
12 pages
Unit 3 DATA VISUAIZATION
No ratings yet
Unit 3 DATA VISUAIZATION
25 pages
Stat (Jafor Sir 1)
No ratings yet
Stat (Jafor Sir 1)
4 pages
Bivariate Logistic Regression
No ratings yet
Bivariate Logistic Regression
12 pages
Willingness To Pay For Community-Based Health Insurance Among Rural Households of Southwest Ethiopia
No ratings yet
Willingness To Pay For Community-Based Health Insurance Among Rural Households of Southwest Ethiopia
19 pages
Wollo University: College of Natural Science
No ratings yet
Wollo University: College of Natural Science
29 pages
Chapter 4 (Regression) PDF
No ratings yet
Chapter 4 (Regression) PDF
63 pages
Ijcm 201 19R9
No ratings yet
Ijcm 201 19R9
6 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.