Building Data Report
Building Data Report
Will Koehrsen
March 1, 2018
Contents
1 Introduction 2
1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Data Exploration 3
2.1 Distribution of Energy Star Scores . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Potential Issue with Energy Star Scores . . . . . . . . . . . . . . . . . . . 4
2.2 Energy Star Score by Building Type . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Site EUI by Building Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Energy Star Score versus Site EUI . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Linear Correlations with Energy Star Score . . . . . . . . . . . . . . . . . . . . . 9
2.6 Remove Collinear Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6 Conclusions 27
1
1 Introduction
In this report I present findings from an exploration of the NYC benchmarking dataset which
measures 60 variables related to energy use for over 11,000 buildings. Of primary interest is
the Energy Star Score, which is often used as an aggregate measure of the overall efficiency of
a building. The Energy Star Score is a percentile measure of a building’s energy performance
calculated from self-reported energy usage.
1.1 Objectives
• Identify predictors within the dataset for the Energy Star Score
• Build Regression/Classification models that can predict the Energy Star Score of a building
given the building’s energy data
• Interpret the results of the model and use the trained model to infer Energy Star Scores of
new buildings
Documentation of the column names can be found in this document. Preliminary data cleaning
and exploration was completed in another Jupyer Notebook.
In [1]: # Pandas and numpy for data manipulation
import pandas as pd
import numpy as np
# Evaluating Models
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, confusion_matrix
# Helpers
import itertools
2
In [2]: # Read in cleaned data and display shape
data = pd.read_csv('data/cleaned_data.csv')
print(data.shape)
(11746, 60)
There are over 11,000 buildings in the dataset with 60 energy-related features each. Many
of the columns contain a significant portion of missing values which were initially encoded as
’Not Available’. In a previous notebook, I filled in the missing values with nan and converted the
appropriate columns to numerical values.
2 Data Exploration
As we are concerned mainly with the energy star score, the first chart to make shows the distribution
of this measure across all the buildings in the dataset that have a score (9642).
3
2.1.1 Potential Issue with Energy Star Scores
The Energy Star Score is a measure based on the self-reported energy usage of a building (see
this documentation for a thorough discussion) which is supposed to represent the percentile energy
performance rank of a building. Given the above distribution, either this dataset has a high portion
of extraordinarily efficient buildings, or the measure is not objective. Roughly 7% of the buildings
have a score of exactly 100, but we would expect the distribution to be uniform across scores
because this is a percentile measure.
In contrast to the Energy Star Score, the Energy Use Intensity (EUI) is based on actual energy
use as determined by the utility. It is straightforward to calculate: take the total annual energy
use (in kBtu) and divide by the square footage of the building. The EUI is meant for normalized
energy use comparisons between buildings of the same building type. This measure is likely more
objective because it uses actual measure consumption. In the graph below we can see that the
4
distribution of Energy Use Intensity is normal, which is what we would expect from a random
variable.
In [4]: figsize(8, 8)
Even though there are issues with the Energy Star Score, this analysis will focus on that mea-
sure. However, further investigative work should be carried out on these scores to determine why
the distribution is unusual. Given the suspect distribution of this value, it would be worthwhile to
find a better overall measure of a building’s energy performance. We can still try to predict the
Energy Star Score, but just because we can predict a value does not mean it is useful!
5
2.2 Energy Star Score by Building Type
In order to determine if certain building types tend to score better or worse on the Energy Star
Score, we can plot the distribution of energy star scores by buiding type. Following is a density
plot showing the distribution of scores for building types with more than 80 measurements in the
data.
The kernel density plot x-limits do extend beyond the range of actual scores because of the
kernel density estimation method used to make the probability density distribution. The actual
values in a density plot can be difficult to interpret, and it is more instructive to focus on the
distribution/shape of the figure.
figsize(14, 10)
6
This gives us our first conclusion: office buildings tend to have much higher Energy Star Scores
than hotels and senior care communities. Residence halls and non-refrigerated warehouses also
have higher scores while multifamily housing has a bi-modal distribution of scores similar to the
overall distribution. It would be useful to see if this holds across a larger sample size, and to get
more data to figure out why some building types tend to do better.
In the exploratio notebook, I identified that there was a negative correlation between the Site
EUI and the Energy Star Score. We can plot the Site EUI by building type to see if there is also a
difference in Site EUI between building types.
# Plot the site EUI density plot for each building type
for b_type in types:
# Remove outliers before plotting
subset = data[(data['Weather Normalized Site EUI (kBtu/ft2 )'] < 300) &
(data['Primary Property Type - Self Selected'] == b_type)]
7
plt.xlabel('Site EUI (kBtu/ft^2)'); plt.ylabel('Density');
plt.title('Density Plot of Site EUI');
This plot provides us with another conclusion: there appears to be a negative correlation be-
tween the Site EUI and the Energy Star Score based on comparing the two distributions between
building types. Building types with lower Site EUI’s tend to have higher Energy Star Scores. The
higher the energy use intensity (which is energy use / area), the "worse" the building’s energy effi-
ciency performance. We can visualize the relationship between the Energy Star Score and the Site
EUI in a scatterplot.
8
subset = subset.rename(columns={'Primary Property Type - Self Selected': 'Proper
The plot shows the expected negative relationship between Energy Star Score and Site EUI.
This relationship looks like it holds across building types. To quantify the relationship, we can
calculate the Pearson Correlation Coeffiecient between the two variables. This is a measure of
linear correlation which shows both the strength and the direction of the relationship. We will look
at the correlation coefficient between Energy Star Scores and several measures.
9
'Weather Normalized Site EUI (kBtu/ft2 )',
'Weather Normalized Site Electricity Intensity (kWh/ft2 )',
'Largest Property Use Type - Gross Floor Area (ft2 )',
'Natural Gas Use (kBtu)',
'ENERGY STAR Score']
subset = data[features].dropna()
subset = subset[subset['Primary Property Type - Self Selected'].isin(types)]
# Remove outliers
subset = subset[subset['Site EUI'] < 300]
10
22 Residence Hall/Dormitory Floor Area -0.071806
23 Residence Hall/Dormitory Natural Gas -0.391847
24 Residence Hall/Dormitory Site EUI -0.876173
25 Senior Care Community Electricity Intensity -0.572131
27 Senior Care Community Floor Area 0.105373
28 Senior Care Community Natural Gas -0.283634
29 Senior Care Community Site EUI -0.566617
This shows the correlation between Energy Star Score and three different measures by building
type. For all buildings we see the following relationships: Energy Star Score is strongly negatively
correlated with the Electricity Intensity and the Site EUI. The strength of the natural gas correlation
depends on the dataset, but natural gas usage tends to be negatively correlated with the Energy Star
Score as well. The relationship between floor area and the Energy Star score is weak for all building
types.
In [10]: # Remove correlations from the dataframe that are above corr_val
def corr_df(x, corr_val):
# Dont want to remove correlations between Energy Star Score
y = x['ENERGY STAR Score']
x = x.drop(columns = ['ENERGY STAR Score'])
11
val = abs(item.values)
# If correlation is above the threshold, add to list to remove
if val >= corr_val:
drop_cols.append(col.values[0])
return x
The procedure removed 18 highly collinear columns from the data. The next step is to de-
termine if we can build machine learning models that can use these features to predict the target,
Energy Star Score.
We could also try clustering on the buildings, but for now we will stick to supervised machine
learning methods. Unsupervised methods would be an interesting avenue for further study.
12
3.1 Feature Preprocessing
To prepare the data for machine learning, we will use a few standard preprocessing steps:
1. Subset to numerical columns and 3 categorical columns. I primarily want to focus on using
the numeric values to predict the score, but I will also include categorical columns including
the building type because we saw that is related to the score.
2. One-hot encoding of categorical variables.
3. Split into training and testing sets (after extracting the buildings with no score). We will be
using 75% training and 25% testing.
4. Impute the missing values. There are many methods for imputation, but I will use the
straightforward median imputation. We need to train the imputer only on the training data
and make transformations of both the training data and the testing data.
5. Return the values for training and testing the model.
13
'Property Id',
'ENERGY STAR Score'])
We will be training the model with over 7000 buildings using 77 features (one-hot encoding
expands the number of features substanitially because it creates one feature for every value of a
categorical variables). The training and testing data have the same number of features which is a
good sanity check!
14
Baseline Error: 25.9698.
If our model cannot beat an average error of 26, then we will need to rethink our approach
That is not very great! We will move on to more serious machine learning models. The problem
as we increase in complexity is that while our accuracy will increase, the interpretability will
decrease.
15
Random Forest Error: 10.6761.
The error is substantially lower than the baseline so we can concude that using maching learn-
ing to predict the Energy Star Score is possible given the dataset. The test error shows that our
model can predict the Energy Star Score within 10 points of the true value on average.
As another evaluation of our model, we can calculate the rˆ2 score between the predictions
and the true values. This gives us an indication of the ratio of variance explained by the model.
A higher value (the max is 1) indicates that our model is better able to explain the relationship
between inputs and outputs.
The interpretation of the Rˆ2 value is that our model explains 86% of the variance in the Energy
Star Score. The rest of the variation is due to noise or latent (hidden) variables that we do not have
access to in the data.
In [18]: figsize(14, 8)
# Plot predictions
ax = plt.subplot(121)
ax.hist(rf_reg_pred, bins = 100)
ax.set_xlabel('Score'); ax.set_ylabel('Count');
ax.set_title('Predicted Distribution')
16
The model can predict the mode of scores at 100, does not seem to be able to predict the low
scores very well. Fixing this imbalance could be an area of future work. We could also try different
models to optimize the error, but for now we have shown that machine learning is amenable to the
problem.
figsize(8, 8)
17
Our model can now make a prediction for any new building as long as it has the measurements
recorded in the original data. We can use this trained model to infer the score and we know that
our estimate should be within 10 points of the true value.
18
'importance': importances})
# Plot formatting
plt.xticks(rotation = 90)
plt.xlabel('feature'); plt.ylabel('importance');
plt.title('Top Feature Importances')
plt.show();
return feature_df
19
20
In [22]: reg_features.head(10)
According to the random forest model, the most importance features for predicting the Energy
Star Score of a building are the Site EUI, the Electricity Intensity, the Property Use Type, the
Natural Gas Use, and the Area. These are in line with the variables that we saw are the most
correlated with the Energy Star Score.
Let’s take a look at the relationship between the Energy Star Score, the Site EUI, the electricity
intensity, the Natural Gas Use, and the area of the building. In the exploration, we noticed that
these relationships are closer to linear when the log transform of the variable is taken. We will
apply that transformation and look at the resulting relationships.
subset = new_data[features].dropna()
# Log Transforms
subset['log Site EUI'] = np.log(subset['Site EUI'])
subset['log Electricity Intensity'] = np.log(subset['Electricity Intensity'])
21
subset['log Floor Area'] = np.log(subset['Floor Area'])
subset['log Natural Gas'] = np.log(subset['Natural Gas'])
This graph is not that informative. We can see that the Energy Star Score is negatively corre-
lated with the Site EUI, the Electricity Intensity, and the natural gas usage. There is no significant
relationship between the floor area and the Energy Star Score.
The overall results of regression show us that it is possible to accurately infer the Energy Star
Score of a building given the data available. The model results are not easily explained, but the
most important features are in line with those found to be correlated with the Energy Star Score.
22
5 Classification of Energy Star Scores
As an additional exercise, we can see if it is possible to place buildings into classes based on their
energy star scores. We will use a simple grading scheme where every 20 point interval is a grade
giving us 5 total.
Classification Preprocessing
y = X['grade']
23
return X_train, X_test, y_train, y_test, missing_scores, feature_names
Out[26]: A 2254
B 1690
C 1280
F 1057
D 950
Name: grade, dtype: int64
The most common ’grade’ for a building is an A and the least common is an F. The skew
towards higher energy star scores is visible once again (self-reporting does not seem to be an
objective strategy).
# Make predictions
rf_clf_pred = rf_clf.predict(X_test)
24
# Plot the confusion matrix as an image
plt.figure(figsize(8, 8))
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title, size = 22)
# Formatting
plt.tight_layout()
plt.xticks(size = 18)
plt.yticks(size = 18)
plt.ylabel('True label', size = 18)
plt.xlabel('Predicted label', size = 18)
print('Accuracy: {:0.2f}.'.format(accuracy))
print('F1 score: {:0.2f}.'.format(f1_value))
Accuracy: 0.59.
F1 score: 0.58.
25
Accuracy is a straightforward metric and it shows that our model can correctly choose among
the five classes 60% of the time. The confusion matrix is useful because it shows the model
mistakes. The most common mistakes are predicting an ’A’ when the true label was a ’B’, and
predicting a ’B’ when the true label was a ’C’. The random forest classifier appears to do very well
and can accurately infer the Energy Star Score of a building if provided with the building data.
26
In [30]: missing_classes = rf_clf.predict(missing_scores)
pd.Series(missing_classes).value_counts()
Out[30]: A 900
F 471
B 341
C 232
D 160
dtype: int64
Any new building can now be classified by the model as long as it is provided with the appro-
priate data. We can be confident in the predictions of the model as shown by the high performance
on the testing data.
6 Conclusions
We set out to answer the question: Can we build a model to predict the Energy Star Score of a
building and what variables provide us the most information about the score?
Given the exploration in this notebook, I conclude that yes, we can create a model to accurately
infer the Energy Star Score of a building and we have determined that the Site Energy Use Inten-
sity, the Electricity Intensity, the floor area, the natural gas use, and the Building Type are
the most useful measures for determining the energy star score.
The highlights from the report are:
1. Energy Star Scores are skewed high with a disproportionate global maximum at 100 and a
secondary maximum at 1.
2. Offices, residence halls, and non-refrigerated warehouses have higher energy star scores than
senior care communities and hotels with scores of multi-family housing falling in between.
3. The Site Energy Use Intensity, the Electricity Intensity, and the natural gas usage are all
negatively correlated with the Energy Star Score.
4. A random forest regressor trained on the training data was able to achieve an average absolute
error of 10 points on a hold-out testing set, which was significantly better than the baseline
measure.
5. If provided with data for a new building, a trained model can accurately infer the Energy
Star Score.
This report also identified areas for follow-up. These include finding an objective measurement
of overall building energy performance and determining why the Energy Star Score distributions
vary between building types. That’s all for now!
27