Decision Tree Used for Predicting Energy Consumption

Building Energy Data Analysis Part Three

43 min readDec 26, 2018

Machine Learning and Modeling

Author’s Note: Following is Part 3 of a comprehensive exploratory data analysis on a dataset of energy consumption for 8 office buildings. Part 1 laid out the background for the project, introduced the data, and posed a series of questions to guide the analysis and Part 2 examined temporal trends and correlations between weather conditions and energy consumption. Part 3 will focus on quantifying the relationships previously identified and delves into using machine learning methods for predicting future energy consumption. This work is part of my involvement on the EDIFES (Energy Diagnostic Investigator for Energy Efficiency Project) at Case Western Reserve University. Full code can be found on GitHub, and as always, I appreciate feedback (especially constructive criticism) at wjk68@case.edu!

Introduction

The overarching question for the office building energy EDA is:

What variables (weather and time) are most strongly correlated with energy consumption in office buildings, and is it possible to create a model to predict the energy use given the time and weather conditions?

The plotting and quantitative analysis completed in Part 2 indicated that temperature and global horizontal irradiance (a measure of the amount of sunlight striking a surface) are strongly positively correlated with energy consumption during the summer. As a reminder, the following plots show the Pearson correlation coefficients between each of the weather variables and energy consumption in all eight buildings by season. A positive value of the coefficient indicates a positive linear correlation (as the weather variable increases, the energy consumption does as well), and a negative value indicates a negative linear correlation.

The weather variables are the following:

ws = wind speed; pwat = precipitable water; rh = relative humidity; temp = temperature; gti = global tilt irradiance; dif = global diffuse irradiance; ghi = global horizontal irradiance

Based on the correlations, a good way to being modeling would be to create a linear model between temperature and energy consumption during the summer. Because of the high value of the correlation coefficient, we would expect that knowing only the temperature would allow us to make decent predictions of the energy use. However, before we can actually get into the modeling, we should take a step back and look at the types of modeling available.

Modeling Approaches

The task of predicting energy consumption from the time and weather is what is known as supervised learning: we have a set of explanatory variables (called features) and known response variables (called targets) to predict. The features in this case are the weather and time conditions and the target is the energy consumption. This is a regression task because the targets are continuous quantities. There are dozens of algorithms for supervised regression that can be easily implemented in Python and the R statistical language. These range in complexity from linear models with simple equations and understandable parameters, to deep neural networks which are essentially black boxes that accept data as an input, perform a sequence of mathematical operations, and produce an (often correct) answer with no understandable justification. For this project I want to focus on both accuracy and interpretability of results in the problem domain, and I will evaluate three different modeling approaches covering the spectrum of complexity: Linear Regression, Random Forest Regression, and Support Vector Machine Regression.

Simple and Multivariate Linear Regression

The simplest model and therefore the place to start when using machine learning is linear regression which aims to predict the value of the target variable by a weighted linear addition of the explanatory variables. Linear Regression can use a single explanatory variable (simple), or it can use many (multivariate). The general equation for a linear model is

y = w0 + w1 * x1 + w2 * x2 + … + wn * xn

where w0 is the intercept, xn represents the explanatory variables, wn is the weight assigned to each explanatory variable, and y is the response variable. In the case of the Building Energy Dataset, y is the energy consumption, x is the weather or time features, and w is the weight (slope) assigned to each feature. The weight in a linear model represents the slope of the relationship, or how much a change in the x variable affects the y variable. One nice aspect of linear modeling is that the impact of a change in one independent
variable can be directly observed by measuring the change in the response.

Random Forest Regression

To understand the powerful random forest, you first need to grasp the concept of a decision tree. The best way to describe a single decision tree is as a flowchart of questions about the explanatory variable values of a data point
that leads to a classification/prediction. Each question (known as a node) has a yes/no answer based on the value of a particular variable (see image below if I’ve already lost you). The two answers form branches leading away from the node. Eventually, the tree terminates in the final classification/prediction node called a leaf. A single decision tree can be arbitrarily large and deep depending on the number of features and the number of classes. They are adept at both classification and regression and can learn a non-linear decision boundary (they actually learn many small linear decision boundaries which overall are non-linear). However, a single decision tree is prone to overfitting, especially as the depth increases because the decision tree is flexible leading to a tendency to memorize the training data.

To solve this problem, ensembles of decision trees are combined into a powerful classifier known as a random forest. Each tree in the forest is trained on a randomly chosen subset of the training data (either with replacement, called bootstrapping, or without) and on a subset of the
features. This increases variability between trees making the overall forest more robust and less prone to overfitting. In order to make a prediction, the random forest passes the variables of an observation to all trees, and takes an average of the votes of each tree (this is known as bagging; the random forest can also weight the votes of each tree with respect to the confidence the tree has in its prediction). Overall, the random forest is fast to train (trees can be grown in parallel), intuitive, has a moderate level of interpretability, and performs well on both classification and regression tasks. The random forest should be one of the first models tried on any machine learning problem and is generally my second approach after a linear model. A model of a single decision tree is presented below:

Individual Decision Tree for Predicting Loan Default (Source)

This particular decision tree is used to make predictions about whether or not an individual will default on a loan based on personal features.

There are a number of hyperparameters that must be specified for the forest ahead of time with the most important the number of trees in the forest, the number of features considered by each tree, the depth of the tree, and the minimum number of observations permitted at each leaf of the tree. These can be selected by training many different models with varying hyperparameters and selecting the combination that performs best on cross-validation or a testing set. A random forest performs implicit feature selection and can return the relative importances of the features (we will take a look at these later in the report) so it can be used as a method to reduce dimensions for additional algorithms. Much as a single analyst might make a bad judgement that is biased in one direction but averaging the predictions of hundreds of analyst will reduce the individual biases, a forest of decision trees makes better predictions on average than any individual decision tree.

Support Vector Machine Regression

A support vector machine (SVM) regressor is the most complicated and the least interpretable of the ML models explored in this report (this explanation is not crucial to understand the report, so feel free to skip the rest of this paragraph). SVMs can be used for both classification and regression and operate on the basis of finding a hyperplane to separate classes. The idea is that any decision boundary becomes linear in a high-dimensional space. For example, a linear decision boundary in three-dimensional space is a plane. The SVM projects the features of the observations into a higher dimensional space using a kernel, which is simply a transformation of the data. The model then finds the plane that best separates the data by maximizing the margin, the minimum distance between the nearest member of each class and the decision boundary. The support vectors in the name of the algorithm refer to the points closest to the decision boundary which are referred to as the support and have the greatest influence on the position of the hyperplane. SVM regressors work by fitting a non-parametric regression model to the data and trying to minimize the distance between the model and all training instances. SVM models are more complex than either a linear regressor or a random forest regressor and have almost no interpretability. The transformation of the features into a higher-dimensional space using a kernel removes all physical representation of the features. SVM models are truly black boxes, but have high accuracy on small datasets with limited amounts of noise. The RBF or Radial Basis Kernel, is the most popular kernel in use and is the default in many implementations. Examples of the classification ability of SVM using different kernels is shown below:

Linear Regression Modeling

Simple Linear Regression

To start with, we will try to predict the daily average energy consumption of a building from a single variable. (Although we have data in 15-minute intervals, average daily energy use could be an easier to parse metric). Temperature is the best place to start because this had the greatest correlation with energy consumption. Therefore, the problem here is trying to predict average daily energy consumption during the summer and winter from average daily temperature. The following code creates a linear model between daily average energy use and daily average temperature for each building in both the summer and winter (for comparison purposes). ‘forecast’ in the code refers to the final cleaned energy column in each dataset. It was first named ‘forecast’ not ‘energy’ when the project started (not by me!) and tradition is difficult to break!

{r}# A simple linear regression model explaining daily energy use by temperaturelinear_temp_model <- function(df, return_model = FALSE) {# Filter out the summer months and find daily averagesdf_summer <- filter(df, lubridate::month(timestamp) %in% c(6, 7, 8)) %>%select(timestamp, forecast, temp) %>% mutate(day = lubridate::yday(timestamp)) %>%group_by(day) %>%summarize_all(funs(mean))# Create the modelsummer_model <- lm(96 * forecast ~ temp, data = df_summer)# Filter out the winter months and find daily averagesdf_winter <- filter(df, lubridate::month(timestamp) %in% c(12, 1, 2)) %>%select(timestamp, forecast, temp) %>% mutate(day = lubridate::yday(timestamp)) %>%group_by(day) %>%summarize_all(funs(mean))# Create the modelwinter_model <- lm(96 * forecast ~ temp, data = df_winter)if (return_model) {return(summer_model)}# Return the temp coefficient and the r-squared performance on the training datareturn(c(summer_model$coefficients[[2]], summary(summer_model)$r.squared,winter_model$coefficients[[2]], summary(winter_model)$r.squared))}# Creata a dataframe to hold all the resultsall_temp_slopes <- as.data.frame(matrix(ncol = 5))names(all_temp_slopes) <- c('bldg', 'summer_slope', 'summer_r2','winter_slope','winter_r2')# Iterate through the buildings and create a summer and winter model for eachfilenames <- dir('../data/')for (file in filenames) {df <- read_data(file)name <- unlist(strsplit(file, '-|_'))[2]results <- linear_temp_model(df)all_temp_slopes <- add_row(all_temp_slopes, bldg = name,summer_slope = round(results[1], 2),summer_r2 = round(results[2], 2),winter_slope = round(results[3],2),winter_r2 = round(results[4], 2))}# Remove the na row from the dataframeall_temp_slopes <- all_temp_slopes[-1,]knitr::kable(all_temp_slopes, caption = 'Summer and Winter Temp Linear Model Stats')

The above table provides both the slope of the models and the R-squared value. R-squared represents the amount of variation in the response variable (energy) that can be explained by the explanatory variable (temperature) in the model. In more meaningful terms, a summer R-Squared of 0.15 means that variations in temperature can explain 15% of the observed variation in energy use. A higher value means the model is “bettter” in the sense that it will have more accurate predictions because it is capturing the relationships in the data. The slope is also interpretable because it represents the change in energy consumption due to a 1 degree Celsius change in temperature. For example, for the SRP building, located in Phoenix, Arizona, a 1 C increase in daily average temperature during the summer will result in 100.8 kWh of increased energy consumption. At the national average price of $0.104/kWh, this is $10.40 in extra cooling costs for each degree temp increase!

We can dig further into the model results for this particular building to see if a linear model is appropriate.

{r}
# Look at the model for a single building
srp_df <- read_data("f-SRP_weather.csv")
summer_model <- linear_temp_model(srp_df, return_model = TRUE)# Display the relevant characteristics
summary(summer_model)Call:
lm(formula = 96 * forecast ~ temp, data = df_summer)Residuals:
   Min     1Q Median     3Q    Max 
-643.6 -206.6  131.6  266.9  456.6Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1864.20     713.75   2.612   0.0105 *  
temp          100.83      21.52   4.685 9.77e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 327.5 on 91 degrees of freedom
Multiple R-squared:  0.1943, Adjusted R-squared:  0.1855 
F-statistic: 21.95 on 1 and 91 DF,  p-value: 9.772e-06

Examining the results we can see that temp has a P value of 9.77E-06 in the model (farthest column to right in the temp row of the coefficients). This means that temperature is a significant variable in the model and therefore is related to energy consumption. However, the overall R-Squared is low at 0.19 meaning that there are other factors that contribute to energy consumption in addition to temperature. The intercept in this case represents the daily energy use for a day with an average temperature of 0 C, not a likely occurrence in Phoenix during the summer! In addition to quantitative stats, there are plenty of ways to diagnose a model graphically in R. However, most of these graphs get pretty involved so I will stick to one showing the Cook’s Distance versus observation number. The Cook’s Distance is a measure of how far a data point is away from other data points, and tells us how much of an outlier a data point is. Outliers are of interest because they can drastically affect a model. Sometimes outliers are the effect of bad data, but they may also be legitimate data points that are extreme for one reason or another and should be closely examined to determine why they occurred. The graph of Cook’s Distance is below:

This shows us that there are 3 data points with a relatively large Cook’s Distance that could be affecting the model of daily energy versus temperature. To determine how large this effect may be, we can create models with and without the data points. The following code creates three models: one with all the data points, one with only the high leverage points, and one with only the data points without high leverage.

{r}
# Create a new dataframe with daily averages during the summer
srp_df_summer <- filter(srp_df, lubridate::month(timestamp) %in% c(6, 7, 8)) %>%
    select(timestamp, forecast, temp) %>% mutate(day = lubridate::yday(timestamp)) %>% 
    group_by(day) %>% 
    summarize(daily_temp = mean(temp, na.rm = TRUE), 
                                daily_energy = 96 * mean(forecast, na.rm = TRUE))# Highlight the high leverage points
srp_df_summer$leverage <- 0
srp_df_summer[c(13, 20, 29), 'leverage'] <- 1# Plot the high leverage points
ggplot(srp_df_summer, aes(daily_temp, daily_energy, color = as.character(leverage))) +
  geom_jitter(shape = 19, size = 3) + geom_smooth(method = "lm", se = FALSE) + 
  scale_color_manual(values = c("blue", "red"), labels = c('normal', 'high')) + 
  xlab('Temp (degrees C))') + ylab('Daily Energy (kWh)') + labs(color = 'leverage') +
  ggtitle('Phoenix Energy vs Temp Linear Model') + theme_stata(12)  + 
  geom_smooth(aes(group = 1 ), method = 'lm', se = FALSE, color = 'black')

The black line indicates the model with all data points, the blue line is only the normal leverage points, and the red line is only the high leverage points. Although the red point model is far off from the other models, the difference between the black and blue line is miniscule, which shows us the three high-leverage points do not greatly affect the model. Combining this with the fact we have no evidence to suggest these points are erroneous means that we should not remove these outliers from the data.

Conclusion: From the R-Squared values, temperature alone is not a good predictor of average daily energy consumption in office buildings. The next step is to create a linear model that includes all weather variables to see if combining many features can yield a more accurate model for predicting daily energy use.

Multivariate Linear Modeling

The next model will use all the available weather variables to explain the variation in temperature. Again, the daily averages for each variable are used here rather than data every 15 minutes. The models are created in much the same way as the simple linear ones, with the model call changed to

{r}# Create models between energy use and weather variablessummer_model <- lm(96 * forecast ~ ghi + dif + gti + temp + rh + pwat + ws, data = df_summer)
winter_model <- lm(96 * forecast ~ ghi + dif + gti + temp + rh + pwat + ws, data = df_winter)

We can examine the R-Squared values to determine if including all the weather variables achievies significant improvement over only using temperature.

Model R-Squared for Daily Average Energy and All Weather Variables

Well, it looks like using all the weather variables does not result in a significant improvement in performance. We can run an ANOVA (analysis of variance) test to quantify the difference between models.

{r}
# Create the two linear models for the SRP building
srp_temp_model <- linear_temp_model(srp_df, return_model = TRUE)
srp_weather_model <- linear_weather_model(srp_df, return_model = TRUE)# Perform an analysis of variance test to determine if the model has improved
anova(srp_temp_model, srp_weather_model)Analysis of Variance TableModel 1: 96 * forecast ~ temp
Model 2: 96 * forecast ~ ghi + dif + gti + temp + rh + pwat + ws
  Res.Df     RSS Df Sum of Sq      F Pr(>F)
1     91 9758698                           
2     85 8870406  6    888292 1.4187 0.2169

The test shows the Residual Sum of Squares (RSS) of the expanded model is lower, but it is not significant because the P value is 0.217. Generally, a P-value less than 0.05 is considered significant . A p-value represents the probability of the observed effects occurring at random. The chance our improvement is due to randomness is therefore 21.7% , far above the threshold for significance.

As a last step with the linear modeling, we can examine one building model in detail to see the magnitude of the slopes and determine which variables are significant in the model.

{r}
summary(srp_weather_model)Call:
lm(formula = 96 * forecast ~ ghi + dif + gti + temp + rh + pwat + 
    ws, data = df_summer)Residuals:
   Min     1Q Median     3Q    Max 
-719.4 -218.1  109.5  239.5  584.3Coefficients:
            Estimate Std. Error t value
(Intercept) 2251.332   1378.505   1.633
ghi           14.627      5.305   2.757
dif           -8.781      4.229  -2.076
gti          -11.340      4.203  -2.698
temp         110.570     35.656   3.101
rh             7.320     13.376   0.547
pwat          -3.450     13.511  -0.255
ws           -63.905    116.159  -0.550
           
             Pr(>|t|)   
(Intercept)  0.10613   
ghi          0.00714 **
dif          0.04088 * 
gti          0.00841 **
temp         0.00262 **
rh           0.58564   
pwat         0.79907   
ws           0.58366   
---
Signif. codes:  
  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1  ‘ ’ 1Residual standard error: 323 on 85 degrees of freedom
Multiple R-squared:  0.2677, Adjusted R-squared:  0.2073 
F-statistic: 4.438 on 7 and 85 DF,  p-value: 0.0003031

There are 3 important weather variables: global horizontal irradiance, global tilt irradiance, and temperature. The rest of the variables are not useful in the model and could be removed without a decrease in predictive ability. Overall, the model for this building can only explain 20% of the variation in daily energy consumption based on daily weather conditions.

Conclusion: Even a multivariate linear model cannot accurately predict daily energy consumption accurately. Different modeling approaches will be required! Moreover, predicting the average daily energy consumption requires averaging 96 observations. This method essentially reduces the amount of data available by a factor of 96 (the number of observations per day) because it averages out the variables over each day. In order to get more accurate predictions, I can look at each individual 15 minute interval of the day and include the temporal variables. The resulting predictions will be energy consumption for each fifteen minute interval as opposed to the daily average. The energy over each interval can then be summed over the course of a day to derive the daily energy consumption.

Machine Learning Model Selection

In order to develop an optimal model to predict the energy from the weather and time conditions for each 15-minute interval, it is best to try out several approaches. The three models that will be looked at are Linear Regression, Support Vector Machine Regression, and Random Forest Regression. Previous analysis with this dataset suggests the relationship between energy use and weather and time is non-linear and will not be explained by a linear regression, but there is no way to know for certain ahead of time.

The only way to choose the best model is to try them out on the data! The basic procedure is to split the data into training and testing sets, fit the models on the training data, make predictions on the test data, and evaluate the models using appropriate metrics. The metrics selected for evaluation are explained below:

rmse: root mean square error; the square root of the sum of the squared errors divided by the number of observations. This is a useful metric because it has the same units (kWh) as the true value so it can serve as a measure of how many kWh the average estimate is from the known true target.
r-squared: the percentage of the variation in the dependent variable explained by the independent variables in the model. Measures how much of the relationships within the data are captured by the model.
MAPE: mean average percentage error: the absolute value of the error for each prediction divided by the true value and multiplied by 100 to convert to a percentage. This metric indicates the percentage error in the prediction and is better than the rmse because it takes into account the relative magnitude of the target variable.

Data Preparation

Before comparing models for energy prediction, I need to prepare the data for machine learning. This will require several steps:

• One-hot encoding of categorical variables

• Transformation of cyclical variables

• Normalization of variables to zero mean and unit variance

• Separation into training and testing features and labels

One-Hot Encoding

One hot encoding is a tricky concept to understand at first that is best illustrated with an example. The goal is to convert categorical variables into numeric variables without creating an arbitrary ordering. The process takes this:

and converts it into this:

Each value for each categorical variable is represented as either a 0 or 1.One-hot encoding is required because machine learning models do not know how to handle categorical data, and simply mapping the values to numbers imposes a random valuing of the feature values based on order.

Cyclical Variable Transformation

This is required because temporal variables, such as months in a year, are not properly represented by a 1–12 encoding. Month 12 is closer to Month 1 in terms of weather than Month 5, but using integer numbers results in Month 5 and 12 placed closer together. Transforming the time variables into cosine and sine components creates the desired representation. The general equation for a sinusoidal is:

Where f is the frequency and t is the time. For months, the frequency is 1/12 because the pattern repeats every 12 months (for daily time in hours, the frequency is 1/24). The transformation of months into sinusoidal components is:

Month 12 will now be represented as occurring closest to Month 1 and 11. The same procedure is required for days of the year and hours of the day.

Normalization

This involves either subtracting the mean and dividing by the standard deviation or subtracting the minimum value and dividing by the maximum minus the minimum. In the first approach, each feature column will have 0 mean and unit (1) variance. The second approach scales each feature value to between 0 and 1. This step is necessary to remove any disproportionate representations because of the varying units used in variables. In other words, a variable with units of millimeters might have a larger weight attached to
it than a unit of meters simply because of different units. Normalization overcomes the problem of differing unit scales and the data in this report is normalized to have a mean of 0 and a unit variance of 1.

Training and Testing Split

Author’s Note: After completing this analysis, I realized that evaluating algorithms for predictive capability based on a random training/testing split of the data was not the best implementation. However, in the model selection phase of the report, I was more concerned with relative as opposed to absolute results. Therefore, comparing algorithms based on a random sampling of the data is valid because if a model cannot accurately predict data points drawn at random, it will not be able to predict electricity consumption into the future. When actually evaluating the best model, I used the last six months as the testing set and the rest of the data for training.

Finally, the dataset must be split into training and testing sets. During training, the model is allowed to see the answers (known targets) in order to learn the relationships between the explanatory and response variables. When testing, the model is asked to make predictions for a set of features it has not seen before. The targets for these features are known and therefore the performance metrics can be computed based on the discrepancy between the known target values and the predictions. We will be using a 0.7/0.3 random training/testing split. Each algorithm will be evaluated against the same training and testing data for a fair comparison.

The following code takes a dataset and prepares it for machine learning by implementing the steps outlined above. The output is a training and testing sets of data.

{r}
# Take in a dataframe in standard format and create training and testing sets
get_features_labels <- function(df) {
  
  # Enforce column specifications
  # Create columns for day of the year, month, and year
  df2 <- df %>% mutate(day = yday(timestamp), month = month(timestamp), 
                      year = year(timestamp)) %>%  
    
    # Transform cyclical features (month, day, num_time)
    mutate(day_sin = sin(2 * day * pi / 365), 
           day_cos = cos(2 * day * pi / 365),
           month_sin = sin(2 * month * pi / 12), 
           month_cos = cos(2 * month * pi / 12),
           num_time_sin = sin(2 * num_time * pi / 24),
           num_time_cos = cos(2 * num_time * pi / 24))
  
  # Convert day of week to a numeric value
  df2$day_of_week <- as.numeric(unclass(df2$day_of_week))
  
  # One hot encoding of categorical variables
  df2 <- dummy.data.frame(df2, names = c('week_day_end', 'sun_rise_set'), sep = '_') 
  
  # Extract the known values, and timestamps for graphing
  labels <- df2[c('timestamp', 'cleaned_energy', 'forecast')]
  
  cols_to_remove <- c('elec_cons', 'elec_cons_imp', 'pow_dem', 'cleaned_energy',
                      'anom_flag', 'forecast', 'anom_missed_flag', 
                      'sun_rise_set_NA')
  
  # Change the timestamp to a numeric
  df2$timestamp <- as.numeric(df2$timestamp)
  
  # Remove the columns from the features
  df2 <- df2[, -which(names(df2) %in% cols_to_remove)]
  
  # Scale all the features to have 0 mean and 1 sd
  df2 <- as.data.frame(scale(df2))
  
  # Set a seed for reproducible results
  set.seed(40)
  # Split data into training and testing sets with 70% training
  train_indices <- (createDataPartition(labels$forecast, p = 0.7))$Resample1
  
  # Return the complete features and labels as testing and training sets
  train_features <- df2[train_indices, ]
  test_features <- df2[-train_indices, ]
  
  train_labels <- labels[train_indices, ]
  test_labels <- labels[-train_indices, ]
  
  return(list('X_train' = train_features, 'X_test' = test_features, 
              'y_train' = train_labels, 'y_test' = test_labels))
}

Utilizing Python

A critical aspect of data science is knowing the right tools to use for
an application. In the case of machine learning implementations, Python handily beats R for ease of use and speed. Scikit-learn is an incredible machine learning library built for Python and contains all three models for this report. While it is not yet possible to directly pass R dataframes to Python,
it can be done using only two additional steps with the feather library. Feather uses binary files to save dataframes and reading and writing times
are greatly reduced compared to using csv files. The entire structure of a dataframe including column names is preserved using feather files.

The following code creates the training and testing datasets as R dataframes and saves them to feather files that can be read directly into Python as Pandas dataframes.

{r}
# Save training and testing dataframes to feather files for use in Python
for (file in filenames) {
  
  # Create the training/testing features/labels
  df <- read_data(file)
  name <- unlist(strsplit(file, '-|_'))[2]
  features_labels <- get_features_labels(df)
  X_train <- features_labels$X_train
  X_test <- features_labels$X_test
  y_train <- features_labels$y_train
  y_test <- features_labels$y_test
  
  # Save the data
  feather::write_feather(X_train, sprintf('../feather/%s_X_train.feather', name))
  feather::write_feather(X_test, sprintf('../feather/%s_X_test.feather', name))
  feather::write_feather(y_train, sprintf('../feather/%s_y_train.feather', name))
  feather::write_feather(y_test, sprintf('../feather/%s_y_test.feather', name))
}

Python Code for Machine Learning Evaluation

Following is the Python code used to train the model on the training data and make predictions on the test data for all three algorithms. The prediction results are saved back to feather files for analysis/visualization in R.

{python}# Import libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
import feather# Directory with feather files
root_dir = ‘feather/train_test/’
building_names = [‘APS’, ‘CoServ’, ‘Kansas’, ‘NVE’, ‘PGE1’, ‘SDGE’, ‘SMUD’, ‘SRP’]# Create the three models with minimal hyperparameter selection
rf = RandomForestRegressor(n_estimators = 200, max_depth = 40, criterion = ‘mse’, 
n_jobs = -1, verbose = 1)
lr = LinearRegression()
svr = SVR(kernel = ‘rbf’, verbose = True)# Need to store RF feature importances
feature_importances = pd.DataFrame(columns = building_names)# Iterate through all the buildings
for building in building_names:
 
 # Create a dataframe to hold the predictions
 predictions = pd.DataFrame(columns = [‘rf’, ‘lr’, ‘svr’])
 
 # Read in the training/testing sets
 X_train = np.array(feather.read_dataframe(root_dir + 
 ‘%s_X_train.feather’ % building))
 X_test = np.array(feather.read_dataframe(root_dir + 
 ‘%s_X_test.feather’ % building))
 y_train = np.array(feather.read_dataframe(root_dir + 
 ‘%s_y_train.feather’ % building)[‘forecast’])
 y_train = np.array(feather.read_dataframe(root_dir + 
 ‘%s_y_train.feather’ % building)[‘forecast’])
 
 # Fit (train) the models and make predictions on the test data
 rf.fit(X_train, y_train)
 rf_predictions = rf.predict(X_test)
 
 # Save features importances of the random forest
 building_feature_importances = rf.feature_importances_
 feature_importances[building] = building_feature_importances
 
 lr.fit(X_train, y_train)
 lr_predictions = lr.predict(X_test)
 
 svr.fit(X_train, y_train)
 svr_predictions = svr.predict(X_test)
 
 # Save the predictions to a dataframe
 predictions[‘rf’] = rf_predictions
 predictions[‘lr’] = lr_predictions
 predictions[‘svr’] = svr_predictions
 
 # Write the predictions 
 feather.write_dataframe(predictions, ‘feather/predictions/’ + 
 ‘%s_predictions.feather’ % building)# Write the feature importances to a feather file for analysis
feather.write_dataframe(feature_importances, 
‘feather/feature_importances/feature_importances.feather’)

The results are then read back into R for quantitative analysis and plotting!

{r}
# Read results back in from Python
feather_dir <- '../feather/predictions/'
metrics <- as.data.frame(matrix(ncol = 4))
names(metrics) <- c('building', 'metric', 'model', 'value')# Iterate through all the predictions for all the buildings
for (file in filenames) {
  name <- unlist(strsplit(file, '-|_'))[2]
  
  # Dataframe to hold results for each building one at a time
  results <- as.data.frame(matrix(nrow = 9, ncol = 4))
  names(results) <- c('building', 'metric', 'model', 'value')
  
  # Compare the true values and the predictions
  true_values <- read_feather(sprintf('../feather/train_test/%s_y_test.feather', 
                               name))$forecast
  predictions <- read_feather(paste0(feather_dir, name, '_predictions.feather'))
  predictions$true <- true_values
  
  # Performance metrics
  rf_rmse <- Metrics::rmse(predictions$true, predictions$rf)
  lr_rmse <- Metrics::rmse(predictions$true, predictions$lr)
  svr_rmse <- Metrics::rmse(predictions$true, predictions$svr)
  
  rf_r2 <- cor(predictions$true, predictions$rf) ^ 2
  lr_r2 <- cor(predictions$true, predictions$lr) ^ 2
  svr_r2 <- cor(predictions$true, predictions$svr) ^ 2
  
  rf_mape <- Metrics::mape(predictions$true, predictions$rf)
  lr_mape <- Metrics::mape(predictions$true, predictions$lr)
  svr_mape <- Metrics::mape(predictions$true, predictions$svr)
  
  rmse <- c(rf_rmse, lr_rmse, svr_rmse)
  r2 <- c(rf_r2, lr_r2, svr_r2)
  mape <- c(rf_mape, lr_mape, svr_mape)
  all_metrics <- c(rmse, r2, mape)
  
  # Add results to building dataframe
  results$building <- name
  results$metric <- c(rep('rmse', 3), rep('r2', 3), rep('mape', 3))
  results$model <- rep(c('rf', 'lr', 'svr'), 3)
  results$value <- all_metrics
  
  # Store all results in a single dataframe
  metrics <- rbind(metrics, results)
}# Remove the first row 
metrics <- metrics[-1, ]

Visualization of Evaluation Metrics

The performance of each model could be viewed in a table, but it is easier to make relative comparisons using charts, and besides, graphs are more pleasing to look at!

The first plot is the root mean squared error of the three algorithms on six of the buildings.

{r}
metrics <- merge(metrics, metadata[, c('Name', 'location')], by.x = 'building',
                 by.y = 'Name', all.x = TRUE)# RMSE Comparison Plot
ggplot(filter(metrics, metric == 'rmse' & building != "CoServ" & building != "SMUD"), aes(factor(model, levels = c('rf', 'svr', 'lr')))) + 
  geom_bar(aes(y = value, fill = model), 
           color = "black", width = 0.9, position = 'dodge', stat = 'identity') + 
  facet_wrap(~location) + xlab('Model') + 
  ylab("rmse (kWh)") + theme_hc(12) + 
  theme(axis.text = element_text(size = 12, color = 'black'),
        plot.title = element_text(hjust = 0.5)) + 
  scale_y_continuous(breaks = seq(0, 6, 1)) + 
  ggtitle("Model Test Set RMSE Comparison") + 
  scale_fill_manual(values = c("blue", "red", "green"))

That’s one point for the random forest! On each building (I am showing 6 of the 8), the random forest significantly outperforms the other two models. The linear regression has considerably more error than the random forest which indicates this problem might not be linear.

The next measure is the r squared value. This can be interpreted as the percentage of the variation in the target variable explained by the independent variables in the model. For this situation, it is the variability in the energy consumption for 15-minute interval explained by the corresponding weather and time information. The code is very similar to the rmse code so I will just show the graph.

The second metric is another clear win for the random forest. On the test set, the random forest achieves over 0.9 R-Squared. All of the models can explain a majority of the variation in the energy consumption except for the linear model on the Kansas building. However, even with the best models, there is variation in the response variable due to latent (hidden) variables, or inherent randomness (irreducible error) that cannot be captured.

The last metric is the mean average percentage error. To get an accuracy metric, we can subtract the mape from 100%.

The random forest significantly outperforms the other two models by
achieving less than 10% error on all of the buildings. The linear model is
much less accurate with the support vector machine in the middle in terms of performance. It is interesting that all of the models performed much better on the SRP and APS buildings, which are both located in Phoenix. Perhaps the buildings in warmer climates have clearer relationships between weather and energy usage.

Conclusion: based on the model evaluation results, the random forest is the most capable algorithm for prediction. I will select this model and optimize it for our specific problem.

Hyperparameter Optimization of Random Forest

After model selection, the next step in the machine learning pipeline is model optimization. All models have a number of hyperparameters (best thought of as settings) that can be adjusted by the developed to optimize the model for a problem. Hyperparameters differ from model parameters in that the latter are learned by the model during training while hyperparameters must be explicitly defined by the developer before training. Scikit-learn implements a set of sensible default hyperparameters, but there is no guarantee these will be the best settings for a particular problem.

Hyperparameters are best optimized by creating a number of different models with different settings and evaluating each one on the same validation set. Scikit-learn in Python has a number of handy tools for automatic hyperparameter optimization including a grid search which exhaustively tries every model in a specified grid and evaluates each one using cross-validation. The function returns the performance of all models and the best performing model can be selected for application. The following Python code performs this grid search over a pre-defined grid of hyperparameters. I used a single building for the hyperparameter tuning, which in hindsight was not the best choice. Settings that work best with one building might not be optimized for all the buildings. However, as shown in a little bit, the optimized model did outperform the baseline model for all buildings.

{python}
# Perform grid search to find best parameters
# Feature selection, depth, number of trees, and minimum size of leaves
param_grid = {
    'max_features': [None, 0.5], 
    'max_depth': [None, 40],
    'n_estimators':[200, 400],
    'min_samples_leaf': [1, 4]
}# Testing on a dataset with a high mape to improve performance
nve_X_train = feather.read_dataframe('feather/train_test/NVE_X_train.feather')
nve_X_train = np.array(nve_X_train)nve_X_test = feather.read_dataframe('feather/train_test/NVE_X_test.feather')
nve_X_test = np.array(nve_X_test)nve_y_train = feather.read_dataframe('feather/train_test/NVE_y_train.feather')
nve_y_train = np.array(nve_y_train['forecast'])nve_y_test = feather.read_dataframe('feather/train_test/NVE_y_test.feather')
nve_y_test = np.array(nve_y_test['forecast'])rf_test = RandomForestRegressor(bootstrap = True, verbose = 2, n_jobs = -1)rf_grid = GridSearchCV(rf_test, param_grid, scoring = 'neg_mean_squared_error', n_jobs = -1, 
verbose = 2)rf_grid.fit(nve_X_train, nve_y_train)rf_grid.best_params_

The best parameters are as follows:

* 200 decision trees in the forest
* No maximum depth for each tree
* A minimum of 1 sample per leaf node
* A maximum fraction of 50% of the features evaluated for each tree
* A minimum number of 2 samples to split a node
* A minimum impurity decrease of 0.0 to split a node

Once the optimal hyperparameters have been selected, I can evaluate them against the same training and testing set as the baseline set of hyperparameters to assess the difference in performance. The optimized hyperparameters were selected based on a single building
and therefore I need to make sure that these settings will work better for all buildings. The following procedure to assess performance relative to the baseline is: read in the optimized predictions and true values, calculate the performance metrics, and graph the baseline and optimized results to compare. If the optimized model performs better on average with the same training and testing set than the baseline, then I can conclude that these hyperparameters are a better fit for the problem.

Optimized to Baseline Comparison

The following code reads in the optimized results and produces a graph showing the optimized versus baseline performance:

{r}
# Iterate through all the predictions for all the buildings
for (file in filenames) {
  name <- unlist(strsplit(file, '-|_'))[2]
  
  # Dataframe to hold results for each building one at a time
  results <- as.data.frame(matrix(nrow = 3, ncol = 3))
  names(results) <- c('building', 'metric', 'value')
  
  # Compare the true values and the predictions
  true_values <- read_feather(sprintf('../feather/train_test/%s_y_test.feather', 
                               name))$forecast
  predictions <- read_feather(paste0(feather_dir, name, '_optimized_predictions.feather'))
  
  predictions$true <- true_values
  
  # Performance metrics
  rf_rmse <- Metrics::rmse(predictions$true, predictions$rf)
  
  rf_r2 <- cor(predictions$true, predictions$rf) ^ 2
  
  rf_mape <- Metrics::mape(predictions$true, predictions$rf)
  
  all_metrics <- c(rf_rmse, rf_r2, rf_mape)
  
  # Add results to building dataframe
  results$building <- name
  results$metric <- c('rmse', 'r2', 'mape')
  results$value <- all_metrics
  
  # Store all results in a single dataframe
  opt_metrics <- rbind(opt_metrics, results)
}# Remove the first row 
opt_metrics <- opt_metrics[-1, ]
opt_metrics <- merge(opt_metrics, metrics[, c('building', 'location')], by = 'building', all.x = TRUE)# Create a dataframe of the optimized rf results
rf_metrics <- rf_metrics
rf_metrics$model <- 'base'
opt_metrics$model <- 'opt'
rf_metrics <- rbind(rf_metrics, opt_metrics)# Plot the rmse of optimized versus base model
ggplot(dplyr::filter(rf_metrics, metric == 'rmse'), aes(location, value, 
                                                       fill = model)) + 
         geom_bar(stat = 'identity', position = 'dodge', color = 'black') + 
         scale_fill_stata() + 
         xlab('Building') + ylab('RMSE (kWh)') + 
ggtitle('Baseline vs Optimized RF RMSE Comparison') + theme_hc(12) + 
  geom_text(data = dplyr::filter(rf_metrics, metric == 'rmse' & model == 'opt'),
            aes(label = round(value, 2)), vjust = -0.4, hjust = -.2, size = 3) + 
  geom_text(data = dplyr::filter(rf_metrics, metric == 'rmse' & model == 'base'),
            aes(label = round(value, 2)), vjust = -0.4, hjust = 1, size = 3) + 
  theme(plot.title = element_text(hjust = 0.5), 
        axis.text.y = element_text(color = 'black'), 
        axis.text.x = element_text(color = 'black', angle = 45, vjust = 0.5))

The optimized random forest has a lower error on all the buildings. Therefore, we can conclude that the settings selected during hyperparameter tuning improve the performance of the Random Forest model. Moreover, the optimized model does not take significantly longer to run than the baseline model, and its use is justified moving forward in the project.

Now that the final model has been selected, we use it to predict six months worth of energy consumption. We can also try to interpret the feature importances from the model and use the model to carry out experiments by altering one or more explanatory variables and observing the change in the response variable as will be shown later in the report.

Model Implementation

One of the primary objectives of the EDIFES research project is to develop a model that can predict six months of future energy consumption with 85% accuracy. Now that we have a fully developed machine learning model, we can test it out on this task and see if the Random Forest is up to the job!

Six Month Energy Prediction

In order to test the model, we will try to predict the last six months of energy
consumption from the weather and time variables for that time period.
The random forest requires the same number of features during testing as during training, and so will always need the full weather and time information. When we need to make forecasts into the future without known weather data, we will have to use average weather from the past few years, which is the best indicator of future weather. The six month prediction using known weather is meant to validate the accuracy of the model when the weather is known. (Prediction is done on historical data where all the features and targets are known so we can make predictions and compare them to the actual values. Forecasting is done into the future where neither the features nor the targets are known, and hence we cannot actually tell how accurate our model is until the future plays out. The project requires us to reach a predictive accuracy which is an estimate of the forecasting capability of a model.)

The predictions on the testing data can be compared to the actual answers (the true energy consumption) in order to determine the prediction ability of the random forest. Because EDIFES will always be able
to retrieve historical weather data for a location, this approach could be implemented to forecast into the future if it proves successful. We would need to train a model using the existing weather and electricity consumption, match the weather data from the previous year (or an average of multiple previous years) to the future dates we want to forecast, and then have the model make predictions for those dates. Eventually, we will develop a method that could not only make forecasts under current operating conditions but also with efficiency improvements taken into account. These modifications could include improving insulation, altering an HVAC schedule, or using timed lights, and quantifying economic benefits is crucial to getting building owners to implement these improvements.

The first step for assessing prediction abilities is to create training/testing sets with the final 6 months used for testing. The code below removes creates training and testing features and labels for the final six months of each dataset. The variables here are also pre-processed using the caret package by centering, scaling, and removing near-zero variance variables.

{r}
# Take in a dataframe in standard format and create training and testing sets 
# of final six months
prepare_six_months <- function(df) {
  
  # Enforce dataframe
  df <- as.data.frame(df)
  
  # Enforce column specifications
  # Create columns for day of the year, month, and year
  df2 <- df %>% mutate(day = yday(timestamp), month = month(timestamp), 
                       year = year(timestamp)) %>%  
    
    # Transform cyclical features (month, day, num_time)
    mutate(day_sin = sin(2 * day * pi / 365), 
           day_cos = cos(2 * day * pi / 365),
           month_sin = sin(2 * month * pi / 12), 
           month_cos = cos(2 * month * pi / 12),
           num_time_sin = sin(2 * num_time * pi / 24),
           num_time_cos = cos(2 * num_time * pi / 24))# Enforce factors
  df2$day_of_week <-  as.factor(df2$day_of_week)
  df2$week_day_end <- as.factor(df2$week_day_end)
  df2$sun_rise_set <- as.factor(df2$sun_rise_set)
  
  # One hot encoding of categorical variables
  df2 <- dummy.data.frame(df2, names = c('day_of_week', 
                                         'week_day_end', 'sun_rise_set'), sep = '_') 
  
  # Extract the known values, and timestamps for graphing
  labels <- df2[c('timestamp', 'cleaned_energy', 'forecast')]
  
  cols_to_remove <- c('elec_cons', 'elec_cons_imp', 'pow_dem', 'cleaned_energy',
                      'anom_flag', 'forecast', 'anom_missed_flag')
  
  # Change the timestamp to a numeric
  df2$timestamp <- as.numeric(df2$timestamp)
  
  # Remove the columns from the features
  df2 <- df2[, -which(names(df2) %in% cols_to_remove)]
  
  # Index of dates for final six months
  six_months_index <- nrow(df2) - (60 / as.numeric(df$timestamp[5] - 
                                                     df$timestamp[4])) * 24 * 180
  
  # Return the complete features and labels as testing and training sets
  train_features <- df2[1:six_months_index - 1, ]
  test_features <- df2[six_months_index:nrow(df2), ]
  
  # Scale features
  # Train the prepreocessor on training data and then transform
  # both training and testing data
  pre_processor <- caret::preProcess(train_features, 
                                     method = c('center', 'scale', 'nzv'))
  
  train_features <- predict(pre_processor, train_features)
  test_features <- predict(pre_processor, test_features)
  
  # Extract labels
  train_labels <- labels[1:six_months_index - 1, ]
  test_labels <- labels[six_months_index:nrow(df2), ]
  
  return(list('X_train' = train_features, 'X_test' = test_features, 
              'y_train' = train_labels, 'y_test' = test_labels))
}

The prepared data sets are again saved to feather files for transport to Python. The model is trained and tested on all the datasets with a Skicit-learn implementation. The code is nearly exactly the same as that for a random training/testing split of data, except the names of the files have changed (and the optimized hyperparameters are used instead of the baselines). I will leave out the code (find it on my GitHub) and show the results.

The moment of truth is visualizing the prediction metrics!

The final averages across all buildings are:

MAPE = 21.24%
R-Squared = 0.771
RMSE = 3.83 kWh

Four of the eight building pass the accuracy bar of 85% as measured by MAPE accuracy. Clearly, there remains significant work to be done on the predictive model, but the random forest regression is a good start. There are many tools in the machine learning engineer’s toolbox for improving performance including: feature engineering, altering the variables fed into the model to provide it with more information to learn mappings; creating a more complex, deeper model; and using ensemble methods that combine diverse algorithms to make a more robust overall model. Work on an ensemble of different predictive models is ongoing, and it is clear that the random forest developed in this report will play an integral role in the EDIFES project!

Interpreting Results

In machine learning implementations, emphasis is often placed on accuracy at the expense of understandable models. However, I like to know not only that my model can make predictions that reflect real-world results, but also how a model makes a given classification/prediction. We have already seen that the slopes in a linear regression can be interpreted as the change in the response variable due to a change in the input variable. The random forest is not quite this intuitive, but there are several options for getting under the hood of the algorithm. Two that we will examine here are feature importances: the relative predictive capability of explanatory variables in the model, and visualizing a single decision tree in the random forest.

Random Forest Feature Importances

The random forest is capable of returning feature importances, which are a quantitative measure of how much including a variable improves the performance of the model. The technical details are a little much for this report, so I like to look at the relative feature importances to make comparisons between variables. The feature importances were calculated during training of the random forest regression model in Python. We can visualize them in R with the following code.

{r}# Feature Importances
features_import <- read_feather('../feather/feature_importances/feature_importances.feather')features <- names(read_feather('../feather/train_test/APS_X_test.feather'))features_import$feature <- featuresfeatures_import <- reshape2::melt(features_import, id.vars = c('feature'))
names(features_import) <- c('feature', 'building', 'importance')features_import <- merge(features_import, metadata[, c('Name', 'location')], 
                         by.x = 'building', by.y = 'Name', all.x = TRUE)# Plot the feature importances for all buildings
ggplot(features_import, aes(feature, importance, col = location)) + 
  geom_point(size = 3) + xlab('feature') + ylab('RF Importance') + 
  ggtitle('Random Forest Feature Importances') +  
  theme(axis.text.x = element_text(angle = 90, vjust = 0)) + 
  theme_hc(12) + scale_color_tableau() + 
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = 'right')

The feature importances for all buildings are presented in the graph below:

As the linear model demonstrated, the most helpful weather variables are temperature and ghi. In terms of all the variables, the time of day in hours (num_time) as well as the cyclical transformation of time of day in hours were most significant in predicting energy consumption. The weekday indicator
also was significant, which is not unexpected because office buildings show a significant difference in energy consumption on the weekends compared to during the week. Finally, we can also see that many features might not be useful and could be excluded from future models, such as month, and sun rise or sun set. These importances could be used for feature selection with other algorithms. Including irrelevant variables can decrease the accuracy of models and increase the run-time, so reducing the set of features to those that are most essential is a major concentration in machine learning.

To get a better sense of the average importances, we can use the following code to take the mean feature importances across all buildings:

{r}
# Show the average importance of features across all buildings
average_import <- features_import %>% group_by(feature) %>%
  summarize(average = mean(importance)) %>%
  arrange(desc(average))# Plot as a horizontal barchart
ggplot(average_import, aes(reorder(feature, average), average)) + 
  geom_bar(stat = 'identity', fill = 'firebrick', withd = 0.9, 
           color = 'black', lwd = 1) + 
  coord_flip() + xlab('') + ylab('Importance') + 
  ggtitle('Average RF Feature Importance') + theme_hc(12) + 
  scale_y_continuous(lim = c(0, 0.3), breaks = seq(0, 0.3, 0.05)) + 
  theme(axis.text = element_text(color = 'black')) +
  geom_text(aes(y = average + 0.015, label = round(average, 3)))

This chart is pretty definitive! The temperature is clearly the most significant
variable, followed by whether or not the day is a weekday or weekend, the time of day in hours, and whether or not the day is a business day (which accounts for holidays as well as weekends). Overall, the most significant weather variables are temperature, dif, gti, and ghi. There are a number of unimportant variables, such as windspeed, month (including all cyclical transformations), year, and sun rise or set. This last one is not surprising because the vast majority of times are not sunrise or sunset,and these have no noticeable connection with energy usage. These are low variance variables
that do not have much explanatory power. They should most likely be removed from any model to avoid including irrelevant features.

Removing unnecessary features can improve the performance of a model as well as decreasing run time. The optimized random forest only considered 50% of the features for each tree and achieved slightly better results than the base random forest. Clearly, not all temporal and weather variables are necessary for predicting the energy consumption at a 15-minute interval.

Visualizing a Single Decision Tree

One of the coolest parts of using this model is looking at individual trees in the forest. The actual trees in the forest are an average of 46 layers deep and the image is far to massive to fit on one screen. Therefore, we can limit the depth of trees in the forest to create an image that is interpretable. Following is a single decision tree limited to three layers for clarity.

There is a lot of information here, but if you can make sense of it, then you understand a decision tree and the random forest! Starting at the top of the tree (the root), we take a data point and move down the tree to a prediction. Say that our data point is a Tuesday at 8:00 am. Therefore, the answer to the first questions is False because the time is greater than 6.90 hours. We move down the right branch and come to the next question, which we answer False because this is a business day (the value of biz_day is 1). This leads down the branch to the far right and we reach a leaf node with the value of 142.7 kWh. Therefore, for a Tuesday at 8 am, we would predict an energy consumption of 142.7 kWh for the 15-minute interval based on the random forest. The real decision trees in the forest are much more complex with 40+ layers in order to make more accurate predictions (The leaf node we ended up in contains 50% of the data points and makes the same prediction for each one, which does not reflect reality where each data point likely has a distinct value). However, this simplistic tree gives the basic idea of how the model works. Furthermore, we can see that the variables we are splitting on, the time of day (num_time) and business day (biz-day) are among the most important as evidenced both by the feature importances and because they are near the root of the tree. More important variables will be closer to the root because they are more informative for making splits of the data. Moving forward, understanding a model will become more important in machine learning, and the random forest is a great blend of intuitive methods and high accuracy.

Quantity of Data versus Performance

One interesting graph we can quickly make is the performance of the model versus the number of data points. Generally, we expect the performance of any model to increase with the amount of data, a phenomenon recorded as “The Unexpected Effectiveness of Data” in a paper by Peter Norvig et al at Google. The following graph shows the mape (mean average percentage error) as a function of the number of data points for all buildings:

As expected, the error decreases as the model has access to greater amounts of data. This trend might not always hold for building energy data, because recent data could be more useful than information from 2–3 years ago which means we might actually want to limit the data to recent observations. Generally, as long as data is valid, increasing the quantity of data should allow the model to better learn the relationships (mappings) between the explanatory variables and response variable. The best way to improve any machine learning model is simple: more (high-quality) data!

Experimenting with Random Forest

For the final part of this report, I want to do a simple experiment that shows one valuable use of the random forest: alter one or more explanatory variables and observe the effect on the response variable. In this case, I will look at the ramifications of global warming on building energy consumption by increasing the temperature and observing the change in energy use.

Simulating Global Warming

Current predictions indicate the global surface temperature will increase
between 1 C and 4 C by 2100 as the result of human-caused climate change. In order to assess these effects on future energy use consumption, an increase of 2 C can be modeled using the random forest.

The training data for this experiment is the full historical record for each building. The testing data is exactly the same as the training data except with the temperature increased 2 degrees Celsius at each interval. In effect, we are “replaying” the past under different conditions. There are no right anwers in this situation because it is hypothetical but it will give us an idea of what the energy consumption could be with global increase in temperatures.

Data is again prepared in R, transported to Python for training and prediction, and sent back to R for analysis. The following code reads in the results from the predictions and plots the results:

{r}
# Create a vector to hold results
diff_df <- as.data.frame(matrix(ncol = 4))
names(diff_df) <- c('building', 'before', 'after', 'diff_pct')yearpoints <- 96 * 365# Iterate through the file and find the differences
for (file in filenames) {
  name <- unlist(strsplit(file, '-|_'))[2]
  
  # Find the total consumption before, total consumption after, and change
  predictions <- read_feather(sprintf('../feather/predictions/%s_increased_predictions.feather', name))
  true <- read_feather(sprintf('../feather/increased/%s_labels.feather', name))
  
  if (length(predictions) >= yearpoints) {
    last_index <- length(predictions) - yearpoints
    annual_before <- sum(true$forecast[last_index:nrow(true)])
    annual_after <- sum(predictions[last_index:length(predictions)])
    
  } else {
    annual_before <- sum(true$forecast)
    annual_after <- sum(predictions)
  }
  
  difference_pt <- (annual_after - annual_before) / annual_before * 100
  
  # Add the row to a dataframe for record keeping
  diff_df <- add_row(diff_df, building = name, before = annual_before,
                     after = annual_after, diff_pct = difference_pt)}# Only complete rows retained
diff_df <- diff_df[complete.cases(diff_df), ]# Record increase for coloring
diff_df$increase <- ifelse(diff_df$diff_pct > 0, 'red', 'green')
diff_df <- merge(diff_df, metadata[, c('Name', 'location')], 
                 by.x = 'building', by.y = 'Name', all.x = TRUE)# Plot the results
ggplot(diff_df, aes(x = location, y = diff_pct, fill = increase)) + 
  geom_bar(stat = 'identity', color = 'black') + 
  xlab('') + ylab('% Difference') + 
  ggtitle('Effect of 2 C Increase in Temp') + 
  scale_y_continuous(lim = c(-2, 5), breaks = seq(-2, 5, 1)) + 
  theme_economist(12) + 
  scale_fill_identity() + theme(legend.position = 'none') + 
  theme(axis.text.x = element_text(angle = 60, vjust = 0.5),
        plot.title = element_text(hjust = 0.5))
```

The percentage changes are shown in the figure below (using the Economist ggtheme from the ggthemes package in R):

This shows 6 of the 8 buildings will see an increase in energy consumption due to global warming. Of course, what every building owner actually cares about is cost of building energy. Using the national average price of energy now, we can quantify the monetary effects:

{r}# Calculate and graph costs (or savings) of temp increase
kwh_cost <- 0.104
diff_df <- diff_df %>% mutate(cost = (after - before) * kwh_cost)# PLot the costs (using economist theme of course)
ggplot(diff_df, aes(location, cost, fill = increase)) + 
  geom_bar(stat = 'identity') + 
  scale_fill_identity() + xlab('') + ylab('Cost $') + 
  ggtitle('Annual Cost of 2 C Increase in Temp') + theme_economist(12) +
  scale_y_continuous(breaks = seq(0, 30000, 5000), limits = c(-500, 35000)) + 
  geom_text(label = round(diff_df$cost), vjust = -1) + 
  theme(axis.text.x = element_text(angle = 60, vjust = 0.5),
        plot.title = element_text(hjust = 0.5))

Economic results are shown below:

One aspect of global warming that often goes unmentioned is that it will unevenly affect different areas of the globe, with some locations enjoying longer growing seasons and others experiencing frequent severe weather events. The unequal distribution of the effects of climate change are reflected in the figure which demonstrates that according to the random forest model, some areas will see a reduction in energy usage (and hence cost) while other areas will see a significant increase in energy use. According to this analysis, an office building of $50,000 ft² could see an increase of nearly $30,000 in annual energy costs due to increased temperatures. Many arguments have been made for taking steps to reduce carbon emissions and transition to sustainable energy sources, but perhaps none is more persuasive than that of simple economics!

Additional experiments are possible, modifying any of the weather or time variables. Eventually, in order to quantify the effects of a proposed building efficiency improvement, a method will be developed to alter a building characteristic and gauge the effect. This is an active area of research and quantifying the savings from a recommendation is critical to the value of a virtual energy audit. The random forest experiment provides a framework for how to alter one (or several) variables and observe the effects on energy consumption using historical data. This exercise demonstrates possible applications of the random forest by “re-running” past data under different conditions and examining effects on energy consumption.

Conclusions

In this report, we evaluated a number of different modeling techniques, implemented the most successful method for prediction, interpreted the model, and simulated global warming. Modeling is a crucial part of the data analysis pipeline because it allows one to determine the strength of relationships between response and explanatory variables.
This can then be translated into real-world applications by concentrating on those variables that have the largest effect on the response. Experiments can be run wherein these features are slightly altered in order to test the model or try to achieve a desired response. For the building energy problem, an ideal model would allow the EDIFES team to alter factors such as the effective thermal resistance of a building (a function of the amount of insulation and quality of seals) or the HVAC schedule and observe the effects on building energy use.

Three models were evaluated in this report for their ability to predict electricity consumption from all weather and time variables: both simple and multivariate linear regression; a support vector machine using a radial basis kernel; and a random forest composed of 200 decision trees. On a random training/testing split, the random forest significantly outperformed the other two models in terms of RMSE, mape, and R-Squared. Therefore, the random forest was selected for further exploration. This exploration included a simulated modeling of a 2 degree Celsius increase in temperature and an attempt to predict energy consumption for six months using known weather data. The explorations revealed both the strengths of the random forest model as well as its limitations. Moving forward in the EDIFES project, I expect a random forest model will play a vital role in helping us to understand and improve building energy consumption.

Following are the 10 primary conclusions from this report:

Temperature is the most highly correlated variable with daily energy consumption. However, a univariate linear model attempting to explain daily energy use from only temperature has an R-Squared value below 0.20 for all buildings during the summer.
* A 1 degree Celsius increase in temperature can increase electricity consumption by up to 300 kWh over the course of a day in an office building.
Including all of the weather variables in a linear model with energy as the
response yielded slightly better predictions although the model was not
significantly better than the simple model with only temperature.
When trying to predict energy consumption for each 15-minute interval
from all of the time and weather variables, the random forest significantly outperformed a support vector machine regression and a linear regression.
The most important variables for predicting energy consumption as ranked by the Random Forest are: temperature, weekday or weekend, time of day in hours, business day or non business day, timestamp, and diffuse horizontal irradiance. Several variables, including month and sun rise or
sunset, were not significant for predicting energy consumption.
A 2 degree Celsius increase in average temperature due to global warming results in an average increase in energy consumption by 1.568% for the eight buildings studied, with 6 of 8 buildings seeing an increase in energy use. The percentage increase could cost a building owner up to $30,000 every year.
The Random Forest model might be capable of meeting the EDIFES project requirement of 0.85 accuracy when predicting six months of energy consumption. Overall, the problem of predicting energy consumption from weather and time variables in 15-minute increments is a highly non-linear task. There are significant differences in energy consumption depending not only on the weather conditions, but the time of day, the day of the week, the season, and the building characteristics.

Well, I appreciate anyone who stuck it out to the end! A final report will summarize the major conclusions and include only results (no code) for a complete overview of the project. Moreover, I will complete a write up focused solely on machine learning that includes work done since this report (anyone looking for neural networks should check this out). That’s it for this report, and as always, I want to know what I did wrong and can improve for next time and contact me at wjk68@case.edu with any comments!