Building Energy Data Analysis Part Two

24 min readDec 26, 2018

Plots, plots, and more plots!

Author’s Note:

This is part two of a comprehensive exploratory data analysis (EDA) of eight office buildings as part ofthe EDIFES Project at Case Western Reserve University. Part One, where I explain the background, take a look at the data, and make some exploratory plots, can be found here. For those who need a refresher, EDIFES (or the Energy Diagnostic Investigator for Efficiency Savings) is a US Department of Energy project (hence the convoluted acronym) to develop a virtual energy audit platform. Energy audits identify inefficiencies in a building’s electricity usage, and are typically done at the physical building by a trained team. There are a number of drawbacks to this approach, and a virtual audit that can be done without ever setting foot in a building will reduce the time and cost of the procedure while increasing the accuracy. This will lead to improved economic and environmental benefits, making it easier for companies to do well by doing good.

I have included some of the code used to produce the plots and analyses, and the full code can be found on GitHub. As always, I welcome feedback (especially constructive criticism) and can be reached at wjk68@case.edu.

Introduction

Before we can get started with any plots, we have to know what data we are dealing with. There are two categories of data available for this project: metadata, which is information about the buildings themselves, and individual building data, which lists the electricity usage and weather data for each building in 15-minute intervals for a period of at least one year. The metadata is presented below. Annual consumption is in kWh and the climate zones are Koppen Geiger.

The two buildings with no name could not be analyzed because poor data quality. This left us with eight office buildings in six climate zones to explore. Each building has its own individual energy and weather file. The rows of each file are the observations every fifteen minutes with the variables forming the columns. Following are all the variables for each building:

Not all of these variables turn out to be useful and the most relevant are:

• timestamp: gives the date and time for the energy measurement in fifteen minute intervals

• elec_cons: the raw electricity consumption data (kWh)

• biz_day: business day or not (accounts for holidays)

• day_of_week: relatively self-explanatory

• week_day_end: weekday or weekend

• num_time: number of hours since the start of day

• ghi: global horizontal irradiance, generally positively correlated with energy consumption during the summer

• temp: temperature, generally positively correlated with energy consumption during the summer

• rh: relative humidity

forecast: final cleaned energy consumption with anomalies removed and missing data imputed

The forecast column (named because the energy use has been “forecasted” using imputation methods) is the electricity consumption we use for analysis. The raw electricity consumption has anomalies and missing points, and these are corrected before any summaries/plots can be made. I won’t go into details here, but that involves setting outliers/anomalies to zero, and then imputing the zero values based on the surrounding values (either using a linear method, or diffusion indices). For example, if we have a 0 value in between a 5 and a 7, then the corrected value would be 6. To make things easier, I have renamed the forecast column to energy in the plots.

Data Visualizations

Now that the tedious (but extremely necessary) business of outlining our variables and making sure the data is clean has been completed, we can move on to the enjoyable part, making graphs and exploring trends! The main question I wanted to explore in this report was:

What weather and time conditions are most correlated with energy consumption, and will it be possible to create a model to predict the energy use based on the conditions?

The second part of that question is mostly saved for part three, but we can definitely take care of the first with some plots!

In the preliminary charts from part one, I noticed there appeared to be patterns that repeated on a daily, weekly, and seasonal timescale as well as overall trends, and it is worth examining each of these in turn.

Overall Trends

First, we need to get an overview of an entire dataset. Plots with every single point every 15-minutes are a little overwhelming, so it’s better to take daily averages.

{r}# Function to plot daily average energy consumption# Function takes in a dataframe and optional building name
# and plots average daily energy consumption
plot_daily_average <- function(A, building_location = '') {
  # Create column with only day
  A$date <- as.Date(format(A$timestamp, "%Y-%m-%d"))
  
  # Group by the day and compute averages, have to use aggregate
  # with characters
  daily.df <- aggregate(A[, c("forecast", "biz_day")],
                        by = list(A$date), FUN = mean)
  daily.df <- plyr::rename(daily.df, replace = c("Group.1" = "date"))
  
  # Create the plot, multiply the forecast by 96 to get consumption for day
  print(ggplot(daily.df, aes(x = date, 
                             y = forecast * 96, color = factor(biz_day))) + 
           geom_point() + labs(color = '') + geom_line() + xlab('') + 
    scale_color_manual(values = c("firebrick", "darkgreen"), 
                       labels = c('Business Day', 'Non-Business Day')) +
      theme_hc(12) + ylab("kWh") + 
      scale_x_date(date_labels = ("%m-%Y"), date_breaks = '4 months') + 
    ggtitle(sprintf('%s Average Daily Energy Consumption', building_location)))
    
}# Iterate through files and create plots
for (file in c("data/f-APS_weather.csv", "data/f-NVE_weather.csv",
               "data/f-Kansas_weather.csv")) {
  name <- unlist(strsplit(file, '-|_'))[2]
  location <- metadata[which(metadata$Name == name), ]$City
  df <- suppressMessages(as.data.frame(read_csv(file)))
  suppressMessages(plot_daily_average(df, location))
}

Examples of running this function on three buildings (Christmas colors in these plots was not intentional!):

There are a number of noticeable trends in these plots. For all the buildings,
energy consumption increases during the the summer which agrees with intuition. The largest source of office building energy consumption tends to be the heating, ventilation, and air conditioning (HVAC) system, which will see more use during the summer months for southern climates. Kansas City, Las Vegas, and Phoenix experience hot summers and hence will see an increase in energy consumption in the summer due to air conditioning use. The buildings in Las Vegas and Phoenix have a decrease in energy consumption during the winter because these cities experience mild winters which do not necessitate energy-intensive heating. Meanwhile, the building in Kansas City does not exhibit the same large decrease during the winter. Energy consumption for this building has two peaks, with one occurring during late summer, and the other occurring during late winter, probably because of the need for cooling during the summer and heating during the winter.

Monthly Trends

Now that we can see what is happening on the long term, how does each month look? The next plots show a boxplot with average daily energy consumption for each month of the year. Boxplots are a great way to visualize data because they show the median (middle value), the range of the values (the box is the Interquartile Range), and outliers. We can see both the location and spread of the data.

This code creates the boxplots where each box is computed based on the daily averages for the month.

# Data from PGE1 building
pge1 <- as.data.frame(suppressMessages(read_csv('data/f-PGE1_weather.csv')))
# Data from SRP building
srp <- as.data.frame(suppressMessages(read_csv('data/f-SRP_weather.csv')))
nve <- as.data.frame(suppressMessages(read_csv('data/f-NVE_weather.csv')))
kansas <- as.data.frame(suppressMessages(read_csv('data/f-Kansas_weather.csv')))# Create month column
srp$month <- lubridate::month(srp$timestamp, label = TRUE, abbr = TRUE)
pge1$month <- lubridate::month(pge1$timestamp, label = TRUE, abbr = TRUE)
nve$month <- lubridate::month(nve$timestamp, label = TRUE, abbr = TRUE)
kansas$month <- lubridate::month(kansas$timestamp, label = TRUE, abbr = TRUE)
# Create plot year where each month is represented as a boxplot
# Multiply forecast by 96 * days in month to get monthly average
# PGE1 building
# ggplot(filter(pge1, biz_day == 1), 
#        aes(x = month, y = 96*30*forecast, group = month)) + 
#   geom_boxplot(outlier.color = 'red', fill = 'dodgerblue3') + theme_hc(12) + 
#   xlab('Month') + ylab('kWh') + scale_fill_stata() + 
#   ggtitle('Portland Building Monthly Consumption Boxplots')# SRP building,
ggplot(filter(srp, biz_day == 1), 
       aes(x = month, y = 96*forecast, group = month)) + 
  geom_boxplot(outlier.color = 'red', fill = 'dodgerblue3') + theme_hc(12) + 
  xlab('Month') + ylab('kWh') + scale_fill_stata() + 
  ggtitle('Phoenix Building Monthly Consumption Boxplots')ggplot(filter(nve, biz_day == 1), 
       aes(x = month, y = 96*forecast, group = month)) + 
  geom_boxplot(outlier.color = 'red', fill = 'dodgerblue3') + theme_hc(12) + 
  xlab('Month') + ylab('kWh') + scale_fill_stata() + 
  ggtitle('Las Vegas Building Monthly Consumption Boxplots')ggplot(filter(kansas, biz_day == 1), 
       aes(x = month, y = 96*forecast, group = month)) + 
  geom_boxplot(outlier.color = 'red', fill = 'dodgerblue3') + theme_hc(12) + 
  xlab('Month') + ylab('kWh') + scale_fill_stata() + 
  ggtitle('Kansas Building Monthly Consumption Boxplots')

The same yearly behavior is observed as before, with an increase in energy use during the summer for the Las Vegas and Phoenix buildings, and two yearly peaks in consumption for the Kansas building. The Kansas building has significantly more outlying points (shown in red) during the spring months indicating that daily energy consumption varies significantly during these months. This is not unexpected as some springs may be much colder than others, necessitating heating.

Weekly and Daily Trends

The last major repeating patterns observed in the data occur on a daily and weekly level with the exact pattern dependent on the season. There are many different ways to visualize these trends but I found the best way was to show all of the seasons on the same plot.

The following code shows average energy consumption over the course of a day for each day of the week with the lines colored by season.

# Plot energy consumption for each day of the week, colored by season
plot_week <- function(df, location) {
  
  # Create necessary columns
  df$day <- lubridate::ymd(as.Date(df$timestamp, tz = 'EST'))
  df$month <- lubridate::month(as.Date(df$timestamp, tz = 'EST'))
  df$year <- lubridate::year(as.Date(df$timestamp, tz = 'EST'))
  
  # December of the preceding year is assocaited with winter of the next year
  df[which(df$month == 12),]$timestamp <-
  df[which(df$month == 12),]$timestamp + years(1)
  
  df[which(df$month == 12),]$year <-
  df[which(df$month == 12),]$year + 1
  
  
  # Season mapping
  df$season <- ifelse(df$month %in% c(12, 1, 2),
  'winter',
  ifelse(
  df$month %in% c(3, 4, 5),
  'spring',
  ifelse(df$month %in% c(6, 7, 8), 'summer',
  'fall')
  ))
  
  # Coloring for scale_color_manual
  df$season_color <- ifelse(df$season == 'winter',
  'blue',
  ifelse(
  df$season == 'spring',
  'darkgreen',
  ifelse(df$season == 'fall', 'orange',
  'firebrick')
  ))
  
  # Day of week as factor for correct ordering
  df$day_of_week <- factor(df$day_of_week,
  levels = c('Mon', 'Tue', 'Wed', 'Thu', 'Fri',
  'Sat', 'Sun'))
  
  df <- arrange(df, season)
  
  # Plot weekly colored by season and facted by day of week
  p1 <- ggplot(df, aes(x = num_time, y = forecast)) + 
    stat_summary(fun.y = mean, geom = 'line', aes(col = season_color), lwd = 1.1) +
    theme_hc(12) + xlab('Time of Day in Hours') + ylab('Energy (kWh)') + 
    ggtitle(sprintf('%s Summer Daily and Weekly Consumption Patterns', location)) +
    scale_x_continuous(breaks = seq(0, 24, 6)) + 
    scale_color_identity(labels = unique(df$season),
                       guide = 'legend') + labs(color = 'season') + 
      facet_wrap(~ day_of_week, ncol = 7)
  
  print(p1)
  
}for (file in file.names[c(1, 3, 4)]) {
  name <- unlist(strsplit(file, '-|_'))[2]
  location <- metadata[which(metadata$Name == name), ]$City
  df <- suppressMessages(as.data.frame(read_csv(file)))
  suppressMessages(plot_week(df, location))
}

There is a ton of information compressed into these single images! All buildings exhibit the highest energy consumption during the summer and the buildings in Phoenix and Las Vegas have the lowest energy consumption during the winter. However, the Kansas City building’s minimum energy consumption is during the spring. We can also see all these buildings have a typical schedule of high electricity use during the workweek, and a significant drop on the weekends, indicative of the normal office building occupancy schedule. The final aspect to notice is energy consumption generally peaks during the mid-afternoon. There is a noticeable increase in the morning, as building HVAC systems are turned on in preparation for the workday, and then the energy use falls before rising into the afternoon. The systems are typically shut down at the end of the working day, and we can infer the working hours by examining the significant increase (start of work) and significant decrease (end of work) in energy.

Finally, I wanted to convey this information in a more dynamic fashion. After searching for a method to do this, I saw that R can be used to create animation swith gganimate. I created an animated version of the same plots as above where each line is a different season and the frames are days of the week.

{r}
# Read in a local file
read_data <- function(filename) {
  df <- as.data.frame(suppressMessages(read_csv(filename)))
  df$day_of_week <- as.factor(df$day_of_week)
  df$week_day_end <- as.factor(df$week_day_end)
  df$sun_rise_set <- as.factor(df$sun_rise_set)
  return(df)
}# Example dataframe
energy_df <- read_data('../data/f-SRP_weather.csv')# Create day, month, year, season, and year-season columns 
# Seasons are defined based on meteorological seasons
# Create a day for grouping
energy_df$day <-lubridate::ymd(as.Date(energy_df$timestamp, tz = 'EST'))
energy_df$month <- lubridate::month(as.Date(energy_df$timestamp, tz = 'EST'))
energy_df$year <- lubridate::year(as.Date(energy_df$timestamp, tz = 'EST'))# December of the preceding year is assocaited with winter of the next year 
energy_df[which(energy_df$month == 12), ]$timestamp <- energy_df[which(energy_df$month == 12), ]$timestamp + years(1)energy_df[which(energy_df$month == 12), ]$year <- energy_df[which(energy_df$month == 12), ]$year + 1energy_df$season <- ifelse(energy_df$month %in% c(12, 1, 2), 'winter', 
                           ifelse(energy_df$month %in% c(3, 4, 5), 'spring',
                                  ifelse(energy_df$month %in% c(6, 7, 8), 'summer',
                                         'fall')))energy_df$season_color <- ifelse(energy_df$season == 'winter', 'blue', 
                                 ifelse(energy_df$season == 'spring', 'darkgreen',
                                        ifelse(energy_df$season == 'fall', 'orange',
                                               'firebrick')))energy_df$ys <- paste(energy_df$year, energy_df$season, sep = '-')energy_df2 <- dplyr::select(energy_df, timestamp, forecast, temp, ghi, num_time, day, month,
                            year, season, ys, day_of_week, season_color)energy_df2$day_ys <- paste(energy_df2$day_of_week, energy_df2$ys, sep = '-')energy_df2$day_of_week <- factor(energy_df2$day_of_week, levels = c('Mon', 'Tue',
                                                                       'Wed', 'Thu',
                                                                       'Fri', 'Sat', 
                                                                       'Sun'))
year_num <- 2016plot_df <- dplyr::filter(energy_df2, lubridate::year(timestamp) == year_num)
plot_df$week <- lubridate::week(plot_df$timestamp)
plot_df <- arrange(plot_df, timestamp)p <- ggplot(plot_df, aes(x = num_time, y = forecast, frame = day_of_week)) + 
  stat_summary(fun.y = mean, geom = 'line', aes(col = season_color), lwd = 1.2) + 
  ggtitle(sprintf('%s Consumption for: ', year_num)) + xlab('Time of Day (hrs)') + 
  ylab('Energy (kWh)') + coord_cartesian(xlim = c(0, 24)) + 
  scale_x_continuous(breaks = seq(0, 24, 4)) + theme_classic(12) + 
  scale_color_identity(labels = unique(plot_df$season),
                       guide = 'legend') + labs(color = 'season') + 
  theme(axis.text = element_text(color = 'black'), 
        plot.title = element_text(hjust = 0.5))gganimate(p, sprintf('%s_snapshot.gif', year_num), interval = 0.5, title_frame = TRUE, 
          ani.width = 650, ani.height = 650)

We now have a pretty good understanding of the typical patterns of office building energy use. Office buildings generally experience a peak in consumption during the afternoon, have higher electricity usage during the work week, and depending on the geographic location, will exhibit different trends across seasons. From these patterns, we can determine useful info such as occupancy schedule and operating hours. To determine opportunities to increase efficiency, we could compare the actual occupancy schedule with that observed from the energy usage and see if there is a mismatch. Maybe the HVAC starts two hours earlier than it needs to, or there might be an unexpected increase on the weekends due to a system malfunction. It’s often surprising how little building owners know about the energy use patterns of their buildings and if we point out the inconsistencies, they can be addressed to both save money and reduce the building’s carbon footprint.

Correlations

Once we have observed the typical patterns of energy use, the next question is to ask what weather conditions might cause consumption to increase or decrease beyond the normal range? To address that query, we will mainly be relying on a simple measure called the Pearson Correlation Coefficient, or r-value. This statistic ranges from -1 to +1 and tells us the direction and strength of a linear relationship between two variables. Two variables that are perfectly positively linearly correlated will have an r-value of +1, while two variables with a perfectly negative linear correlation will have an r-value of -1. As an example, if temperature and energy have an r-value of 0.8, then this means that as temperature increases, the energy also tends to increase in a linear fashion. The r-value does not tell us the slope of this relationship, only whether it is linear and positive/negative. Once we have established a linear relationship, we can use modeling to find out the slope, or how large of an effect changing one variable will have on another. Examples of different r-values are shown below.

Different Pearson Correlation Coefficients (source)

We can look at actual tables of the correlation coefficients, but I find that rather dull. A visual representation is much better at conveying relative differences and is easier to understand at a glance. A quick method to create correlation plots is using ggpairs, a function in the GGally package in R.

{r}
suppressMessages(library(GGally))# Load in dataframes 
srp.daily.summer <- suppressMessages(as.data.frame(read_csv('srp_daily_summer.csv')))
srp.daily.winter <- suppressMessages(as.data.frame(read_csv('srp_daily_winter.csv')))# Rename forecast column for clarity
srp.daily.summer <- dplyr::rename(srp.daily.summer, energy = forecast)
srp.daily.winter <- dplyr::rename(srp.daily.winter, energy = forecast)linear_func <- function(data, mapping, ...){
  p <- ggplot(data = data, mapping = mapping) + 
    geom_point(alpha = 0.5, color = "blue") + 
    geom_smooth(method = "lm", fill = "red", color = "red")
  p
}ggpairs(data = srp.daily.summer,
        columns = c("ghi", "temp", "rh", "pwat", "ws", "forecast"),
        lower = list(continuous = linear_func),
        title = "Phoenix Summer Correlations")ggpairs(data = srp.daily.winter,
        columns = c("ghi", "temp", "rh", "pwat", "ws", "forecast"),
        lower = list(continuous = linear_func),
        title = "Phoenix Winter Correlations")

These show correlations between all variables for a single building (in Phoenix, AZ) in summer and winter. On the lower left are scatterplots between each pair of variables, and the right shows the numerical value of r. The diagonals are density plots of the variables because the correlation between a variable and itself is always 1! To determine the value of the correlation coefficient between variables, pick one variable, and move down the row to another variable for comparison. For instance, in the summer plot, in the top right corner, we can see that energy has a 0.125 correlation with ghi (global horizontal irradiance, a measure of the intensity of sunlight hitting a surface). This is a slightly positive linear correlation and if we move to the bottom left corner, we observe the relationship in a scatterplot of ghi vs energy. The strongest correlation with energy during the summer is temperature at a value of 0.575. These agrees with our intuition: when it is warmer during the summer, a building has to use more air conditioning and hence more electricity.

What if we want to look at the correlation values for all of the weather variables with energy for all of the buildings? That’s a lot of information for one plot, but it is doable with one of my favorite graphs: the correlation heatmap. There are built-in libraries in R for making heatmaps, but I decided to create my own starting with the basic geom_tile() in ggplot2.

{r}# Pretty bad heatmap to use as a starter
plot_heatmap <- function(A, season = '') {
  # Create a matrix of the weather correlations
  weather.matrix <- as.matrix(A[1:7])
  
  # Rename the rows with the buildings
  rownames(weather.matrix) <- A$location
  
  # Turn matrix into long format
  melted.weather <- melt(weather.matrix) 
  colnames(melted.weather) <- c('Building', 'Weather', 'corr')
  # Create rough draft of correlation matrix
  print(ggplot(melted.weather, aes(x = Building, y = Weather, fill = corr)) + 
    geom_tile() + ggtitle(sprintf('%s Correlation Heatmap', season)) + 
      theme(axis.text.x = element_text(angle = 60, vjust = 0.5)))
  return(melted.weather)
} 
summer.corrs <- dplyr::select(summer.corrs, ghi, dif, gti, temp, rh, pwat, ws, location)
winter.corrs <- dplyr::select(winter.corrs, ghi, dif, gti, temp, rh, pwat, ws, location)# Use the function to create melted dataframesmelted.summer <- plot_heatmap(A = summer.corrs, season = 'Summer')
melted.winter <- plot_heatmap(A = winter.corrs, season = 'Winter')# Create Quality Correlation Matrix Heatmap by building on previous
plot_final_heatmap <- function(melted.A, season = '') {
  
  # Initial heatmap
  heatmap <- ggplot(melted.A, aes(x = Building, y = Weather, fill = corr)) + 
    geom_tile(color = "white") + 
    # Create a diverging color scale to highlight differences
    scale_fill_gradient2(low = "navy", high = "red", mid = "lemonchiffon",
                         midpoint = 0, limit = c(-0.8, 0.8), space = "Lab", 
                         name = "Pearson Corr. Coeff.") + 
    # Adjust the themes as needed
    theme_minimal() + xlab('Building Location') + 
    theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                     size = 12, hjust = 1, color = 'black'), 
          axis.text.y = element_text(vjust = 0.5, size = 12, color = 'black'),
          panel.grid = element_blank(),
          plot.title = element_text(hjust = 0.5)) + 
    coord_fixed() + ggtitle("Weather Correlation with Elec. Cons.")
    
  # Add labels to the heatmap and refine the theme
  finished.heatmap <- heatmap + 
    geom_text(aes(Building, Weather, 
                  label = round(corr,2)), color = "black", size = 3.5) +
  theme(
    axis.title = element_text(size = 16, hjust = 0.5),
    panel.border = element_blank(),
    panel.background = element_blank(),
    axis.ticks = element_blank())  + 
    guides(fill = guide_colorbar(barheight = 12, vjust = 0)) +
    ggtitle(sprintf('%s Correlation Heatmap', season)) 
    
    
  # Print and save the heatmap
  print(finished.heatmap)
}plot_final_heatmap(melted.summer, season = 'Summer')
plot_final_heatmap(melted.winter, season = 'Winter')

In these plots, the color corresponds to the value of r, making it very easy to spot the anomalous building in the summer graph! The building in Portland, OR displays nearly the exact opposite correlation from all of the other buildings for every weather variable. Instead of an increase in energy use when the temperature rises during the summer, this building has a decrease in temperature. Likewise, all other buildings have a decrease in energy use when the relative humidity increases except for Portland! I notified the team and we got in contact with the building owners to see what was going on. It turns out they had no explanation for us and could not point out anything that made this building different. Our two hypotheses were one, that the building had solar power, which produced more energy as the temperature rose because the sun would be out (and the irradiance would rise) and thus the building would use less electricity from the grid, or two, that the electricity meter timestamp was off by some fixed amount. Our best guess was the meter was 12 hours out sync with the real time, but we have not had a chance to discuss this with the utility. The building owner has confirmed she is not aware of any solar panels and I went so far as to look on Google Maps at the building to see if there were some renegade solar panels anywhere near the building. Alas, I could not find any and the mystery remains. If you have any additional ideas, feel free to contact us!

A point of interest from the winter heatmap is that some of the correlation values are positive for temperature and some are negative. This is actually what the team told me to expect because during the winter, buildings with electrical heating will need to use more electricity as the temperature declines, leading to a negative coefficient. However, buildings with gas heating or not heating at all, will not use more electricity as the temperature declines and may have a positive or near zero r value. This is a fact I could have realized beforehand, but it took actually digging through the data and creating some visuals to make the point clear.

Note: If you run the function for the first heatmap (plot_heatmap), you will see an awful plot that is barely comprehensible. I decided this was unacceptable and used the second function (plot_final_heatmap) to improve on the first. I enjoy the iterative process of data visualization and spending the extra time to make presentable plots. You can communicate the same info in a table or basic graph, but people are more likely to pay attention if you give them the message in something they will want to look at!

Visualizing Relationships

After identifying relationships, the next step is generally to quantify them by creating models that let us see how much a change in one variable affects the value of another. That will be the sole focus of report three, and for now, I want to take a quick look at how these relationships play out. The following code creates a plot of daily average energy use and daily average temperature during the summer for the building in Phoenix.

{r}srp.daily.summer <- srp.daily[which(month(srp.daily$date) %in% summer), ]
srp_summer_long <- gather(srp.daily.summer, key = 'variable', value = 'value', rh, energy, temp)srp_summer_long$md <- paste0('0', lubridate::month(srp_summer_long$date), '-', lubridate::day(srp_summer_long$date))srp_summer_long <- srp_summer_long %>% group_by(md, variable) %>% summarize_all(funs(mean))srp_summer_long$date <- as.Date(srp_summer_long$md, format = '%m-%d')srp_summer_long[which(srp_summer_long$variable == 'energy'), ]$value <- 
  srp_summer_long[which(srp_summer_long$variable == 'energy'), ]$value * 96
srp_summer_long[which(srp_summer_long$variable == 'temp'), ]$value <- 
  srp_summer_long[which(srp_summer_long$variable == 'temp'), ]$value * 1.5 * 96# Create plot of average temp vs average elec cons.
ggplot(data = dplyr::filter(srp_summer_long, variable == 'energy'| variable == 'temp'), 
       aes(as.Date(date), value, color = variable)) + 
  geom_point(size = 1.5) + 
  geom_line(size = 1.1) + 
  scale_y_continuous(sec.axis = sec_axis(~.*(1/(96 * 1.5)), 
                                         name = "Avg. Temp (C)")) + 
  labs(color = 'variable') + xlab('') + ylab('kWh') + 
  ggtitle('Phoenix Summer Average Daily Energy Use and Temp.') + 
  scale_color_manual(values = c("firebrick", "darkgoldenrod1")) + 
  theme(axis.text.y = element_text(color = 'firebrick', face = 'bold'),
        axis.text.y.right = element_text(color = 'darkgoldenrod1', face = 'bold'),
        legend.position = 'bottom')

The figure illustrates how temperature and daily energy use generally track one another. The correlation coefficient in this case was 0.77. It appears that the temperature leads the energy consumption, or in other words, the temperature rises and then the energy consumption rises which demonstrates the concept of thermal mass. This is the thermodynamic equivalent of inertia, which is says that for two objects going the same speed, it will be more difficult to slow down the heavier object. Likewise, a building with a greater thermal mass will take longer to change temperature, whether that is an increase or decrease. There are other variables at play here besides just temperature that we should take into account, but overall the trend shows that temperature and energy use are positively correlated for this building during the summer. An animated version of this plot for one week is the title image of the post.

To see a negative correlation, we can look at the relationship between daily average relative humidity and energy consumption.

{r}
srp_summer_long[which(srp_summer_long$variable == 'rh'), ]$value <- 
  srp_summer_long[which(srp_summer_long$variable == 'rh'), ]$value  * 90ggplot(data = dplyr::filter(srp_summer_long, variable == 'energy'| variable == 'rh'), 
       aes(as.Date(date), value, color = variable)) + 
  geom_point(size = 1.5) + 
  geom_line(size = 1.1) + 
  scale_y_continuous(sec.axis = sec_axis(~.*(1/(90)), 
                                         name = "Avg. RH (%)")) + 
  labs(color = 'variable') + xlab('') + ylab('kWh') + 
  ggtitle('Phoenix Summer Average Daily Energy Use and Relative Humidity') + 
  scale_color_manual(values = c("firebrick", "darkorange1")) + 
  theme(axis.text.y = element_text(color = 'firebrick', face = 'bold'),
        axis.text.y.right = element_text(color = 'darkorange1', face = 'bold'),
        legend.position = 'bottom')

The relative humidity has a -0.44 relationship with energy use. This is a bit more difficult to spot in the graph because it is not as strong as the temperature relationship, but we can see that the two lines tend to travel in opposite directions.

Machine Learning Modeling

What we have learned so far is there are many time patterns and weather correlations in the energy data. Now, the question is, can we create a machine learning model that will be able to predict the energy consumption from the weather and time conditions? Part three will go into detail about the modeling process, but I wanted to show the evaluation results of several different models here. The overall objective is to make a model that can predict six months worth of energy consumption given the weather and time information. The model will be trained on all the data except for the last six months, and then tested on the held out data. During training, the model is able to see both the weather and time variables (known as features in machine learning) and the energy consumption (called the target). The goal of training is for the model to learn the mapping between features and the target. The model tries to figure out how much energy will be used during a 15-minute interval based on the weather, time of day, day of the week, and so one. Then, to assess the accuracy of the model, it will have to predict electricity consumption on the test set from only the features (it is not allowed to ever see the test set targets because that would be cheating!).

Creating an accurate model requires lots of engineering, and we first need to choose a type of model before we can delve into optimizing that model. To cover a wide range of model complexity, I choose three algorithms for evaluation: linear regression, support vector machine regression, and random forest regression. If you’re not familiar with these methods, it’s not that important for this part of the report, and they will get a full explanation in the next part. For now, just know that linear regression tries to fit a straight line to the data, a support vector machine transforms the features into a higher dimension (yes, it is pretty much magic) and then fits a line to the data, and the random forest is composed of hundreds of decision trees. A single decision tree is basically a flowchart of questions. Starting at the root, you follow the branches down, answering each question until you arrive at an answer and a random forest works by taking a vote of many of these trees to find the average prediction, which is going to be closer to the true value on average than any one particular answer. A decision tree used to determine if an applicant should be approved for a loan based on their info is presented below. Start at the top, answer each question based on the individual’s background, and eventually you arrive at a decision!

Model Evaluation

With details to come in part three, here I will just show the results of evaluation. The models were compared to one another by splitting the data into a random training and testing set (using the same set for each model), training the model, testing the model, and comparing models using several metrics. The data preparation was completed in R, and the actual implementation of the algorithms was done in Python. The metrics chosen are presented below:

1. rmse: root mean square error, the square root of the sum of the deviations squared divided by the number of observations. This is a useful metric because it has the same units (kWh) as the true value so it can serve as a measure of how many kWh the average estimate is from the known true target. The rmse is dependent on the magnitude of the target variable and is often normalized for model evaluation.

2. MAPE: mean average percentage error: the absolute value of the error for each prediction divided by the target value and multiplied by 100. The percentage accuracy can be found by subtracting the MAPE from 100%.

3. R-squared: the percentage of the variation in the dependent variable explained by the independent variables in the model. In other words, with an r-squared of 0.6, that means our model can explain 60% of the variability in the energy consumption from the weather and time variables.

A lower MAPE is better because it is the percentage error, and a higher r-squared is better because it means the model captures the relationships in the data more completely. The MAPE and R-Squared results are presented below the code used to generate them:

{r}# Read in results saved as a feather file
results <- as.data.frame(feather::read_feather('feather/results.feather'))
results$building <- winter.corrs$location
# Melt into long format
results_long <- reshape::melt(as.data.frame(results), id = c('season', 'building'))
results_long$variable <- as.character(results_long$variable)
# Fix random forest values
results_long[which(results_long$variable == "random_forest_r_squared"), ]$variable = "rf_r_squared"
results_long[which(results_long$variable == "random_forest_rmse"), "variable"] = "rf_rmse"
results_long[which(results_long$variable == "random_forest_mape"), "variable"] = "rf_mape"# Split columns into model name
results_long$model <- stringr::str_split_fixed(results_long$variable, "_", 2)[, 1]
results_long$metric <- stringr::str_split_fixed(results_long$variable, "_", 2)[, 2]results$season <- factor(results$season, levels = c('spring', 'summer', 'fall', 'winter'))# Plot mape
ggplot(filter(results_long, 
              metric == 'mape' & building != "Lewisville, TX" & building != "Sacramento, CA"), 
       aes(season)) + 
  geom_bar(aes(y = value, fill = model), 
           color = "black", width = 0.8, position = 'dodge', stat = 'identity') + 
  facet_wrap(~building, scales = 'free_y') + xlab("season") + ylab("mape (%)") + 
  theme_hc(12) + theme(axis.text.x = element_text(angle = 60, vjust = 0.5), legend.position = 'right',
        plot.title = element_text(hjust = 0.5, size = 16), 
        axis.text = element_text(color = 'black'),
        axis.title = element_text(color = 'black', size = 14))  + 
  ggtitle("Model MAPE Comparison") + scale_fill_wsj()# Plot r-squared
ggplot(filter(results_long, 
              metric == 'r_squared' & building != "Lewisville, TX" & building != "Sacramento, CA"), 
       aes(season)) + 
  geom_bar(aes(y = value, fill = model), 
           color = "black", width = 0.8, position = 'dodge', stat = 'identity') + 
  facet_wrap(~building, scales = 'free_y') + xlab("season") + ylab("r-squared") + 
  theme_hc(12) + theme(axis.text.x = element_text(angle = 60, vjust = 0.5), legend.position = 'right',
        plot.title = element_text(hjust = 0.5, size = 16), 
        axis.text = element_text(color = 'black'),
        axis.title = element_text(color = 'black', size = 14))  + 
  ggtitle("Model R-Squared Comparison") + scale_fill_wsj()

These results show that the random forest is without a doubt the winner! Next steps for the random forest are tuning the hyperparameters (adjusting the settings to maximize performance) and then assessing how well it can predict six months of energy use. To give you a little idea of what to expect in the next part, the following graphs show an entire dataset with the predictions of the random forest overlaid, and then a typical weekly prediction.

The Random Forest shows much promise for prediction, but there is definitely some work to do! That however can wait until the next part…

Conclusions

After an enjoyable exploration of the eight building energy datasets we can draw the following conclusions:

Building energy use exhibits daily, weekly, seasonal, and yearly patterns.
Buildings in different climate zones have different patterns on every timescale and varying correlations between weather and energy use.
Temperature and global horizontal irradiance are the most positively linearly correlated weather conditions with the energy use of a building during the summer (except for that anomalous Portland mystery).
A linear model is not able to capture all of the non-linear time trends and relationships between weather and electricity consumption.
Machine learning techniques such as a Random Forest may be able to accurately predict electricity consumption from weather and time conditions, a possibility which will be fully investigated in further work!

Thanks for making it to the end. As a special bonus, here is another animated plot:

See you in part 3!