Building Energy Data Analysis Part One

Background, Exploratory Questions, and First Plots!

Author’s Note

Since August 2017, I have had the privilege of working as a data science researcher with the EDIFES project at Case Western Reserve University. While the name is awkward (as a US Department of Energy project, it has to have an acronym), but the idea is simple: take the traditional time-consuming and expensive in-person building energy audit process and make it virtual using data science techniques. The goal is to greatly reduce the time and cost of e energy audits while increasing their reliability, resulting in lower building carbon footprints and significant savings for companies. We want to make it easy for companies to do well by doing good! As a first step in my involvement with the project, I was given a small subset of building data and told to get familiar with the data by performing a full exploratory data analysis (EDA). An EDA is a great way to gain an intuitive understanding of the data involved in the project. Moving forward, this understanding will be vital as I to know the expected patterns and be able to quickly spot anomalies or interesting trends in the data. Moreover, a full EDA will let me try out some of the functions already developed by the EDIFES team and potentially develop new models. Finally, doing an EDA is enjoyable! There is something almost magical about taking millions of lines of data and crafting a single chart to explain all the trends, or making a model that predicts the future with astonishing accuracy.

This will be the first in a three part series (with a bonus fourth article entirely devoted to machine learning!). Although promising multiple parts usually leads to disappointment when the author fails to deliver (looking at you George RR Martin), take heart in the knowledge that part 2 is already complete and part 3 is well under way! If you decide to stick with me for the whole journey, you will be treated to the wonders of the R programming language, Python, Tableau, machine learning with Scikit-learn, and maybe even some deep learning with Tensorflow! All of these tools are vital to data scientists, and I hope this may spark some interest in anyone looking to get into this exciting (and high-paying) field. As always, I welcome any feedback (email and encourage everyone to check out the whole project on GitHub.

Project Overview

Problem Statement

An energy audit is an inspection and analysis of the electricity consumption patterns of a building by documenting incoming and outgoing energy flows. When done correctly, energy audits can significantly increase the overall efficiency of a building, thereby reducing electricity costs and greenhouse gas emissions. However, traditional physical building energy audits are resource intensive in terms of both time and cost. In-person audits require a team of auditors to travel to a building and perform a series of tests and inspections with specialized equipment to determine the efficiency of a building and identify areas for improvement. In addition to the resource costs of a physical audit, assessments can vary significantly between auditing teams, raising into doubt any economic benefit of an in-person audit. Real concrete gains in building energy use are possible, as the US Department of Energy estimates that building efficiency could increase by 30% by 20301 implementing existing technologies. However, it is clear that physical energy audits are not the optimal tool for identifying efficiency improvement opportunities.

EDIFES Approach

EDIFES: Energy Diagnostic Investigator for Efficiency Savings is an ARPA-E funded project [1] jointly undertaken by the Great Lakes Energy Institute and CWRU that aims to develop a software tool for performing virtual energy audits. The overall objective of the project, under the direction of Professors Alexis Abramson and Roger French, is to eliminate the need for physical audits by determining areas for efficiency improvements using only overall building electricity consumption data and building metadata (location and square footage). Part of this process is developing a set of building markers from the electricity data, such as building occupancy schedule and HVAC on/off cycle times, that can be used to characterize a building. [2] In the first phase of the project, EDIFES has partnered with several companies to gain access to building data, perform preliminary analyses of those buildings, and receive feedback on the accuracy of the report. [3]

Introduction to Dataset

I will be exploring a dataset of time-series electricity consumption data from 8 Progressive office buildings. In addition, the building information contains metadata allowing the weather data associated with each building to be obtained from solarGIS. The data has already been cleaned by members of the EDIFES team and the corresponding weather data is attached to the time-series for each building. Each building is contained in a separate csv file. In the initial cleaning process performed by the EDIFES team [4], 2 of the office
buildings did not pass the data quality standards established by the project because of issues with anomalies and missing values. As these buildings have already been partially modeled by the team, I will focus on exploring questions that might have been beyond the scope of the initial EDIFES investigation.

Building Metadata

The building metadata contains city, square footage, and annual consumption which is primarily useful for obtaining the corresponding weather data from solarGIS.

{r}# Load in needed libraries without displaying messages
# Load in the metadata with no messages
metadata <- suppressMessages(read_csv("metadata/progressive_metadata.csv"))
# Make a nicely formatted table of metadata
kable(metadata[c(1,2,4,6,8)], caption = "Progressive Building Metadata")
Table 1: Progressive Building Metadata

Buildings with Reference number 92 and 95 were not analyzed by EDIFES because they did not pass the data quality standards of the project.

A brief explanation of the metadata variables follows:

• Ref: reference number
• Location: City, State
• Square Footage: Size of building in ftˆ2
• Climate Zone: Koppen-Geiger climate zone classification
• Annual Consumption: Total consumption per year in kWh

Energy and Weather Data

The energy data for 8 of the buildings has already been cleaned and each building is in a separate csv file with associated weather information. Each row of the csv data files contains one time interval of fifteen minutes. The columns contain the energy consumption and weather data associated with that building and timestamp. The csv file for each building contains the following columns.

Table 2: Building Energy and Weather Data
{r}# Read in data for one building and display
suppressMessages(f.SRP <- read_csv("data/f-SRP_weather.csv"))
head(f.SRP[c(1:6)], 5)
head(f.SRP[c(7:13)], 5)
head(f.SRP[c(14:20)], 5)
Figure 1: Typical Energy and Weather Data for a Building

Not all of the variables were examined in the initial modeling of the data by the EDIFES team. The following variables were of primary interest to the team on the first pass through the data.

• timestamp: gives the date and time for the energy measurement in fifteen minute intervals
• elec_cons: the raw electricity consumption data (kWh)
• elec_cons_imp: 1 or 0 marker if the data was linearly imputed
• power_dem: difference in power (kW) between measurements
• biz_day: business day or not (accounts for holidays)
• day_of_week: relatively self-explanatory
• week_day_end: weekday or weekend
• num_time: number of hours since the start of day
• temp: most correlated weather variable with energy consumption
• rh: relative humidity, the second highest positively correlated weather variable with energy consumption
• forecast: final cleaned energy consumption with anomalies removed and missing data imputed using a
custom function
• anom_missed_flag: marker that tells if forecast energy had to be corrected from the raw number. There would be two reasons for the raw data to be corrected:the data was missing, or the data was anomalous (an outlier).

Although not all of the variables were used in the modeling by the EDIFES team, I will keep all of the weather data to try and find relationships that may have gone overlooked.

Data Cleaning

The data has been completely cleaned. Team members used the tsoutliers R package to identify and correct anomalies. Missing data was either linearly imputed (if fewer than 4 missing points in a row) or imputed using a custom function developed for the project. Assembling the dataframes will be as simple as reading in the data from the correct csv file. I expect to have one dataframe for each building rather than collecting them into one massive data structure. The EDIFES team is more concerned with analyzing buildings individually rather than performing comparisons and the best strategy is to perform the EDA one building at a time.

Exploratory Questions

With all this beautiful tidy data at my hands, what questions should I ask?

My primary interest is correlations between weather and energy consumption. As building energy consumption can be strongly dependent on the weather, this research holds potential for identifying opportunities for significant energy savings. In particular, I want to identify what weather conditions result in higher energy usage. This information could then be used to inform building energy managers that their building could be optimized to their climate or could be used to develop a more efficient heating/cooling schedule in anticipation of forecasted weather.

Furthermore, if it were possible to develop a model that accurately predicted energy consumption for the next day based on the weather forecast, this could be used to derive the building energy schedule in order to prepare for the weather conditions. I would like to see if a machine learning model could accurately predict the next day’s energy consumption based on current/forecasted weather conditions. With the addition of keras into R, it is now possible to relatively quickly develop a deep or recurrent neural network. The weather neural network would take in the previous day’s weather and electricity consumption as labels and try to predict the next day’s electricity consumption as the target. A recurrent neural network with LSTM cells would be an ideal model for this task because it has the capability to “remember” previous inputs, which is vital in time series (or any sequence) analysis.
I am also interested in comparitive analysis. Although I will look at the buildings individually, it might be useful to compare buildings with similar characteristics located in different climate zones. Moreover, I want
to determine how a “good” building performs with regards to energy efficiency. I think these questions could be answered by comparing the buildings and using average building energy use statistics from
other data sources. Finally (at least for now), I would like to apply the next series of 10 building markers currently being developed to these buildings. Studying the feedback provided by Progressive with regards to the first 10 buildings markers was a good learning experience because it showed what the team was able to correctly predict, and what parts of the models need to be adjusted. A similar prediction — feedback cycle will take place with the second set of 10 markers, and it will show what we can and cannot understand about
a building from a single stream of data. This would involve using code currently under development by EDIFES members. I hope to develop a pipeline that will automatically run the 10 building marker for a building. For the first 10 markers, the EDIFES team developed a script that took in a cleaned building dataset and generated a complete report. The script will
need to be applicable to any cleaned data and not only the dataset I have for this project. Based on these three paths of exploration, I formulated the following set of 6 questions:

1. Which weather variables are most correlated with energy consumption and what is the physical explanation behind this?

2. Can the current day’s weather data be used to predict tomorrow’s energy consumption?

3. Controlling for building size, which climate zone is the most energy intensive?

4. Are these buildings “good” in terms of energy efficiency? What does “good” mean in this context (i.e. how can it be quantified)?

5. Can the next 10 markers accurately characterize these buildings?\

6. Based on the answers to the previous 5 questions, are there concrete recommendations for building managers to reduce energy use?

I expect these questions to change and adapt over the course of this exploration. EDA is an iterative process, and trying to answer one question can lead to 5 additional queries. Moreover, all data has fundamental limitations and may not be appropriate to answer the initial questions. This is not an undesirable situation because the heart of EDA is to learn what your data can teach you before employing more sophisticated statistical methods to test hypotheses.

In the words of John Tukey (a mathematician who developed the FFT and boxplot):

Data analysis, like experimentation, must be considered as an open-ended, highly interactive, iterative process, whose actual steps are selected segments of a stubbily branching, tree-like pattern of possible actions.

Exploratory Visualizations

Yeah, that’s great and all, but show us some plots!

First up, I wanted to show the location and the associated climate zone. This demonstrates the diversity in climate zones. This visual was made using Tableau, a powerful visualization software (and as students, we can get the professional edition for free). Part of data science is using the appropriate tools for the situation, and if you have clean, tidy data and want to make a decent looking map in five minutes, then Tableau is the best tool for the task! (If you have a csv file and are thinking about making a plot in Excel,
stop, do yourself a favor, and open Tableau so you can make a professional looking chart.)

Figure 2: Location of Buildings Sized by Annual Consumption per Square Footage

As can be seen, 6 different climate zones are represented meaning there is considerable variability for comparison purposes.

Now we can go to the world of R for a few sample plots of the energy consumption and weather data. These plots look at one of the buildings in Phoenix, AZ. The first plot shows the raw energy consumption data.

{r}# Plot the raw electricity consumption over time
ggplot(f.SRP) + geom_jitter(aes(timestamp, elec_cons), color = "blue", alpha = 0.2) +
ylab("Elec. Cons. (kWh)") + xlab('') +
ggtitle("Raw Electricity Consumption")
Figure 3: Raw Electricity Consumption of a Typical Building

The next figure shows the cleaned energy consumption for the same building.

{r}# Plot the complete cleaned energy data over time
ggplot(f.SRP) + geom_jitter(aes(timestamp, forecast), color = "red", alpha = 0.2) +
ylab("Elec. Cons. (kWh)") + xlab('') +
ggtitle("Cleaned Electricity Consumption")
Figure 4: Cleaned Energy Consumption for a Typical Building

These graphs are somewhat unrefined and cluttered, but they primarily serve to demonstrate the cleaned data does not contain the anomalies or missing data that are present in the raw data.

A better visualization is to plot the mean daily energy consumption over the extent of the data. This smooths out the intraday variability inherent when taking observations every 15 minutes.

{r}# Create date only column in df
f.SRP$date <- as.Date(format(f.SRP$timestamp, "%Y-%m-%d"))
# Have to use aggregate because dealing with characters
# group_by does not work with character columns
daily.df <- aggregate(f.SRP$forecast, by = list(f.SRP$date), FUN = mean)
# Plot the average daily electricity consumption
ggplot(daily.df, aes(Group.1, x)) + geom_point() +
geom_line() + geom_smooth() + xlab('') +
ylab("Elec Cons. (kWh)") + ggtitle("Avg. Daily Electricity Consumption")
Figure 5: Mean Daily Energy Consumption

As expected for Phoenix, AZ, in the American southwest, energy consumption rises during the summer months and falls during the winter months. It will be interesting to see how much of a difference a “cool” summer or a “warm” winter has on energy consumption.

The next two plots look at possible correlations between energy use and weather conditions. As a hypothesis, I would expect electricity consumption to increase with an increase in temperature and decrease with a decrease in relative humidity. The first plot shows the relationship between average daily electricity consumption and average daily temperature.

{r}# Create a dataframe of daily averages
# This requires using aggregate because group_by does not work for characters
daily.df <- aggregate(f.SRP[,c("forecast", "temp", "rh", "ws")], by = list(f.SRP$date),
FUN = mean)
# Plot the daily average electricity consumption and daily average temperature
ggplot(daily.df, aes(x = Group.1)) +
geom_point(aes(y = forecast)) +
geom_line(aes(y = forecast)) +
geom_smooth(aes(y = forecast, color = "Avg. Elec Cons.")) +
geom_smooth(aes(y = temp, color = "Avg Temp")) +
xlab('') + ylab("Avg. Elec. Cons.") +
labs(color = "Legend") +
scale_y_continuous(sec.axis = sec_axis(~.*1, name = "Avg. Temp (C)")) +
ggtitle("Avg. Daily Electricity Consumption and Temperature")
Figure 7: Average Daily Temperature and Energy Consumption

The next plot shows the relationship between average daily relative humidity and average daily energy consumption.

{r}# Plot the average daily electricity consumption
# and the average daily relative humidity
ggplot(daily.df, aes(x = Group.1)) +
geom_point(aes(y = forecast)) +
geom_line(aes(y = forecast)) +
geom_smooth(aes(y = forecast, color = "Avg. Elec Cons.")) +
geom_smooth(aes(y = rh, color = "Relative Humidity")) +
xlab('') + ylab("Avg. Elec. Cons.") +
labs(color = "Legend") +
scale_y_continuous(sec.axis = sec_axis(~.*1, name = "Avg. Relative Humidity (%)")) +
ggtitle("Avg. Daily Electricity Consumption and Relative Humidity")
Figure 8: Average Daily Relative Humidity and Energy Consumption

Finally, to quantify all relationships between weather variables and energy consumption, I can generate the entire correlation matrix. To capture the relationships, this code calculates the Pearson correlation coefficient, a measure of the linear relationship between two variables. It ranges from +1 indicating a positive perfectly linear relationship to -1, indicating a negative perfectly linear relationship. The image below shows illustrative values of the Pearson correlation coefficient.

Figure 9: Varying Values of the Pearson Correlation Coefficient [source]

In the correlation plot below, forecast refers to the final cleaned energy value. Looking down the forecast column shows first the distribution of the cleaned energy (as a histogram) and then three scatterplots showing the trends between energy consumption and temperature, relative humidity, and wind speed respectively. Moving rightward across the forecast row are the correlation coefficients of the weather variables with energy. The stars represent the significance of the relationship with three stars indicating a strong linear relationship.

{r}# Create a correlation matrix with daily average final cleaned energy 
# (forecast), temperature, relative humidity, and windspeed
subset <- filter(daily.df, Group.1 < as.Date("2016-05-01"))
suppressWarnings(chart.Correlation(subset[, c("forecast",
"temp", "rh", "ws")],
histogram = TRUE, pch = 19))
Figure 10: Correlations between Weather Variables and Energy Consumption

There are definitely hints of meaningful relationships between forecast (the final cleaned energy) and temperature and relative humidity. There does not appear to be much of a relationship between wind speed and energy consumption, but the possibility cannot yet be ruled out with these preliminary visualizations!


This first report identified the importance of the project and approach of the EDIFES team, posed a few questions to guide the analysis, and explored some preliminary relationships quantitatively and visually. My exploration of the data will be guided by a set of six initial questions with more to come as I start a thorough examination of the data. Finally, I have made some preliminary charts of the data and have identified possible relationships for further investigation. See you in Report 2!


[1] “About the Commercial Buildings Integration Program | Department of Energy”,, 2016. [Online]. Available:

[2]: E. Pickering, “EDIFES 0.4 Scalable Data Analytics for Commercial Building Virtual Energy Audits”, Masters of Science in Mechanical Engineering, Case Western Reserve University, 2017.

[3]: M. Hossain, “Development of Building Markers and an Unsupervised Non-Intrusive Disaggregation Model for Commericial Building Energy Usage”, Ph.D., Case Western Reserve University, Department of Mechanical and Aerospace Engineering, 2017.

[4] Rojiar Haddadian, Arash Khalilnejad, Mohammad Hossain,
Jack Mousseau, Shreyas Kamath, Ethan Pickering

Data Scientist at Cortex Intel, Data Science Communicator