By Steve Mock, director of data and information services, Pecan Street Inc.

Pecan Street’s database of real-world energy and water use is the largest on the planet, and it has allowed us and researchers from around the world to explore a host of research questions. For example;

And there’s more. You can browse Pecan Street’s published research, or you can scroll through the more than 300 peer-reviewed research papers that have used Pecan Street data.

But beyond any one analysis, there’s a bigger reason why we collect so much data. Emerging tools like artificial intelligence and machine learning have the potential to revolutionize how we generate, move, store and use critical resources like electricity and water. But to reach their potential, these tools require mountains of data.

The devices we install in our participants’ homes collect electricity data from between 12 and 32 circuits in the home across five different power dimensions (real power, total harmonic distortion, current, angle, and apparent power) at a frequency of either once per minute or once per second. This data is collected by our ETL (Extract, Transform, Load) process by streaming the data out of the homes, parsing it, and inserting it into our relational database hosted on-site in Pecan Street’s data center.

In addition to homes that participated in the past, we have 732 active homes in Texas, California, and New York. That amounts to data from more than 1,000 homes, and we’re still expanding the network.

A few years ago, we reached one petabyte of on-site storage capability, a milestone that pushed us into true big-data territory. We’ve since added close to another .5 petabyte. Because the database has to accommodate the maximum data feed from our “most measured homes” it grows by about 92 gigabytes per day. Our electricity set alone grows from 7-10 billion data points a day.

For example, in a recent 24-hour period, we received data from 623 homes; 523 of which read 32 circuits across the five power dimensions every second, the remainder gather the five dimensions across 12 circuits every minute across. So, that’s:

  • 523 homes x 86,400 seconds in a day x 32 circuits x 5 electricity readings = 7,229,952,000 data points, plus,
  • 101 homes x 1440 minutes in a day x 12 circuits x 5 electricity readings = 8,726,400 data points
  • Together, that’s about 7.2 billion data points per day in just our electricity data set!

Over time, this mountain of data becomes more and more useful for AI and ML models that can predict energy consumption and generation, model grid scenarios or identify what is consuming the energy at the plug.