Power Outages in SF Bay Area (Part 1)

I never thought much about power reliability—six years in downtown Chicago, and I can’t recall an outage that impacted my life. But in the Berkeley Hills, nestled in the wealthy, high-tech San Francisco Bay region, we lose power a few afternoons a year. It’s baffling that homeowners in $2M houses tolerate sudden WiFi, fridge, and light failures. During our third power outage in as many months, I wondered: How reliable is my electricity? Is it better a few blocks away?

Averages Don't Tell It All

Power outages don’t follow city or county lines: when the power goes out at my house, we walk 10 minutes to a café and keep working.

I imagined a map showing power reliability across my area—something like this:
Author’s recording of power outages within 10km of Berkeley between November 22, 2024 and December 10, 2024. Dark orange is a lot of time without power; grey is a bit of time without power; no color is sweet 100% reliable electricity.

Surely someone must publish this map. So I went looking for it online...

Currently Available Outage Information

Real-Time Outage Information from PG&E

In my neighborhood, power outages tend to be highly localized. My intuition is backed up by actual data.

PG&E, our flawed utility provider, it shares the exact location of houses without power. For example, as I type this now, about 1 customer in this area of approximately 15 houses has a power outage due to an unplanned outage. Their neighbors have power.

The homes in the green colored area have their power out, and have for the last 5 hours. I’m sure someone writing a book on PG&E’s organizational culture can write a nice chapter on why this area is colored green.

Historical Outage Information from PG&E

While PG&E publishes this high-resolution outage data, historical performance is reported at a level approaching the county-level:
https://www.pge.com/assets/pge/docs/about/pge-systems/CPUC-2023-Annual-Electric-Reliability-Report.pdf

This is cute; it clearly doesn’t tell a good story, but it also doesn’t tell the full story.

They also have a data request program, which allows academic researchers and governments to request data. However, the log of past requests and text of data available suggests that polygon-level outage information is not available.

Outage information from Poweroutage.us

The private market has come to the rescue, offering outage information for the entire country.

https://poweroutage.us/area/state/california

Their product page boasts: “Need historical data? We've got it! Every data point collected is also archived! Outage data can be retrieved at the utility, state, county, and city levels. Data can also be summarized in any way that meets your needs, raw data, hourly, daily, outage events, etc”.

Inquiring about purchasing polygon-level data reveals that they can’t provide anything more granular than the city/county level.

Outage information on GitHub

User Simon Willison scrapes PG&E’s outage map and posts a history of it on GitHub!

It seems promising, but a look through the readme shows that Simon’s data doesn’t have any polygon information. Per his comments:

This repository only archives outages that are reported for a single location.

The outage map also includes polygon data, which is much more interesting... but has not proven practical to archive here, for reasons explained in this issue comment. Short version: I'd have to constantly archive 100MB of data per snapshot because the polygons are so large!

It also suggests that people aren’t capturing this information because it’s too much data. Hmmm…

Outage information from California Office of Emergency Services (CalEOS)

The State of California essentially re-publishes the outage maps from PG&E and several other large California utilities on ESRI’s ArcGIS Hub.

This is the juicy data I’m after! From a state agency! Surely they’re capturing historical data!

Alas, an email reveals they are not capturing historical data. However, there’s both a “download data” button and an explicitly labeled Public Use license.

Perhaps I can make a contribution here…

Putting the pieces together: scraping CalEOS’s data

Spread across three files, CalEOS shares 11mb of data (approximately 2 iPod songs of data) updated every 10-20 minutes.

Indeed, this is a lot of data. Storing the raw data in GeoJSON would take something like 1 gb per day – enough that 2 weeks would fill a Gmail inbox on the free tier.

As I designed my data collection and analysis workflow, I had a few goals:

Start quickly. Time marches forward, and every 10 minutes that passes is 10 minutes of unanalyzable history.
Run locally. I’m cheap, so running the workload on the Intel NUC computer in my closet is more attractive than paying $48/year for a VPS. If I want it, I can achieve greater reliability by putting an old computer in a friend’s closet.
Maintain high resolution. Low resolution data already exists, so high resolution data is the main differentiator of this project.

This lead to a few decisions:

Build a dumb scraper. I love SQL databases and eventually want all this data loaded into PostGIS, but I could write my scraping code now and then worry about my analysis and importing code later.
Be (slightly) reckless with storage. Since I could run it on my old computer with 100gb of storage, I could simplify my program by not over-optimizing the storage consumption. I can balance the work by only recording the body of files that I’ve never seen before, and then figuring out further deduplication later. This is another benefit of running locally - storage is a key price differentiator in the VPS world.
Use basic Unix tools. Since the scraper was pretty dumb, I don’t need to worry about setting up and maintaining Python, PostGIS, ZFS, or whatever else on my scraping machine. Curl, cron, shasum, and the file system can do the heavy lifting.
Use immutable data structures. Nothing should delete or edit data that’s been written before: it will record facts, and then the eventual analysis will turn that into a story.
Use hashes to manage duplicates. The data from CalEOS only changes every 10 minutes or so, but I want to be able to estimate timing to the highest resolution available. As a result, to more accurately estimate the change time, I need to download the data more frequently than 10 minutes. This could balloon the storage required by downloading tons of repeated data. To manage that, I’d compute a hash for each file, record the payload of the file when it’s new, and record the hash and time when we’ve seen it before.

Because I wanted to use basic Unix tools, I needed to think a lot about files. For simplicity, the filename does a lot of the heavy lifting, so it’s the first thing to design:

2024-12-02/layer0-2024_12_02T05_20_02_utc-0dff5…136f.json

2024-12-02 - to simplify human management (and–eventually–file system performance) we create a new folder for each day.
layer0 - CalEOS publishes their data across 3 ArcGIS layers, so record which layer this is.
2024_12_02T05_20_02_utc - the UTC timestamp of when we downloaded this file. We use a human-readable date because a growing application will need some human file management. We use UTC so that multiple computers in multiple time zones have a chance of collaborating successfully. We put “utc” in the file name because I don’t want to worry about time zones in the future.
0dff5…136f - the hash of the file. Hashes are a way of uniquely identifying a larger piece of information: to a reasonable approximation, when the file changes, the hash of the file changes.
.json - tells me that the file contains the GeoJSON payload of data, rather than simply being a file describing that we’ve seen this hash before. (use .info if it does not contain the GeoJSON payload)

With the key pieces written down, we can now write our bash function to download a file:

Change to the proper directory, creating it if it doesn’t exist
Use curl to download the file to a temp file
Record the time the download finished
Compute the hash of the downloaded file
Write a file to disk using the naming scheme above. If we’ve never seen the file, record the entire file. If it’s a duplicate, create an empty file to record that the data has not changed.

Then, we loop that across all of the URLs we want to download, tell Cron to run it every 2 minutes, and we’re off to the races!

Check back for updates about analysing this data!

← Previous
Next →
Reclaiming My Media Diet with RRSS - A Custom RSS Reader