Methods for Nine Months of Unplanned PG&E Electricity Outages

17 March 2026

This is a lightly-edited draft. Please send any feedback to {my first name}@srcramer.com

More than a year ago, I started to wonder about the reliability of my electricity in the Berkeley Hills: we get power outages a few afternoons per year, but is it really better a few blocks away?

I spent more than a year collecting house-level power outage data from a publicly licensed State of California dataset, and I've now put 9 months of it on a web map. While other data sources content themselves with showing information at the county level or regional level, we show more detailed shape-level information that is often accurate to the power customer level. Questions, comments, or collaboration requests? Send me an email at {my first name}@srcramer.com.

This document describes how I collected the data, sources of error, validation, information on the basemap, and a copyright notice.

Data collection and transformation

Step 0: CalOES gathers information

I download data from the public use licensed "Power Outage Incidents" from CalOES GIS Data Manager.

I assume the data reaches CalOES like this:

each utility has internal power outage detection methods which feed its internal dashboards
the utility turns this data into something fit for consumption by the outside world
CalOES has a server which receives or requests this data and periodically (every 15 min, according to the documentation) re-publishes it to their website

Step 1: gather data

Every 2 minutes, a server downloads JSON files from the CalOES map. The map publishes three data layers, of which we download two:

layer	meaning	action taken
0	Point information about where power outages are, which utility is responsible, how many customers are impacted, and some other utility-specific metadata	Downloaded but yet not analysed
1	Polygon information about what areas power outages cover, which utility is responsible, how many customers are impacted, and some other utility-specific metadata. PG&E is the only utility to provide polygon information for this layer, though in the future more utilities may	Downloaded and analysed
2	County-level summary information about the number of customers with and without power, the polygon shape of each county	Not downloaded (it's too many bytes for too little meaning)

The data gathering process downloads the files, computes whether it's seen the payload before, and then saves new payloads with a timestamp. If it's seen a payload before, it simply records that payload's hash and observation time.

Step 2: Compress data

A normal day of outages produces at least 1GB of text JSON files. Much of it is repeated: because outages last hours, the same polygon appears across many files.

To reduce long-term storage challenges, I compress these text files using Zstandard at compression level 22, which brings the average day to under 5MB.

The data produced can be really spiky. During December 2025 when SF had massive power outages, the raw text files ran to 12GB and the compressed data to 300MB!

Step 3: Ingest data

I download each of the days to my laptop and ingest the data into my local PostGIS server.

Step 4: Select data

For this analysis, I chose data that:

is in the polygon layer (as of publication, PG&E is the only utility providing polygon information)
is an unplanned outage (as determined by the utility metadata)
is an outage which the utility reports impacts two or more customers
was observed between 2025-07-01 and 2026-03-05.

Step 5: Compute polygons with the same outage performance

Power outages are unique because failures in the infrastructure tend to create a consistent pattern of outages. For example, when a power line on my street touches a tree, it will shut down power for all of the houses on that line until it's fixed. I could show standardized square or hexagonal bins which would give a sense of those underlying structures, but I wanted to be able to see them directly.

To compute those polygons for California, I first divide the state into tiles of 0.1° latitude by 0.1° longitude. Then, for each tile, I:

clip the geometry at the tile boundary
compute all of the polygon boundaries as lines using the PostGIS functions ST_ExteriorRing(ST_Dump(ST_Intersection(geo)).geom)
find all of the intersections using ST_Union(geo)
rebuild as atomic non-overlapping polygons using ST_Dump(ST_Polygonize(geo))
delete small polygons (small polygons can be generated when, eg, the utility slightly changes the boundaries of which transformers affect which areas)

Step 6: Turn the outage observations into hours

At its core we know point-of-time outage information, but we want to know duration information. As an example:

time	was the power on?
10:02 AM	yes
10:18 AM	no
10:28 AM	no
10:42 AM	yes

We chose to interpret this as meaning: "the power was out from 10:18 AM until 10:42 AM". In reality we know this is wrong -- even assuming the utility reports the information accurately, it's possible that the power went out anytime from 10:02-10:18, and that it came back on from 10:28 to 10:42. This interpretation hopefully means that variability in the start and end times cancels out, giving us a view that's at least internally consistent and also hopefully right on average.

As a practical matter, we:

grab all of the observation times (e.g. 10:18 AM)
compute the "next" observation time using PostGIS lead() (so that 10:18 AM row has a next observation time of 10:28 AM)
compute the time between the two observations (20 min), capped at 1hr

We cap the time between the observations at 1hr because the data collecting apparatus is prone to small failures and we don't want missing data to unfairly label an area as having bad power.

We then apply this back to all of the polygons:

compute an arbitrary point that's within that polygon using ST_PointOnSurface(geo)
select the power outages observations that apply to that point using ST_Within (e.g. 10:18 AM, 10:28 AM)
sum up all of the capped times between the observations (e.g. 16 min + 20 min)
apply that value computed at a point to the polygon which we know has the same outage information

Step 7: Simplify the shapes and put it on a map

To reduce the amount of data, I aggressively simplify the polygons. I found this sequence of PostGIS functions to do a nice job reducing lines which did not impact the polygon's shape (st_buffer( .., 'join=mitre')), separating polygons that are not actually connected (st_dump()), and communicating house-level detail of information (ST_SnapToGrid()).

(st_dump(
  ST_SimplifyPreserveTopology(
    ST_SnapToGrid(
      ST_Buffer(
        ST_buffer(geo, .00001, 'join=mitre'), 
      -.00001, 'join=mitre')
      , 0.00001)
    , 0.00001))).geom

Then, for every tile, Python generates a GeoJSON file that contains each polygon and how long its power was out as computed above. To make the map performant and hostable on Cloudflare Pages, Tippecanoe stitches those GeoJSON files together to generates PMT vector map tiles for various zoom levels. I then display them with OpenLayers, which manages to render it smoothly during panning/zooming in Chrome and Safari.

Sources of error

CalOES data does not match what's happening on the ground

For example, we see that parts of the Yosemite Wilderness had 193 hours of power outages. This is absurd: there is no electric service in the Yosemite Wilderness. We see a similar phenomenon around Lake Merritt, Tilden Park and assuredly more.

We also see inaccurate polygons in urban areas, like a polygon for a planned outage that covers 6 houses but where the structured information says only 1 customer is impacted. We excluded outages with only 1 impacted customer before we saw this example, but it stands to reason that there are more like it hiding in the data we show on our map.

Short outages are badly represented

CalOES publishes a point-in-time snapshot every 15(-ish) minutes. So if a property had a 5 minute outage, it's unlikely to be included. And if it is, it's more likely to be recorded as 15 minutes of outage.

CalOES does not update

On average I downloaded a unique version of data from CalOES every 12.3 minutes.

The longest intervals between datasets are:

Jan 23, 2026 from 12:52 AM Pacific for the next 8 hours
December 15, 2025 from 2:57 AM Pacific for the next 4 hours
July 13, 2025 from 6:04 AM Pacific for the next 2 hours

From what I can tell, these are caused by errors in the pipeline on CalOES's end where they are simply not updating their data for a while.

Downloaded invalid JSON

I occasionally download JSON that was truncated and as a result invalid. For example:

time (UTC)	hash	size	last few chars of file	valid
04:16:02	3352a	37 kb	`[-122.276971210187,`	❌
04:18:03	0da49	102 kb	`37.3675048846639],`	❌
04:20:01	02dc8	195 kb	`"IncidentId":"2748472"}}]}`	✅
04:22:02	02dc8	195 kb	`"IncidentId":"2748472"}}]}`	✅

For now I simply ignore the invalid JSON observations. As a result, I wouldn't record the observation until 4:20 even though it could be stale data from 4:16 or earlier. To me, this demonstrates the value of over-scraping this point-in-time data: I get a high-by-default temporal resolution, and when something goes wrong it can recover a bit faster.

My data collection infrastructure fails

During the time period I published in this map, I had no infrastructure failures. Because I'm recording point-in-time observations, there is little way to backwards-compute data that was not observed.

Outside of this period, I had a few scraping outages, most often servers running out of storage.

Outages caused by privately owned infrastructure

We have no visibility into events like a house blowing a fuse. Or an apartment building having a problem with its internal power distribution. Similarly, we have no visibility into recovery options like someone running a generator or using their in-home solar battery backup.

Validation

Having data pulled from the internet is good, but having it match up to ground truth is even better. Long story short - the data matched the other sources I found, and if you have PG&E I'd be really excited if you could take 5 minutes and help me validate more, particularly if you live outside the core Oakland/Berkeley/SF area.

Validating one outage at my house

My house had an outage on December 29, 2025. I found two sources of outage history for my house. PG&E power outage alert text messages and the export of hourly electricity usage, and they both tell the same story as the CalOES data.

PG&E text messages

I've signed up to receive texts about power outages at my house on PG&E's website. I have no memory of the alert texts being inaccurate, so probably they match reality? Indeed, they sent me three messages on that day:

1:33 AM - PG&E: Investigating potential outage near {your address},. More info to follow soon. Manage alerts: pge.com/myalerts
1:48 AM - PG&E: Outage near {your address}. We plan to have power on by Dec 29 08:00AM.
3:32 AM - PG&E: Power has been restored near {your address}. This was an unplanned outage. Still no power? Go to {our website}

These messages imply that the power was out for about 2 hours while I was asleep.

PG&E hourly power usage

On PG&E's website you can download your electricity usage broken down by either hour or (for some customers) 15 minute period. Interpreting the hourly electricity usage is a bit finicky. They don't always say "your power was out so your bill is zero". Instead, when the power was out, it seems to leave a note saying "This data was estimated" and is often accompanied by lower usage during the period. I saw:

Start Time	End Time	kWh	Note
00:00	00:59	0.35
01:00	01:59	0.16	* This data was estimated
02:00	02:59	0.15	* This data was estimated
03:00	03:59	0.15	* This data was estimated
04:00	04:59	0.29

Worth noting: There must be other reasons that the data can be estimated. One friend had their data estimated for a month, but I did not hear about any month-long power outage. So this needs to be taken with a grain of salt. Regardless, multiple records of consistently estimated data tend to match up with a power outage, so it's better than nothing.

In an ideal world a person would write a statistical model to notice that a bunch of power use was missing and use that as confirmatory evidence. But I've found my house's hourly consumption to be highly variable, so I'm not optimistic that it would tell me more than this quick eyeballing.

Data downloaded from CalOES

And in the underlying power outage data I see:

1:40 AM - no outage
1:50 AM - outage
[12 omitted observations] - outage
3:32 AM - outage
3:47 AM - no outage

And on the map

On the map we report 1hr 57 min of outage for that day. Which matches what we get for the PG&E text messages, and is also consistent with what we see in our hourly outage info (meaning that there was no power in the 1AM, 2AM, and 3AM hours). Always good to know that our data transformations are working.

Validating months of outages at my house

So I can make this little table:

Date	SMS	Hourly Data	CalOES
2/11/2025	outage	outage	outage
4/28/25	outage	outage	outage
10/19/25	outage	outage	outage
11/11/25	no	no?	no
12/29/25	outage	outage	outage

The only interesting example is 11/11/25. There, I had two non-continuous hours where the data was estimated. Maybe there were two small power blips? (I wouldn't expect to be able to identify those if they did happen). However, I'm inclined to believe that there was no longer outage if for no other reason than the 5AM hour in the middle has lower usage than either of the estimated 4AM and 6AM hours.

Start Time	End Time	kWh	Note
03:00	03:59	1.10
04:00	04:59	0.71	* This data was estimated
05:00	05:59	0.32
06:00	06:59	1.10	* This data was estimated
07:00	07:59	1.88

Validating with exported power usage

I asked four friends in Berkeley, Oakland, and San Francisco to share their hourly power usage from their PG&E portal. Their data exports mostly matched up with the CalOES data and what appears on the map!

My friend in Berkeley lives a few blocks away. Their confirmed that my house indeed has far more power outages than theirs.

Showing the outline of PG&E's service area

It was surprisingly challenging to identify where PG&E does and does not distribute electricity to end users. I could easily find this PDF map but struggled to find a representation of it that was suitable to import into PostGIS.

Instead I found this ArcGIS map which shows the outlines of a few utilities (including PG&E) from 2021-ish, mashed that together with an outline of California, and went about assigning approximate labels to the areas that were in the state but not covered by the major utilities listed.

In hindsight, I think I could have used this map of overlapping electricity distribution networks. I dismissed this map at first glance because it seemed obviously wrong (e.g. PG&E provides electricity in SF, not the City and County of San Francisco). However, I missed that there are overlapping polygons which can then be interpreted usefully.

I gained a greater appreciation for why it's hard to say which utilities cover which geographic areas: it's because large customers make special arrangements with different utilities, sometimes called releases. For example, most of SF municipal Hetch Hetchy Power's electricity is distributed via the PG&E grid (so the outage would appear in the PG&E data). However, it provides power directly to large projects like SFO airport, Salesforce Transit Center, 60% of SF streetlights, and a few SF neighborhoods (SFPUC pg 13). I generally stuck with the boundaries listed in the official maps, labeling SFO and Treasure Island as part of Hetch Hetchy Power, but leaving streetlights and Salesforce Transit Center as visually part of the PG&E area.

PG&E outages outside PG&E territory

The difficulty mapping PG&E's territory means that there are some places where PG&E shows an outage but my map data says another utility services the area.

For example, according to all the polygons I've been able to find online, PG&E does not service the Merced city center and the Merced Irrigation District does. However, there are clearly power outages reported near there, and the underlying data (which contains a "company" field) confirms that the company is PG&E. The town's website says that both utilities service the area, and Redditors suggest they have a choice between utilities at a given service address.

The interconnected nature of the grid means that PG&E might assign outage polygons to areas where they handle some middle part of the distribution. For example, the City of Alameda (which has its own municipal power distribution company) had an outage caused by a fire at a PG&E substation which supplies power to the municipal utility (source).

I opted to let the PG&E boundaries generally reflect the official maps, but still show the power outages under them. If you have any ground-level information about where people in these conflicting areas get their power, please do reach out so I can correct my visualization.

Optimizing the basemap

I opted to use vector basemaps because they make it easy to put the basemap on the bottom, the outage data in the middle, and the labels on top (thanks to the friend who suggested it).

I used a few mostly-monocrome basemaps, but chose a colored one so that the user can clearly see green nature areas and blue water (which rarely have power coverage and as a result don't have power outages). I started with the Bright style from OpenFreeMap.org and then edited the stylesheet to remove any of the yellow-ish colors which could be confused for a power outage (e.g. yellow roads, sand) and a bunch of lines (e.g. county-level boundaries, ferry routes, etc.) which are unlikely to help the user orient to power outages.

Future work can include re-computing these map tiles so that they only include the elements I render, further helping lag when panning/zooming on the map.

Copyright information

← Previous
Sunset Here is Sunrise Where?
Next →
Designing an Architecture for StreetSmart