PubPub Infrastructure: A Technical Breakdown
This post is a companion to our infrastructure transition announcement. It goes into the technical details of the cost reductions we’ve made.
The short version
We’re still waiting for final bills to settle, but our current analysis puts the new infrastructure at less than 1% the cost of our previous setup. Every single month of old server costs buys us over 100 months of service in our new configuration. Over 8 years!
To be clear, these numbers are comparing direct server and infrastructure costs for PubPub. The larger cost is paying people to maintain everything, keep it updated and healthy, and be available to the community for support, which is why the Sustainability Fund matters so much. But the sustainability of our technical infrastructure is in better shape than ever before.
These changes were made possible by a combination of better database tooling, LLM-assisted code analysis, and generous nonprofit programs from companies like Fastly, Cloudflare, and Sentry. Together, these made it possible to do in a few months what would have otherwise taken years (or been too risky to attempt) just a year ago. We got lucky with timing: these tools became available right when we most needed to cut costs, and our reduced team size meant we couldn’t afford a slow, cautious approach.
Some of these savings look like things we should have done sooner, but many genuinely were not feasible until recently. The tooling wasn’t there, the risk was too high, or both.
Here’s the full picture:
| Category | Cost Reduction | Notes |
|---|---|---|
| Server hosting | 98.6% | Database optimizations, Docker-based deployments on generic cloud servers |
| Full-text search | 100% | Migrated off SaaS into app codebase |
| Community Impact analytics | 100% | Migrated off SaaS into app codebase |
| PDF generation | 100% | Migrated off SaaS into app codebase |
| Analytics infrastructure (Redshift, Stitch, Metabase) | 100% | Migrated off SaaS into app codebase |
| CDN and asset delivery | 100% | Provided by Fastly’s nonprofit program |
| Image optimization and resizing | 100% | Moved from AWS CloudFormation stack to Fastly |
| DNS, security, and caching | 100% | Provided by Cloudflare’s nonprofit program |
| Custom-domain certificate provisioning and management | 100% | Moved off Heroku via Cloudflare and Fastly |
| Bug tracking | 100% | Provided by Sentry’s nonprofit program |
| Real-time collaboration storage (Firebase) | 100% | Cold-storage migration to PostgreSQL heavily reduced usage |
Many of these changes were made possible by a cascading of other changes on the list. For example, more affordable, containerized servers meant we had more capacity to move certain tooling into the codebase, which meant certain dependencies on other infrastructure was released, and so on. Few, if any, of the changes on the list above could’ve been done on its own.
Now, the details.
Server costs
The biggest single change. Identifying expensive database queries let us fine-tune our PostgreSQL setup in ways we couldn’t before. Previously, we relied on a much more expensive and complex (and therefore fragile) setup of read replicas, huge memory allotments, and complicated hosted services to handle the scale and spikes of activity on PubPub. Better indexes and more efficient queries reduced the compute cost per page view, but they also eliminated some huge bottlenecks. The savings cascaded: less memory pressure on the database meant faster responses, which meant less time the servers spent waiting on queries, which meant we didn’t need as many replicated server instances to handle the same traffic. We didn’t need as many beefy machines, because they didn’t need as much memory and they weren’t spending so long waiting around.
This was made possible by modern tooling that let us go deep on database diagnostics and address nearly a decade of technical debt. LLM-assisted analysis of performance metrics helped us find and refactor expensive queries that had been accumulating since PubPub’s early days. These optimizations would not have been feasible even a year ago.
Bringing services in-house
A huge number of services we relied on externally (search, analytics, PDF generation) have been folded into the main PubPub codebase. Previously, these were spread across four separate services and roughly eight different AWS tools. Bringing them in-house is a win on three fronts: it cuts costs, it reduces the technical complexity of our stack, and it makes self-hosting PubPub dramatically simpler since there are fewer external dependencies to configure and maintain.
Modern code-assist tools were critical here. For analytics, they helped us navigate through lots of legacy data and schemas, align them to a single schema, validate at huge analytics-scale volume, and proceed with confidence. For search, they helped us optimize in-database search vectors stored directly in PostgreSQL, replacing an external search service entirely. Testing these optimizations and identifying slow query patterns needed to happen at a scale that made the effort too expensive to attempt in the past. The complexity of doing all of this migration cleanly would not have been possible with our reduced team size a year ago, and these tools became available right when we needed them most.
Nonprofit partner programs
A big part of our stack was also helped by organizations who were generous to our mission and nonprofit status. Monthly charges we were paying to Fastly (CDN and caching), Cloudflare (DNS, security, and caching), and Sentry (bug tracking) have been reduced to zero. They’ve all been very generous in offering free premium accounts to our nonprofit.
Fastly and Cloudflare in particular unlocked major architectural changes:
- Custom-domain management was moved off of Heroku, which made it possible to move to much cheaper Docker-based server deployments.
- Our asset CDN and dynamic image resizer were moved off of AWS, where they had been a CloudFormation stack of Lambda, CloudFront, EC2, and more. All replaced by Fastly’s infrastructure.
Analytics infrastructure
Moving analytics in-house let us decommission Redshift, Stitch, and Metabase hosted servers on AWS entirely.
Firebase and real-time collaboration
We refactored our Firebase storage (which provides real-time collaboration on Pubs) by taking a cold-storage approach to how changes and checkpoints are stored. This was the change we announced in January: we no longer store eternal step-by-step edit history for all Pubs. Our data showed fewer than 0.1% of users ever used that feature, and it was enormously expensive.
We reduced active storage on Firebase to 0.03% of what it had been (three hundredths of a percent) while moving the rest to cold storage in our PostgreSQL database. A large part of this was content in Pubs that were published years ago, probably never going to be edited again, but sitting in expensive real-time storage.
Quick note for users: ‘Cold’ storage isn’t actually that cold. There is no change in user experience whether your pub draft is in cold storage or not. If someone happens to return to a draft that is in cold storage, we simply re-hydrate the firebase db on the fly from our backend. At worst it adds ~50ms to the first page load after returning months later.
Firebase is a great example of why this work couldn’t have happened sooner. We had already compressed data stored there to keep costs “low.” That earlier compression made careful parsing of edge cases difficult, and the risk of a migration wasn’t worth the savings. What changed was having an LLM that could parse through thousands of Pubs and use data structures that are basically unreadable by humans on the fly to validate the migration, catch edge cases, and find bug patterns. That gave us the confidence to go ahead. We’ve done migrations like this before, and historically they took 4 to 6 months and left a long trail of edge cases to catch because of the sheer volume of content PubPub has accumulated over nearly a decade. This time, it was fast and clean.
What this means for users
These migrations are live and should have no change on the user experience other than better and more stable performance. The one visible change is the reduced pub history time-travel tool, as described in our January update.
There are still some cost-saving opportunities we see in smarter management of the user-uploaded assets we maintain. We have hopes of lowering costs even further, but we’re getting into diminishing returns territory at this point.
As we work through a changing landscape, we’ve been forced to adapt how we operate, but we’ve put in place the technical foundation that will allow PubPub to persist for a very long time.
It’s worth saying: any one of these cost reductions would have taken months or been impossible in the past. The fact that the team was able to do all of them is part of the trick. It wasn’t any single thing. It was all the things that needed doing, done in short order. And just like reduced query memory pressure opened up headroom for higher throughput on the database, a similar thing happened across the whole stack. As we simplified one piece, more optimization opportunities cascaded out of it. Bringing services in-house simplified our deployment, which made it easier to containerize, which made it cheaper to host, which made it easier to reason about performance, which surfaced more savings. It wasn’t about getting one thing done. It was about being able to do all of them together.
This was actually super fun…
This was a blast. It felt like spring cleaning. We finally got to go after technical debt we’d been wanting to address for years but couldn’t justify the risk or the time. No better combination of work than doing something that is mission aligned, good for the organizations sustainability, and is technically exciting.
We’d love to do this kind of work more. If your sitting on aging infrastructure and wondering what a similar cleanup could look like, we’d enjoy that conversation. Reach out.