When the Biden Administration announced its federal student loan forgiveness program—an initiative that cancels up to $20,000 of student loan debt for millions of people—borrowers flocked to the StudentAid.gov website to check their eligibility. But just hours after the August announcement, many people encountered issues logging into the site or couldn’t access it at all.
Some were sent to a virtual waiting room that explained the site was experiencing high volumes of visitors. Others who made it through the waiting room were met with another message: “A lot of people are interested in our website. As a result, some pages may take longer to display than usual.”
While outages like this can frustrate users, it’s important to remember that there are civil servants who work tirelessly to keep public services running. Outages are equally as frustrating for operations staff who want to help. At Nava, we know this to be the case from our experience rebuilding another site impacted by a spike in traffic: HealthCare.gov.
In 2013, the Affordable Care Act website launched a new functionality that allowed users to apply for Medicaid, CHIP, or purchase individual private health insurance, generally at reduced rates. HealthCare.gov experienced a similar user overload, and on its first day, only six people were able to sign up for health insurance.
Members of the team that helped re-launch HealthCare.gov after its rocky start went on to found Nava. One of their key contributions was to help prepare the Department of Health and Human Services (HHS), the agency that oversees HealthCare.gov, to adapt to abrupt changes like policy shifts or a spike in site traffic. This process revealed the importance of pre-planning to develop a strong, scalable site architecture that meets users’ needs and can withstand unexpected change.
Of course, building simple, effective, and accessible government services requires more than being able to handle spikes in traffic. As the Department of Education and other government agencies look forward, they have an opportunity to keep pace and even lead the way as policy and technology change. By building digital services in a modular, human-centered way, they can create equitable government services.
For this post, we'll focus on a core component of building any successful digital service: secure, reliable infrastructure that allows government agencies to adapt to shocks to the system, such as a historic student loan cancellation announcement. Conducting user load tests and gathering key metrics via pilots before a site’s launch can help prevent crashes like those of HealthCare.gov—and provides potential insight into how the StudentAid.gov crash might have been prevented.
Testing a site’s limits helps prepare for a successful launch
When launching a new website or feature, it’s crucial to ask “How many people will this serve?” Whether the answer is 10 or a billion, estimating your site’s user load is the first step in conducting user-centered research that will help your site launch smoothly.
Once you’ve landed on a number, test whether your site can easily handle the user load you foresee. This is how we approached testing our Scalable Login System (SLS), which provides authentication and account management for millions of people on HealthCare.gov. SLS is a RESTful API service built on Amazon Web Services. As an added experiment, we wanted to see how far the system could scale by running a load test of 1 billion users, 50 times the 20 million accounts that SLS currently handles.
With a throughput of 7,754 transactions per second, over the course of an hour, response time was 128ms—in the 90th percentile—and there were zero errors. In other words, our test confirmed that currently available open source software and a solid cloud infrastructure can succeed under intense user loads.
Rolling out pilots and collecting metrics are the keys to success
It’s not enough to passively conduct a user load test—while performing your test, it’s important to gather goal-oriented metrics that reveal whether your service will function for the people who use it.
Last year, Nava partnered with California’s Employment Development Department (EDD) to rebuild a web application for people to confirm their status for unemployment benefits during the pandemic. The state’s goal was to confirm claimants’ eligibility and pay out unemployment benefits as quickly as possible. With this goal in mind, we tracked login and completion rates to measure the web application’s efficacy. We found that 93 percent of people who logged in were able to complete the multi-page unemployment certification, confirming that the application worked for end-users.
But sometimes it isn’t so easy. Sometimes your program runs into bugs or unexpected snags. In order to prevent catastrophe in the event of a hiccup, it’s important to roll your program out in small bites, or pilots.
In California, we rolled out a soft launch of our retroactive certification form. On the first day, we emailed a link to 10,000 people, less than 1 percent of total claimants. On the second day, we sent the link to 100,000 people. Every day we monitored Google Analytics to determine if the form was performing. Within a few hours on the first day, our metrics revealed that a percentage of users were not able to log in. Our team rapidly diagnosed and fixed the issue, all before the form was officially launched.
These types of precautions don’t require large teams or billions of dollars—in fact, catching errors before a government program launches can save taxpayers money. The old HealthCare.gov login system, for example, cost $250 million to launch and would have cost another $70 million to stay online. The new SLS cost $4 million to launch and costs less than $1 million per year to stay online.
In the case of government programs that serve millions of people, the importance of planning ahead—and iterating along the way—cannot be overstated. Rolling programs out in small bites, collecting essential metrics, and testing a site’s user load are small steps that can help prevent big issues like the StudentAid.gov crash. Most importantly, developing secure, reliable, and scalable infrastructure helps agencies prepare for unexpected events or changes. Taking these steps can help get essential services—like student loan forgiveness—to those who need them most, and it can build trust in our public institutions.
Special thanks to Zoe Blumenfeld, Sha Hwang, Cyrus Sethna, and Karen Turner for their contributions to this article.
Written by
Editorial manager
Senior Infrastructure Engineer
Project Manager
Director of Engineering, Growth and Strategy