Nava’s Scalable Login System (SLS) provides authentication and account management for users on HealthCare.gov. It is a RESTful API service built on Amazon Web Services. The Nava team wanted to see just how far it could scale by running a load test with a billion users, 50 times the 20 million accounts that SLS currently handles.
Summary
Throughput of 7,754 transactions per second was served for an hour. Response time was 128ms, in the 90th percentile, and there were zero errors. Nine times the number of current production servers were used to serve this load: 70 4-core machines vs 15 2-core machines in current production. Application servers’ CPU was at a comfortable 50 percent.
Nava’s Scalable Login System running 7,754 transactions per second, with a 128ms response time, at 90th percentile and zero errors.
Approach
Tools
Nava has developed its own load testing infrastructure based on Apache JMeter, an industry standard load testing tool and ruby-jmeter. The tests are written in Ruby from reusable components that simulate SLS client http requests. All components and tests are revision controlled in Github. The tests were conducted from a distributed load generation “grid” (of only two machines!) with a total of 6,000 worker threads.
Architecture
The test
We wanted to see how far the current SLS architecture could scale. The current system has over 20 million users in the database. The current observed peak load for HealthCare.gov’s Open Enrollment, in 2015, was on December 14th (about 150 requests per second). The load test simulated key API requests for registering users, logging in and getting user information.
Extrapolating current peak service size and usage, our goal was a database of one billion users and a request rate of 7,500 per second (50 times the current number of users and the peak throughput SLS had seen). While populating the database with one billion users, Brendan posted updates in Slack:
Running the test proved to be... uneventful.
With a database of one billion users, the load test was prepped. We achieved 7,754 requests per second sustained for one hour with acceptable latency and zero errors.
The most time consuming part of this exercise was populating the database with one billion users. At 2,000 users per second, it still took over a week!
Conclusion
Currently available open source software and commodity cloud infrastructure can, if properly implemented, perform under intense loads. SLS has performed very well for over 20 million users currently on HealthCare.gov, and this load test demonstrates that SLS can comfortably accommodate many times the entire American population without any major changes.
Brendan Neutra, who designed and executed the test, was recognized as a 2016 FCW Rising Star for this load testing work.
Written by
Senior Infrastructure Engineer
Software Engineer