From Minutes to Milliseconds
I was working with a health insurance company that was creating a new mobile application for their customers, and they needed to get a subscriber’s claims data to display in the application. An existing data endpoint gave them almost everything they needed except that it took two minutes to get a response. All the new app needed was the addition of “just a few simple things” and to improve the current response time to less than 30 seconds. When mobile asked the department who owned the service how long it would take to make the adjustments, they came back with an estimate of one month to return the new data fields and an additional five months if they wanted it under 30 seconds. When the mobile group heard the estimate of 1-6 months, they came to the team I was working with to help them get it done sooner.
Requested Response Time - 0:30 | Current Response Time - 6:00
Our first pass took us a couple of days. Initially, we reached out to the group that owned the endpoint and some of the data analysts for the system to talk about where and how the data was stored. We then threw together a very quick, simple, and dirty implementation of the endpoint that duplicated the existing logic in a very verbose and ugly way. We rewrote the same basic data call in a slightly new way, deployed a new endpoint, and had something that did everything the mobile app needed. The only problem was that the mobile call to our endpoint took six minutes to return the data.
Mobile was not happy with a 6-minute call and made it absolutely clear that such performance was unacceptable (which we already knew). Mobile also discovered that the data they’d requested was incorrect, and they needed some changes. This discovery allowed both them and us to catch and correct a missed requirement early and adjust very quickly. The existing endpoint owners sat back smugly and said, “See! We told you so!” We just smiled and said, “Just wait, that’s only version one.”, as we started on making our walking skeleton better.
Requested Response Time - 0:30 | Current Response Time - 2:00
Our second pass was delivered one day later. We’d added simple indexes to the tables that we were using, and our return time was now hovering around two minutes. Mobile was still not happy even though the data included corrected data. It was still too slow for a mobile environment. The existing endpoint owners were definitely less smug as it had only taken us three working days to implement the new fields being requested for the same two-minute call.
Requested Response Time - 0:30 -> 0:02 | Current Response Time - 0:30
Knowing two minutes was bad, we started our next refinement, querying the information database. We started looking at how the data relations, other data table sources, and data moved through the system. We found that there were other ways to aggregate the claims data and speed up the response time. This work was very detailed, and we did a lot of experimental prototyping of different ideas in the two days it took us to deliver the next endpoint version. That version returned all the data needed for the call around 30 seconds and had taken us two more days of work. Hooray, we’d hit the goal set by mobile in a total of five days! Unfortunately, as commonly happens, mobile was not satisfied with 30 seconds; they moved the goal to a two-second response time. The owners of the current REST endpoint weren’t talking to us anymore for some reason.
Requested Response Time - 0:02 | Current Response Time - 0:01
With our new goal of two seconds, we went back to work. Hitting 30 seconds for a response time had been tough and required us to do the obvious, easy, and even medium-hard data optimization. There wasn’t that much left in terms of query optimization to be done, so we started to look at the data we were retrieving. Did we need to deliver all historical claims, or could we just return a year’s worth? Mobile was trimming the claims to 12 months already. Win…less work for them and us! Armed with that new insight, we were able to go back and change some things, and after another two days’ worth of work, we beat mobile’s “new” two-second requirement by returning the data in one second.
Mobile was ecstatic and very pleased with the work that we’d done. The previous endpoint owners had taken us off their Christmas list, and we still didn’t feel we were done. We requested another couple of days to see if we could improve performance to what we felt was acceptable mobile response times, sub-100-milliseconds. Since the group had given us three months to do this work, and we’d just beaten their performance requests twice, they were happy with us taking a little extra time to “work our magic” on things.
Requested Response Time - 0:02 | Current Response Time - 0:00.024
So, we started pushing the envelope of the whole system. We added in system-level caching, something completely new to our entire system. We also implemented a data prefetching system that would put claims data into the cache when the user logged into the mobile app. We did this by creating simple subsystems that were available for the whole system to use. Now the entire system had access to caching and the ability to preload any data we wanted for a user. This work took us a little bit longer to do, but we’d played around with different prototypes of this thinking along the way, so we were able to put in a good solid system in a little over four working days. Our system now had a response time of 16-24 ms. Mobile was speechless.
This highlights the extreme power of iterating on simple things. We started by delivering something simple, quick, and dirty. We knew it was not the end state, but it minimally met the requirements and allowed for integration with everyone involved. It allowed us to play with how we were doing things by emphasizing evolutionary design, constant iterative improvement, and delivering working software. Early on in the project we were advised to take 1-2 weeks planning our approach relying on subject matter experts within the company for help. The same experts that were estimating doing this would take 5-months. No reasonable amount of planning, even with all the right people, would have resulted in the changes we made. Further I believe that this planning would have hindered us from recognizing emerging solutions due to anchoring us with false expectations and the “sunk cost” effect. In the end, the mobile group got exactly the data they needed blazingly fast - vastly outperforming their initial extremely hopeful request of 30 and later 2 seconds. Their expectation of giving us 2-3 months to deliver this was beaten too. In two days, they had something they could work against; in five working days, they had their initial request met, and in 11 working days, they had a comprehensive solution. Finally, in 12 days, they had a system that exceeded their expectations and significantly more flexibility for the whole system.