How we managed to reduce our latency by 3 times

‍

Did you ever have performance issues with a big micro-services architecture? Today I will share with you my personal story on how we managed to reduce latency by 3 times in the Wix authorization system, ending up with single-digit (in ms) responses consistently.

Some details:

My team and I are responsible for the authorization system in Wix. The system should (among others) be able to answer a single simple question: “Is the calling entity allowed to perform the requested operation on the provided resource?”. The answer to this question can be “Yes” or “No”, but getting this answer can be difficult. Let’s call this question isPermitted from now on.

In the last couple of months, we did a complete re-write of the authorization system in Wix. We switched from a role-based access control (AKA RBAC) to attribute-based access control (ABAC) which gave us more flexibility in terms of the features we could provide. We expected the new system to be much more robust and performant, and when we hit a major milestone of feature parity, we realized that although the new system performs way better than the old one, it is still not good enough.

One might say that “not enough” is something to argue about. In some cases, a latency of one second is enough, and in other cases, every nanosecond counts. When we finished the re-write, our system was able to answer in ~30ms. Sounds great, right? Wrong.

Consider the fact that each and every endpoint in Wix gets traffic both from the inside and from external callers. Also consider that a single call can pass through multiple services, aggregating info from each one. Hence, on a single roundtrip of a client request, multiple isPermitted calls are expected and acceptable. Now, when each call to isPermitted takes ~30ms, we can easily get to a few hundreds of ms per call, without performing any additional logic. Obviously, this number is no good. We had to reconsider many of the decisions we made, and understand how we can do better.

‍

‍

First of all, we needed to define our goals. 30 milliseconds is a single number, which doesn’t say much on its own. We started off by understanding the total picture. Is 30ms the time it takes an average call to be processed? Is it the worst case? Is it the best case? In solving performance issues, it is common to measure such things by percentiles. You’d often see the words p50, p95, p99, etc. These all mean the percentile on which the parameters were measured. p50 means the 50th percentile (meaning how well the system performs for 50% of the callers), p99 means the 99th percentile, etc. We measured our endpoint under these parameters, here is what we got:

In p50, we were able to respond in 28ms.
In p95–56ms.
In p99–90ms.

The picture is becoming a bit clearer. We were able to see that the issue is already big on the 50th percentile, so we were aiming to target fixes in this area first, and then move on to improving the more edge case issues.

Now that we have those three numbers, we still don’t have a good idea of where we should improve. Our next step was to break down the internal logic into measurable units, to pinpoint the problems. There are some tools that can do that for you, but the basic idea is to wrap every such logic unit with a timer, start it before performing the code, and stop it right afterward. An example of such logic can be a read from some external datastore, a call to an external service, or a read operation from a file. Usually (unless you are writing a super complicated business logic with multiple iterations over large amounts of data) — most of the time will be spent on IO of all kinds.

Next, you should report the results to an external log, and preferably aggregate it to a nice visual graph (we used Grafana for it). Now, we have a clear view of the internal core of our endpoint, and we can see what and how we can improve.

An important note: You should make sure that reporting this data does not affect the performance as well. Writing to a log file (for example) synchronically can have its own overhead, and might cause your whole service to perform even worse. Verify that the tool you use does not create an impact of its own.

‍

‍

Now, we had this awesome graph that was able to point us to the actual effect of each change. After we deployed each improvement, we waited a while and looked at the updated graph for the reduction of all (or some) of the metrics. This gave us immediate feedback on how well the deployed fix behaved.

Here comes the difficult part — the prioritization. Once you have in hand each part of the flow and how long it takes — you have all the information you need in order to decide which of the parts you want to tackle and improve first. From my experience, there are a few factors that you need to take in mind in order to make a decision:

1. What is the benefit of it? — How many seconds (or ms, in our case) will you save, and what part does the improvement take out of the total response time (in percentages)?

2. Which improvements are you able to make, and how difficult each of them is?
This requires a bit of deep-diving into the code and understanding what takes so long and how you can make it better. I will share some common use cases further in this post, but each business domain might have its own tweaks. You need to be able to declare (more or less) — where the pain points are in each flow, and how you can make them hurt less.

3. External dependencies — Some improvements might require other teams in your organization to participate (might be changes in the system or the framework, might be external services you use that operate too slow). You need to take this into consideration. If you have an endpoint that takes 5 seconds, 4 of which are spent on waiting for some service to respond — you don’t really have much to do (but the owners of the external service probably do).

4. Backward compliance — Breaking changes are hurtful. They hurt you (since you need to make sure all clients are migrated), and they hurt your clients (because now they have another task that they weren’t planning on doing). When improving performance, you might want to get some extra parameters from the caller, for example. Usually, it’s better not to force your client to pass these parameters, but to state that everyone doing this, will gain a performance boost. This will push the clients that are suffering the most to make the change you need them to.

‍

‍

Once you’ve considered all these issues, my suggestion is to start with quick wins first. You might have a very small change, that will not turn the world upside down, but you will reduce a couple of ms from the execution time. These quick wins pile over one another, ending up with a big improvement at a very small cost.

Another thing to keep in mind is your long term strategy. Don’t make changes that do not match the world you want to end up with, because eventually, you will have to revert them. In our case, some things are more complicated in the ABAC world, by definition. We could have simplified them up, and in return lose some of the features that ABAC provides — but keeping in mind the end goal we want to reach pushed us not to simplify it up.

I feel like I’ve been talking enough about general ideas, let’s get into specifics! A small side note beforehand — what we did is no silver bullet. An improvement in one system might be the wrong thing to do for another. Use these suggestions as a way of thinking, and not as the way-to-go.

‍

‍

Scaling up

The first thing you can think about in order to perform better is to scale up your system. It might be more instances, stronger instances, more memory, etc. A good indication that this is needed is that your service operates at high CPU most of the time, that a lot of time is wasted waiting for a free thread to handle the requests or a free connection to the DB (in that case maybe you need to consider scaling up the DB, and not the app). In case you have actual problems with the logic, and not a lack of resources — this will probably not help, however, this is the first (and easy) win you might be able to get. In our case, we were lacking memory. We scaled it up accordingly and ended up with a small, but essential, improvement.

Splitting DB READs from WRITEs

If you manage a service that receives a lot of traffic, you probably have more than one replica of your database. This is the case we had — our service had several replicas of our DB in each data center, in a primary-secondary architecture. Our configuration states that reads can be performed from the closest replica, but writes always hit the primary DB in the main DC.

The first thing we did (or more accurately, finished doing, as it was already partially done), is to make sure all the reads hit the secondary DBs. This has a massive effect, especially on the higher percentiles. Consider this: a cross-DC call (meaning — a call to a very remote instance) results in a penalty of 100ms (for example). This will cause DB reads to travel all the way to the main DC, instead of a super-quick operation in a very close machine. Another bad aspect of this solution is that the primary DB will be loaded with many reads, that it doesn’t have to do. This can cause slowness(you’d have to wait for a free connection) and in the worst cases even downtime! After we separated the reads from the writes we saw a nice reduction of up to 16%. We were starting to make progress!

Is everything you think that is in-memory actually is?

‍

The next thing we saw is that an operation that we thought was supposed to be super quick, took us ~2ms for each call. We’ve done this operation from 7 to 15 times in the flow of this code. The operation was reading a specific header. What we didn’t know is that this header was kept there in a serialized manner. We had to deserialize it every time we wanted to read it, before being able to use the object we got. We did not have that in mind at all when implementing the first solution, and we would have never known about it without the metrics we implemented at earlier stages!

At this point, the solution was “easy”. All we had to do is to deserialize it once and pass it forward to the underlying layers. In theory, this sounds very simple, but in reality, we know that permeating variables into methods can be a messy procedure, especially if there are tests and mocks involved. Such changes should be done with a surgical concentration, as changing both production code and test at once is very risky. The lesson to learn here is that it is always better for internal components not to rely on anything from outside, besides the given parameters. All external reads should be set up outside the business logic, which we want to always operate the same on the same input, regardless of the environment state. This is something we had to learn the hard way. After this change was done, we saw an amazing decrease in our latency of about 32%.

Killing Proxies

As part of our re-write, we kept the old authorization system (the one we replaced), as a proxy to the new system. We did it in order to avoid having all the clients switch their reads to the new system, in order to be able to control the flow of the requests, and fallback to the old one if needed. One of the things we noticed is that when calling through the proxy, the latency of the requests rose. This was due to an extra call that the proxy had to make, and a penalty you must pay whenever adding another hop to your roundtrip.

The next improvement we made was killing this proxy. As I said, breaking changes are hurtful, so in order to not break anyone — we suggested our clients migrate on their side in order to gain the performance boost. This moved the ball to our clients’ field — whoever feels the pain of the latency, migrated immediately. Everyone else will migrate little by little in an organized deprecation process. Remember that every hop on the way to the answer has a cost, and try to reduce these hops as much as possible.

Notice that not every service on the way to yours needs to be killed and its logic should be moved. Micro-service architecture is important, it’s much more scalable than a massive monolith, and it has an enormous effect on velocity. Killing services almost always have a cost (which might not be immediate).

Caching

The authorization world is tricky. You never want to allow anyone to do something they are not allowed to do on one hand, but on the other hand, most of the questions that come your way are ones you had already answered multiple times. We saw that we retrieve the same policy over and over, for most calls in the process.

This is the point we decided to cache such common policies. Caching is no magic solution, and as Phil Karlton once said — “There are only two hard things in Computer Science: cache invalidation and naming things.”. However, in this case, we had a clear indication that removing this roundtrip to the DB (or, more precisely, doing it once every long while) — will benefit our common clients a lot.

Conclusion

We started our journey with a single number (the response time of the call) and a statement that it is too slow. In the process, we introduced many metrics (that are also very helpful for production monitoring) and improved non-optimized parts of the system, step by step. We ended up with amazing results, as you can see in the graph below, and with a lot of lessons learned.

Here are my takeaways from this project:

Before diving in and jumping to conclusions — measure everything you can.
Prioritize what will give you the most benefit VS what will cost you the most (money or dev time).
Scale resources up if needed.
Split DB reads from writes.
Revisit assumptions you made regarding how fast parts of your system works.
Know how many hops a request has to make, and reduce some if possible.
Consider caching as an optimization.

Our journey isn’t done yet

We are thinking of multiple solutions to keep improving performance, therefore this is only the first step in our story. I hope I’ll be able to have another post in the near future, with the same title!

‍