Strategies to Handle Challenges in Caching
Insights into various challenges in different caching strategies and how to make a decision to pick one
Previously I wrote about the basics of caching to get a high level overview of various types of Caching. If you didn’t get a chance to read it, you can access it here:
In this article, I will dive into the various factors that should help determine needs of a cache, commonly occurring issues with the different types of caches and various considerations to keep in mind while choosing one over the other.
How to determine if caching is needed for your system/service?
Cache Hit Ratio: If the data that your service provides doesn’t need to be refreshed frequently and is among the frequently retrieved data, you should consider caching for it.
Tolerance to Eventual Consistency: At what rate the source data is changing and what is cache refresh interval. You should consider the fact that how important is it for the consumers of your service to fetch recent data.
Types of Caching Strategy and their Challenges
Easier to implement by having some sort of storage within the service say, hashmap but leads to cache coherence problem i.e. being a local cache, it will be different per server and lead to inconsistent data. For eg: Server S1 has responded to the request R1 with data D1 and stored it in it’s local cache. If data gets updated in DB to D2 version after this and same request R1 is made, then there will be a probability of returning either D1 or D2 based on which server the request lands.
Cold Start is also one of the major issues with in-memory caches. With the addition of every new server or even during deployment, since each server is starting up with no cache, there can be an increase in the number of requests to the downstream dependencies. This can be addressed using request coalescing.
Addresses above issues since it’s stored separately eg: redis or memcached.
Provides more storage and reduces cache eviction due to capacity.
Challenges include increased complexity of the overall system and more load to maintain an additional cache server.
Always add code in the service to handle scenarios where cache might not be available:
Either you can start calling downstream services during that time. But it could lead to increased load onto them if the cache outage remains for a longer duration.
Or, you can use external memory along with internal memory to avoid falling back to downstream dependency completely
You can also consider load shedding techniques to avoid overwhelming the downstream services by restricting the number of requests to the service.
To handle scaling of cache and elasticity issues:
If cache reaches it’s capacity, then we need to scale up by adding more nodes to it. By getting a deeper understanding of the system and how it behaves upon reaching capacity (for eg: say Memory utilization rises per container if cache reaches capacity), can help in setting up accurate alarms. These alarms can be used to scale-up the caching service. While scaling up the service a couple of things need to be kept in mind i.e. does your caching cluster support adding of nodes without any down time or does it support consistent hashing to distribute the traffic evenly. Always make sure to test out the scaling strategy by simulating failures.
To handle robustness of data:
Cached data should be able to read newer code format and newer code should be able to handle the old versions of the data being served from cache.
Critical Cache Implementation Considerations
Based on the type of data that flows through your service, we can determine the size of the cache such that we are able to increase the cache hit rate.
This refers to moving data out of cache when cache hits capacity. The most common pattern used for cache eviction is LRU (Least Recently Used).
This is the policy to determine how long we can retain the data in cache depending on either how frequent data gets refreshed or how much the customer is able to handle the stale data.
Downstream Service Unavailability
If the downstream service is unavailable for some reason, cache service should be able to safeguard the cache for a longer duration and instead of hitting the downstream with requests to update the data, wait until it recovers. Based on the kind of trade off you can make with your customers, either report stale data from cache and avoid browning out the downstream service by requests or have a mechanism to store the error response from the downstream service and translate it accordingly.
One of the concerns with cache cluster is the security of sensitive and/or customer data that’s being cached. The sensitive data should be encrypted before storing and provide the security while transmitting data to and from the cache.
Also, since cached data is returned faster than the call that fetches from DB, attackers can identify type of requests being made by a service based on their response time. This is side channel timing attack.
“Thundering Herd” Problem
During the unavailability of downstream service, if multiple requests are being made to get the un-cached data from downstream service, it could lead to multiple retries being made which in turn could brown out the service. We can combine a few strategies to mitigate this issue such as per customer/request throttling, request coalescing i.e. only sending one request for the same un-cached data, etc.
Caching is a means to provide faster access to data and to increase availability of the downstream service but it comes at a cost of increased complexity of handling cache nodes. By smartly understanding the requirements of the downstream service, we can come up with a caching solution that should be monitored diligently to tweak the parameters based off different scenarios such as traffic peak, cache unavailability, downstream service browning out, etc.
I hope this article is able to provide a sneak peak into some of the core considerations to keep in mind while deploying a caching solution for your service.
If you like the post, share and subscribe to the newsletter to stay up to date with tech/product musings.
(The contents of this blog are of my personal opinion and/or self-reading a bunch of articles and in no way influenced by my employer.)