Meta How to Improve Cache Consistency to 99.99999999%

The Importance of Caching Technology
Caching is an indispensable technology in computer systems, widely used at various levels, including cache hardware, operating systems, and network browsers, especially playing a significant role in backend development. For companies as large as Meta, caching is particularly critical. Caching technology helps reduce system latency, handle complex workloads, and effectively save costs. Meta’s reliance on caching has led them to set higher standards when facing the challenges of cache invalidation. Through relentless efforts, Meta has achieved a very high level of cache consistency accuracy, with inconsistencies occurring less than once in a hundred million cache writes.

Cache Invalidation and Cache Consistency
A cache is not the original data source, so when the data source updates the data, the cached data should be changed synchronously to maintain consistency. There can be challenges in this process, and if not handled properly, the cached data can remain inconsistent with the data source for an extended time. Ideally, a mechanism should proactively refresh expired cache entries. By configuring the TTL (Time To Live) of cache entries, we can maintain the freshness of the cache and avoid invalidation issues. However, for Meta, relying solely on TTL is not sufficient; they must ensure high cache consistency even in distributed systems.

For example, consider a scenario with time series data, where the cache is updated by an operation, and before the update reaches the cache, the data has already been changed in the original data source. If the cache refresh signal arrives before the update operation, the cache will be wrongly updated, causing inconsistency. To overcome such problems, a version control mechanism is introduced: using a version field to help resolve conflicts, ensuring old versions of data do not overwrite newer versions. This method suffices for most internet companies, but for complex systems like Meta, a more advanced cache consistency assurance strategy is required.

Why Meta Emphasizes Cache Consistency
From Meta’s perspective, cache inconsistency can cause serious issues equivalent to data loss. For users, this directly affects their user experience. For instance, sending a message on Instagram actually involves the process of backend storage and cache mapping, and consistency issues can directly affect the delivery and display of messages.

In a hypothetical scenario, there are three users: Bob, Mary, and Alice. Bob is in the United States while Alice is in Europe; on the other hand, Mary lives in Japan. Both Bob and Mary want to send a message to Alice. To improve efficiency, the system will automatically choose the data storage region closest to the user to forward the message to Alice. Here, a problem may arise if there is an inconsistency between the servers in the regions where Bob and Mary are located; TAO replica queries may not be able to accurately transmit the message to Alice’s region, causing Alice not to receive the message. This is a potential issue in the message delivery process, and it is a major problem that needs to be resolved urgently.

In order to address issues of cache invalidation and data inconsistency, the primary step is to implement precise monitoring. A mechanism capable of accurately measuring cache consistency is established, allowing immediate detection and alert to relevant personnel once inconsistencies in the cache are identified. Moreover, it is necessary to ensure the accuracy of the measurement results to avoid false alarms, to prevent on-duty engineers from neglecting frequent erroneous alerts, thus causing the measurement results to lose trust and value.

Alternatively, recording and tracking each cache state change might be a simple solution strategy. This approach is feasible for situations with small data volumes. However, for massive systems like Meta that perform caching operations more than a trillion times daily, such an approach obviously escalates workload from heavy to almost unbearable, not counting the additional load during debugging.

Polaris attempts to resolve the issues mentioned above in a more efficient manner. As a client that interacts with stateful services, Polaris does not need to deeply understand the inner workings of the services. Its design principle is to ensure that “the cache will eventually be consistent with the database.” When an invalidation event is received, Polaris queries all replicas to verify any other non-compliant operations. For example, if Polaris receives an invalidation event indicating “x=4 @ version 4“, it will check all cache replicas to determine if there are any inconsistencies. If a replica shows “x=3 @ version 3“, Polaris will mark this replica as inconsistent and add it to a queue for further inspection of this cache host later.

Furthermore, Polaris also reports inconsistencies within specific time windows, such as 1 minute, 5 minutes, or 10 minutes, among various periods. This is beneficial for using multiple queues to manage rollback and retry, and it is also crucial to avoiding false positives.

To give a more specific example, if Polaris receives a notification about the invalidation of “x = 4 @ version 4” but cannot find an entry for x in the cache, then Polaris should mark this condition as inconsistent. In this scenario, there could be two outcomes: one where the version 3 entry of x is not visible, and another where version 4 represents the latest write operation for the key, indicating a true case of cache inconsistency.

In complex systems, ensuring consistency between the cache and the database is a challenge. In certain cases, such as after writing to version 5 of the database followed by deletion of key x, this may result in inconsistency in the system. To clarify such situations, it is necessary to verify data consistency between the cache and the database. This typically involves resource-intensive queries directly to the data source, bypassing the cache, but frequent queries may pose additional risks to the database.

To strike a balance between ensuring consistency and protecting the database, it is necessary to intelligently query the database upon detection of potential inconsistencies. Such a system design requires that a verification operation be performed after a certain time threshold, such as 1 or 5 minutes. An example is a system called Polaris, which aims to ensure near 100% consistency between cache write operations and the database within the specified time. For instance, an indicator shows that within 5 minutes, there is a 99.99999999% chance that cache write operations are consistent.

Consider how to resolve cache inconsistency at the code level. Assume a cache system maintains a key-value pair mapping as well as an association of keys to version numbers. When the cache system receives a read request, it first looks up the cache; if there’s a miss, it retrieves data from the database and populates the cache. This can be implemented with an asynchronous thread.

Here is a code snippet simulating this process:

cache_data = {}
cache_version = {}
meta_data_table = {"1": 42}
version_table = {"1": 4}

def read_value(key):
    value = read_value_from_cache(key)
    if value is not None:
        return value
    else:
        return meta_data_table[key]

def read_value_from_cache(key):
    if key in cache_data:
        return cache_data[key]
    else:
        fill_cache_thread = threading.Thread(target=fill_cache, args=(key,))
        fill_cache_thread.start()
        return None

def fill_cache(key):
    fill_cache_metadata(key)
    fill_cache_version(key)

def fill_cache_metadata(key):
    meta_data = meta_data_table[key]
    print("Filling cache meta data for", key)
    cache_data[key] = meta_data

def fill_cache_version(key):
    time.sleep(2)  # Simulate work
    version = version_table[key]
    print("Filling cache version data for", key)
    cache_version[key] = version

def write_value(key, value):
    version = version_table.get(key, 0) + 1
    write_in_database_transactionally(key, value, version)
    time.sleep(3)  # Simulate work
    invalidate_cache(key)

def write_in_database_transactionally(key, data, version):
    meta_data_table[key] = data
    version_table[key] = version

And when write operations update the metadata and version information in the database, inconsistency may have already occurred between the cache and the database. At this time, the system needs to intelligently handle this situation, to ensure data consistency as much as possible, while avoiding unnecessary performance overhead.

When an inconsistency between the cache and the database state is observed, the conventional operation is to restore consistency by invalidating the cache. However, there are cases where cache invalidation may fail. Although this situation is uncommon, when it does occur, it can lead to some perplexing problems.

For example, consider the following cache invalidation function:

        
def invalidate_cache(key, metadata, version):
    try:
        # Intentionally introduce an error to simulate an exception scenario
        cache_data = cache_data[key][value] ## To produce error
    except:
        # When an exception occurs, remove the corresponding cache
        drop_cache(key, version)

If this function fails to execute properly for some reason, it will instead call another function to forcibly remove the cache.

        
def drop_cache(key, version):
    # Get the current cache version information for the key from the cache version dictionary
    cache_version_value = cache_version[key] 
    # If the supplied version number is greater than the current cache version, remove the key
    if version > cache_version_value:
        cache_data.pop(key) 
        cache_version.pop(key)

Now, let’s imagine a scenario where read and write threads are launched simultaneously:

        
read_thread = threading.Thread(target=read_value, args=("1"))
write_thread = threading.Thread(target=write_value, args=("1",43))
print_thread = threading.Thread(target=print_values)

In this scenario, cache invalidation may fail under specific conditions. If the logic for deleting the cache doesn’t work as expected, older metadata might be indefinitely retained in the cache.

The above example is just a simplified version. In reality, bugs causing cache inconsistency can be much more complex, involving database replication and cross-regional communication. These bugs typically require a series of specific interacting steps and particular timing to trigger, making them relatively rare occurrences.

Lurking behind interleaved operations and post-error-handling code are tightly coupled exception handling logic designed to ensure system consistency. If you are responsible for addressing these issues, you might learn about cache inconsistencies from Polaris reports. In such cases, the primary task is to check the logs to determine the source of the problem.

While it’s nearly impossible to record every change to cache data, we can track those critical operations that might cause changes. By monitoring this aspect, if the cache does not receive an invalidation event or the event handling fails, the duty officer should review whether the cache server received the invalidation event, whether the server handled this event correctly, and whether the data item subsequently became inconsistent.

To address these challenges, Meta has built a stateful tracking library. It is present in the form of a purple window that records and tracks all cache change events, including those complex interactions that might trigger bugs leading to cache inconsistency.

In managing any distributed system, establishing a reliable monitoring and logging system is key; it helps us catch errors in real time and quickly trace back to the root of the issue, thereby effectively mitigating and resolving problems. Take Meta as an example—their Polaris system can issue timely alerts when detecting an abnormal state. Through an efficient consistency tracking mechanism, on-call engineers are able to accurately locate the problem within a mere 30 minutes.

This kind of efficient monitoring and tracking system is crucial for ensuring the stable operation and high efficiency of a system. With these technologies, companies can respond quickly to various emergencies, ensuring service quality and customer satisfaction.

Whether it’s a major cloud service platform or an individual service provider, one can learn from Meta’s case the importance of monitoring and log management, using it as an effective means to enhance service quality and improve error response efficiency.