Honeycomb Architecture: A Highly Available Cloud Architecture

Delving into Honeycomb Architecture: Enhancing the Availability of Multi-Tenant Applications

Overview of Honeycomb Architecture
Honeycomb architecture is a design strategy aimed at increasing the high availability of multi-tenant applications by creating independently operating compartmentalized components (or “cells”). In this model, each isolated cell is implemented as a standalone, self-sufficient application instance capable of operating normally without the need for other cells, thus handling traffic independently. If a problem occurs in a particular cell, the impact is limited to the users served by that cell and does not spill over to others, effectively reducing the potential fault impact range and ensuring that the service level agreement (SLA) for most users remains unaffected. Depending on different strategies and needs, cells are organized and traffic is routed in various ways.

Key Points in Automating Honeycomb Architecture
The critical issues that must be solved in the process of automating honeycomb architecture include:

Isolation: How to ensure clear boundaries are maintained between cells?
New Cells: How to continuously and efficiently create new cells and put them into operation?
Deployment: How to ensure that each cell receives the latest code updates?
Permissions: How to ensure the security of cells and manage their network access rights?
Monitoring: How can operators quickly understand the health status of all cells and identify those affected by faults?

There is no shortage of tools and methods to address these issues. For instance, the solutions adopted by Momento company will be discussed in this article. However, before delving into the specific issues, it is crucial to achieve standardization of certain processes. By standardizing some stages of building, testing, and deployment, we can more easily establish common automated processes that will greatly simplify the work of reusing infrastructure code across all components in each cell. Let’s be clear, “standardization” here is not equivalent to “homogenization”; modern cloud applications are often composed of various microservices running on different frameworks and platforms such as Kubernetes, AWS Lambda, EC2, etc. Even within such a diversified environment, automating commonality between components is achievable by standardizing specific segments of the lifecycle.

Standardization – Deployment Templates

In software development, successfully deploying code changes to a production environment involves a series of standard steps. These steps commonly include:

Developers submit code to the version control system.
Utilizing the latest modifications, software components are built, which may be in the form of Docker images, JAR files, ZIP files, or other formats.
Components are released to the corresponding repositories, such as Docker images being pushed to Docker repositories, JAR files uploaded to Maven repositories, ZIP files stored in cloud services, etc.
Components are deployed to the production environment, and are typically deployed to each unit one by one.

For any single component of an application, this deployment process can be viewed as a template.

The honeycomb architecture aims to minimize the scope of impact of system failures, which often occur shortly after deployment. To this end, we add safeguards during the deployment process. Incorporating a “staging” phase and introducing a “baking” period between subsequent deployment units is a very good practice. During the baking period, we can detect anomalies through monitoring metrics and alerts, and pause the deployment until the issues are resolved.

The goal is to generalize the automation process, so that any component of an application can easily go through this series of deployment steps, regardless of the underlying technology. Currently, many tools are capable of automating the steps mentioned above, hence we can utilize specific automation tools, choosing the most suitable ones according to different needs and environments.

Most infrastructure deployments, including Momento, are typically done on AWS. Therefore, we tend to use the tools provided by AWS to execute deployment jobs. For components that run on EC2 and are deployed through CloudFormation, we use:

AWS CodePipeline to define and implement deployment stages;
AWS CodeBuild to execute the build process;
AWS Elastic Container Registry to release new Docker images;
AWS CloudFormation to deploy the new version to various units;
AWS Step Functions to monitor alerts during the “baking” stage and decide whether to safely deploy the changes to the next unit.

As for the components based on Kubernetes, we can make appropriate adjustments to achieve similar deployment steps. For example, using AWS Lambda to invoke the Kubernetes API to deploy new images to each unit.

In the implementation of the software deployment phase—leveraging the advantages of Kubernetes, despite the diverse technology stacks used by different application components, we can set a unified template for deploying changes. With this template and a consistent toolchain for implementation, we can make subtle adjustments when necessary. This standardization of the build lifecycle promotes the universal construction of automated steps and allows for a large amount of infrastructure code to be reused, ensuring consistency and recognizability of the deployment process across different components.

How to standardize the build targets? An effective approach is to define a series of standardized build targets and commonly use them across different components. For instance, at Momento, we adopted a time-tested technique—using Makefiles. They are simple to understand, have a long history and are highly efficient. By defining the same build targets, services written in two different languages, such as Kotlin and Rust, share a unified approach even if the build commands differ. The “pipeline-build” target in each Makefile is used to standardize the build steps.

We also designed a “unit bootstrapping” target, so that, whether it is being deployed to an AWS unit or a GCP unit, the target name remains consistent. It can be relied upon and trusted during deployment in other parts of the infrastructure due to each component’s inherent consistency.

Another tool for standardization is the “unit registry”, which is a mechanism that provides a directory and basic metadata of all units. At Momento, we chose to use TypeScript to build the unit registry, using approximately a hundred lines of code to define simple interfaces, making it easy for us to express all relevant data of each unit.

The CellConfiguration interface is very crucial as it contains all the key information required for a given unit, such as whether the unit is for production or development, its region, the DNS name of endpoints within the unit, and whether it’s AWS or GCP. In addition, the MomentoOrg interface contains an array of CellConfiguration. Utilizing the models provided by these interfaces, we can write more TypeScript code to instantiate them and obtain specific data for each cell.

Our “alpha” cell possesses a set of exhaustive data information, which includes the unit name, account ID, region, DNS configuration, and other key information. When we want to create a new unit, we simply add its information into the array within the unit registry, making the process concise and clear.

The next important step is to ensure that this data about the cells can be accessed by other parts of the infrastructure. In practice, sometimes it is necessary to store the data in a queryable database for more complex operations. However, for our needs, it suffices to store the data in JSON format in the S3 cloud storage service.

In order to access and use these data, we developed a small TypeScript library. This library’s function is to retrieve data from S3 and convert it into TypeScript objects. We published the library to a private npm repository so it can be reused in the code of the infrastructure. This way, we can build common patterns in the automation of infrastructure construction and perform the same automated configuration for each unit.

Standardization – Unit Bootstrapping Script

To achieve automation with consistency and convenience, we have introduced the “Unit Bootstrapping Script.” This script greatly simplifies the process of deploying application components to a new unit and ensures consistent operations across different units. For example, if your application components are spread across different git repositories, by using the bootstrapping script, enabling a new unit can be completed in the following steps:

Utilize the unit registry to obtain metadata, such as AWS account ID and DNS configurations, among other information.
For each application component, perform the following steps:

Clone the corresponding git repository.
Execute the standardized cell-bootstrap target in the Makefile.

This bootstrapping script provides a solution that is both generic and extensible with just five lines of code, used for deploying new application units. Even if the application acquires new components, this script remains applicable and ensures that the deployment process is simple and consistent.

In reviewing and addressing the challenges of automation infrastructure, we have defined a standardized method to organize unit information, and generalized the lifecycle tasks for application components.

In the AWS environment, the most direct way to ensure isolation between units is to create a separate AWS account for each unit. Initially, this may appear complex, but with the maturing of AWS tools, the process is now much simpler than before. An independent AWS account not only achieves default isolation from other units but also allows you to set complex cross-account IAM policies for the interactions among the different units.

When deploying multiple computational units using AWS accounts, establishing meticulous IAM policies becomes necessary to prevent improper interactions between units. Managing IAM policies is undoubtedly one of the most challenging aspects when dealing with AWS, and avoiding these configurations as much as possible can save precious time and reduce complexity. On the other hand, the advantage of employing a multi-account strategy is the ability to integrate accounts using AWS Organizations, and the visualization analysis of costs with AWS Cost Explorer is straightforward for each independent unit. Conversely, if you opt for deploying via a single account, you would need to meticulously tag each unit’s resources to trace the spending for each unit.

As the cell architecture strategy is adopted, routing issues arise. If each isolated unit runs a replica of the application, appropriate strategies must be put in place to ensure that user requests are correctly routed to the targeted unit. For user interactions with the application via SDK or other client software, a simple solution is to assign a unique DNS name to each unit. This approach is exactly what we have adopted in Momento. When we create authentication tokens for users, the DNS name of the targeted unit is included as part of the claim within the token, allowing our client libraries to correctly direct traffic based on this information.

But when users need to interact with services via a web browser, the situation is different. At this moment, a public DNS name must be provided so it can be accessed in a browser, leading to the creation of a lightweight routing layer. The routing layer’s design should be as simple as possible. It only needs to contain the basic logic to identify users, to judge and route to the appropriate unit based on specific information in the request, and to carry out request proxying or redirection appropriately. Even though this setup provides a smoother experience for users – as they need not understand the details of the unit – it also means there is an added maintenance and monitoring requirement for this global component, which may become a source of risk for a single point of failure. However, the cell architecture can largely avoid such risk, which is why it is essential to strive to simplify and miniaturize the routing layer.

Another advantage of building such a routing layer is that it can transparently migrate users from one unit to another without their knowledge. For example, if a user requires a larger or more idle unit, a new unit can be pre-configured for them, and through the release of a series of routing configuration changes, the user can be transferred to the new unit without any awareness. As long as the new unit’s setup follows the previously mentioned standardized criteria, most of the related work is already in place. The problem then becomes how to efficiently create new units. The implementation process usually includes the following steps: opening a brand new AWS account within the Organization; adding that account to the unit registry; executing the unit boot script to construct and deploy all necessary components.

As we standardized the build lifecycle steps for each component in Makefiles, we easily rolled out a new unit. Due to the generality of the deployment logic, the startup of the new unit required almost no extra effort. Deployment is one of the most challenging problems in application architecture, especially when it comes to cell-based architecture. Fortunately, with the rapid development of infrastructure as code (IaC) tools in recent years, these challenges have become easier to address.

In the IaC domain, most tools have traditionally used declarative configuration syntax, such as YAML or JSON, to describe the resources that users need to create. However, a new trend is rising, which allows developers to define infrastructure using real programming languages they are familiar with. Developers no longer need to deal with tedious and lengthy configuration files, but can describe infrastructure components with programming languages they are well versed in. Here are some examples of tools that adopt this approach:

AWS CDK (Cloud Development Kit) – A tool for deploying CloudFormation infrastructure.
AWS cdk8s – A tool for deploying Kubernetes infrastructure.
CDKTF (Terraform’s CDK) – A tool for deploying infrastructure through HashiCorp Terraform.

These tools allow us to reduce a significant amount of YAML/JSON boilerplate configuration through programming constructs, such as for loops.

Another advantage of expressing infrastructure definitions in a programming language is that we can leverage npm libraries as dependencies, enabling our IaC projects to add unit registry library dependencies and access arrays containing all unit metadata. By iterating over this array, we can define the infrastructure steps required for each unit. When new units are added or the unit registry is updated, the infrastructure configuration is also automatically updated.

Combining the powerful features of AWS CDK and AWS CodePipeline, we can define a universal pattern pipeline for each application component and ensure that necessary build and deployment steps are configured for each component while sharing most of the code. At Momento, we have written TypeScript CDK code for various stages that may need to be added in AWS CodePipeline, such as building projects, pushing Docker images, deploying CloudFormation stacks, and deploying new images to Kubernetes clusters. We can place these stages in an array and add them to each pipeline by iterating over them.

It is a “meta” pipeline responsible for creating individual pipelines for each application component. This repository serves as the single source of truth for all our deployment logic. Whenever developers need to change the content of deployment infrastructure, they can do so here. Any changes we make to the list of deployment steps (for example, changing the order of units or using more complex “baking” steps) will be automatically reflected in all component pipelines.

When a new unit is added, the pipeline of pipelines runs and updates all component pipelines, adding the new unit to the list of deployment steps. To help improve usability, we carefully consider the order of deployment to production units. Units are grouped by size, importance, and traffic level. In the first stage, we deploy to pre-production units where changes are tested before being pushed to production units.

If these deployments go smoothly, we gradually deploy to increasingly larger production units. This phased deployment approach makes change deployments controllable and increases the likelihood of capturing issues before they affect more customers.

To manage access to units, we primarily rely on AWS SSO (now IAM Identity Center). This service provides us with a single sign-on page where all developers can log in with their Google identities and then access the AWS consoles they have the right to. It also provides access to target accounts via command line and AWS SDK, making automation tasks easy.

The management interface provides granular control over user access within each account. For example, roles like “Read-Only” and “Unit Operator” are defined within the unit accounts, granting different levels of permissions. By combining the role-mapping capabilities of AWS SSO with CDK and our unit registry, we can fully automate the inbound and outbound permissions for each unit account.

For inbound permissions, we can iterate through all developers and unit accounts in the registry and grant the appropriate roles using the CDK. When a new account is added to the unit registry, the automation mechanism automatically sets up the correct permissions.

We loop through each unit in the registry and grant access to resources, such as ECR images or private VPCs, for outbound permissions as needed. Monitoring a large number of units can be difficult. The key is to have a method of monitoring that ensures operations personnel can assess the health of services within all units from a single view. Expecting operations personnel to review metrics in each unit account is not a scalable solution.

To solve this problem, we only need to select a centralized metrics solution that can export metrics from all unit accounts. This solution must also support grouping metrics by dimensions, like unit names. Many metrics solutions provide this capability, which can aggregate metrics from multiple accounts into the CloudWatch metrics of a central monitoring account.

In the fields of availability and cloud architecture, the cell-based architecture has received widespread attention due to its robust high-availability characteristics. This architectural approach enhances the durability and elasticity of the overall system by replicating services across various independent units, thus ensuring the achievement of Service Level Agreements (SLAs). Automating the cell-based architecture is key to enhancing business agility and engineering velocity.

With the help of modern infrastructure and tools, we can automate the deployment and management of cell-based architecture. The automation tools and techniques introduced in this article, such as a series of AWS tools, are not the only options; other cloud service providers like GCP and Azure also offer corresponding solutions. In addition, third-party tools such as Datadog, New Relic, LightStep, and Chronosphere are available on the market to provide support.

Automation grants us the ability to rapidly expand new units. With standardized infrastructure and unit bootstrapping scripts, deploying a new unit from scratch can take just a few hours, greatly improving efficiency. For startups and small businesses, this means the ability to respond to user demands in a short time frame, which could be key to securing important deals.

Furthermore, developers can create their own units within their personal accounts to facilitate testing and debugging of complex features that rely on interactions between multiple services or components. Compared to shared development environments, personal units reduce conflicts and interruptions during development, enhancing productivity and allowing developers to focus on their tasks.

It is important to recognize that no universal solution can apply to every situation. Different businesses and environments require different tools and levels of automation. The development in the field of Infrastructure as Code (IaC) has made automation easier to implement. Taking advantage of these opportunities to standardize component definitions is a critical step.

Google has recently announced major layoffs, a move that has sparked collective protests from veteran figures within the company. Many employees have criticized the leadership, believing them to be short-sighted and hasty in their decision-making. At the same time, voices have pointed out that the expansion of middle management has reached a point of ineffectiveness.

Within the teams being laid off, the disbandment of the Python team has particularly garnered widespread attention from the community. The founder of PyTorch was shocked and angered by this development, stating that core language teams are irreplaceable to the entire tech community. His reaction was terse and strong: “WTF!”

On the other hand, Germany has once again decided to embrace the Linux operating system, planning to migrate tens of thousands of computer systems from Windows to Linux. This shift draws attention, especially considering that the country encountered problems related to Linux migration twenty years ago.

Moreover, a systematic bug led to a hundred people being wrongfully imprisoned, prompting deep public reflection on the reliability of technology and the fairness of the legal system. Despite an investment of up to 280 million yuan in an attempt to upgrade the cloud system, the effort ultimately ended in failure. For twenty years, such issues caused by Japanese software have been common and have inflicted heavy losses on businesses.