Apache Kafka, since its launch, has won a high reputation in the field of stream processing for its excellent design concepts and implementation. Throughout the evolution of modern stream processing architecture, Kafka, with its innovative distributed log abstraction, has brought revolutionary progress in real-time data stream processing and analysis. Kafka’s significant success is primarily due to its ability to meet the widespread need for high throughput and low latency data processing across businesses of different scales. Over time, Kafka’s ecosystem has become increasingly prosperous and has already become a recognized standard in the industry.
Despite this, in the face of the rapid rise of cloud computing and cloud-native technologies, Kafka has encountered a series of new challenges. The traditional storage model, under the background of users’ increasingly urgent demands for cost optimization and elasticity in the cloud environment, seems incapable. Solutions such as tiered storage have been proposed to address these issues, aiming to reduce costs and extend the lifespan of data by grading storage across different media layers. However, the practical applications have not provided perfect answers as expected and have instead increased system complexity and maintenance costs.
With the continuous improvement of cloud computing services, we believe that shared storage is the correct solution to Kafka’s traditional pain points. With our innovative shared storage architecture, we have brought a storage solution to Kafka that is superior to tiered storage and Freight Clusters. By adopting a strategy of separating storage and compute, we have updated Kafka’s storage layer and expanded it into a shared streaming repository, while reusing all its computational code, ensuring full compatibility with the Kafka API protocol and its ecosystem. By integrating Kafka’s storage layer with object storage technology, we have significantly improved Kafka in terms of technology and cost, thoroughly resolving the previous issues with cost and elasticity.
The article will next delve into the evolutionary journey of Kafka’s storage architecture, as well as the profound considerations behind our unique shared storage architecture, with the aim of bringing readers more inspiration about the design of modernized streaming storage systems.
The Shared Nothing Architecture on Local Disks In earlier times, Apache Kafka, as a mainstream streaming platform, mainly adopted the Shared Nothing architecture centered on local disks. However, with the continuous evolution of cloud technology, this traditional storage framework has begun to not meet the requirements of the new era. In the Shared Nothing architecture, the system does not rely on any shared resources, and each node has its own storage and computing capabilities and can complete tasks independently. Although this design ensures good scalability and fault tolerance in certain scenarios, its limitations in a cloud environment are increasingly evident:
- Storage cost issue: Local disk storage is costly, and considering the strategy of using data replicas, this makes the cost issue even more prominent.
When using a Kafka cluster with EBS GP3 volumes, cost analysis shows that if Kafka is configured with three replicas to ensure data reliability, then the cost would actually be triple the base price, jumping from $0.08/GiB/month to $0.24/GiB. To ensure the system has enough space to accommodate data growth and potential disaster recovery, at least 50% of storage space is usually reserved, thus doubling the overall cost to $0.48/GiB. Such costs are uneconomical for systems that need to store large amounts of data over the long term. Especially when Kafka is used for long-term storage of historical data, replaying this data when needed, the related cost issues become particularly evident.
Operational Complexity: Although the Shared Nothing architecture brings the convenience of distributed processing to Kafka, it has introduced a series of complexities at the operational level. Under this architecture, each node must independently manage its own storage, leading to tight coupling between compute nodes Broker and local storage. Problems arising from this include time-consuming and resource-intensive partition data migration during Kafka Broker horizontal scaling, which consumes a lot of network bandwidth and disk I/O resources, affecting the system’s normal read and write operations. This process may last for hours or even days, severely impacting the availability of the Kafka cluster.
Performance Bottlenecks: Kafka clusters show data read operation bottlenecks, especially when handling a large amount of historical data. The limitations of local disk I/O are particularly evident because disk I/O throughput capacity is limited. When the system needs to read historical data from disk while processing real-time data streams, there can be resource contention. This not only slows down system response time but can also cause service delays, thus impacting overall data processing performance. For scenarios requiring log analysis or data replay, the high latency of cold reads directly affects the timeliness of analysis.
Lack of Elasticity: Due to the characteristics of the Shared Nothing architecture, Kafka clusters have limitations in their scalability, especially under the tight coupling between Broker nodes and local disks, making it difficult for clusters to quickly respond to dynamic load changes. Clusters are slow to expand when data traffic spikes and, likewise, can’t quickly downscale when the load decreases, leading to low resource utilization and wasted costs. Users cannot fully utilize the resource-sharing and pay-as-you-go advantages offered by public clouds, and instead must reserve resources in the traditional way, leading to the waste of idle resources. Furthermore, even with manual operations for horizontal scaling, due to the need to replicate partition data, it is still a high-risk operation. Clusters are unable to scale in/out rapidly, which prevents the realization of true pay-as-you-go, thus unable to effectively utilize cloud-based Spot instances or achieve a true Serverless architecture.
Multi-AZ Network Traffic Costs: In multi-availability zone (AZ) architectures, when using data networks provided by cloud service vendors, users may face high network transfer costs. Especially in scenarios involving cross-AZ data transfer, these costs can account for more than 80% of the total expenses. Taking a Kafka cluster as an example, even if it is designed to support multi-AZ disaster recovery and has three replicas, the necessity for data replication across different AZs generates a significant amount of network I/O load, leading to considerable cost expenditure. This Shared Nothing architecture, which is based on local disks, considering the costs of read/write operations and cluster scaling, is actually a deployment method with relatively low economic efficiency.
Tiered Storage Can’t Save Kafka: Although the Apache Kafka community has discussed the concept of Tiered Storage for many years, until version 3.7, Kafka itself still failed to provide a mature production-level tiered storage solution. Some Kafka service providers, such as Confluent and Aiven, have introduced their own versions of tiered storage solutions. These solutions attempt to utilize the low-cost feature of object storage, moving earlier data from local disks to object storage to reduce long-term storage costs. However, such attempts have not completely solved some of the core issues Kafka faces, such as:
- High network traffic costs between multi-AZ: Even with tiered storage reducing data volume, since the relationship between Brokers and local disks remains unchanged, these measures still fall short of addressing the root cause.
- Kafka tiered storage requires at least one log segment on the local disk, which means that Brokers are still stateful in the tiered storage scheme, maintaining a certain amount of local data.
- Operational complexity: Even with tiered storage, Kafka operations are still complex, and with the introduction of a new element like object storage, this complexity has potentially increased. The data migration process remains complex and error-prone during horizontal scaling or downsizing.
- Higher infrastructure costs: Infrastructures like cloud disks, serving as backups for data storage to cope with unpredictable workloads and ensure service availability during peak periods, force users to have large amounts of local disk space, thus increasing infrastructure costs.
Given the high costs of cloud disks, they significantly increase the expenses of data storage. After analysis, we find that tiered storage has alleviated the issue of historical data storage costs to some extent and reduced the volume of data migration under certain circumstances. However, tiered storage has not fundamentally solved the challenges Kafka is facing. The persistent issues are those associated with past Shared Nothing architectures based on local disks, leading to the problems of storage costs and elasticity.
Even the seemingly attractive solution of writing data directly to S3 object storage cannot comprehensively resolve Kafka’s issues. While this strategy improves elasticity and storage costs, it inevitably sacrifices latency. Stream processing systems are extremely sensitive to latency, especially in scenarios with high real-time requirements such as financial transactions, online gaming, and streaming media, where latency is a strict and uncompromising metric. Compromising on high latency will cause stream processing systems to fail to meet critical business needs, which directly contradicts Kafka’s core purpose of low latency design.
Writing directly to object storage solves the problem of stateful compute nodes and reduces hefty network costs across regions (AZ), but it is clearly not the best solution in a shared storage architecture. Typically, writing to S3 can introduce a latency of more than 600 milliseconds, which is unacceptable in many situations requiring real-time performance. When designing systems, we must ensure that high-frequency usage paths have the lowest latency and optimal performance. For Kafka, reading “hot data” is a frequent operation, so it is necessary to ensure that this operation has the lowest latency and best performance.
Currently, many customers still choose to use Kafka over traditional messaging queue systems like RabbitMQ and are extremely sensitive to latency. Even millisecond-level delays can result in unacceptable lag in message processing, affecting business processes and end-user experience. Although direct writing to S3 can provide cost-efficient storage solutions in some cases, it does not meet Kafka’s needs for low latency in high-frequency data processing paths.
To maintain Kafka’s competitiveness as a streaming platform, cost and performance must be balanced in the design, especially when handling high-frequency data, to ensure user experience and the system’s efficient operation. Therefore, finding a storage solution that can reduce costs, enhance Kafka’s elasticity, and maintain low latency is an important pursuit in our innovation for Kafka.
In terms of shared storage innovation, our project AutoMQ proposes an innovative shared storage architecture by reasonably using a small number of EBS volumes as a Write-Ahead Log (WAL) in conjunction with S3.
This storage architecture provides users with many benefits of directly writing to S3, such as outstanding scalability, no need for data replication across availability zones, and low costs, while also ensuring that latency does not increase. By looking at the above architecture diagram, we can see how meticulously the storage architecture is implemented. Once the WAL data is successfully persisted through Direct IO, consumers can immediately read the persisted stream data from the memory cache. This architecture only keeps recent WAL data on EBS, while historical data is still read from S3, so only a small capacity EBS is required (usually only 10GB), with EBS accounting for a minimal cost of overall storage. In the event of a Broker node failure, using the technology of multiple EBS mounts, an EBS volume can be remounted in milliseconds to recover the data on the WAL.
EBS + S3 ≠ Tiered Storage
Those unfamiliar with the advantages of our storage architecture can easily confuse this innovative shared storage architecture with Tiered Storage. In the traditional tiered storage system, the Broker is essentially stateful, tightly coupled with local disks. However, in our Shared Storage architecture, the Broker is decoupled from EBS. When a compute node fails, using the EBS multi-attach and NVME reservations capabilities, failover and recovery can be completed within milliseconds. AutoMQ’s Shared Storage architecture is designed with EBS as a shared storage. When an EC2 instance fails, the EBS volume can be swiftly mounted to another node through multi-attach technology, providing read and write services without impacting operations. In this sense, both EBS and S3 are shared storage, unlike traditional stateful local disks. For EBS, due to complete decoupling from the Broker, Brokers in AutoMQ are stateless. EBS and S3 are storage services provided by cloud services. By fully leveraging cloud storage features, we can seamlessly share EBS volumes among Brokers, forming our innovative Shared Storage architecture.
EBS is cloud service, not just a physical volume
We are fully committed to building the next generation of truly cloud-native streaming systems, which can maximize the potential of public cloud services. The key lies in that building a cloud-native system requires full utilization of public cloud services with economies of scale and technical advantages. This necessitates a shift from the past hardware-based software mindset to a design thinking based on cloud services. As a cloud service, EBS is essentially a distributed storage that not only implements replication technologies like those in HDFS, Bookeeper, and other systems, but also benefits from scaled application of cloud services, reducing the marginal costs. Hence, public cloud providers have invested heavily in production and R&D costs for EBS.
In terms of cloud storage technology, Alibaba Cloud, after more than a decade of consistent investment and optimization, has accumulated over 15 million lines of C++ code to support its technological progress. Similarly, AWS has employed the Nitro Card technology that integrates software and hardware, even rewriting network protocols suitable for local area network environments and recording them on hardware, with the aim of providing persistent, reliable, and low-latency cloud storage services. Today, the view that cloud storage is less reliable than local disks or has serious latency issues is no longer valid. The cloud disk services launched by mainstream cloud service providers in the market are very mature. By leveraging their vast infrastructure, rather than starting from scratch, we can drive innovation in cloud-native software more effectively.
Addressing the high cost of EBS, an article compared the per GB storage cost of S3 and EBS in a 3-replica Kafka cluster, revealing that EBS costs could be up to 24 times higher than S3. For users with large clusters that require long-term data retention, the expense of EBS storage can make up a significant portion of their overall cluster costs. Irrational utilization of cloud storage mediums can lead to a spike in storage costs. It is important to note that EBS and S3 are each designed for different read and write scenarios: EBS is optimized for low latency, high IOPS scenarios, while S3 is more suited to cold data storage, high throughput, and applications insensitive to latency. Comprehensive consideration of the characteristics of cloud storage services and the appropriate selection based on the situation can maximize cost-effectiveness while ensuring performance and availability. Our Shared Storage architecture design not only follows the principle that “high-frequency access paths require the lowest latency and best performance,” but also cleverly integrates the storage features of EBS and S3, offering low latency, high throughput, and theoretically infinite streaming storage capacity at a low cost. For Kafka, where frequently written hot data is promptly read by consumers, the use of EBS’s persistence, low latency, and virtual storage properties, paired with technologies such as Direct I/O, allows for data to be persisted with minimal latency. Once the data is persisted, consumers can quickly read it from memory cache, achieving high-frequency read and write paths with millisecond latency. By using EBS as part of recovery (WAL), we only need about 5-10GB of EBS storage space, for instance, the cost of an AWS 10GB GP3 storage volume is approximately $0.8 per month (30 days), which not only fully leverages the characteristics of EBS but also resolves the latency issues of writing directly to S3.
To address the issue of AWS EBS’s inability to perform disaster recovery (DR) across multiple availability zones, AWS EBS is somewhat unique compared to solutions offered by other cloud providers, as the EBS provided by AWS doesn’t offer region-level services typically offered by other cloud providers.
At present, several major cloud service providers, including Azure, GCP, and Alibaba Cloud, which is expected to launch services in June 2024, have introduced Regional EBS. The Shared Storage solutions on these cloud platforms can seamlessly connect to Regional EBS, enabling disaster recovery capabilities at the availability zone level, effectively leveraging the technical advantages of cloud storage services.
In AWS EBS, to support disaster recovery at the availability zone level, solutions include duplicating data onto EBS volumes located in different availability zones, as well as utilizing S3 Express One Zone for support. Thus, as cloud computing and cloud-native concepts continue to evolve, although Kafka faces some challenges, its position is irreplaceable and it will continue to grow and develop. The introduction of the Shared Storage architecture will undoubtedly bring new vitality to the Kafka ecosystem and help it truly step into the cloud-native era.