What physical limits must data systems still challenge on the path to AGI?

2024-05-14 10:42:15

The Logical Starting Point of Distributed Data Warehouses and Business Requirements Distributed data warehouses are a response to the three core drivers of modern business requirements: responsiveness, accuracy, and real-time capability. They are not inventive creations but rather a profound insight into and strategic response to the limitations of existing data systems. This discussion focuses

The Logical Starting Point of Distributed Data Warehouses and Business Requirements

Distributed data warehouses are a response to the three core drivers of modern business requirements: responsiveness, accuracy, and real-time capability. They are not inventive creations but rather a profound insight into and strategic response to the limitations of existing data systems. This discussion focuses on how to explore and develop the physical limits of data systems through distributed data warehouses.

The Challenges Faced by Modern Data Systems

Imagine a bed-and-breakfast booking application scenario, where property owners list their properties on a platform, while consumers search for and reserve their preferred houses on the same platform. Users complete all the related processes on the platform, from booking, check-in to writing reviews. These seemingly simple processes require strong support from the data system.

In the early stages of the application, a simple relational database (such as MySQL or PostgreSQL) could meet the task requirements. However, as the user base grows rapidly and the volume of data and query pressure soar, a single database cannot bear this burden and encounters severe performance limitations.

Diverse Solutions to Data Challenges

To overcome these challenges, non-relational databases like MongoDB were introduced to achieve horizontal scaling and enhance the ability to handle an ever-growing data processing demand. Further user growth brought new requirements: users wanted to search for accommodations using keywords rather than full names, and conventional data processing methods could not effectively handle such search queries quickly.

The introduction of search engines provided a solution, such as Elastic Search, which can efficiently handle complex keyword search tasks. However, to make the search engine effective, it is necessary to establish a data synchronization link to feed data into the search engine. Data synchronization methods can generally be divided into two types: one is regular full-volume data synchronization, and the other is incremental data synchronization for real-time requirements, which usually relies on tools like Kafka or Flink.

Exploration and Significance of Distributed Data Warehouses

Only after all data is synchronized to the search engine can efficient search services based on the search engine be realized. In this process, the concept of distributed data warehouses was born, aiming to integrate and optimize the above data processing flows and to explore the physical limits of data systems in different business scenarios to meet the most demanding business requirements.

Conclusion

In summary, with the constant evolution of user needs, traditional data processing systems are no longer sufficient. The distributed data warehouse, as an industry response to this challenge, is not only for performance improvements but also for gradually meeting the high standards of business in terms of accuracy and real-time capabilities, thereby providing a superior service experience for users.

In modern home-sharing reservation applications, user experience can be greatly improved through the use of advanced search technologies. Now, users can quickly filter for accommodation options that meet specific needs by simply entering conditions such as “smoke-free parking spots”. With the significant progress of large language models in recent years, machines have begun to understand the meaning of natural language, thus unlocking a previously unprecedented method of search in commercial services.

Besides keyword searches, users can now conduct searches with natural language narratives, such as looking for “a minimalist homestay that is pet-friendly”. To support this advanced search demand, vector databases have been introduced into the systems, utilizing full and incremental data synchronization links to ensure real-time updates to the vector databases, and then collaborating with large language models to complete complex semantic searches.

Furthermore, summarized analysis is also a focal point for users. For example, users may want to know which are the highest-rated homestays in a city over the past three days. To address this need, the system has introduced advanced data warehouse products like Clickhouse, Hive, Snowflake. This requires the development of new data synchronization links to ensure that data can be synchronized to the data warehouses, and then complex data analysis needs are met through BI tools.

From initially simple relational databases to now applications that include various data types and complex business scenarios, this homestay application has progressively evolved into a complex system in terms of its data architecture. We can categorize the data needs as follows:

  • Structured data: Includes basic information such as the name, geographic location, number of rooms, and price of the homestay.
  • Semi-structured data: Involves facility styles and functional parameters, such as the capacity of refrigerators or the power of microwave ovens.
  • Unstructured data: Information such as homestay photos and text reviews.

The main usage scenarios for data include:

  • Simple query: Quickly obtaining detailed information about a particular homestay.
  • Keyword search: Based on specific keywords, such as “smoke-free parking spots”.
  • Semantic search: Based on more complex natural language descriptions, such as “a minimalist homestay that is pet-friendly”.
  • Summarized analysis: Comprehensive analysis of the highest-rated homestays in a city over a certain period.

As business demands grow, the developed data systems become increasingly complex, a typical phenomenon in many commercial services. However, this complexity also brings numerous challenges. Analyzing from three different perspectives:

  • Research and Development Perspective: Building such a complex system requires R&D personnel to master knowledge of various different products, as well as understand the limitations of each product and their respective mitigation plans, which notably increases the difficulty and threshold of development.
  • For smaller-scale enterprises, recruiting experienced data engineers is a real challenge, and the scarcity of talent often leads to these businesses being unable to fully exploit the commercial value brought about by data.

In today’s data-intensive environment, having professional data engineers is a tremendous asset for a company. However, even with these professionals, companies still face significant challenges in the process of implementing synchronization of data across multiple products. Data engineers need to develop one synchronization link after another, which not only increases their workload, lowers overall data development efficiency, but also slows down business iteration speed, becoming a major obstacle to business growth.

From the maintenance perspective, maintaining a multitude of products also presents significant challenges, where data synchronization tasks often become the most error-prone link in the data system. If data synchronization issues occur, different components in the system may display inconsistent data states. Moreover, the same data needs to be synchronized to different products multiple times, leading to data redundancy, thereby consuming more storage resources and increasing costs.

On the business level, although data architecture is optimized and developed to meet business needs, data delays still occur under the current state of multiple data synchronizations. Along with the complexity of the architecture, data inconsistencies are inevitable, affecting the overall satisfaction of business requirements.

When discussing core business requirements, one can reference Elon Musk’s insights on the X platform: “The correct way to evaluate a product is not just by comparing it to competitors (too simple), but rather by comparing it to physical limits.” If pursuing physical limits becomes our goal, then how should we define the physical limits of a data system? On what dimensions should we evaluate the limits of a data system? In this regard, we approach from the following three dimensions: performance, accuracy, and real-time.

First is the demand for performance: As business continues to grow, the demand for performance also rises. We expect to meet the business requirements for various aspects of performance when resources are abundant. This is because once performance requirements are not met, the system will immediately show problems, and thus, performance has always been a core driving force in the development of the big data field.

Regarding accuracy: We aspire for the data stored in the system to always be accurate and for all data-based query results to be beyond doubt. Ensuring data accuracy is crucial for making reliable decisions, and this is also a basic but core requirement faced by data systems.

The last aspect is the need for real-time performance: different business requirements have varying demands for real-time performance, ranging from hourly to minute-level or even second-level, which are not uniform. For a data system that challenges the physical limits, it needs to cope with the diverse real-time requirements of various business scenarios, ensuring minimal data latency.

Based on the above analysis, we further explore how to build a data system for different scenarios that can achieve an ideal state across three dimensions: performance, correctness, and real-time, thereby challenging the physical limits of the data system.

In the exploration of evaluating the physical limits of data systems, the first scenario we consider is simple queries. This scenario discusses from three core dimensions: performance, correctness, and real-time. Consider such a use case where users are eager to review the detailed information page of a certain guesthouse, which requires the system to quickly extract all the information about that guesthouse.

Initially, the system adopted relational database technology. As the relational database directly stores the source data, it naturally ensures real-time data updates. Moreover, thanks to its support for transactions, and these transactions ensure strong data consistency, meeting the need for data correctness. However, the performance of the relational database is limited by the performance of the single server. Once the performance requirement of the business exceeds the maximum capability of the single-server, the standalone database cannot meet the demand.

An intuitive solution is to adopt horizontal scaling technology, distributing data storage across many servers. Although this can provide the necessary performance support, it also brings new challenges, which is how to maintain the integrity and correctness of data on this basis. Taking the system’s resident table (storing basic information of residents) and order table (recording all orders) as an example, data sharding can cause a resident’s information and their order records to be scattered and stored on different servers. In some operations, such as needing to delete a resident and all their orders, changes must be made simultaneously on multiple servers, and to ensure the atomicity and consistency of these change operations, distributed transactions are required.

However, distributed transactions are technically challenging. To avoid this problem, NoSQL databases have become one of the solutions. Among the many types of NoSQL databases, the document-oriented database is outstanding. Here, we take the document-oriented database as an example to analyze how it deals with the problems of horizontal scaling.

The main challenge of horizontal scaling lies in implementing distributed transactions across servers. If it can be ensured that all data entities that need to be modified simultaneously are allocated on the same server, the demand for cross-server distributed transactions can be avoided. The document model is built upon this idea as it groups all related data entities that need to be modified synchronously into separate documents and ensures that these documents are stored on the same server and will not be split. For example, when using the relational model, the left side represents the relational representation of the data, while the right side represents the same data using the document model.

After adopting the document model, previous operations, such as deleting a tenant and all their order information, have become exceptionally simple—just by removing the corresponding document of that tenant. The document-based database ingeniously avoids the complexity of cross-server distributed transactions and achieves true horizontal scaling capabilities, which is particularly important in dealing with the continuously increasing business performance demands.

The document model is very suitable for describing one-to-one and one-to-many relationships between entities, allowing users to concentrate associated data entities within a single document. However, when encountering many-to-many relationships, the limitations of the document model become apparent. For these complex relationships, there are usually two solutions. First, replicated entities can be stored in different documents to address this.

When we manage to modify entities in the database, we may find that we must update various documents storing these entities, which could be distributed across different servers. To achieve atomicity in such modifications, we inevitably return to the need for distributed transactions. One alternative is to store different entities in separate documents and reference them by IDs.

However, this still presents challenges, and maintaining reference consistency is a major issue. Whenever we create a document, we must ensure that all the referenced documents already exist. Conversely, when deleting a document, we must confirm that no other document is referencing it. Since a document and its referenced documents may be located on multiple servers, ensuring the consistency of this reference relationship still relies on distributed transactions.

In the end, to accurately express all relationships between entities in a document-based database, distributed transactions become an indispensable link. Only by implementing distributed transactions can we ensure the perpetual correctness and consistency of data. This leaves us facing two choices: one is to realize distributed transactions on top of the document-based database to solve the issue of data accuracy; the other is to realize distributed transactions on the relational database while increasing its horizontal scaling capability to address performance issues.

To make a decision, let us compare these two choices. A major advantage of document-based databases, in addition to horizontal scaling, is their support for semi-structured data. Users can easily change the document structure without going through the complicated process of database table structure modification. This offers tremendous convenience for agile data development, and NoSQL databases, represented by document-based databases, have thus been seen as symbols of agile development and horizontal scaling.

On the other hand, relational databases also have their undeniable advantages, the most notable of which is the power of the SQL query language. SQL simplifies database query processing, and its ecosystem is very robust, with many tools providing excellent SQL support.

Taking these factors together, the relational database appears to be favored due to many characteristics that document-based databases lack. Therefore, building on relational databases seems to be a more reasonable solution. Of course, challenges still exist, especially with the implementation of distributed transactions. However, Google’s Spanner has successfully implemented distributed transactions and has published detailed methodology in the form of a paper. Spanner’s great success within Google’s internal applications not only proves the feasibility of distributed transactions theoretically but also demonstrates the potential in engineering practice.

With distributed transaction technology, we can leverage data sharding to achieve horizontal scaling of databases, thereby meeting performance needs. Choosing a relational database as a basis does not mean that we give up the advantages of document databases in handling semi-structured data. The type system of relational databases is flexible and can be expanded, introducing JSON data types, and relational databases easily support semi-structured data. After integrating distributed transactions and the ability to process semi-structured data, relational databases not only encompass all the advantages of NoSQL but also show their unique value in multiple aspects.

When discussing the practicality of combining data synchronization and search engines, we built a system that performs excellently in terms of performance, correctness, and real-time capabilities. Considering the specificity of keyword search scenarios, let’s explore through a practical example why real-time capability is so crucial in searches.

In the architecture of a certain homestay application, data is synchronized from the database to MongoDB and then updated to the search engine to support keyword searching. For instance, a homestay administrator tried to search for listings with the keywords “smoke-free” and “parking”. However, their own homestay listing did not appear in the search results. The reason was that the administrator did not include “parking” in the information of their homestay. After realizing this oversight, the administrator promptly added the information in the management back-end, but even so, the homestay still did not show up in the repeated search results. The cause of this situation was due to a delay in the data synchronization process from the database to the search engine.

The above example clearly shows that real-time capability plays an indispensable role in the search scenario. While most search engines perform well in terms of performance, they often fall short when it comes to achieving the ultimate correctness and real-time responses. To ensure that search engine results are always the freshest, a viable solution is to perform searches directly on the original data, without relying on data synchronization. Since the original data is stored in a relational database, this naturally meets the requirements of correctness and real-time capability, and the issue boils down to performance.

In this scenario, we are dealing with a single-machine performance issue brought about by new usage scenarios of the data, not a challenge of scalability. Even if the data volume is not large and can be processed on a single server, using a relational database alone for searching may result in decreased performance. To overcome this challenge, we propose building on the foundation of a distributed database to enhance search performance.

One of the core technologies for the exceptional search performance of search engines is the inverted index. With the inverted index, articles are broken down into individual keywords, determining which documents each keyword appears in. The IDs of these documents are sorted to form an inverted list. When users search, the system first looks up the inverted list for each keyword, and by performing union or intersection operations, it quickly locates the documents that meet the search criteria.

Beyond the inverted index, another extension mechanism for relational databases is the index type. By incorporating the inverted index mechanism into relational databases, we can achieve efficient searching capability. Moreover, the use of such inverted indexing is not limited to keyword searching; it is equally applicable to indexing structured data in tables. Such inverted indexing helps speed up the union search and filtering process for structured data.

Therefore, on the basis of relational databases, we have absorbed the inverted index technology of search engines and developed a platform that not only satisfies rapid search performance but also ensures the accuracy and real-time nature of data.

In the past few years, generative AI technology has made significant progress, especially large language models, which can provide unexpectedly high-quality answers to most questions. However, given that these models can only access public domain data during training, they may not be able to reach their full potential when answering questions involving private domain data.

In modern technological solutions, there are some challenges when dealing with large language models, particularly when applying them to private domain data. One method is to fine-tune the large language models with private domain data. However, this approach might have some limitations, such as high barriers to entry and a fine-tuning process that might lack learning important details of private domain data, as the knowledge learning of large language models is essentially a lossy compression process, resulting in potentially valuable details being lost during this process.

To overcome these limitations, Retrieval-Augmented Generation (RAG) offers another solution. Its workflow is: question recall system. When a user poses a question, the system submits it to the recall system to search for relevant documents in the knowledge base. For example, entering keywords like “smoke-free parking space”, the knowledge base will return associated documents. Subsequently, the user’s question, system prompts, and recalled documents are entered into the large language model as context, and an answer is generated based on this.

The retrieval methods are diverse, aside from keyword-based searches, semantic retrieval is also becoming popular. Semantic retrieval allows users to find relevant information using natural language. Its basic principle is the embedding model, with which each document generates a high-dimensional vector summarizing the information. These vectors reflect the degree of similarity between documents. When a user asks a question, the system also generates a vector for it and searches for documents with similar vectors in high-dimensional space to provide to the large language model for answers.

In a system containing hundreds of millions of documents and their corresponding vectors, the indexing scalability of relational databases and the use of vector indices (such as IVFFlat or HNSW) make efficient searching for similar vectors in high-dimensional spaces possible. Hence, combining vector retrieval with large language models’ capabilities can create an efficient, accurate, and real-time semantic retrieval system, without the need for an additional vector database, relying only on relational databases to meet the needs.

In today’s data-intensive business environments, real-time response plays a key role in aggregative analytics. Digital repositories not only store basic data on guest houses and customers but also contain performance information about applications and services operations, which is essential for detecting and troubleshooting performance issues. When operation and maintenance personnel detect performance issues in monitoring, they urgently need to understand the reasons causing the application to slow down, and then determine whether it’s a common problem affecting all users or just a specific user group.

If only some users experience performance issues, operations will investigate further, such as whether different application versions, mobile models, and network types have different impacts on performance. At this point, operations quickly identify the likely source of the problems through various temporary queries, often due to configuration errors under specific circumstances. Once the configuration is modified, operations staff need to immediately verify whether the change helps to restore the performance for the affected users. Therefore, it is clear that real-time data feedback is crucial for operations personnel to quickly iterate and resolve issues. The quality of data immediacy ultimately depends on how data is written into the data warehouse.

One common method of writing data is through batch processing. For example, the system might collect a batch of data every five minutes and then write it all at once into the data warehouse. This method can enhance performance and ensure data accuracy, but there may be a slight shortfall in immediacy. Although data delays can be reduced to the level of hours or even minutes, achieving lower latencies remains challenging even when resources are allocated reasonably. Many data warehouses and data lakes use this batch processing approach to fulfill the needs of performance and accuracy, but quasi-real-time processing is not the same as real-time feedback; it does not achieve absolute immediacy.

Another way of data input is immediate writing, and these systems generally only guarantee so-called “eventual consistency.” The concept of eventual consistency means that if there are no new write operations for a certain period, then the data should be consistent when queried later. However, if real-time writing is continuously occurring, “eventual” may actually be hard to realize. Therefore, users may face the risk of data inconsistency when querying data. For example, updates concerning bank accounts may include two parts: one deducting funds from one account and the other adding an equal amount of funds to another account. Although synchronous operations ensure that both changes are accurately recorded in the data warehouse, during reading, only one of the updates may be seen, potentially leading to an incorrect calculation of the total balance.

To further illustrate, we can see that some B&B managers, in order to increase their occupancy rates, may choose to advertise. This process might generate two different events, exposure and clicking, which are not necessarily occurring in the same transaction but are causally related. Even if, for the sake of performance improvement, the system processes them together in the same request during real-time writing, reading the data might still encounter only the click event and miss the exposure event, which violates the causal principle.

While the atomicity of reading is often overlooked and the atomicity of writing is highly valued, the atomicity of reading is equally important to ensure data correctness. Although real-time data writing satisfies the needs of business performance and immediacy, eventual consistency is not the same as absolute consistency; it may lead to data inconsistency during reading, failing to fully meet the business expectation of data correctness.

To satisfy the correctness and immediacy needs of data storage at the same time, it’s inevitable to support real-time data writing and to ensure strong consistency during writing. Distributed databases perform excellently in achieving these two requirements, and the only remaining challenge is performance issues. It should be clarified that the performance problems here do not refer to scalability, but rather to single-machine performance issues under new usage scenarios of data.

Performance optimization can be achieved by leveraging the efficient analysis capabilities of data warehouses. Data warehouses are able to achieve high performance mainly due to the following three technologies:

  • Columnar Storage: By storing data for the same column together, columnar storage achieves a higher data compression rate and reduces the I/O required to read data. During query execution, this could mean that only part of the column data needs to be read, which further reduces I/O consumption and enhances query efficiency.
  • Vectorized Execution: Unlike traditional row-by-row processing, vectorized execution significantly boosts query speed by processing data in micro-batches. It should be noted that vectorized execution is not related to embedding vectors and should not be confused with them.
  • Materialized Views: By precomputing common queries and storing the results, query performance is improved. For example, precomputed aggregate indicators can be used directly, avoiding redundant calculations. Materialized views are a common abstraction of this type of precomputation.

After implementing key technologies like columnar storage, vectorized execution, and materialized views, systems can support high-performance analytical scenarios while ensuring accuracy and real-time performance.

On the path of technological integration, we are much like the distinguished physicist Maxwell from history. By combining previous equations about electricity and magnetism, he formed Maxwell’s equations, which led to the entirely new physical phenomenon of electromagnetic waves. Just as discussed in the four technological scenarios, we choose relational databases as our foundation and draw from optimization technologies in other systems to achieve the ultimate optimization in performance, accuracy, and real-time aspects. These performance optimization technologies are compatible and mutually reinforcing and can be integrated into a single system, just like Maxwell did with various optimization techniques.

In this way, we can build a powerful system that can not only store and process structured, semi-structured, and unstructured data, but also meet the highest requirements for performance, accuracy, and real-time validity in various scenarios such as simple queries, keyword searches, semantic retrieval, and summary analysis. Moreover, the system can support emerging query scenarios, such as combining keyword and natural language queries, which were impossible in any single product before.

Emulating Maxwell’s approach of drawing on various strengths is not to simplify it as equivalent to greatness, but to emphasize that through integration and innovation, we can meet and resolve new technical challenges, thereby pushing the boundaries of data storage and processing technology.

Achieving success is no easy feat. Maxwell noted an asymmetry in his equations, that a changing magnetic field could generate an electric field, but a changing electric field could not produce a magnetic field. This clashed with his sense of aesthetics, so he decided to make a “small” adjustment by adding the concept of displacement current. This seemingly insignificant change was actually profound, as it not only resolved the conflict between the law of charge conservation and Maxwell’s equations but also initiated the generation of electromagnetic waves, thereby leading to the concept of light.

Similarly, in the world of data systems, we constantly seek the existence of something akin to displacement current—namely, the minimalist experience. After business requirements are fully met, the pursuit of excellence in development and operational experience becomes a new topic. A superior user experience gradually becomes a significant point of difference between products.

To provide a better experience in development and operations, we need to start from several aspects:

  • Creating a unified API that offers consistent user experience whether for simple or complex search, analysis, and other scenarios, avoiding the sensation of switching between different products.
  • Establishing a unified data storage that satisfies multiple scenarios with one set of data and ensures that data synchronization operations are elective rather than mandatory.
  • Ensuring new products are compatible with the existing ecosystem, reducing the learning curve, allowing users to proficiently use new products based on their existing knowledge.
  • Effectively isolating different scenarios, potentially through software or hardware means, to ensure resources allocation and operations in different scenarios do not interfere with each other.
  • The focus is on the product’s adaptability. The system should auto-tune according to different scenarios, for instance, automatically adjusting concurrency or execution plans based on the type and complexity of the queries to optimize performance and reduce unnecessary expenses.

To achieve true adaptability, in-depth consideration must be given at the onset of architectural design, not merely as an afterthought following product development. Our goal is to create a seamless and efficient system for business scenarios that demand diversity, complexity, and high performance.

When faced with business demands of varying intensity, selecting the right technological approach is crucial. For situations that can tolerate a certain degree of real-time sacrifice, a system designed to meet the highest standards may lead to wastage of resources. In such cases, the system should optimize based on specific business requirements to avoid unnecessary expenses. For instance, for analytical tasks that don’t have particularly high real-time requirements, data can be written using batch processing. This method can effectively reduce additional performance burdens that come from seeking extremely low latency, ensuring that technical solutions match business scenarios, and achieving optimum performance.

With the continuous advancement of technology, a new class of data products—Distributed Data Warebases—has emerged. This concept is derived from the combination of “Data Warehouse” and “Database,” signifying that it integrates all the advantages of both data warehouses and databases. Its characteristics are as follows:

  • Inclusive: Capable of storing all types of data, whether structured, semi-structured, or unstructured.
  • Widely applicable: Supports various data processing scenarios from simple queries, keyword searches to semantic queries, and summary analysis.
  • Pushing boundaries: Not only provides basic guarantees of performance, correctness, and real-time, but also strives to excel in these areas.

As a database, the distributed Data Warehouse solves the problem of horizontal scaling through distributed technology. In the field of data warehousing, it also overcomes the challenges of accuracy and real-time performance. Moreover, to simplify development and operational processes, the distributed Data Warehouse offers a minimalist experience with unified APIs, unified data storage solutions, and compatibility with existing ecosystems and workload isolation features. Its adaptive characteristics ensure the system optimizes for different scenarios.

The distributed Data Warehouse embodies the pursuit of excellence based on business scenarios, not merely an invention but an important discovery. Its emergence marks a new era in data development. For instance, in the data architecture of a homestay application, by utilizing a distributed Data Warehouse, not only does the architecture become more concise, but it also meets the ultimate needs of the business in performance, accuracy, and real-time requirements.

Here’s an intriguing story: a little girl, seeing a paper magazine for the first time, tried to zoom in and out with her fingers, just like using an iPad. After several attempts, she realized the magazine did not respond to her actions, as if it were an iPad with limited functionality. Similarly, perhaps one day, we will find that existing databases or data warehouses appear insufficient when compared to the new type of distributed Data Warehouse.

The highly anticipated ArchSummit Global Architects Summit will be held in Shenzhen from June 14th to 15th. At this premium technology event, Hu Yuejun, Vice President of Technology for ProtonBase, will deliver a fascinating presentation titled “Exploration and Practice of Data Warehouse in the Intelligentization of Enterprise Private Domain Data”. Hu Yuejun will delve into how Data Warehouse solutions, while integrating traditional TP, AP, and text search functions, can help enterprises optimize data asset management and thus drive business innovation and growth.

In the global software technology field, a fascinating change is taking place in Germany. The country is shifting from the Windows system to the Linux system, with tens of thousands of systems having completed this migration. This move has attracted widespread attention, and many wonder if it will repeat the difficulties experienced twenty years ago during the migration process.

Meanwhile, the recent massive layoffs at Google have sparked collective protests from its senior employees. The criticism is aimed directly at the company’s top management, accusing them of being rigid in thinking and lacking effective leadership skills. It also points out that the company’s middle management is expanding uncontrollably, leading to serious systemic flaws.

Another disheartening piece of news is that, due to a system error, a large corporation caused as many as one hundred people to be mistakenly sent to jail. Despite investing up to 280 million RMB for system upgrades, its attempts to migrate to the cloud have still ended in failure. This company seems to have been plagued by bad luck, loss, and setbacks for the past twenty years.

On the other side, the Linux community has recently adopted a ban on large model codes. This decision has sparked heated discussions online, especially when it was discovered that the term “shit” appeared up to 7 times in the relevant short essays, leading netizens to believe that this ban was a wise move.