In the wave of data intelligence permeation of our era, optimizations in infrastructure, computing power, architecture, and data training, among other fields, have become critically important. To delve into the present AI environment and its intelligent data processing and utilization methods, a succinct summary of relevant discussions is organized below.
Impact and Challenges Brought by AI Large Models
With AI-generated content (AIGC) becoming increasingly popular, the training of large models has placed unprecedented demands on the underlying computational infrastructure. Experts with extensive experience in cloud computing suggest that the era of large models is shifting focus away from the traditional triad of compute, network, and storage—instead, placing more emphasis on algorithms, computing power, and data. Today, the pressure on computing power mainly stems from the need for higher levels of parallel computing, vector computation, and matrix calculation, especially in GPU computing. Apart from this, the collaboration problems posed by inter-connected calculations within large model clusters—such as the need for high-throughput and low-latency networks during massive data interactions, and the parallel high-speed reading of training data—have set new requirements for computing power, sometimes referred to as DPU. Therefore, the age of large models has introduced novel challenges distinctly different from traditional distributed computing scenarios in massive promotions or big data contexts.
Database Technology Requirements in the New Era
In the AI era, data has become the core of enterprise competitiveness. Hence, businesses are increasingly focusing on the intelligent full-chain management of private domain data. The current state of enterprise data systems is that data often resides within various distinct systems depending on the respective application scenarios. For businesses with relatively smaller scales, simple queries and transactional data storage can rely on databases like MySQL and PostgreSQL.
As the business expands continuously, corporations may opt to keep their data in MongoDB for better horizontal scalability. MongoDB, with its superb document model, offers support for semi-structured data storage and retrieval. Nevertheless, MongoDB still faces challenges in expressing many-to-many relationships and handling transactional updates.
In scenarios involving keyword search, as business data volumes burgeon, many companies prefer using Elasticsearch as their search engine. However, this choice can lead to delays in data updates and unfriendly update issues. Another common use case is data aggregation and analysis, where industry often utilizes tools like ClickHouse or Snowflake for data processing. Yet, these tools may not fully satisfy needs in real-time processing and data updating.
The continuous advancement of AI and AIGC technology, along with the enhancement of large models’ natural language understanding abilities, has made semantic search increasingly important. However, these large models are usually trained on public domain data, so their utilization for private domain data is often limited. One strategy is to fine-tune the large models to better adapt to the characteristics of private domain data, but this path is typically cost-prohibitive as it requires retraining the model.
Another strategy is to combine large models with private enterprise data knowledge bases, and search the internal knowledge bases through techniques such as vectorization to provide semantic search results, which is often referred to as the RAG method. However, this approach also faces challenges in data updates and structured data integration.
In summary, in order to solve various problems in different scenarios, enterprises often need to adopt multiple technical solutions, which undoubtedly increases the costs of development and maintenance. What’s more serious is the distribution of data across multiple systems, which adds storage costs and the circulation of data can lead to consistency and real-time issues. With the application of AI and semantic search technologies, as well as the introduction of vectorized data, how to efficiently integrate with the existing structured data to improve the accuracy of semantic search has become a new challenge.
Returning to the basic needs of users for data, they expect consistency, correctness, timeliness, as well as high-performance storage, query capabilities, and data mining abilities. To meet these challenges, one must uncompromisingly pursue these demands below the physical limits.
From the perspective of the data analysis field, large model technology is likely to change the data architecture of enterprises and fundamentally improve the daily data analysis experience of users. The introduction of large models requires data architectures to adapt to new application scenarios, such as model training, tuning, and inference. This not only raises the requirements for data quality but also means that unstructured data, such as audio, images, and videos, needs to be taken into consideration. Data architecture must evolve accordingly, transitioning from traditional warehouse-style structures to more flexible data platform structures.
At the same time, the application of large models may lead to inconsistencies in data metadata management. For example, inconsistencies between real-time and offline data sources, and new demands that new algorithm application scenarios bring to metadata.
Unified metadata is crucial for data architecture, as it provides a unified data view for different usage scenarios, thus facilitating the application of large models in inference and training. In the field of data analysis, as Professor Wang Haihua has pointed out, the inclusion of large models is expected to significantly improve the efficiency and automation of data analysis. Traditional data analysis tools often require complex operations, such as writing SQL queries and carrying out specific development, but large models with their powerful semantic understanding and logical reasoning capabilities can simplify the data analysis process, allowing business users to engage in data analysis more easily. Large models can also create automated data analysis capabilities that are implemented through logical reasoning and intelligent agent technologies. Simple questions can lead to complex data analysis results, which in turn significantly improve work efficiency. In conclusion, large model technology brings revolutionary changes to data architecture and significantly enhances the efficiency of data analysis, making enterprise data applications more convenient and innovative.
Facing the new changes brought by the era of AI large models, enterprises must pay attention to several key technical points when building intelligent computing platforms. In the process of practical operation and implementation, it is particularly important to note the transformation in management, computation, storage, and networking interconnectivity.
For example, in management, although the interaction between the scheduler and the work nodes may not change much, the computational aspect is a different story. Given the high requirements for GPU usage efficiency with large models, companies often use bare metal technology that directly accesses the GPU cards rather than virtualization solutions. At the same time, at the virtual network level, DPUs are needed to improve the connection between bare metal servers. In scenarios with high demands for data interconnection, low-latency network technologies, such as RDMA and InfiniBand, must be introduced. Moreover, to meet the high parallel download demands of large model computations, network performance also needs to be enhanced in terms of storage, which may be achieved by adopting RoCE or InfiniBand networks.
At the software level, GPU computing power scheduling technology is needed to ensure the execution of large-scale computing tasks, as well as parallel file systems to meet the high-speed parallel data download requirements during large model training. Therefore, building an intelligent computing platform means a comprehensive restructuring of compute, network, and storage architectures to meet the special requirements of AI computing.
Furthermore, multiple challenges will be encountered in the process of building an intelligent computing platform. Enterprises may need to access and purchase expensive new technologies and hardware devices, as well as perform extensive compatibility adaptations for new hardware, such as new networking methods, drivers, plugins, and kernel modules. Moreover, to fully utilize the performance of these expensive hardware devices, companies also need to deeply optimize scheduling strategies, GPU reuse, network protocols, and storage protocols. Clearly, constructing an intelligent computing platform is a complex task, involving multiple complicated aspects of hardware, software, deployment, and optimization, and any enterprise must invest considerable technical and resource input to meet these challenges.
Tianyi Cloud has accumulated rich experience and practice in supporting enterprises in their transformation process. Against the background of the rapid development of various large models, as a cloud computing service provider focusing on ToB customers, Tianyi Cloud, in particular, has gained valuable experience in assisting customers with the practical implementation of large models. The common situation shows that most enterprises, due to the high cost of training large models from scratch, prefer to optimize existing models. Therefore, enterprises need to conduct a comprehensive assessment when deciding to adopt large models. This includes in-depth evaluation of computing power, data storage needs, and network architecture.
First, the assessment of computing power requirements is crucial, considering the scale of model parameters, the volume of data to be processed, and the algorithms used for adjustments, to determine the necessary number of GPU cards and the training time cost. The assessment of data storage must ensure the read speed of training data, involving the scale of data and the choice of file system. The network evaluation covers whether there is a need to build a high-performance, low-latency RDMA network to support large model communication. Secondly, the assessment of data quality is equally important for businesses, as low-quality data may lead to a waste of resources. Before constructing a large model platform, it is also necessary to evaluate the complexity of platform construction, whether there is a need to develop a GPU card scheduling platform, whether a deep learning platform is required on the top layer, and even a model training platform. After this series of evaluations, it becomes possible to cooperate with customers to create a large computing platform that utilizes resources efficiently.
When utilizing large models for intelligent data analysis, enterprises face multifaceted challenges. From the perspective of data analysis, the introduction of large models has brought new issues to data architecture, especially the challenges of processing private domain data and unifying data. In response to this issue, Tianyi Cloud has already made practical advances in the field of intelligent data analysis and gained experience. In business scenarios such as the freight industry, this company has developed into a medium or even larger-scale Internet company. Companies that value data have already begun to comprehensively collect and store data, and use these data to support business decisions, precision marketing, risk control, and map location services among other scenarios. With the application of large models in intelligent data analysis, deeper insights are expected, and a further realization of the intelligence of data analysis.
With the development of technology, although we have made some progress in the level of intelligence, we recognize that there is still a lot of room for improvement in deep data analysis and the application of artificial intelligence (AI). In particular, the emergence of big data models has provided us with more possibilities, enabling their integration with daily data analysis work. For example, our operations team needs to make a large number of business strategy adjustments and effect analyses every day, involving critical monitoring metrics such as business data ownership and order and user growth. These scenarios are crucial for the team.
In consideration of these key business scenarios, we decided to create a project called “Fast Search.” The purpose of this project is to provide the operations team with an accessible point of entry that is easy and has a low threshold so that they can quickly and directly obtain the business insights they need. This means that the operations team will be able to get the information they need without relying on complex data products or in-depth analysis reports.
Through the “Fast Search” project, our goal is to progressively simplify and integrate existing data products, and ultimately establish a unified intelligent data analysis entry point. We expect to provide intuitive, easy-to-use data queries and insights through this platform, integrating various data analysis functions to achieve user semantic query input and simplified output needs.
Nevertheless, in the process of simplifying data products, we face the challenge of data quality and consistency. Our goal is to gradually unify scattered data resources and establish a reliable set of metrics to further optimize the quality of data and metadata. In this way, querying and gaining insights has become an easy task.
At the same time, we also recognize the current limitations of large data models in terms of inference capabilities. It is crucial to find a balance between the accuracy of data and inference capabilities, especially for applications where the correctness of data is key. Therefore, we have conducted in-depth thinking and trials at the data level to ensure the high quality of data and the precision of metrics. Moreover, we understand that the capabilities of large data models are not immutable and will change over time, so we need ongoing wisdom and efficiency to address challenges and solve problems.
I firmly believe that these thoughts and practices are not only beneficial for the specific application of data analysis but are also applicable to the technological application and implementation in all fields.
Regarding how to improve the level of intelligence while maintaining the accuracy and stability of the data analysis platform in practice, we are well aware that the choice of model has a key impact on the results. Various suppliers, including internationally renowned ones like OpenAI, Google, as well as domestic providers such as Wenxin and Tongyi Qianwen, offer a diverse selection of models. Based on the different stages of business, we tend to choose the best models in the industry to maximize business benefits. Once in the service phase, other factors such as data security and costs also need to be weighed.
As for the models themselves, at our company, we mainly apply two major categories of models. One type is the commercial model suitable for simple scenarios, such as Alibaba’s Tongyi model, which performs well in many business situations. The second type is the private deployment model, which is applicable to occasions with very high data security requirements, such as applications involving personal privacy data and tasks like driver recording summaries.
For scenarios that require domain-specific fine-tuning, we will prioritize models that are privately deployed and fine-tuned, to ensure that they meet specific customization needs. Compared to this, commercial models often lack such flexibility. In addition, when launching models for online services, both the business effect and safety are essential considerations. We must ensure that the model does not generate inappropriate content, such as those involving pornography, violence, terrorism, etc., which requires effective review and risk control capabilities. Compared to traditional content control methods, large models can enhance security and increase review efficiency. Therefore, choosing the model that is most suited to a specific scenario is an important and comprehensive decision based on cost, security, effectiveness, and custom fine-tuning.
When talking about the database field, the core technology of Data Warebase, which combines big data and artificial intelligence, is gradually becoming a trend. It includes the abilities of traditional transaction processing, analytical processing, text search (TP/AP/Text Search), and through vector computing and integration with large models, it effectively realizes intelligent private domain data and universal search (Universal Search) to help enterprises use data assets more deeply to drive business innovation.
Work in the field of data analysis and intelligent experience is constantly improving. To enhance precision and effectively regulate data governance, and solve problems that may arise when combining large models with data, many experts with rich experience in the development and application of large models have contributed valuable knowledge. In today’s enterprise digital transformation, scenarios often arise where multiple systems work together, thus there is an urgent need for a unified data product that can integrate database technology, big data, and artificial intelligence to solve various complex problems. To this end, our Data Warebase product offers the following unique features:
- Supports determinants, bitmaps, and vectorized indexing, enhancing abilities in traditional TP/AP search and semantic search.
- Conducted in-depth research and development in the areas of vectorized execution, distributed transactions, and system optimization, ensuring accuracy and performance of the system.
- Adopting a compute-storage separation architecture, better separating computing and storage functions, achieving an ultimate flexible experience and helping to save costs.
- Possesses strong adaptability, significantly performing in aspects such as index selection, query concurrency determination, transaction commit, etc.
- Compatible with the SQL and PG ecosystems, facilitating users to utilize existing tools for operations without the need for additional modifications.
In summary, we are committed to using Data Warehouse to meet the challenges brought by AI, by integrating, optimizing, and training data to improve overall efficiency. We believe that future artificial intelligence technology will understand indexes more accurately and will be used in synergy with structured data, greatly advancing precision.
We think that Data Warehouse will become a unified system to meet users’ data storage and computation needs, providing capability fulfillment for the intelligent transformation of enterprise private data and the integration of various computation scenarios.
Firstly, Cloud Scorcher focuses on intelligence computing platforms, mainly for the scheduling of GPU resources and underlying Infra management, based on such advanced computing technologies as bare metal GPUs, DPUs, parallel file systems, RDMA networks, etc. Wisdom Gatherer is more focused on the service layer of intelligence computing, offering task scheduling and processing, helping users to perform model training or fine-tuning, significantly reducing the barriers to training, fine-tuning, deployment, and inference of large models. Energy Ground is akin to a computing power cloud market, aligning with the national policy strategy of “East data West computing”, where computing power from various regions registers on an online platform, and users can choose appropriate clusters for computation tasks as per their needs. Energy Ground implements comprehensive multi-region and multi-cluster scheduling, showing a more apparent advantage compared to traditional virtual machine and container computing power.
Mr. Wang Haihua has extensive experience in the field of data architecture and has had related experiences in companies such as Didi, Ele.me, and Pinduoduo. Faced with different scales and requirements of data platform architecture, especially in ensuring data complexity, performance of large models, and accuracy, he has unique insights.
In the rapidly growing fields of e-commerce and food delivery, the volume of business is increasing rapidly each month, bringing a huge demand for computing power and storage. The first challenge faced by the data architecture is the need for strong scalability to support the surge in business volume. Fortunately, many components in the big data technology ecosystem are quite mature in terms of scalability and fault tolerance and need only minor supplements to meet growth needs. The second challenge is that rapid growth also brings many new business requirements, which requires the data platform and architecture to quickly adapt and support new projects, sometimes balancing perfection with acceptability to allow for reasonable resource allocation.
In the early stages of Doris, the system was adopted to support real-time data writing and efficient data querying needs. Even at that time, when Doris had not yet reached full maturity, there were occasional issues. However, we always maintained open communication with the business departments, ensuring good collaboration on both sides by clarifying problems and explaining situations.