Generative AI technology advances the pivotal position of vector databases

The advancement of generative artificial intelligence technology has significantly propelled the widespread application of large-scale pre-trained models. Central to this development chain is the vector database, a type of database designed specifically for storing and retrieving vector-form data, enabling the transformation of unstructured data—such as text, images, and videos—into mathematical vector representations. Thus, vector databases can handle vector data, enabling effective retrieval and similarity analysis of these unstructured data types.

In the realm of large models, vector databases have demonstrated their great potential to improve efficiency and precision. Owing to this advantage, vector databases have received strong enthusiasm from the capital market. For example, the Dutch AI-native vector database company Weaviate successfully raised fifty million US dollars in Series B funding in the last half of the previous year; following that, the US company Pinecone announced a one hundred million US dollars Series B funding round. These active capital operations reflect that the current popularity of vector databases has reached a new peak.

Of course, the heightened attention and the benefits vector databases bring to large models are closely related. Despite the exceptional performance of large language models, they still have some issues: knowledge update delays, producing illusions, lacking industry-specific or proprietary knowledge, and difficulty providing secure answers. Vector databases, acting as a kind of long-term storage memory for large language models, can address these issues through storing vector data feedback and interventions at a lower cost.

It is precisely because of the significant contribution of vector databases in promoting the application of large models that an increasing number of manufacturers have begun developing their own vector database products. Baidu Intelligent Cloud recently launched its enterprise-market-oriented VectorDB (abbreviated as VDB) version 1.0. This database adopts a newly-designed kernel and performs excellently in the elastic scaling of data in the order of tens of billions. Its performance is 1 to 10 times better than other similar open-source products in various application scenarios.

Against this backdrop, why has Baidu Intelligent Cloud chosen to release their specialized vector database product now? Is there a real market demand behind this product, or is it just a simple speculative bubble? During the development process, how did they overcome related technical challenges? Recently, we had the opportunity to conduct an in-depth interview with Baidu’s chief architect of database products, Zhu Jie, and senior architect Guo Bo, to discuss and understand these questions deeply.

In answering whether Baidu would develop its own vector database and whether the new solution could surpass past practices, Guo Bo noted that although Baidu had used its own optimized versions of ES (Elasticsearch, abbreviated as BES) and HNSW (Hierarchical Navigable Small World) algorithms in previous large language models—proving that the previous solutions could meet certain large model needs—current technology trends and business development goals require more efficient and professional vector databases.”

In the current technological environment, the application scenarios of large models are not only limited to basic applications but are also continuously expanding into more fields, such as knowledge base management and RAG (Retriever-Augmented Generation). These scenarios require high retrieval capabilities and also demand database-related abilities. Especially for B2B large clients, vector search is only a part of the demand; advanced database-related functionalities are also crucial.

Many open-source products on the market possess certain capabilities for vector processing, and there is a rich array of vector search algorithms and libraries available, which are easy to obtain and apply. However, in the future, the competitive edge of vector databases will not be confined to their “vector” processing ability but will increasingly focus on their comprehensive “database” functionalities. An outstanding vector database should integrate vector processing with database functionalities. Last year, we observed that many vector database products overly focused on “vector” aspects while neglecting the diversity of “database” features. As a result, some products may have emphasized the integration of open-source vector processing technologies but overlooked the construction of advanced database capabilities. These products might satisfy the demands of small-scale and less stringent clients, but in the long run, they may prove insufficient for B2B clients with larger scales and stricter requirements. Therefore, Baidu believes it is necessary to develop a professional vector database with profound database capabilities to construct the future with far-sightedness and discernment.

When further discussing the profound meaning of “database,” we must realize that as businesses develop, customers indeed have such needs. These requirements cover not only vector search technology but also support for various data types and processing capabilities, as well as many enterprise-level advanced features—such as high reliability, high availability, low-cost operation, and ease of use. These seemingly fundamental yet crucial elements are also indispensable.

For instance, enterprises face numerous security requirements in document management, such as how to achieve multi-tenant isolation between different departments, how to ensure the execution of different permissions, and how to audit sensitive data access. Additionally, clients may require geo-redundant active-active solutions, demanding high reliability and high availability that can extend to deeper application scenarios. The current market’s vector databases mostly provide only basic vector search functionalities, lacking in other areas. From a long-term perspective and the viewpoint of actual clients, these features are very important.

We believe that for enterprise-level clients, it is essential to focus on professional vector database solutions. Therefore, a database should not be limited to offering vector search functionalities but should also meet a wide range of corporate needs, including data type support, interface flexibility, usability, security features, multi-tenancy processing, access auditing, and geo-redundant active-active requirements.

Regarding the development of database solutions for enterprise-level users, is it necessary to adopt a purely specialized vector database to solve the problem? The answer is affirmative; only through professional customization can the technical requirements and business challenges be fully met.

When we turn our perspective to the common types of vector databases on the market, we find that traditional relational databases, such as MySQL and PostgreSQL, despite their advantages and features, are not suitable for use as vector databases. Especially considering their limitations in organizing data storage, the lack of specialized retrieval engines optimized for vector data, and their interfaces not being modern enough (lacking APIs, etc.), their shortcomings in handling large-scale distributed data management determine that they may be inadequate in facing vector data scenarios.

On the other hand, non-relational databases like MongoDB and Elasticsearch (ES) have certain advantages compared to traditional relational databases. However, there are still some restrictive factors for specific companies. Because MongoDB and Elasticsearch’s licenses have changed to the Server Side Public License (SSPL), after certain versions, these databases are no longer suitable for building products. Moreover, in terms of capabilities of distributed systems, memory efficiency, and other aspects, these systems have not yet reached the expected level, which, in the long run, is also not an economically efficient choice.

Therefore, considering the needs of customers and within the enterprise, we need to independently develop a vector database, which is a necessary step that fits our requirements. If we decide to develop our own database system, it is essential to carry out comprehensive optimization to ensure that it performs exceptionally well in distributed system architecture, storage engines, retrieval engines, and aspects related to vector data performance.

When asked if the vector database includes a retrieval engine and if it’s developed entirely from scratch as traditional databases, the representative pointed out that although some algorithms, such as HNSW, are open source, all other parts including Baidu’s self-developed Puck algorithm, covering everything from storage format, engine, Schema, data types, distributed architecture, product management layer, as well as console and frontend, etc., are independently developed by our team. It’s not just about having an HNSW algorithm; we need to combine these algorithms with other components of the database to create a complete retrieval engine that can provide vector search services.

Regarding the question of how long the project took from initiation to release and how many resources were invested, we learned that after in-depth research and determining the necessity of developing our own vector database, the team started the project initiation and system design work in August last year. Subsequently, in mid-September, the team entered a tense phase of development and completed the development and testing of the first version at the end of January this year, successfully launching a public beta test.

Our team consists of members from both cloud storage and database domains, boasting a balanced talent mix. Not only does the team include experts proficient in distributed storage engines, but it also has developers with deep understanding in the R&D of database products. Such a team structure ensures the comprehensiveness and depth of system and product development.

In the native design phase of the product, we endowed the system with robust vector processing capabilities, avoiding the optimization difficulties faced when adding vector plugins to existing systems afterwards. With the belief that a column-based engine is better suited for managing vector data than traditional row-based engines, we developed a specialized columnar storage engine. Our engineering optimization focuses on reducing non-floating point computation overhead, improving resource efficiency, and emphasizing memory overhead optimization to enhance the effective use of memory.

Specific engineering optimization practices, for example, using a columnar storage engine can provide more nuanced processing for vector data characteristics. For data containing multiple vector fields, different fields may originate from different content and use different embedding models; using a columnar storage engine can isolate these different vector fields’ data. For instance, when building indexes and retrieving data, the system only needs to focus on the original data and indexes related to the targeted fields, significantly avoiding the resource overhead of unrelated data. These are all optimizations implemented from the ground up, centered around vector data and specific scenarios, to ensure resources are fully utilized and performance advantages are maximized.

We recently completed a series of performance evaluations, including benchmark tests against a well-known open-source vector database. The results show that whether it is for datasets of 128, 768, or 960 dimensions, our database significantly outperforms the open-source system in retrieval performance at the same recall rate. In different scenarios, the performance improvement ranges from nearly double to tenfold, underscoring the significant performance advantage of our system.

From the perspective of recall rate, relaxing the recall rate requirements, such as from 99% to 95%, and then to 90%, can significantly enhance the retrieval performance of our database. Vector retrieval and the development of the in-memory engine are some of the main challenges we have encountered. Developing the database faced numerous technical challenges, and the ultimate achievement is a series of solutions and some notable milestone moments.

A major challenge lies in the development of the storage engine, particularly the columnar storage part. After thorough investigation of several approaches, we chose Baidu’s in-house developed columnar storage format library, suitable for vector scenarios, which includes key features such as columnar storage, columnar compression, and KV separation. Even though it is just a format library and has not yet developed into a complete engine, it offers a solid foundation. Building on this foundation, we continued to develop to supplement the Schema system, in-memory tables, snapshots, compaction, query optimization, and anomaly recovery, transforming the format library into a complete storage engine.

To overcome these challenges, we first assembled team members knowledgeable about KV engines, broke down the big problem into several smaller ones, determined the interface definitions and hierarchies of various submodules, and then developed efficiently to achieve the set objectives. For the vector search engine, we started with basic open-source search libraries, connected to the backend storage engine, and constructed a comprehensive search system. This required transformations in multiple aspects, such as Schema, storage organization, index organization, and hybrid scalar-vector search mechanisms. Also, while building the search engine, high-level abstract design was necessary to adapt to different vector indices and search methods, ensuring universal support for various algorithms and scenarios.

For the aforementioned technical challenges, we emphasize mutual understanding and cooperation among team members, continue to iterate on solutions, and accelerate the R&D process by constructing exceptional scenarios to test the solution’s completeness and robustness. All these efforts converge at the milestone moments: at the end of October when the storage engine completed its basic functions and was put into operation; by the end of November when the indexing engine layer was fully connected, realizing vector retrieval; by the beginning of December, the basic performance tests were preliminarily completed, demonstrating superior performance.

The programming language used for development is C++ 17. In our test process, we mainly focus on two aspects: testing for cloud products and the database kernel level. At the database kernel level, testing is divided into basic functional testing and more cutting-edge chaos testing.

Function testing focuses on functionality and interface accuracy, ensuring the system can handle user requests correctly as expected. Chaos testing focuses on testing the robustness of the distributed system, especially the system’s high availability and reliability. Chaos testing introduces a variety of exceptional scenarios such as hardware, network, resource issues, and node failure to simulate extremely adverse operational conditions. If the database kernel still operates stably under these conditions, it indicates that the system has strong fault tolerance. Chaos testing, as a tradition in the field of Baidu’s distributed storage, has also been incorporated into regular testing procedures. In addition, distributed systems are also subject to formal verification to detect distributed logic issues at an early stage, and our vector database kernel has also gone through this verification process.

Regarding whether the database supports multimodal and overall performance issues, the vector database provides support for multimodal functions at the database level, but does not directly process multimodal data—this processing is mostly done at the product or solution level. Our vector database can support various data types, enabling the native data of different modalities and their corresponding vector data to be stored and managed in an associated manner. When performing vector retrieval, it is possible to recall various modality native data related to the target vector, for further processing by higher-level programs.

End-to-end processing of multimodal data, such as text, images, and videos, requires front-end embedding capabilities and models to assist businesses with embedding. At the product level, we will integrate Baidu Intelligent Cloud’s Qianfan mega-model and provide high-level SDKs. By utilizing the Qianfan SDK to implement text and image embedding, we then use the vector database to accomplish associated storage and recall of these data.

With end-to-end multimodal support, our system can not only handle various forms of data, such as images, videos, and text, but also conveniently complete the construction of the AI workflow. The continuous enrichment of toolkits such as High-Level SDKs will make users more adept when dealing with multimodal data.

As of now, the multimodal domain is still in an actively developing stage, although there haven’t been any particularly leading application cases in the industry yet. However, some clients are already attempting to apply multimodal technologies into practice, such as autonomous driving companies, which stand out for their significant needs in content semantic management and processing of images. On the other hand, enterprises in the entertainment and cultural sectors have a massive amount of image and video resources accompanied by textual information, forming typical multimodal application scenarios. Although there are not many examples of in-depth applications of multimodal technology, it is foreseeable that the rapid growth of this field will drive the gestation and emergence of a series of AI-native applications.

Not long ago, vector databases were a relatively quiet tech area, but with the popularity of large models, vector databases gradually gained favor from venture capital and at one point became very hot. However, in the second half of 2023, Retrieval-Augmented Generation (RAG) technology seems to have replaced the spotlight on vector databases. In the discussion of RAG and vector databases, we came to understand that RAG is not a newly emerged technology, rather, it is a technique that has been reignited by OpenAI. RAG is not on the same level as vector databases and should be seen as a distinct technology or solution.

Initially, the idea of RAG came from the area of search engines, seeking to enhance search performance through the capability of internal enterprise searches. OpenAI was the first to use vector databases as a separate external tool. Later, to simplify user experience, OpenAI developed a plugin inside its system, allowing users to easily upload document data and build a simplified version of the RAG system. The advantage of the OpenAI system is that it helps users handle document data and create Demos, especially suited for small teams to rapidly build product prototypes. Nevertheless, this plugin-based solution is not without its limitations, such as a restriction on the number of documents that can be uploaded and a lack of effective permission management, which are insufficient for enterprise-level applications. Moreover, the high cost is also a limitation in document processing and indexing customization.

In the long view, we believe that RAG (Retrieval-Augmented Generation) systems will shine for those larger-scale enterprises and institutions, especially those whose internal development data usually can’t be reached by large models. Such internal data often need to adhere to strict permission management and can’t be accessed by large data models for training. Over time, large models that successfully integrate with enterprise-level vector databases and continue to develop will win the future.

Vector databases required by enterprises not only need to have excellent extensibility but also be equipped with comprehensive security permission management features, distributed performance, and low operating costs to meet various needs of enterprise-level customers. Although large data models are improving day by day, able to process and learn more information, the RAG system still remains crucial when dealing with proprietary enterprise data. OpenAI’s promotion of the RAG concept certainly has value, but the truly significant application scenarios for RAG systems actually lie in the enterprise-level data field. Only when an enterprise possesses rare and unique data does the RAG system become indispensable.

Regarding the open-source program for vector databases, we are not in a hurry to push forward at the moment. Given that both big data models and vector databases are evolving rapidly, we believe that the most critical task is to continue building a stronger product capability. Open sourcing allows more people to use the product with a lower threshold, but this purpose can already be achieved through cloud services or other means. Currently, we have made our cloud services freely available to the public, and everyone is welcome to visit the official website to experience VectorDB. In the future, we might consider allowing customers to deploy our products in their private environments, which is also part of our plans.

Regarding the team’s work status and moments that made a deep impression on the members, our team indeed experienced many profound moments throughout the entire R&D process. Firstly, all team members have shown extremely high morale and confidence. Each of us is very excited to participate in such an innovative project and feel an urgency to push the project forward. Although the workload is heavy, everyone strives to keep up the pace with few complaints. Secondly, members have a good understanding and receptiveness to various technical solutions. Some of our members have a lot of experience in distributed systems, some are experts in storage engines, others focus on vector indexing and retrieval, and still, others have rich backgrounds in product management and console development. Everyone actively learns and understands each other’s expertise, which has promoted a deeper overall understanding of the system and the product among members and greatly improved the efficiency of problem-solving.

In the fields of cloud computing and artificial intelligence, the great opportunity that is unique to the next few years is undoubtedly the emergence of large models. This phenomenon is destined to drive the leap in development of the vector database field. For every product and technical team member involved, this is not only exciting but also a precious opportunity. It is heartening that our product has recently started public testing, and according to preliminary market feedback, user response is extremely positive.

In a very short period of time, a large number of users have already activated free clusters for actual use and testing, and the weekly user growth data shows an astonishing speed. Users not only propose various new requirements after using the service but also provide extremely positive feedback. Seeing the rapid increase in the number of users, receiving wide recognition, and getting feedback on all sorts of questions and demands are all extremely exhilarating and valuable experiences for us.

About the continuous growth of future market demand for vector databases, fundamentally, the types of data that enterprises own will directly determine the potential emerging application scenarios in the future. For example, the ride-hailing service DiDi was born based on mobile location data. In other words, the core value of data lies in its potential space for innovation. An enterprise’s innovation capacity is closely related to the types of data it possesses; different types of data empower enterprises to hatch various applications, and whether an enterprise can use vector databases to develop applications depends entirely on the kind of data it has.

Nowadays, enterprises span across many industries and have obtained diverse types of data, we believe this is why large models have the ability to restructure all existing systems. Undoubtedly, vector databases will have a huge enhancing and restructuring effect on existing systems such as office automation, autonomous driving image management, sales processes, ERP systems, and even financial systems. At the same time, the emergence of AI-native applications may well exceed our current imagination, just as location data has spawned new businesses, new intelligent agents will seek knowledge, which could come from large models or from enterprises’ internal proprietary knowledge bases, hence data will undoubtedly become a valuable source of knowledge for future intelligent entities.

Nowadays, the challenge we face is that a large amount of valuable document data, such as PDF, DOC, and PPT files, is still stored in an unstructured form on personal computers or the cloud. Past technological limitations have failed to fully tap into and manage the potential value of these document contents. However, with the breakthrough of large model technology, we now have the capability to conduct in-depth mining and management at the semantic level, addressing the missing links in content management, and refocusing on the value of these document contents.

In today’s era of information explosion, businesses urgently need to extract valid information from a large volume of unstructured files. To this end, we can leverage technology for deep text content processing, involving analysis, segmentation, vectorization, building keywords and semantic indices, extracting metadata, and more. Furthermore, with specific technologies such as multimodal technology, we can build connections between different types of content and centralize this information in vector databases. This not only supports content applications, such as the popular RAG, but also helps various enterprises to unearth the value of unstructured files, enhance existing business efficiency, and explore new business opportunities.

The future battleground for document management will likely shift from unstructured object storage to more semi-structured database systems, especially vector database systems. In this field, we see great potential opportunities.

When facing the upcoming challenges and opportunities, we will focus on three main directions for technical research and deepening:

Support for RAG: We anticipate that RAG will become a key point. To optimize support for RAG, vector databases will need to provide not only efficient vector retrieval capabilities but also integrate key technologies such as full-text search, multi-route retrieval recall, and hybrid sorting. These are important areas for technological innovation this year.
More efficient, cost-effective solutions: Many customers are extremely sensitive to costs. In pursuit of cost reduction and efficiency enhancement, we will devote ourselves to developing more cost-effective vector indexing and retrieval mechanisms and exploring intelligent capabilities to further improve cost-effectiveness.
Development of AI application suites: AI applications based on vector databases will be another research focus. Starting with RAG and multimodal technologies, we will develop full process workflow suites, significantly reducing the difficulty and threshold of application development, and improving development efficiency.

Moreover, we will also pay attention to the integration of internal and external ecosystems, including but not limited to the application of technologies in Baidu’s internal services such as cloud storage, document libraries, etc. Through integrated applications, our products will become more comprehensive. Our strong ecosystem will be key to the success of the product.

As for the future, the technical development of vector databases will unfold in the following directions:

Continuously strengthen basic capabilities such as large-scale, high-performance, and low-cost to meet diverse customer needs.
Optimize key technologies that support AI applications, improve retrieval efficiency and quality, and provide the necessary support for content management and knowledge base construction.
Provide rich enterprise-level features to meet needs for secondary development, cost budgeting, security auditing, and integration with existing business architectures.
Integrate the large model ecosystem to provide customers with end-to-end tools and solutions.
Build a semantic-based content management platform to propel the content management industry into a new era of AI nativity.

Google’s major layoffs spark collective protest from veterans: As the layoff storm at tech giant Google intensifies, one wave subsides and another rises. The company’s senior members have publicly expressed their dissatisfaction, pointing directly at the short-sighted thinking of the company’s leadership, while criticizing the middle management team’s lack of ability. This expansionist management style has severely affected the operational efficiency of the company.

Google dismisses entire Python team: In this wave of layoffs, it is shocking that Google has dissolved the entire Python development team. The decision led to the founder of PyTorch voicing his urgent dissatisfaction, hinting that this move will bring immeasurable loss to the Python community and emphasized the irreplaceability of the core language team.

Germany embraces Linux again: In the shift of digital infrastructure, Germany has once again decided to migrate tens of thousands of systems from Windows to a Linux-based operating system. What draws attention is whether Germany can effectively avoid the various problems it encountered during the migration twenty years ago and successfully complete this migration task.

System bug leads to imprisonment of over a hundred people, massive investment leads to failure: In a shocking incident, due to system vulnerabilities, more than a hundred people were wrongly sentenced to prison. Despite an investment of 280 million yuan in an attempt to fix and migrate the infrastructure to the cloud, the attempts still ended in failure. For twenty years, this company has unfortunately fallen into the vortex of Japanese software time and again.