With the widespread application of large models, multimodal technology is increasingly seen as an important trend in future technological development. Despite its vast prospects, there are also numerous technical challenges, such as challenges in data fusion, model integration, and cross-modal information fusion, which are key points for technological progress.
At the AICon Global Conference on AI Development and Application and the Large Model Application Ecosystem Exhibition, the planning of the multimodal technology and application sessions was particularly eye-catching. Meng Erli, the technical director of Xiaomi AI Lab’s machine learning team, personally selected four experts from the field to share their insights and experiences.
How the foundational voice model promotes sound understanding and generation
At the summit, we were fortunate to have Wang Yujun share his in-depth insights about the foundational voice model. Wang Yujun, as the leader of Xiaomi’s voice technology and also the head of the acoustics and speech direction at the AI Lab, has more than 20 years of experience in sound research. Under his leadership, the acoustic and speech team established by Xiaomi covers a wide range of areas, including speech understanding, generation, and measurement. They have achieved significant accomplishments in service quality and technological innovation, as their speech services have reached an impressive 1.26 billion times per day, and they have won crowns in numerous acoustic speech competitions.
In Wang Yujun’s sharing, he will focus on explaining how Xiaomi’s foundational voice model has undergone precise evolution and has improved support for sound understanding and generation from both encoding and decoding ends. The audience will be able to gain a profound understanding of how to utilize foundational voice models to advance the development of acoustics from his presentation and also comprehend the current challenges and future directions of the field.
Towards the practical application of large multimodal models
We then invited Yao Yuan, a researcher at Mianbi Intelligent and a postdoctoral fellow at the Department of Computer Science at Tsinghua University, who has rich experience in the research of large multimodal models. Yao Yuan’s presentation focused on the latest progress in the practical application of large multimodal models and his team’s new achievements.
He analyzed the many challenges that large multimodal models encounter in practical application, such as enormous model parameter scales, high computational costs, limitations in image resolution, and language processing capabilities. Therefore, Yao Yuan shared his team’s latest explorations and introduced breakthroughs in areas such as building efficient end-side base models like MiniCPM-V 2.0, encoding high-definition images, cross-language generalization abilities, and enhanced learning with multimodal human feedback.
MiniCPM-V 2.0, with its 2.8 billion parameters, not only outperforms other mainstream models on common benchmarks in composite performance, but also excels in OCR, high-definition image processing, and bilingual support. It has also achieved significant success in exhibiting credible behavior. The model has garnered much attention and widespread recognition on the international open-source platform HuggingFace.
Through Yao Yuan’s presentation, the audience can deeply understand the problems and challenges that large multimodal models face in practical application and learn about the related solutions and technical methods, thereby helping to apply these technologies better in real life.
In the financial sector, frontline experts leading advanced technology are continuously exploring and deepening the research and application of multi-modal large models. Among them, Zhou Siji, Director of Financial Solutions and Head of Financial Big Models at Volcano Engine, has made remarkable achievements. With her in-depth research and practice in natural language processing, machine learning, computer vision, and other fields, she dedicates herself to the application of artificial intelligence in the financial industry.
Teacher Zhou mentioned in her speech that big models are undergoing a transformation from single modality to multi-modality, which will bring disruptive productivity tools to different industries and may cause fundamental changes in business models. For the financial industry, multi-modal large models, by simultaneously processing text, data, tables, and visual information, are able to comprehensively understand financial professional documents, significantly improving the effectiveness of fintech applications. She also deeply analyzed from multiple perspectives the future development direction, application scenarios, and continuous expectations for practical implementation of this technology in the financial field.
Meanwhile, in the field of content understanding and generation technology, Li Yan, head of the Kuaishou “Ketu” big model team and a Ph.D. from the Institute of Computing Technology, Chinese Academy of Sciences, has developed and practiced the Kuaishou self-researched Wen Shengtu large model based on his over ten years of experience, and revealed the model’s development history and technical paths in this sharing session.
Dr. Li Yan also analyzed the technical details of the Wen Shengtu large model and its specific applications in the Kuaishou APP. His sharing covered an all-around interpretation from the development of large models to assessing their effectiveness, selecting the best scenarios for implementation, and how to avoid potential application risks, allowing the audience to understand the important value of the Wen Shengtu large model in actual operations and the efficient ways to achieve business goals.
The future direction and potential of artificial intelligence and big model applications will be further discussed at the upcoming AICon Global Artificial Intelligence Development and Application Conference.