The Sora team claims their product is too powerful for ordinary people to use at the moment.

2024-05-14 10:41:40

New Video Generation Model “The Coronation of a New King,” Why Has Sora Become a Global Focus? Since 2023, multimodal video generation technology has made significant progress and breakthroughs, forming an accelerating trend of development. Particularly noteworthy is that by February 2024, OpenAI’s video generation model Sora has attracted widespread global attention. The name Sora

New Video Generation Model “The Coronation of a New King,” Why Has Sora Become a Global Focus?
Since 2023, multimodal video generation technology has made significant progress and breakthroughs, forming an accelerating trend of development. Particularly noteworthy is that by February 2024, OpenAI’s video generation model Sora has attracted widespread global attention. The name Sora is derived from the Japanese “そら”, symbolizing the vast expanse of the sky, representing its infinite creative potential.
Compared to its predecessors like Runway, Pika, VideoPoet, etc., Sora has clear advantages in terms of video generation effects and quality. Its outstanding performance has made it one of the most prominent models. Professionals say about Sora: “Sora’s debut came ahead of our expectations, OpenAI brought us a pleasant surprise ahead of schedule.”
Although Sora did not introduce a completely new theoretical framework, it has shown high comprehensive performance through the efficient integration of existing technologies. It is analyzed that the DiT structure adopted by Sora is based on the concept of the Diffusion Transformer from a paper presented at the recent ICCV 2023 conference in the field.
In terms of the specific technical methods for building the model, Sora adopted a different approach from previous models. Previous models were based on diffusion model technology, using hints added with Gaussian noise (prompts), and rendering frame by frame through a U-Net network. However, due to the lack of long-span temporal connections, videos could often only be generated for a few seconds and suffered from severe frame-jumping issues. In comparison, Sora uses a Transformer instead of U-Net to maintain the coherence of the video across the timeline and can adapt to different aspect ratios, generating longer and better-quality videos.
It is reported that Sora can now generate videos up to one minute long, and theoretically, it can produce even longer video content. However, doing so brings more uncertainties and requires significantly more computing resources. An industry expert pointed out: “It’s relatively simple to innovate from zero to one, but further improvements require a qualitative leap and the difficulty increases significantly.”

In the current field of video generation technology, maintaining consistency between characters and scenes is an important and challenging task. For example, in a generated continuous video sequence, the appearance features of the protagonist and the supporting characters, such as face shape, skin tone, eye size, and the position of moles should remain unchanged. However, models like Sora currently cannot guarantee that characters maintain complete consistency in every frame of the generated videos. There are also issues with scene consistency; different camera angles may introduce changes, and maintaining the continuity and authenticity of scenes is crucial. Solving these problems helps achieve effects close to what film companies expect.

Furthermore, there is a diversity of opinions within the industry regarding the specific implementation of the “world model” by Sora.OpenAI even claims that Sora’s extended video generation model will build a universal simulator for the physical world. However, this statement has not been accepted by all experts. For example, Turing Award winner Yann LeCun has suggested that Sora does not truly understand the physical world and remains skeptical of its attempts to implement a “world model.”

If Sora can show a higher level of understanding in processing video content—such as maintaining consistency in the plot and logic before and after supplementing a video segment—then it will make a significant leap in simulating the real world. This would mean that Sora understands the world not just visually but is able to deeply grasp video content, including physical laws and social norms. Just as in movies where the plot resonates throughout, if Sora can achieve a similar effect in video generation, it would indicate that it has developed a longer causal chain understanding, marking an important milestone in its evolution.

Sora is currently in its early stages of development, reminiscent of the initial stages of GPT-3. Although it has demonstrated certain capabilities, it is still far from reaching the maturity of ChatGPT. Nevertheless, the rate of progress for Sora is quite rapid, and it is expected to continue evolving quickly in the future. It is anticipated that there will be many breakthroughs by the year 2024.

At present, Sora is still not open for public use. During a recent interview, core team members of Sora mentioned that given its powerful capabilities, it is not yet suitable for immediate use by the general public. OpenAI is actively collecting user feedback and engaging in extensive safety-related work. According to a previous statement by OpenAI’s CTO Mira Murati, Sora could be publicly tested as early as this year.

The foundational model Sora will have a comprehensive impact on all industries, especially in film and television, e-commerce, and gaming, where Sora will spark endless new creativity. The film and television industry will particularly benefit from Sora’s technology, as this imaginative model has an incredibly broad scope of practical applications in the industry.

In the film production field, the workflow is complex and highly interconnected. Once a screenplay is completed by the writer, the project moves into the stage of finding a suitable director. At this point, whether it’s in Hollywood or a well-known domestic production studio, production teams commonly make a 30 to 40-minute sample reel to showcase the core content of the movie. This refined edit must demonstrate the plot, characters, environment, and special effects, and it also serves to prove the project’s innovation and market potential to potential investors. Formal investment only begins once investors are convinced of the project.

But the cost of this part is not small, with expenses ranging from 10,000 to 20,000 yuan per minute. After the introduction of video generation technologies like Sora, this cost is expected to be significantly reduced, possibly to just a few thousand yuan per minute. The application of Sora can also eliminate the troublesome processes of scene setup, special effects production, and post-production in filmmaking, greatly improving efficiency.

“In cooperation with the Beijing film and television industry, I noticed that the team often encountered scenes that required 3D post-production to complete, such as the big bang of the universe or the orbital scenes of the Earth and the moon.” However, the production cost of such scenes is extremely high, with the common film and television frame rate of 25 frames per second, a scene lasting 2 to 3 seconds needs about 70 frames, resulting in even very brief shots requiring thousands or even tens of thousands in cost. The expert indicated that the emergence of video generation models like Sora has provided new means of presentation for scenes that were costly and difficult to capture through traditional filming.

The impact of Sora is not limited to the field of film and television production, but is also significant in traditional advertising, gaming, and streaming industries. The application of AI technology allows filming work that used to take a long time to complete in just a few minutes, and significantly reduces costs.

In the e-commerce sector, Sora has also brought transformation to the generation of video content. Traditional production of promotional videos involves selecting models, setting up scenes, filming, and post-production processes. But with Sora, merchants only need to submit a text description or pictures, and they can quickly obtain high-fidelity video content, greatly improving the efficiency of marketing material production. Merchants can also use Sora to create different scene experiences for products, or show the effects of products in various spatial layouts to further enhance consumers’ purchasing incentives.

Although Sora has made breakthroughs in video content production, to truly apply it to the e-commerce market, some challenges need to be overcome. Currently, despite the exciting interaction with Sora, it is still limited by a lack of clear control and intuitive interaction methods.

E-commerce is exploring new approaches to applying artificial intelligence-generated content. E-commerce merchants hope to create video content based on existing products, but current generation models like Sora have not yet been able to naturally integrate products into videos and present them in real scenarios to attract potential customers. While Sora has demonstrated its powerful creative capabilities, there is still further optimization space and need for how these technologies can be effectively integrated with existing e-commerce platforms to generate content for specific needs.

Meanwhile, to expand the application range of Sora technology in the e-commerce industry, it will require further research and exploration of new methods to achieve the perfect integration of products and videos.

In the gaming industry, AIGC technology has already been put into use relatively early. For example, Sora technology can be used to generate in-game seasonal transition videos, which not only enriches the visual effects of the game with the alternation of scenes within seconds but also lowers the development costs. The design of game characters’ clothing and accessories can also be realized through Sora, including the simulation of characters’ actions such as running and flying. Traditional modeling and action synthesis can now be accomplished with the help of AI technology, combined with manual adjustments, to bring about more vivid and realistic results.

Other important aspects of game production, such as lighting effects and material simulation, can also be achieved through AI technology. For example, the different floating effects, falling speeds, and sounds produced by garments made of various materials as characters fall from heights, as well as the action design for quadrupedal animals are complex tasks that artificial intelligence technology can tackle. Particularly for common animals like cats and dogs, as they do not involve complex intellectual property issues, they are more suitable as experimental cases and solution development.

As for the image generation models, due to their technological maturity, their applications have become more extensive and in-depth across numerous industries. Currently, tools like Stable Diffusion can generate massive amounts of images, such as in-game islands, homeland design drawings, etc., inspiring art designers. Although these generated images and videos cannot be applied directly, they provide designers with highly valuable creative inspiration and reference.

In the field of e-commerce, image generation models have been widely used and show a mature development. WeShop, a smart product image generation service tool based on the Stable Diffusion model, represents the growth of this technology. Its core team originates from the popular e-commerce platform Mushroom Street. WeShop provides services mainly for two types of customers: one part are factory owners who are supply chain entities, who can transform product images into pictures with various models and backgrounds using this tool; the other part are e-commerce businesses that wish to expand their business to international markets, who can convert domestic product images into those with models more suited for foreign consumers using this tool.

Discussing the future of this technology, industry insiders believe that the foundational models in the image generation field will make further leaps. Currently, technological development tends towards unification, recognizing the urgent need to introduce Decision Tree (DiT) structures. Although there are still attempts to explore purely Transformer architectures at this stage, they have not revolutionized the existing technology level. However, with the Sora model proving the effectiveness of large-scale models, more resources are expected to be invested in image generation technologies in the future, thereby driving the development of this field. Even though this shift has not yet become the focus of public attention, there is a firm belief that this technology will soon achieve significant breakthroughs.

In the face of challenges brought by technological innovations, innovative technologies like Sora not only have the potential to transform industries, but also pose challenges to the job market. More and more practitioners are beginning to worry whether their jobs can withstand the impact of new technologies. Especially in the film and television industry, there are discussions about the future of special effects companies and whether directors and post-production workers face unemployment. However, most experts remain optimistic, citing the development of CG technology as an example. Although the emergence of this technology initially worried animators, it did not reduce the cost of film and television animation production, but instead increased costs as well as the visual effects and quality of the works. This change has stimulated the creativity of artists and directors and has driven the improvement of the entire industry standard.

Facing technological change, we should adopt a positive attitude, embrace change, and look for new opportunities rather than worry and resist blindly. If we only stick to the past methods of working and are unwilling to adapt to new technologies, we may indeed encounter problems. But new technologies have also opened up broader markets for us, enhanced the industry’s development potential, and brought us more opportunities to try and innovate. “If you insist that you are replaced by new technology, this may be an unavoidable mindset,” say many in the industry.

Standing at the forefront of technological change, we can happily observe that as new technologies continue to emerge, the labor force once bound by old work models is gradually being liberated. This liberation confronts us with new opportunities and challenges. For tools like Sora, their current role is not to replace humans but to serve as aids, helping to improve our work efficiency.

Take the development of the film industry as an example, even with the assistance of language models like ChatGPT for screenwriters, real screenwriters still play a vital role in the creation of storylines. In the stage of storyboard planning, the role of directors is also indispensable, as only they can imbue the script with life through their personal emotions and soul. In addition, the presence of actors is equally important, as their performance endows the film with unique appeal.

So, what new changes has Sora actually brought? It has indeed contributed to the efficiency of video and film production, reduced costs, and may partially replace certain special effects production processes. However, it is important that Sora, more often, provides the possibility of creating demos, offering ideas for subsequent special effects production rather than replacing them entirely. Work that used to take ten days can now be done in just a few days. However, we should clearly understand that any technology is just a tool; its ultimate goal is to serve humanity, so absolute replacement does not exist. The more critical role of tools lies in reducing costs and improving efficiency.

In the rapid evolution and innovation of AI technology over the past two years, we stand on the threshold of a new era. We must maintain continuous attention to the development of AI technology and be ready to welcome new challenges and changes at any time. By learning and adapting to new technologies, we can ensure that we hold our ground in the tide of AI technology.

As Wu Haibo said, “Our current goal is to first engage in this transformation, transforming ourselves into an AI Native company.” Embracing an entrepreneurial mindset to disrupt traditional business models and exploring the boundless possibilities of the AI era with an open mind is the choice that must be made at present. Currently, although the specific impact of AI technology in the e-commerce sector has not been fully revealed, it is certain that AI technology will bring sector-wide innovation to the industry, from the intelligent enhancement of existing e-commerce platforms to the birth of an entirely new e-commerce ecosystem.

In the film and gaming industries, the transformative potential of artificial intelligence technology is gradually becoming apparent. What these fields urgently need now is a comprehensive, practical solution. Such an ideal toolkit should be able to convert text and hand-drawn manuscripts into a range of content: including 2D images, 3D models, character standees, skeletal rigging, and the automatic generation of motions.

From the perspective of game development, the emergence of such a tool will greatly improve the efficiency of development. Developers can automatically generate all the necessary resources for game production by simply inputting text descriptions and drafts, significantly reducing the tedious work during the development process.

Similarly, the film production field also needs such an all-in-one tool. This tool can directly parse the script content and generate video materials based on established plot development. This ensures not only the uniformity of the visual style and quality but also significantly improves the production efficiency and quality of the entire industry’s works.