Sam Altman redefines AI battle lines with the launch of Sora

Sam Altman, CEO of OpenAI took to the social media platform X to announce the launch of Sora, a video-to-text platform. He said – “here is Sora, our video generation model – Sora. today we are starting red-teaming and offering access to a limited number of creators.”

Apart from generating a video with text instructions model can also take an existing image and generate a video. It can animate the image’s contents with accuracy and attention to detail. The model also takes an existing video and extends it or even fills missing frames.

Commending the team on building the product, Altman added, that they are focussed on teaching AI to understand and simulate the physical world in motion. OpenAI’s chief added that the goal is to train models that help people solve problems that require real-world interaction.

A platform for the creative industry

The platform is currently being given access to designers, filmmakers, and visual artists, to get feedback on advancing the model. Sora follows a diffusion model that generates a video by starting with one that looks like a static noise. This gradually transforms by removing the noise over multiple steps.

“By giving the model foresight of many frames at a time, we’ve solved the challenging problem of making sure a subject stays the same even when it goes out of view temporarily. Similar to GPT models, Sora uses a transformer architecture, unlocking superior scaling performance,” stated the company blog.

Sora currently represents videos and images as collections of smaller units of data called patches, each of which is akin to a token in GPT. By unifying how the data is represented, they can train diffusion transformers on a wider range of visual data. This spans durations, aspect ratios, and resolutions.

Building on DALL-E and GPT models, Sora uses the recaptioning method from DALL-E. This means it can generate highly descriptive captions for the visual training data. The model therefore can follow the user’s text instructions in the video better.

“Sora can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt,” stated the company’s blog note. The note added Sora will be available to red teamers to assess critical areas for harm or risks.

“We’re sharing our research progress early to start working with and getting feedback from people outside of OpenAI and to give the public a sense of what AI capabilities are on the horizon,” added the blog.

Understanding the specifics

The interesting thing about the platform is that it can create specific types of motions, multiple characters, and even accurate details of the background and subject. This current model can understand what the user has asked for in the prompt and how those prompts exist in the real world.

Coming in with a deep understanding of language, Sora can interpret prompts, and generate compelling characters with emotions. It can also create several shots within a single generated video that can accurately portray characters and visual style.

However, there are weaknesses. The model may still struggle with simulating the physics of a complex scene and may find it difficult to understand specific causes and effects. For example, if a person takes a bite of a cookie, the bite marks may be missing.

The model is yet to understand spatial details – it mixes up left and right and can struggle with lefts and rights. It could also struggle with precise descriptions of events that take place over time that follow a specific camera trajectory.

“We’ll be taking several important safety steps ahead of making Sora available in OpenAI’s products. We are working with red teamers — domain experts in areas like misinformation, hateful content, and bias — who will be adversarially testing the model.”

The world of AI battles

The digital landscape is evolving at an unprecedented pace. Artificial Intelligence (AI), the metaverse, and Web3 have captured the imagination of global leaders and tech titans alike. Among the cacophony of buzzwords echoing through boardrooms and virtual spaces, one resounds with fervour: Generative AI (GenAI) and the metaverse.

Notably, this surge of interest extends its tendrils into the entertainment industry. Recently, Camb.AI, a dubbing platform based in the United Arab Emirates, secured a hefty $4 million in seed funding. Its services were prominently featured during the coverage of the 2024 Australian Open Tennis Tournament and the spine-chilling film “Three.”

By 2023, a deluge of chatbots inundated the market—Google’s Bard (now Gemini), Meta AI, Elon Musk’s Grok, Samsung’s Gaus, each vying for supremacy. Fast forward to 2024, and the battleground shifts to industries and enterprises, with heavyweight contenders unveiling their arsenals.

Google debuts Gemini as a standalone Android App, while OpenAI teases ChatGPT 5 with enhanced features. Amazon stealthily introduces its AI chatbot, Nvidia unveils Chat with RXT, and OpenAI’s Bret Taylor launches Sierra. Meanwhile, PayPal-backed Rasa secures $30 million in Series C funding to bolster its enterprise-focused conversational AI product.

Addressing safety concerns

The OpenAI blog added they are also building tools to help detect misleading content like a detection classifier that can tell a video was generated by Sora. The team plans to induce C2PA metadata in the future, as they deploy the model in OpenAI.

As the battle for AI intensifies, one of the biggest concerns has been security and bias. Already used by several million people, the security concerns are real. OpenAI has several times come under the eye of the storm.

A report by Menlo Security, stated the concerns are only growing especially after the OpenAI data breach in March last year, when the data of over 1.2 million subscribers was exposed. It has led to the creation of several newer platforms like Nvidia’s Chat with RTX that focusses primarily on privacy, keeping the data restricted to the user’s environment.

To address this, Altman added that Sora not only leverages the existing safety methods that they built for their products that use DALL-E3. Currently, the text classifier will check and immediately reject prompts on violence, celebrity likeness, IP of others, sexual content and others.

“We’ve also developed robust image classifiers that are used to review the frames of every video generated to help ensure that it adheres to our usage policies before it’s shown to the user,” stated the blog.

The team is already engaging with educators, artists, and policymakers around the world to understand core concerns and identify positive use cases.

“Sora serves as a foundation for models that can understand and simulate the real world, a capability we believe will be an important milestone for achieving AGI.” The note said.

Related