Mira Murati Unclear on Sora’s Data Source

Dren.H

2 months ago

Mira Murati Unclear on Sora’s Data Source

OpenAI’s CTO Mira Murati unclear on data sources for the Sora model, amidst legal scrutiny over AI training practices.

OpenAI CTO Mira Murati says Sora was trained on publicly available and licensed data pic.twitter.com/rf7pZ0ZX00
— Tsarathustra (@tsarnick) March 13, 2024

In a recent interaction with The Wall Street Journal, OpenAI’s Chief Technology Officer Mira Murati shed light on the enigmatic data sourcing practices behind the company’s newest AI venture, Sora. This forthcoming video generation tool, which can craft videos based on textual prompts, finds itself at the heart of an ambiguity regarding its training data’s origins. Murati’s responses hinted at a mix of publicly accessible and proprietary data fueling Sora’s capabilities, yet specifics remained elusive, especially concerning content from major social media outlets like YouTube, Instagram, or Facebook.

OpenAI, a titan in the AI sector valued at $80 billion, often taps into vast datasets to educate its models. These datasets are crucial for teaching AI systems to discern patterns, interpret language, or forecast outcomes. Murati, who has been instrumental in spearheading key projects like the Dall-E 3 image generator, Whisper speech-recognition software, and the ChatGPT-4 chatbot, found herself momentarily at the helm as CEO following Sam Altman’s departure in November 2023.

The Complex Web of AI Training Data

The conversation veered towards OpenAI’s collaboration with Shutterstock, probing whether its repository might contribute to Sora’s learning process. While Murati confirmed Shutterstock’s involvement, the discourse on data specifics ended there, maintaining a veil over the comprehensive dataset utilized. This opacity raises broader discussions about the ethical and legal frameworks surrounding AI training methodologies.

OpenAI’s past is not without scrutiny regarding its data practices. The company faced legal challenges, notably from authors Sarah Silverman, Richard Kadrey, and Christopher Golden in July 2023, who argued that ChatGPT’s content generation infringed on copyrighted material. Similarly, The New York Times and Microsoft found themselves in OpenAI’s legal crosshairs over alleged unauthorized use of journalistic content for AI training. Another lawsuit in California accused OpenAI of harvesting private online data without consent to refine ChatGPT’s responses.