Featured image
Video-To-Text

Twelve Labs Unveils Pegasus-1: A Video-To-Text Foundation Model

avatar

Sven

October 28th, 2023

~ 3 min read

Twelve Labs, a San Francisco Bay Area-based AI research and product company, has announced the release of their video-language foundation model, Pegasus-1. This new model showcases Twelve Labs' commitment to providing comprehensive video understanding capabilities and offers a suite of APIs for various video understanding tasks. In this blog post, we will delve into the technical report of Pegasus-1 and explore its features, performance, and applications.

Product

Twelve Labs introduces Pegasus-1, their latest video-language foundation model, along with a suite of Video-to-Text APIs. These APIs include the Gist API, Summary API, and Generate API, allowing developers to generate specific text outputs from video data with just a single API call. The Gist API provides concise text outputs like titles and hashtags, while the Summary API generates video summaries and highlights. The experimental Generate API allows users to prompt specific formats and styles for customized outputs, from bullet points to creative lyrics.

Product and Research Philosophy

Unlike other approaches that frame video understanding as an image or speech understanding problem, Twelve Labs adopts a "Video First" strategy. They believe that video understanding requires a unique approach that combines visual perception with sequential and contextual nuances from audio and text. Four core principles guide their philosophy: Efficient Long-form Video Processing, Multimodal Understanding, Video-native Embeddings, and Deep Alignment between Video and Language Embeddings.

The New Model

Pegasus-1 boasts approximately 80 billion parameters and consists of three model components: video encoder, video-language alignment model, and language decoder. These components are jointly trained to generate video-native embeddings, align video and language embeddings, and produce human-readable text outputs.

Dataset

To train Pegasus-1, Twelve Labs has collected over 300 million diverse video-text pairs, making it one of the largest video-text corpora available for video-language foundation model training. The technical report is based on an initial training run conducted on a 10% subset, consisting of 35 million video-text pairs and over 1 billion image-text pairs.

Performance

Pegasus-1 outperforms the previous state-of-the-art video-language model, showing a 61% relative improvement on the MSR-VTT Dataset and a 47% enhancement on the Video Descriptions Dataset, as measured by the QEFVC Quality Score. Additionally, Pegasus-1 surpasses ASR+LLM models, outperforming Whisper-ChatGPT (OpenAI) and a leading commercial ASR+LLM product by 79% on MSR-VTT and 188% on the Video Descriptions dataset.

API Access to Pegasus-1

Developers can access Pegasus-1 through Twelve Labs' Video-to-Text APIs. The waitlist for API access can be found on their website.

Closing Remarks

Twelve Labs' Pegasus-1 represents a significant advancement in multimodal video understanding. With its innovative approach, comprehensive dataset, and impressive performance, Pegasus-1 offers new possibilities for video-to-text generation. While there are certain limitations and challenges to address, Twelve Labs is dedicated to further improving their model and ensuring responsible deployment of advanced technologies.