AdsQA: Towards Advertisement Video Understanding

Xinwei Long1*, Kai Tian1*, Peng Xu1, Guoli Jia1, Jingxuan Li1, Sa Yang2, Yihua Shao3, Kaiyan Zhang1,
Che Jiang1, Hao Xu1, Yang Liu1, Jiaheng Ma1, Bowen Zhou1†

1Tsinghua University,  2Peking University,  3Institute of Automation, Chinese Academy of Sciences

* Indicates Equal Contribution      † Indicates Corresponding Author

Teaser Image

Abstract

Large language models (LLMs) have taken a great step towards AGI. Meanwhile, an increasing number of domainspecific problems such as math and programming boost these general-purpose models to continuously evolve via learning deeper expertise. Now is thus the time further to extend the diversity of specialized applications for knowledgeable LLMs, though collecting high quality data with unexpected and informative tasks is challenging. In this paper, we propose to use advertisement (ad) videos as a challenging test-bed to probe the ability of LLMs in perceiving beyond the objective physical content of common visual domain. Our motivation is to take full advantage of the clue-rich and information-dense ad videos’ traits, e.g., marketing logic, persuasive strategies, and audience engagement. Our contribution is three-fold, (1) To our knowledge, this is the first attempt to use ad videos with well-designed tasks to evaluate LLMs. We contribute AdsQA, a challenging ad Video QA benchmark derived from 1,544 ad videos with 10,962 clips, totaling 21.1 hours, providing 5 challenging tasks. (2) We propose ReAd-R, a Deepseek-R1 styled RL model that reflects on questions, and generates answers via reward-driven optimization. (3) We benchmark 14 toptier LLMs on AdsQA, and our ReAd-R achieves the stateof-the-art outperforming strong competitors equipped with long-chain reasoning capabilities (e.g., VOT, and MCTSr) by a clear margin.

Benchmark Overview

Teaser Image

AdsQA introduces a novel benchmark for understanding advertisement videos using LLMs. These videos are clue-rich, persuasion-driven, and semantically dense — ideal for evaluating cognitive-level multimodal reasoning. The benchmark consists of 1,544 ad videos, 10,962 clips, and five reasoning tasks: visual concepts, emotion recognition, theme extraction, persuasion strategy, and audience modeling. Our proposed model, ReAd-R, a reinforcement-learned ad reasoner, achieves state-of-the-art performance across all tasks.

Benchmark Tasks

Dataset Overview & Statistics

Teaser Image

The AdsQA benchmark introduces a comprehensive, large-scale video QA dataset specifically designed around the complex and information-rich nature of advertisement videos. It offers a diverse and well-structured data source to evaluate LLMs on implicit reasoning tasks.

The dataset is constructed using a novel Role-Played Multi-Agent Annotation framework that simulates human expert behaviors—including those of marketers, visual designers, and psychologists—to automatically generate rich, specialized, and insightful QA pairs for each advertisement video.

🧠 Understanding the Need for Specialized Reasoning

Advertisement videos are fundamentally different from typical instructional or user-generated videos. They are designed to convey powerful messages in short time spans—often through symbolism, emotion, metaphor, and subtle psychological cues. These layers of meaning are not explicitly stated; they are intentionally implicit, aiming to influence viewer perception on a subconscious level.

As a result, reasoning over ad content requires cognitive abilities beyond simple perception. It’s not just about identifying what appears on screen, but understanding why it appears, what it aims to evoke, and for whom it was crafted. Traditional video-language models often fall short when asked to interpret persuasive strategies, emotional triggers, or audience targeting embedded in ad narratives.

This gap in reasoning capability motivates the need for a new model—one that not only processes visual and textual inputs, but also reflects, evaluates, and reasons like a human expert. This is where ReAd-R, our Reinforced Ad Reasoner, comes in.

ReAd-R: Our Reinforced Ad Reasoner Model

Teaser Image

ReAd-R is our proposed reinforcement learning-based reasoning model, inspired by DeepSeek-R1, tailored for advertisement video understanding. Unlike traditional chain-of-thought methods, ReAd-R reflects on the ad's contents and learns from feedback to generate high-quality answers in a human-like manner.

It integrates a vision encoder, a lightweight language model, and a multi-stage policy optimization framework. By leveraging rule-based reward modeling, it avoids hallucinations and promotes precise, grounded reasoning. ReAd-R was trained with only 500 annotated ad clips and achieves state-of-the-art performance across multiple QA tasks, especially those involving implicit reasoning such as persuasion strategies and emotion recognition.