AdsQA: Towards Advertisement Video Understanding

Xinwei Long^1*, Kai Tian^1*, Peng Xu¹, Guoli Jia¹, Jingxuan Li¹, Sa Yang², Yihua Shao³, Kaiyan Zhang¹,
Che Jiang¹, Hao Xu¹, Yang Liu¹, Jiaheng Ma¹, Bowen Zhou^1†

¹Tsinghua University, ²Peking University, ³Institute of Automation, Chinese Academy of Sciences

* Indicates Equal Contribution † Indicates Corresponding Author

Abstract

Large language models (LLMs) have taken a great step towards AGI. Meanwhile, an increasing number of domainspecific problems such as math and programming boost these general-purpose models to continuously evolve via learning deeper expertise. Now is thus the time further to extend the diversity of specialized applications for knowledgeable LLMs, though collecting high quality data with unexpected and informative tasks is challenging. In this paper, we propose to use advertisement (ad) videos as a challenging test-bed to probe the ability of LLMs in perceiving beyond the objective physical content of common visual domain. Our motivation is to take full advantage of the clue-rich and information-dense ad videos’ traits, e.g., marketing logic, persuasive strategies, and audience engagement. Our contribution is three-fold, (1) To our knowledge, this is the first attempt to use ad videos with well-designed tasks to evaluate LLMs. We contribute AdsQA, a challenging ad Video QA benchmark derived from 1,544 ad videos with 10,962 clips, totaling 21.1 hours, providing 5 challenging tasks. (2) We propose ReAd-R, a Deepseek-R1 styled RL model that reflects on questions, and generates answers via reward-driven optimization. (3) We benchmark 14 toptier LLMs on AdsQA, and our ReAd-R achieves the stateof-the-art outperforming strong competitors equipped with long-chain reasoning capabilities (e.g., VOT, and MCTSr) by a clear margin.

Benchmark Overview

AdsQA introduces a novel benchmark for understanding advertisement videos using LLMs. These videos are clue-rich, persuasion-driven, and semantically dense — ideal for evaluating cognitive-level multimodal reasoning. The benchmark consists of 1,544 ad videos, 10,962 clips, and five reasoning tasks: visual concepts, emotion recognition, theme extraction, persuasion strategy, and audience modeling. Our proposed model, ReAd-R, a reinforcement-learned ad reasoner, achieves state-of-the-art performance across all tasks.

Benchmark Tasks

Visual Concept Understanding (VU): Identifying characters, objects, slogans, and relationships.
Emotion Recognition (ER): Inferring the emotional tone and audience impact.
Theme & Message Extraction (TE): Distilling the ad’s core idea and storyline.
Persuasion Strategy (PS): Analyzing rhetorical and visual techniques like humor, metaphor, etc.
Audience Modeling (AM): Predicting the intended demographic and psychological profile.

Dataset Overview & Statistics

The AdsQA benchmark introduces a comprehensive, large-scale video QA dataset specifically designed around the complex and information-rich nature of advertisement videos. It offers a diverse and well-structured data source to evaluate LLMs on implicit reasoning tasks.

📼 Total Videos: 1,544 advertisement videos
🎞️ Total Clips: 10,962 video segments
🕒 Duration: 21.1 hours of curated content
🌍 Diverse Domains: Clothing, food, public service, electronics, healthcare, etc.
🌐 Language Coverage: English (dominant), Spanish, Russian, Japanese, etc.
📊 Annotation: 7,838 QA pairs across five task categories

The dataset is constructed using a novel Role-Played Multi-Agent Annotation framework that simulates human expert behaviors—including those of marketers, visual designers, and psychologists—to automatically generate rich, specialized, and insightful QA pairs for each advertisement video.

🧠 Understanding the Need for Specialized Reasoning

Advertisement videos are fundamentally different from typical instructional or user-generated videos. They are designed to convey powerful messages in short time spans—often through symbolism, emotion, metaphor, and subtle psychological cues. These layers of meaning are not explicitly stated; they are intentionally implicit, aiming to influence viewer perception on a subconscious level.

As a result, reasoning over ad content requires cognitive abilities beyond simple perception. It’s not just about identifying what appears on screen, but understanding why it appears, what it aims to evoke, and for whom it was crafted. Traditional video-language models often fall short when asked to interpret persuasive strategies, emotional triggers, or audience targeting embedded in ad narratives.

This gap in reasoning capability motivates the need for a new model—one that not only processes visual and textual inputs, but also reflects, evaluates, and reasons like a human expert. This is where ReAd-R, our Reinforced Ad Reasoner, comes in.

ReAd-R: Our Reinforced Ad Reasoner Model

ReAd-R is our proposed reinforcement learning-based reasoning model, inspired by DeepSeek-R1, tailored for advertisement video understanding. Unlike traditional chain-of-thought methods, ReAd-R reflects on the ad's contents and learns from feedback to generate high-quality answers in a human-like manner.

It integrates a vision encoder, a lightweight language model, and a multi-stage policy optimization framework. By leveraging rule-based reward modeling, it avoids hallucinations and promotes precise, grounded reasoning. ReAd-R was trained with only 500 annotated ad clips and achieves state-of-the-art performance across multiple QA tasks, especially those involving implicit reasoning such as persuasion strategies and emotion recognition.

🏋️‍♂️ Trained with GRPO (Generalized Reinforcement Policy Optimization)
⚙️ Uses rule-based inclusion and exclusion rewards
🧠 Outperforms LLaVA, and Qwen2-VL in abstract VideoQA tasks