logo MetaGPT: A Large Vision-Language Model for Meme Metaphor Understanding

AAAI 2026

1Dalian University of Technology
2RMIT University
Corresponding authors.


Abstract

Meme is an expressive medium that often conveys rich emotions and intentions. Recent studies have confirmed the critical role of metaphors in meme understanding. However, existing metaphor research heavily relies on manual annotations, and mainstream vision-language models (VLMs) still struggle with the recognition and comprehension of metaphors. 🌈 To address these challenges, we introduce MetaGPT, the first vision-language model specifically designed for meme metaphor understanding. MetaGPT is capable of identifying and extracting metaphors in memes, and generating accurate meme interpretations. Furthermore, we construct a dedicated dataset for meme understanding, MUnd, which comprises approximately 32,000 high-quality question-answer (QA) pairs across three core tasks: metaphor detection, metaphor domain extraction, and meme interpretation. Based on MUnd, we further propose an evaluation benchmark for meme understanding and conduct a comprehensive assessment of existing VLMs. Experimental results reveal that current models still face challenges in metaphor comprehension, while MetaGPT consistently outperforms them across all tasks, highlighting its potential in advancing meme understanding. Our code and appendix are available in the supplementary materials.

🔥Highlight

  • We propose MetaGPT, the first large vision-language model specifically designed for metaphor understanding and meme interpretation. MetaGPT effectively identifies and extracts complex cross-modal metaphors in memes and provides coherent meme interpretations.
  • We construct a large-scale, high-quality dataset for meme understanding, named MUnd. MUnd comprises approximately 32,000 high-quality QA pairs across three core tasks, providing a reliable foundation for training and evaluating models in meme understanding.
  • We define a new task, metaphor domain extraction, as a crucial step toward deeper meme understanding with VLMs. Furthermore, extensive experiments validate the strong potential of MetaGPT in this domain.


The MUnd Dataset

Existing meme datasets primarily focus on annotating visual content, while the metaphorical mappings embedded in memes remain underexplored. However, effective meme understanding requires not only surface-level visual analysis but also the ability to capture implicit mappings between source and target domains. To address this gap, we extend two high-quality metaphor-rich datasets, MET-Meme and MEMECAP, to enhance VLMs' performance in meme understanding. The data construction pipeline of MUnd is illustrated in Figure 2. MUnd consists of three core QA tasks: metaphor detection, metaphor domain extraction, and meme interpretation.

teaser

Figure 2. Data construction pipeline of MUnd. The pipeline supports three QA tasks: metaphor detection, metaphor domain extraction, and meme interpretation.

Overall Framework of MetaGPT

The overall training framework of MetaGPT is illustrated in Figure 3. It consists of three main components: a visual encoder, a visual projector, and a large language model (LLM). Given a meme image X from the MUnd dataset, the visual encoder first extracts visual features. These features are then projected into the language space via the visual projector, resulting in V. Finally, V is concatenated with the instruction prompt T and fed into the LLM to generate the response.

description

Figure 3. Training framework of MetaGPT.

Main Results


Metaphor Domain Extraction

Table 1: Results of metaphor domain extraction. We use BERTScore F1 to compute the similarity between the predicted sourcetarget domain pairs and the references, under different thresholds τ ∈ {0.5, 0.6, 0.7, 0.8}. A prediction is considered correct if the score exceeds the threshold. '-' denotes that the model fails to perform this task and achieves a score of zero.

description

Meme Interpretation

Table 2: Performance comparisons on meme interpretation. The top-2 scores are marked in bold and underlined, respectively. △ indicates the performance gap between our method and the best baseline.

description

Metaphor Detection

Table 3: Results of metaphor detection.

description

Human Evaluation

Table 4: Human evaluation on metaphor domain extraction.

description

Case Study

As illustrated in Figure 4, we conduct a case study on both meme interpretation and metaphor domain extraction tasks. In the meme interpretation, MetaGPT successfully captures the implicit relationship within the meme and identifies its underlying intent. In contrast, other models tend to make incorrect associations based solely on the textual content. In the metaphor domain extraction, although MetaGPT generates an incorrect pair, it still manages to identify the valid source-target domain mapping. Other models fail to extract any metaphorical mapping, indicating their limited capacity for understanding metaphorical expressions.

teaser

Figure 4. Qualitative examples of responses, with incorrect parts highlighted in red.

The MUnd Dataset Examples