SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models

We introduce SCAM datasets to study and evaluate the robustness of multimodal foundation models against typographic attacks.

Model

VLM

LVLM

- params

- px

object: - -

attack: - -

object: - -

attack: - -

object: - -

attack: - -

Abstract

Typographic attacks exploit the interplay between text and visual content in multimodal foundation models, causing misclassifications when misleading text is embedded within images. However, existing datasets are limited in size and diversity, making it difficult to study such vulnerabilities. In this paper, we introduce SCAM, the largest and most diverse dataset of real-world typographic attack images to date, containing images across hundreds of object categories and attack words.

Through extensive benchmarking of Vision-Language Models (VLMs) on SCAM, we demonstrate that typographic attacks significantly degrade performance, and identify that training data and model architecture influence the susceptibility to these attacks. Our findings reveal that typographic attacks persist in state-of-the-art Large Vision-Language Models (LVLMs) due to the choice of their vision encoder, though larger Large Language Models (LLMs) backbones help mitigate their vulnerability.

Key Contributions

Introduction of SCAM, the largest real-world typographic attack dataset to date
Comprehensive evaluation of VLMs and LVLMs, revealing an average drop in accuracy of 26%
Evidence that synthetic attacks closely resemble real-world attacks, validating their use in research
Finding that LVLMs inherit vulnerabilities from vision encoders, but larger LLM backbones mitigate weaknesses

SCAM Dataset

SCAM is the largest and most diverse real-world typographic attack dataset to date, containing images across hundreds of object categories and attack words.

1,162

Data points

660

Distinct object labels

206

Unique attack words

Attack Word Categories

Distribution of attack words in SCAM into categories, highlighting both everyday terms and safety-critical vocabulary.

Evaluation Methodology

VLM Evaluation

We evaluate the performance of VLMs in a zero-shot classification task. For each image, we compute the cosine similarity between its embedding and the text embeddings of both the object label and the attack word. The predicted label is determined based on the highest cosine similarity score.

LVLM Evaluation

To assess the robustness of an LVLM against typographic attacks, we evaluate whether its output changes when exposed to typographic modifications. Our evaluation involves providing the model with an image along with a simple prompt asking it to identify what entity is depicted in the image with two options.

Results

Impact on Vision-Language Models

Accuracy distribution of 99 VLMs across NoSCAM, SCAM, and SynthSCAM datasets. VLMs experience an average accuracy drop of 26% when evaluated on the SCAM dataset, with an even steeper decline of 35% on SynthSCAM.

Impact on Large Vision-Language Models and Dependence on Prompt

Smaller LLaVA models suffer substantial accuracy drops of 30-50%, while models with larger LLM backbones exhibit better performance under attack. Further, we evaluate whether prompting LVLMs to ignore the attack and focus on the object improves robustness. While we cannot rule out the existence of an effective prompt, our results suggest it is not effective.

Impact of Model Parameters

Susceptibility to typographic attack is agnostic of VLMs size, measured in millions of parameters. While model size alone does not correlate with robustness, larger LLM backbones in LVLMs help mitigate vulnerability to typographic attacks.

Impact of Attack Size

Model accuracy decreases as the post-it area increases. The size of the attack text correlates with model accuracy, with larger post-it notes causing greater performance degradation.

Key Findings

📈

Performance Impact

Typographic attacks cause an average accuracy drop of 26% in VLMs and up to 50% in smaller LLaVA models.

👁️

Vision Encoder Vulnerability

LVLMs inherit vulnerability to typographic attacks from their vision encoders, particularly the ViT-L-14-336 backbone.

🧠

LLM Backbone Effect

Larger LLM backbones can compensate for vision encoder limitations, making models more resilient to typographic attacks.

🔄

Synthetic Validity

Synthetic typographic attacks closely mirror real-world scenarios, validating their use for evaluating model robustness.

Citation

@misc{scambliss2025,
    title={SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models}, 
    author={Justus Westerhoff and Erblina Purelku and Jakob Hackstein and Jonas Loos and Leo Pinetzki and Lorenz Hufe},
    year={2025},
    eprint={2504.04893},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2504.04893}, 
}

SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models

SCAM

SynthSCAM

NoSCAM

Model

Abstract

Key Contributions

SCAM Dataset

1,162

660

206

Attack Word Categories

Evaluation Methodology

VLM Evaluation

LVLM Evaluation

Results

Impact on Vision-Language Models

Impact on Large Vision-Language Models and Dependence on Prompt

Impact of Model Parameters

Impact of Attack Size

Key Findings

Performance Impact

Vision Encoder Vulnerability

LLM Backbone Effect

Synthetic Validity

Resources

Paper

Code

Dataset

Citation