SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models

We introduce SCAM datasets to study and evaluate the robustness of multimodal foundation models against typographic attacks.

SCAM

Real-world Typographic Attack

SCAM dataset example

SynthSCAM

Synthetic Typographic Attack

SynthSCAM dataset example

NoSCAM

Removed Attack

NoSCAM dataset example

Model

- params
- px
-
object: - -
attack: - -
object: - -
attack: - -
object: - -
attack: - -

Abstract

Typographic attacks exploit the interplay between text and visual content in multimodal foundation models, causing misclassifications when misleading text is embedded within images. However, existing datasets are limited in size and diversity, making it difficult to study such vulnerabilities. In this paper, we introduce SCAM, the largest and most diverse dataset of real-world typographic attack images to date, containing images across hundreds of object categories and attack words.

Through extensive benchmarking of Vision-Language Models (VLMs) on SCAM, we demonstrate that typographic attacks significantly degrade performance, and identify that training data and model architecture influence the susceptibility to these attacks. Our findings reveal that typographic attacks persist in state-of-the-art Large Vision-Language Models (LVLMs) due to the choice of their vision encoder, though larger Large Language Models (LLMs) backbones help mitigate their vulnerability.

Key Contributions

SCAM Dataset

SCAM is the largest and most diverse real-world typographic attack dataset to date, containing images across hundreds of object categories and attack words.

1,162

Data points

660

Distinct object labels

206

Unique attack words

Attack Word Categories

Distribution of attack words in SCAM into categories

Distribution of attack words in SCAM into categories, highlighting both everyday terms and safety-critical vocabulary.

Evaluation Methodology

VLM Evaluation

VLM Evaluation Methodology

We evaluate the performance of VLMs in a zero-shot classification task. For each image, we compute the cosine similarity between its embedding and the text embeddings of both the object label and the attack word. The predicted label is determined based on the highest cosine similarity score.

LVLM Evaluation

LVLM Evaluation Methodology

To assess the robustness of an LVLM against typographic attacks, we evaluate whether its output changes when exposed to typographic modifications. Our evaluation involves providing the model with an image along with a simple prompt asking it to identify what entity is depicted in the image with two options.

Results

Impact on Vision-Language Models

VLM Results

Accuracy distribution of 99 VLMs across NoSCAM, SCAM, and SynthSCAM datasets. VLMs experience an average accuracy drop of 26% when evaluated on the SCAM dataset, with an even steeper decline of 35% on SynthSCAM.

Impact on Large Vision-Language Models and Dependence on Prompt

LVLM Results

Smaller LLaVA models suffer substantial accuracy drops of 30-50%, while models with larger LLM backbones exhibit better performance under attack. Further, we evaluate whether prompting LVLMs to ignore the attack and focus on the object improves robustness. While we cannot rule out the existence of an effective prompt, our results suggest it is not effective.

Impact of Model Parameters

Model Parameters vs Performance Drop

Susceptibility to typographic attack is agnostic of VLMs size, measured in millions of parameters. While model size alone does not correlate with robustness, larger LLM backbones in LVLMs help mitigate vulnerability to typographic attacks.

Impact of Attack Size

Post-it Size vs Object Win Ratio

Model accuracy decreases as the post-it area increases. The size of the attack text correlates with model accuracy, with larger post-it notes causing greater performance degradation.

Key Findings

📈

Performance Impact

Typographic attacks cause an average accuracy drop of 26% in VLMs and up to 50% in smaller LLaVA models.

👁️

Vision Encoder Vulnerability

LVLMs inherit vulnerability to typographic attacks from their vision encoders, particularly the ViT-L-14-336 backbone.

🧠

LLM Backbone Effect

Larger LLM backbones can compensate for vision encoder limitations, making models more resilient to typographic attacks.

🔄

Synthetic Validity

Synthetic typographic attacks closely mirror real-world scenarios, validating their use for evaluating model robustness.

Resources

Citation

@misc{scambliss2025,
    title={SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models},
    author={Justus Westerhoff and Erblina Purelku and Jakob Hackstein and Leo Pinetzki and Lorenz Hufe},
    year={2025},
    eprint={2504.04893},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2504.04893},
}