Real-world Typographic Attack
Synthetic Typographic Attack
Removed Attack
Typographic attacks exploit the interplay between text and visual content in multimodal foundation models, causing misclassifications when misleading text is embedded within images. However, existing datasets are limited in size and diversity, making it difficult to study such vulnerabilities. In this paper, we introduce SCAM, the largest and most diverse dataset of real-world typographic attack images to date, containing images across hundreds of object categories and attack words.
Through extensive benchmarking of Vision-Language Models (VLMs) on SCAM, we demonstrate that typographic attacks significantly degrade performance, and identify that training data and model architecture influence the susceptibility to these attacks. Our findings reveal that typographic attacks persist in state-of-the-art Large Vision-Language Models (LVLMs) due to the choice of their vision encoder, though larger Large Language Models (LLMs) backbones help mitigate their vulnerability.
SCAM is the largest and most diverse real-world typographic attack dataset to date, containing images across hundreds of object categories and attack words.
Data points
Distinct object labels
Unique attack words
Distribution of attack words in SCAM into categories, highlighting both everyday terms and safety-critical vocabulary.
We evaluate the performance of VLMs in a zero-shot classification task. For each image, we compute the cosine similarity between its embedding and the text embeddings of both the object label and the attack word. The predicted label is determined based on the highest cosine similarity score.
To assess the robustness of an LVLM against typographic attacks, we evaluate whether its output changes when exposed to typographic modifications. Our evaluation involves providing the model with an image along with a simple prompt asking it to identify what entity is depicted in the image with two options.
Accuracy distribution of 99 VLMs across NoSCAM, SCAM, and SynthSCAM datasets. VLMs experience an average accuracy drop of 26% when evaluated on the SCAM dataset, with an even steeper decline of 35% on SynthSCAM.
Smaller LLaVA models suffer substantial accuracy drops of 30-50%, while models with larger LLM backbones exhibit better performance under attack. Further, we evaluate whether prompting LVLMs to ignore the attack and focus on the object improves robustness. While we cannot rule out the existence of an effective prompt, our results suggest it is not effective.
Susceptibility to typographic attack is agnostic of VLMs size, measured in millions of parameters. While model size alone does not correlate with robustness, larger LLM backbones in LVLMs help mitigate vulnerability to typographic attacks.
Model accuracy decreases as the post-it area increases. The size of the attack text correlates with model accuracy, with larger post-it notes causing greater performance degradation.
Typographic attacks cause an average accuracy drop of 26% in VLMs and up to 50% in smaller LLaVA models.
LVLMs inherit vulnerability to typographic attacks from their vision encoders, particularly the ViT-L-14-336 backbone.
Larger LLM backbones can compensate for vision encoder limitations, making models more resilient to typographic attacks.
Synthetic typographic attacks closely mirror real-world scenarios, validating their use for evaluating model robustness.
@misc{scambliss2025, title={SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models}, author={Justus Westerhoff and Erblina Purelku and Jakob Hackstein and Leo Pinetzki and Lorenz Hufe}, year={2025}, eprint={2504.04893}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2504.04893}, }