Self-Calibrated Consistency can Fight Back for Adversarial Robustness in Vision-Language Models

ICML 2026

Jiaxiang Liu, Jiawei Du, Xiao Liu, Shangyang Li, Songchen Ma, Changshuo Wang, Prayag Tiwari, Mingkun Xu
SCC concept

Left: under adversarial counterattack, an embedding drifts toward hard-negative classes (e.g. Dog → Wolf). Right: SCC uses semantic consistency and spatial consistency together to pull it back into the correct class region.

Abstract

Pre-trained vision–language models (VLMs) such as CLIP have demonstrated strong zero-shot capabilities across diverse domains, yet remain highly vulnerable to adversarial perturbations that disrupt image–text alignment and compromise reliability. Existing defenses typically rely on adversarial fine-tuning with labeled data, limiting their applicability in zero-shot settings.

In this work, we identify two key weaknesses of current CLIP adversarial attacks—lack of semantic guidance and vulnerability to view variations—collectively termed semantic and viewpoint fragility. To address these challenges, we propose Self-Calibrated Consistency (SCC), an effective test-time defense. SCC consists of two complementary modules: Semantic consistency, which leverages soft pseudo-labels from a counterattack warm-up and multi-view predictions to regularize cross-modal alignment and separate target embeddings from confusable negatives; and Spatial consistency, aligning perturbed visual predictions via augmented views to stabilize inference under adversarial perturbations.

Together, these modules form a plug-and-play inference strategy. Extensive experiments on 22 benchmarks under diverse attack settings show that SCC consistently improves the zero-shot robustness of CLIP while maintaining accuracy, and can be seamlessly integrated with other VLMs for further gains. These findings highlight the great potential of establishing an adversarially robust paradigm from CLIP, with implications extending to broader VLMs such as BiomedCLIP.

Method

Test-time defense paradigms

Three test-time defenses, side by side. R-TPT updates text prompts online; TTC applies a corrective perturbation to the input; SCC adds a semantic pseudo-label and aggregates several views, giving a stable and accurate recovery.

Main results (Table 1)

Classification accuracy (Acc.) and adversarial accuracy (Rob.) under PGD-10 attack (εa = 1/255) across 16 datasets. SCC is a pure test-time defense — no adversarial fine-tuning, no labels.
Δ = improvement over vanilla CLIP.

Dataset Metric CLIP Adversarial Finetuning Test-time Defense Δ
CLIP-FTTeCoAPMG-AFTFARE RNAnti-advHDTTCDOC SCC (ours)
CIFAR-10Rob.0.743.3433.6140.6619.652.0112.3917.2228.7547.7859.18+58.44
Acc.85.1284.9064.6170.6974.4481.1883.5278.2381.1881.9982.24-2.88
STL-10Rob.11.0012.7370.0873.0859.0616.2337.4239.0276.7086.3390.50+79.50
Acc.96.4094.4987.4088.5691.7295.8595.4589.5095.8596.0495.62-0.78
ImageNetRob.1.150.9318.8921.4314.001.778.676.6338.4143.7249.77+48.62
Acc.59.6954.2434.8936.1248.7959.3454.2754.5449.3946.4656.03-3.66
OxfordPetsRob.1.042.1038.3541.1831.071.8620.4212.0457.8767.1876.67+75.63
Acc.87.4484.1462.1265.8879.3787.4180.6280.9183.3581.3686.48-0.96
Caltech256Rob.8.476.7643.1945.9138.7911.3325.3623.4860.1165.9372.88+64.41
Acc.81.7278.5361.1462.2473.3281.2579.3879.1279.6679.4581.16-0.56
Flowers102Rob.1.140.5421.9423.4317.141.527.167.2939.1445.5554.59+53.45
Acc.65.4653.3736.8037.0047.9864.6262.6658.2264.1663.1464.16-1.30
Food101Rob.0.700.4213.9018.5711.651.2013.128.0357.8462.0065.39+64.69
Acc.83.8864.8629.9836.6155.3183.4475.8180.3082.1881.0682.13-1.75
PCAMRob.0.081.1148.2446.1816.230.414.9744.7452.8562.4469.99+69.91
Acc.52.0247.2149.9650.0352.5452.7352.4950.3852.7353.4654.41+2.39
Avg. (16)Rob. 2.702.9126.5428.7620.003.8612.0113.8139.1746.0451.68+48.98
Acc. 61.5155.8040.2542.3051.0261.6157.3556.6259.7559.6060.21-1.30

Eight of the 16 datasets shown. SCC outperforms every prior test-time defense on every dataset in adversarial accuracy while preserving clean accuracy (–1.30 pp on average versus CLIP). The full table covering all 16 datasets and stronger attacks (ε = 4/255, CW, AutoAttack, PGD-100) is in the paper.

Medical benchmarks (Table 3)

Adversarial robustness on 6 medical datasets under PGD-10 (εa = 1/255). SCC is a drop-in replacement for the test-time stage and works for both CLIP and BiomedCLIP backbones.

BackboneMethod BUSIBTMRICHMNISTCOVID-19DermaMNISTKneeXray Avg. Rob.
CLIP CLIP0.000.000.000.130.020.000.02
TTC11.678.932.207.519.117.687.38
SCC23.8516.199.127.3012.4011.0813.32
BiomedCLIP BiomedCLIP0.000.490.002.720.000.000.08
TTC7.9522.202.8018.364.917.5110.62
SCC31.9248.9316.5657.5820.6428.3534.00

On BiomedCLIP, SCC raises average robustness from 0.08% → 34.00% (a +23.38 pp improvement over TTC), with clean accuracy preserved at 44.63%.

Efficiency (Table 4)

Per-image runtime and robustness on DTD. SCC is 30× faster than R-TPT while delivering higher robust accuracy.

MethodTime / imageRobust Acc.
R-TPT (64 views)0.37 s32.8
TTC0.012 s27.4
SCC (ours)0.0125 s34.6

Ablation

Ablation across datasets

Robust accuracy per dataset for the four combinations of sec (semantic) and spa (spatial) consistency. Turning either module off causes a noticeable drop; turning both off collapses robustness — confirming the two components are complementary, not redundant.

BibTeX

@inproceedings{liu2026scc,
  title     = {Self-Calibrated Consistency can Fight Back for Adversarial Robustness in Vision-Language Models},
  author    = {Liu, Jiaxiang and Du, Jiawei and Liu, Xiao and Li, Shangyang and Ma, Songchen and Wang, Changshuo and Tiwari, Prayag and Xu, Mingkun},
  booktitle = {International Conference on Machine Learning (ICML)},
  year      = {2026}
}