ETVA is a text-to-video alignment evaluation framework that provides fine-grained assessment scores highly consistent with human judgment.
Precisely evaluating semantic alignment between text prompts and generated videos remains a challenge in Text-to-Video (T2V) Generation. Existing text-to-video alignment metrics like CLIPScore only generate coarse-grained scores without fine-grained alignment details, failing to align with human preference. To address this limitation, we propose ETVA, a novel Evaluation method of Text-to-Video Alignment via fine-grained question generation and answering. First, a multi-agent system parses prompts into semantic scene graphs to generate atomic questions. Then we design a knowledge-augmented multi-stage reasoning framework for question answering, where an auxiliary LLM first retrieves relevant common-sense knowledge (e.g., physical laws), and then video LLM answer the generated questions through a multi-stage reasoning mechanism. Extensive experiments demonstrate that ETVA achieves a Spearman’s correlation coefficient of 58.47, showing much higher correlation with human judgment than existing metrics which attain only 31.0. We also construct a comprehensive benchmark specifically designed for text-to-video alignment evaluation, featuring 2k diverse prompts and 12k atomic questions spanning 10 categories. Through a systematic evaluation of 15 existing text-to-video models, we identify their key capabilities and limitations, paving the way for next-generation T2V generation.
ETVA contains a multi-agent framework for generating atomic questions and a knowledge-augmented multi-stage reasoning framework for question answering.
We construct ETVABench-2k for evaluating Open-Source Text-to-Video models and ETVABench-105 for evaluating Open-Source and Closed-Source Text-to-Video models. A question-driven classification method decomposes these prompts to 10 distinct categories.
We visualize the evaluation results of 10 open sourced Text-to-Video models and 5 closed sourced Text-to-Video models across 10 dimensions on ETVABench-105.
We visualize the evaluation results of 10 open sourced Text-to-Video models across 10 dimensions on EFTVBench-2k.
BLIP_BLEU: 0.264
CLIPScore: 0.384
VideoScore: 2.110
ETVA: 0.750
BLIP_BLEU: 0.163
CLIPScore: 0.384
VideoScore: 1.960
ETVA: 0.375
BLIP_BLEU: 0.081
CLIPScore: 0.374
VideoScore: 1.690
ETVA: 1.000
BLIP_BLEU: 0.194
CLIPScore: 0.361
VideoScore: 1.870
ETVA: 0.875
BLIP_BLEU: 0.176
CLIPScore: 0.373
VideoScore: 2.004
ETVA: 0.500
BLIP_BLEU: 0.183
CLIPScore: 0.320
VideoScore: 2.006
ETVA: 0.375
BLIP_BLEU: 0.132
CLIPScore: 0.366
VideoScore: 2.330
ETVA: 0.375
BLIP_BLEU: 0.203
CLIPScore: 0.321
VideoScore: 2.010
ETVA: 0.375
BLIP_BLEU: 0.139
CLIPScore: 0.323
VideoScore: 2.213
ETVA: 0.625
BLIP_BLEU: 0.264
CLIPScore: 0.318
VideoScore: 2.271
ETVA: 0.250
BLIP_BLEU: 0.177
CLIPScore: 0.343
VideoScore: 2.434
ETVA: 1.000
BLIP_BLEU: 0.268
CLIPScore: 0.347
VideoScore: 2.477
ETVA: 0.500
BLIP_BLEU: 0.201
CLIPScore: 0.378
VideoScore: 1.946
ETVA: 0.250
BLIP_BLEU: 0.212
CLIPScore: 0.344
VideoScore: 1.767
ETVA: 0.500
BLIP_BLEU: 0.157
CLIPScore: 0.331
VideoScore: 2.354
ETVA: 0.250
BLIP_BLEU: 0.177
CLIPScore: 0.351
VideoScore: 2.545
ETVA: 1.000
BLIP_BLEU: 0.128
CLIPScore: 0.324
VideoScore: 2.713
ETVA: 0.250
BLIP_BLEU: 0.277
CLIPScore: 0.321
VideoScore: 2.707
ETVA: 0.500
BLIP_BLEU: 0.277
CLIPScore: 0.384
VideoScore: 2.590
ETVA: 0.250
BLIP_BLEU: 0.264
CLIPScore: 0.339
VideoScore: 2.666
ETVA: 0.250