AV Benchmark: Audio‑Text and Audio‑Visual Generation Metrics

AV Benchmark: Audio‑Text and Audio‑Visual Generation Metrics

This project provides a comprehensive evaluation toolkit for assessing generated audio, text, and video content. It implements a range of metrics for audio-text and audio-visual generative tasks, including Fréchet distances, Inception scores, KL divergence, CLAP similarity, ImageBind scoring, and audiovisual synchronization (DeSync). Designed for research benchmarking, it facilitates evaluation of models that align audio with text or visual modalities.

Key Features

  • Computes Fréchet distances across audio embedding spaces (PaSST, PANNs, VGGish)
  • Calculates Inception scores for audio using PaSST and PANNs
  • Measures mean KL divergence for audio feature distributions
  • Evaluates audio-text similarity via LAION‑CLAP and MS‑CLAP models
  • Scores audio-visual alignment using ImageBind embeddings
  • Quantifies synchronization misalignment in seconds using Synchformer
  • Includes Python scripts for extracting features from text, audio, and video
  • Works with video datasets, integrating ffmpeg for frame-audio extraction
  • Compatible with Python 3.9+, PyTorch 2.5.1+, and supports GPU acceleration