This project provides a comprehensive evaluation toolkit for assessing generated audio, text, and video content. It implements a range of metrics for audio-text and audio-visual generative tasks, including Fréchet distances, Inception scores, KL divergence, CLAP similarity, ImageBind scoring, and audiovisual synchronization (DeSync). Designed for research benchmarking, it facilitates evaluation of models that align audio with text or visual modalities.
Key Features
Computes Fréchet distances across audio embedding spaces (PaSST, PANNs, VGGish)
Calculates Inception scores for audio using PaSST and PANNs
Measures mean KL divergence for audio feature distributions
Evaluates audio-text similarity via LAION‑CLAP and MS‑CLAP models
Scores audio-visual alignment using ImageBind embeddings
Quantifies synchronization misalignment in seconds using Synchformer
Includes Python scripts for extracting features from text, audio, and video
Works with video datasets, integrating ffmpeg for frame-audio extraction
Compatible with Python 3.9+, PyTorch 2.5.1+, and supports GPU acceleration