MasterProject: VideoBERT for Video-Text Representation Learning

Category: Natural Language Processing

License: MIT

Model Type: Generative AI

This repository contains the code for a master's thesis project focused on VideoBERT, a model designed to learn joint representations of video and text. The project involves several steps, including data collection from the HowTo100M dataset, feature extraction using the I3D model, clustering of features, and training a VideoBERT model for tasks such as video captioning and masked language modeling. The implementation is based on Hugging Face's Transformers library

Key Features

Data Collection: Gathering videos and text annotations from the HowTo100M dataset.
Data Transformation: Adjusting frame rates and adding punctuation to text annotations.
Feature Extraction: Using the I3D model to extract features from videos.
Clustering: Applying hierarchical k-means clustering to group video features.
Model Training: Training a VideoBERT model using the processed data.
Evaluation: Assessing the model's performance on the YouCookII validation dataset.

GitHub Demo Video Arxiv

Project Screenshots

Similar Projects

MasterProject: VideoBERT for Video-Text Representation Learning

Key Features

Project Screenshots

Similar Projects

Entmax: Sparse Alternative to Softmax for Neural Networks

ViLBERT-Multi-Task: Multi-Task Vision and Language Representation Learning

Ren’Py Translator

Sandwich Transformer: Balancing Depth and Efficiency in Transformer Models

LinguaLens Translation Assistant

Fairseq: A Fast, Extensible Toolkit for Sequence Modeling