top of page

Transformers for Video Analytics

Writer's picture: vijay mvijay m
“Transformers: The best architecture that has come up recently in AI, Andrej Karpathy in Lex Fridman's podcast"

Transformers are a type of neural network architecture that has gained popularity in recent years for tasks such as natural language processing and image classification. They are particularly well-suited for video analytics due to their ability to handle long sequences of data and model relationships between different elements in a video. Transformers are a better choice for video analytics than LSTMs due to their efficiency, performance, and ability to handle long sequences of data


Video Vision transformers


“Video Vision transformers are an extension of Vision Transformers (VIT) that models pairwise interactions be-tween all spatio-temporal tokens"

Mapping of tokens can be done using Tubelet embedding, where extraction of non-overlapping,spatio-temporal “tubes” from the input videos is done , and they are linearly projected to obtain tokens.





Drawing cues from the original BERT paper, an additional CLS token is added to the set of embedded tokens, which is responsible for aggregating global video frame information and final classification. The Video transformer architecture is shown below:



The architecture of the Transformer Encoder is made up of several Layers of identical blocks. Each block begins with a Multi-Head Self Attention (MSA) layer and ends with Multi-Layer Perceptron (MLP) blocks. In the above architecture, every token attends to every other spatio-temporal token and therfore has quadratic complexity. To alleviate the problem, we can use a Factorised Encoder model



This model consists of two transformer encoders in series: the first models interactions between tokens extracted from the same temporal index to produce a latent representation per time-index.The second transformer models interactions between time steps. It thus corresponds to a “late fusion” of spatial- and temporal information.


Implementation


We implemented both the variants of the model in Tensorflow, and for the input data pipline we use DeepMind's DMVR Video Pipeline Readers. We treat this model as the baseline to compare all of our other video analytics implementations. This model is part of the 'FramebyFrame' Video Analytics stack.

Reference

A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lucic, and C. Schmid, “Vivit: A video vision transformer, Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.



75 views0 comments

Recent Posts

See All

Comments


bottom of page