“Transformers: The best architecture that has come up recently in AI, Andrej Karpathy in Lex Fridman's podcast"
Transformers are a type of neural network architecture that has gained popularity in recent years for tasks such as natural language processing and image classification. They are particularly well-suited for video analytics due to their ability to handle long sequences of data and model relationships between different elements in a video. Transformers are a better choice for video analytics than LSTMs due to their efficiency, performance, and ability to handle long sequences of data
Video Vision transformers
“Video Vision transformers are an extension of Vision Transformers (VIT) that models pairwise interactions be-tween all spatio-temporal tokens"
Mapping of tokens can be done using Tubelet embedding, where extraction of non-overlapping,spatio-temporal “tubes” from the input videos is done , and they are linearly projected to obtain tokens.
![](https://static.wixstatic.com/media/8c2927_5f91daa345d24b3eb8ac997211081645~mv2.png/v1/fill/w_409,h_222,al_c,q_85,enc_auto/8c2927_5f91daa345d24b3eb8ac997211081645~mv2.png)
Drawing cues from the original BERT paper, an additional CLS token is added to the set of embedded tokens, which is responsible for aggregating global video frame information and final classification. The Video transformer architecture is shown below:
![](https://static.wixstatic.com/media/8c2927_433d778f5d3946fab1bff040bd57bfea~mv2.png/v1/fill/w_837,h_433,al_c,q_90,enc_auto/8c2927_433d778f5d3946fab1bff040bd57bfea~mv2.png)
The architecture of the Transformer Encoder is made up of several Layers of identical blocks. Each block begins with a Multi-Head Self Attention (MSA) layer and ends with Multi-Layer Perceptron (MLP) blocks. In the above architecture, every token attends to every other spatio-temporal token and therfore has quadratic complexity. To alleviate the problem, we can use a Factorised Encoder model
![](https://static.wixstatic.com/media/8c2927_1c16e42fa2924d2a8af5e098a081cd57~mv2.png/v1/fill/w_602,h_359,al_c,q_85,enc_auto/8c2927_1c16e42fa2924d2a8af5e098a081cd57~mv2.png)
This model consists of two transformer encoders in series: the first models interactions between tokens extracted from the same temporal index to produce a latent representation per time-index.The second transformer models interactions between time steps. It thus corresponds to a “late fusion” of spatial- and temporal information.
Implementation
We implemented both the variants of the model in Tensorflow, and for the input data pipline we use DeepMind's DMVR Video Pipeline Readers. We treat this model as the baseline to compare all of our other video analytics implementations. This model is part of the 'FramebyFrame' Video Analytics stack.
Reference
A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lucic, and C. Schmid, “Vivit: A video vision transformer, Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
Comments