Learning Binary Residual Representations for Domain-specific Video Streaming


This post briefly summarizes our work on using deep learning to improve video streaming quality for existing video compression standards. The work is published as an AAAI 2018 paper (joint work with Yi-Hsuan Tsai, Deqing Sun, Ming-Hsuan Yang, and Jan Kautz). Some results can be found in the accompany video.


Existing video compression standards (e.g., MPEG4, H.264, and HEVC) can effectively compress most of the information in a video. What are left uncompressed are the residual images, which are the difference between the compressed and original videos. The residual images are difficult to compress because they contain highly non-linear, domain-specific patterns. In this work, we ask the following two questions, hoping that they can lead us to better compress videos for video streaming.

  1. Whether one can improve existing video compression algorithms to achieve a better visual quality using the same bandwidth if one is willing to limit the use of the resulting algorithm in a domain-specific manner.
  2. Whether the improved algorithm can be seamlessly integrated to existing video compression standard.

Why 1? We ask the first question because many interesting video streaming services are domain-specific. For example, as using video game streaming services (e.g., NVIDIA GeForce Now), the game videos are first rendered in the GPU server. They are then compressed and delivered to the end user. Over a period of hours, the videos to be streamed are all in the same domain.

Why 2? We ask for a seamless integration because we want to leverage existing video compression infrastructure, including those hardware-optimized computation and well-established software stacks.


We believe that we have come out with a hybrid system to address the two questions mentioned above. In our hybrid system, we first apply an existing video compression algorithm (e.g., H.264) to compress domain-specific videos and train a binary autoencoder (the latent representations are either 0 or 1) to encode the resulting residual images into binary representations in frame-by-frame. We then apply Huffman coding to losslessly compress the binary representations. The compressed binary representations can be sent to the user in the meta data field in the existing video streaming network packet. This way, our system is compatible the existing video streaming standard. We illustrate our system design in the following figure.



This allows us to achieve a better visual quality using the same channel bandwidth. Specifically, we can reduce the bandwidth assigned to the existing video compression algorithm and use the saved bandwidth for transmitting the binary residual representation computed by the binary autoencoder. Our experiment results show that by this way, we can improve the visual quality up to 1.7 db using this hybrid system. The results for the SkyRim game is show in the following figure (click and zoom in for a better view). For more details about the algorithm and experiment results, please check out our paper.