Visually Informed Binaural Audio Generation without Binaural Audios

Computer Vision and Pattern Recognition (CVPR) 2021


Stereophonic audio, especially binaural audio, plays an essential role in immersive viewing environments. Recent re- search has explored generating stereophonic audios guided by visual cues and multi-channel audio collections in a fully- supervised manner. However, due to the requirement of professional recording devices, existing datasets are limited in scale and variety, which impedes the generalization of supervised methods to real-world scenarios. In this work, we propose PseudoBinaural, an effective pipeline that is free of binaural recordings. The key insight is to carefully build pseudo visual-stereo pairs with mono data for training. Specifically, we leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between the location of a sound source and the received binaural audio. Then in the visual modality, corre- sponding visual cues of the mono data are manually placed at sound source positions to form the pairs. Compared to fully-supervised paradigms, our binaural-recording-free pipeline shows great stability in the cross-dataset evalua- tion and comparable performance under subjective prefer- ence. Moreover, combined with binaural recorded data, our method is able to further boost the performance of binaural audio generation under supervised settings.

Demo Video



  title={Visually Informed Binaural Audio Generation without Binaural Audios},
  author={Xu, Xudong and Zhou, Hang and Liu, Ziwei and Dai, Bo and Wang, Xiaogang and Lin, Dahua },
  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)},