Chinese-LiPS Dataset

Chinese-LiPS Dataset

A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides

Dataset Overview

Hugging Face Datasets
License: CC BY-NC-SA-4.0
GitHub Pages

The Chinese-LiPS dataset is a multimodal audio-visual speech recognition (AVSR) dataset. The dataset includes:

Download

Dataset Splits

Split Duration (hours) Segment Speaker
Train 85.37 30,341 175
Validation 5.35 1,959 11
Test 10.12 3,908 21
All 100.84 36,208 207

Dataset Access

You can access the dataset using the links below:

Dataset Organization

The dataset includes three main modalities: audio, slide video, and lip-reading video. The dataset is organized into several files:

Citation

@misc{zhao2025chineselipschineseaudiovisualspeech,
  title={Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides}, 
  author={Jinghua Zhao and Yuhang Jia and Shiyao Wang and Jiaming Zhou and Hui Wang and Yong Qin},
  year={2025},
  eprint={2504.15066},
  archivePrefix={arXiv},
  primaryClass={cs.MM},
  url={https://arxiv.org/abs/2504.15066}
}

Contact

If you have any questions or suggestions, feel free to reach out via email at zhao1jing1hua@gmail.com.