Chinese-LiPS Dataset

Download

Dataset Splits

Split	Duration (hours)	Segment	Speaker
Train	85.37	30,341	175
Validation	5.35	1,959	11
Test	10.12	3,908	21
All	100.84	36,208	207

Dataset Access

You can access the dataset using the links below:

Dataset Organization

The dataset includes three main modalities: audio, slide video, and lip-reading video. The dataset is organized into several files:

image.zip: This file contains the first frame of each slide video segment in the test dataset, which is used for OCR and visual language model processing. The images are in JPG format.
processed_train.zip, processed_test.zip, processed_val.zip: This files contain data processed using the Whisper-Flamingo library. It includes the full dataset with 16 kHz audio, 96×96 lip-reading videos with 25 fps.

train.zip, test.zip, val.zip: These files contain the training, testing, and validation sets. Each zip file has the following structure:

train.zip/test.zip/val.zip
├── ID1_age_gender_topic/
│   ├── WAV/
│   │   ├── ID1_age_gender_topic_001.json         # Audio segment annotation file
│   │   ├── ID1_age_gender_topic_001.wav          # Audio file (48 kHz)
│   ├── PPT/
│   │   ├── ID1_age_gender_topic_001_PPT.mp4      # Slide video file (1080p 30fps)
│   ├── FACE/
│   │   ├── ID1_age_gender_topic_001_FACE.mp4       # Lip-reading video file (720p 30fps)
├── ID2_age_gender_topic/
│   ├── WAV/
│   │   ├── ID2_age_gender_topic_001.json
│   │   ├── ID2_age_gender_topic_001.wav
│   ├── PPT/
│   │   ├── ID2_age_gender_topic_001_PPT.mp4
│   ├── FACE/
│   │   ├── ID2_age_gender_topic_001_FACE.mp4
├── ...

asr.zip: Contains the complete datasets from train.zip, test.zip, and val.zip, organized by TOPIC. Zip file has the following structure:

asr.zip
├── topic1/
│   ├── ID1_age_gender_topic/
│   │   ├── WAV/
│   │   │   ├── ID1_age_gender_topic_001.json         
│   │   │   ├── ID1_age_gender_topic_001.wav          
│   │   ├── PPT/
│   │   │   ├── ID1_age_gender_topic_001_PPT.mp4      
│   │   ├── FACE/
│   │   │   ├── ID1_age_gender_topic_001_FACE.mp4     
│   ├── ID2_age_gender_topic/
│   │   ├── ...
├── topic2/
│   ├── ...

meta_all.csv, meta_train.csv, meta_valid.csv, meta_test.csv: These metadata files contain the fields ID, TOPIC, WAV, PPT, FACE, and TEXT.

The TOPIC field is abbreviated in Chinese as follows: DZJJ = E-sports & Gaming, JKYS = Health & Wellness, KJ = Science & Technology, LY = Travel & Exploration, QC = Automobile & Industry, RWLS = Culture & History, TY = Sports & Competitions, YS = Movies & TV Series, ZX = Others.

Chinese-LiPS Dataset

A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides

Dataset Overview

Download

Dataset Splits

Dataset Access

Dataset Organization

Citation

Contact