omniture

Open-Source MagicData-RAMC: 180-Hour Conversational Speech Dataset in Mandarin Released

2022-04-14 21:48 1775

BEIJING, April 14, 2022 /PRNewswire/ -- MagicHub, an open-source community for AI, releases 180-hour conversational speech dataset in Mandarin for free, enriching the open source speech corpus and promoting the development of spoken language processing technology and conversational AI.

Data Profile

MagicData-RAMC is a collection of high quality and richly annotated training data that includes 351 sets of multi-turn Mandarin conversations recorded in indoor environment by smart phone with a total duration of 180 hours.

In order to reflect real-world conversation scenarios as much as possible, MagicData-RAMC ensured a balanced gender and geographic distribution, as well as a diversity of topics during the collection process. There are 663 speakers in total in MagicData-RAMC, including 368 males and 295 females, 334 from the north and 329 from the south.

The annotation information of each conversation includes transcribed text, voice activity timestamp, speaker information, recording information, and topic information. The speaker information includes gender, age, and geography, and the recording information includes environment and device.

MagicData-RAMC is currently available for download at

https://magichub.com/datasets/magicdata-ramc

Researches Based on MagicData-RAMC

Magic Data, together with the Institute of Acoustics, Chinese Academy of Sciences, Shanghai Jiao Tong University and Northwestern Polytechnic University, completes the research related to speech recognition, speaker diarization and keyword search based on MagicData-RAMC, which has been submitted to Interspeech 2022, the top conference in the field of speech.

Preprint available on arxiv

https://arxiv.org/abs/2203.16844

Challenge and Baseline

Together with the Institute of Acoustics, Chinese Academy of Sciences and Jiangsu Normal University, Magic Data held the Magic Data ASR-SD Challenge in July to October, 2021 for evaluating the MagicData-RAMC. 

Baseline and more information can be found at

https://github.com/MagicHub-io/Magic-Data-ASR-SD-Challenge

A following challenge on evaluating MagicData-RAMC will be launched soon. Stay updated at MagicHub open-source community.

About MagicHub

MagicHub community is an open-source data platform developed by Magic dedicating to assist AI developers in model training and to promote development of open-source ecosystem.

For more information, contact open@magicdatatech.com

Cision View original content:https://www.prnewswire.com/news-releases/open-source-magicdata-ramc-180-hour-conversational-speech-dataset-in-mandarin-released-301525875.html

Source: Magic Data
collection