BEIJING, April 14, 2022 /PRNewswire/ -- MagicHub, an open-source community for AI, releases 180-hour conversational speech dataset in Mandarin for free, enriching the open source speech corpus and promoting the development of spoken language processing technology and conversational AI.
Data Profile
MagicData-RAMC is a collection of high quality and richly annotated training data that includes 351 sets of multi-turn Mandarin conversations recorded in indoor environment by smart phone with a total duration of 180 hours.
In order to reflect real-world conversation scenarios as much as possible, MagicData-RAMC ensured a balanced gender and geographic distribution, as well as a diversity of topics during the collection process. There are 663 speakers in total in MagicData-RAMC, including 368 males and 295 females, 334 from the north and 329 from the south.
The annotation information of each conversation includes transcribed text, voice activity timestamp, speaker information, recording information, and topic information. The speaker information includes gender, age, and geography, and the recording information includes environment and device.
MagicData-RAMC is currently available for download at
https://magichub.com/datasets/magicdata-ramc
Researches Based on MagicData-RAMC
Magic Data, together with the Institute of Acoustics, Chinese Academy of Sciences, Shanghai Jiao Tong University and Northwestern Polytechnic University, completes the research related to speech recognition, speaker diarization and keyword search based on MagicData-RAMC, which has been submitted to Interspeech 2022, the top conference in the field of speech.
Preprint available on arxiv
https://arxiv.org/abs/2203.16844
Challenge and Baseline
Together with the Institute of Acoustics, Chinese Academy of Sciences and Jiangsu Normal University, Magic Data held the Magic Data ASR-SD Challenge in July to October, 2021 for evaluating the MagicData-RAMC.
Baseline and more information can be found at
https://github.com/MagicHub-io/Magic-Data-ASR-SD-Challenge
A following challenge on evaluating MagicData-RAMC will be launched soon. Stay updated at MagicHub open-source community.
About MagicHub
MagicHub community is an open-source data platform developed by Magic dedicating to assist AI developers in model training and to promote development of open-source ecosystem.
For more information, contact open@magicdatatech.com
View original content:https://www.prnewswire.com/news-releases/open-source-magicdata-ramc-180-hour-conversational-speech-dataset-in-mandarin-released-301525875.html