omniture

Mozilla's voice data crowdsourcing project Common Voice launches in Simplified Chinese Mandarin

2019-05-09 10:15 4047
  • Mozilla is now supporting voice data collection in Simplified Chinese Mandarin to build a publicly available voice dataset for everyone to use
  • Voice data collection in so far 27 languages, with 72 more in progress
  • With 18 different languages, adding up to almost 1,400 hours of recorded voice data from more than 42,000 contributors, the latest Common Voice data release marks the largest to-date public domain transcribed voice dataset.

TAIPEI, , May 8, 2019 /PRNewswire/ -- Mozilla, the non-profit organization behind the Open Source Firefox browser, is excited to announce that Common Voice, its initiative to crowdsource a large dataset of human voices for use in speech technology, has launched in Simplified Chinese Mandarin. Thanks to Mozilla's communities and our deeply engaged language partners people can now donate their voice at https://voice.mozilla.org/zh-CN.

Speech interfaces are the next frontier for the Internet. In-car assistants, smart watches, lightbulbs, bicycles and thermostats - the number of speech-enabled devices is increasing daily. However, there are barriers to global innovation: Startups, researchers or anyone else who wants to build voice-enabled technologies need large amounts of high quality, transcribed voice data on which to train machine learning algorithms. But publicly available datasets are limited, both in terms of quantity and language representation, and the cost of proprietary voice data -- owned by only a handful of companies -- is enormous.

Launched in June 2017, Mozilla's project Common Voice aims to change the current market dynamics by building a global corpus of open voice data that can power the voice interfaces of the future. Mozilla believes these interfaces shouldn't be controlled by a few companies as gatekeepers to voice-enabled services, and Mozilla wants users to be understood consistently, in their own languages and accents.

Voice data collection in so far 27 languages, including Simplified Chinese Mandarin

Since Mozilla enabled multi-language support in June 2018, Common Voice has grown to be more global and more inclusive. Over the last 10 months, volunteer communities have enthusiastically rallied around the project, launching data collection efforts in 27 languages with currently 72 more in progress on the Common Voice website.

Our latest addition is Simplified Chinese Mandarin. Speakers from around the world can now donate their voice or validate samples from others at https://voice.mozilla.org/zh-CN.

Mozilla’s voice data crowdsourcing project Common Voice launches in Simplified Chinese Mandarin.
Mozilla’s voice data crowdsourcing project Common Voice launches in Simplified Chinese Mandarin.

Voice contributors also have the option to create a saved profile, which allows them to keep track of their progress. Providing some optional demographic profile information also improves the audio data used in training speech recognition accuracy.

As for all Common Voice languages, our goal for Simplified Chinese Mandarin is to capture about 10,000 validated hours of audio. This is approximately the number required to train a production speech recognition system. And the good thing is: Literally everyone can help reaching this goal and making voice recognition better. On the commute to work, on the bus, during lunch time, at home or together with friends and family. Either via voice.mozilla.org or the iOS app. All you need is your phone or your computer.

George Roter, Director Open Innovation Programs at Mozilla, said: "You may just record or listen for a few seconds - but imagine if hundreds of thousands of people did this! The more people help, the faster this dataset becomes valuable for everyone."

Multi-language dataset release

Following its promise Mozilla will continue to make the collected voice data available for use. In February this year Mozilla shared our first multi-language dataset with 18 languages represented, including English, French, German and Traditional Chinese Mandarin, but also for example Welsh and Kabyle. Altogether, the new dataset includes approximately 1,400 hours of voice clips from more than 42,000 people.

With this release, the continuously growing Common Voice dataset is now the largest of its kind, with tens of thousands of people contributing their voices and original written sentences to the public domain (CC0). The full dataset can be downloaded on the Common Voice website.

Mozilla’s voice data crowdsourcing project Common Voice launches in Simplified Chinese Mandarin.
Mozilla’s voice data crowdsourcing project Common Voice launches in Simplified Chinese Mandarin.

George Roter added: "Mozilla aims to contribute to a more diverse and innovative voice technology ecosystem. Our goal is to both release voice-enabled products ourselves, while also supporting researchers and smaller players. We are thrilled to see the growing support we are getting to build the world's largest public multi-language voice dataset and we are grateful to all the volunteers who made the launch in simplified Mandarin Chinese possible."

Photo - https://photos.prnasia.com/prnh/20190508/2460230-1-a
Photo - https://photos.prnasia.com/prnh/20190508/2460230-1-b  

Source: Mozilla
collection