Datasets WG – 12.10.25

Updated Mar 17, 2026 Datasets & Benchmarks

Dec 10, 2025

Invited Rosina Haberl John Marconi Peter Chang Petrut Bogdan raashid.ansari@silabs.com sam@imagimob.com Tamas Daranyi Eiman Kanjo Kanjo, Eiman Emil Jørgensen Njor Eric Smiley Goran Vuksic Kiruba Subramani Seung Hun SHIN/신승훈 ~~Pete Bernard~~ ~~Sherif Eissa~~ ~~suzen@embedl.com~~ ~~Vijay Janapa Reddi~~ ~~Weier Wan~~ ~~EAIF Team Calendar~~

Summary

Adam Fuks, Rosina Haberl, and Tomer Badug decided to postpone the next data set working group meeting from January 7th to January 14th due to CES. Tomer Badug confirmed they secured third-party approval to share a wake word data set, confirming it is around 10,000s of utterances, and needs a legal draft from Ceva to remove liability before sharing; Eric Smiley confirmed that someone named John is actively assisting with hosting this data. The data set working group and audio working group plan to collect speech data from Foundation members across different languages, accents, and genders, with Petrut Bogdan suggesting a crowdsourcing app for recordings, and Adam Fuks proposing networks to validate the collected audio data.

Details

Notes Length: Standard

Wake Word Data Set Sharing Tomer Badug provided an update on sharing a wake word data set purchased by Ceva from a third party. They explained that they obtained approval from the third party to share the data set despite initial contractual restrictions (00:03:29). Tomer Badug confirmed they need a license or similar draft from Ceva’s legal team to remove liability before uploading and sharing the data set, and Eric Smiley confirmed that someone on their team (John) is actively working with Tomer Badug on the next steps for hosting it (00:04:50) (00:13:54).
Data Set Size and Contents Adam Fuks asked Tomer Badug about the wake word data set size, noting that a high-quality data set would need to be in the tens of thousands of utterances, and Tomer Badug confirmed the Alexa wake word data set should be around that size, like 10,000s. Tomer Badug also mentioned they are working on a separate keyword spotting data set that includes commands like “play” and “pause,” indicating multiple data sets are being discussed (00:10:42). They acknowledged the legal complexities and delays in obtaining approvals for sharing this data (00:13:01).
Speech Data Collection and Crowdsourcing The audio working group and data set working group plan to collaborate on collecting speech data from Foundation members to obtain a large amount of data across different languages, accents, and genders (00:04:50). Petrut Bogdan suggested crowdsourcing recordings via a phone app, which could pre-label the data and randomize instructions, rather than limiting collection to live events (00:06:44). Tomer Badug agreed that collection should not be limited to live events, although a booth at an event could boost visibility, and mentioned a suggestion from the audio working group about an existing open-source web app for collecting data (00:08:27).
Data Validation and Augmentation Adam Fuks proposed using networks instead of humans to validate the collected audio data, ensuring the text is correct or that the user said the right words, to which Tomer Badug agreed, suggesting running a speech-to-text system and calculating a word error rate (00:09:39). Adam Fuks also emphasized the need to grow the data set, which Tomer Badug clarified they had previously interpreted as “augmentation” in the training process sense (00:15:55). The speakers concluded by agreeing to remove the initial personal conversation regarding citizenship from the transcription .

Suggested next steps

Tomer Badug will double-check and get back to Adam Fuks with the amount of utterances in the Alexa wake word dataset.
Tomer Badug will keep Eric Smiley posted on the data set clearance.
Eric Smiley will follow up with John regarding the effort to upload or add the data set.
Adam Fuks and Tomer Badug will coordinate a meeting to discuss options to grow the data set during Adam Fuks’s time in Israel.