THE CHALLENGES OF COLLECTING HIGH QUALITY SPANISH VOICE DATA
November 4, 2024
Computational linguists Emma O’Neill and Claire O’Neill are heading to Miami this week to represent SoapBox Labs and our parent company, Curriculum Associates (CA), at New Ways of Analyzing Variation (NWAV 52), the global academic conference focusing on socially driven language research.
Building a Spanish language voice engine
Since 2013, our voice engine has been designed and built to understand children’s voices accurately and equitably. We are expanding our capabilities to design and deliver a Spanish Language voice engine. This new voice engine will enable young Spanish-speaking learners to actively practice and develop critical literacy skills through voice-enabled tools, accelerating their path toward reading proficiency.
Why is it a Priority?
Spanish is the second most spoken language in the US with over 4 million school age speakers. Strengthening students’ reading skills in their primary language creates a stronger foundation for reading in a second language. By developing Spanish voice-enabled learning products, we directly support the needs of students in dual-immersion and bilingual programs learning to read in Spanish.
Two ethical data collection considerations
Privacy
Our privacy–by–design approach means we always responsibly collect children’s voice data, putting privacy and child safety first. This proactive approach ensures that children’s data privacy rights are protected from the earliest design stages through to the end-user experience. Any voice data collected is:
- Anonymized and de-identified
- Encrypted in our engine
- Retained only to improve the accuracy of our voice engine
- Externally audited by PRIVO to comply with all regulations and AI related guidances including COPPA in the US and GDPR in Europe.
Accuracy
As pioneers in building a child-specific voice engine, we have championed equity from the start and believe that for a voice-enabled learning product to work, it must work equally for every child, which means it must:
- Understand young children’s voices accurately
- Understand the diversity in language and speaking styles across Spanish-speaking communities in the US.
Continuously testing and refining our voice engine across accents, dialects and the different speaking styles of children ensures our voice engine works accurately and honors every child’s voice equally.
The Challenges of Gathering Voice Data
At NWAV 52, Emma and Claire will use their poster session to address the challenges in gathering accurate and representative voice data needed for robust testing and share insights on how they and their Dublin-based team have worked on:
- Capturing Spanish language diversity
- Designing engaging voice prompts for young children
- Designing tasks that mirror a classroom setting
Meet Us in Person!
Want to learn more? Connect with Emma and Claire at any time during the conference, or come check out their poster session on Day 2 – Fri, Nov 8th – from 2:20 to 3:30 p.m.
And, watch this space for insights from NWAV 52!