THE CHALLENGES OF COLLECTING HIGH QUALITY SPANISH VOICE DATA

November 4, 2024

Child reading out loud to a voice-enabled product

Contents

Building a Spanish language voice engine

Two ethical data collection considerations

The Challenges of Gathering Voice Data

Meet Us in Person!

Computational linguists Emma O’Neill and Claire O’Neill are heading to Miami this week to represent SoapBox Labs and our parent company, Curriculum Associates (CA), at New Ways of Analyzing Variation (NWAV 52), the global academic conference focusing on socially driven language research.

Building a Spanish language voice engine

Since 2013, our voice engine has been designed and built to understand children’s voices accurately and equitably. We are expanding our capabilities to design and deliver a Spanish Language voice engine. This new voice engine will enable young Spanish-speaking learners to actively practice and develop critical literacy skills through voice-enabled tools, accelerating their path toward reading proficiency.

Why is it a Priority?

Spanish is the second most spoken language in the US with over 4 million school age speakers. Strengthening students’ reading skills in their primary language creates a stronger foundation for reading in a second language. By developing Spanish voice-enabled learning products, we directly support the needs of students in dual-immersion and bilingual programs learning to read in Spanish.

Two ethical data collection considerations

Privacy

Our privacy–by–design approach means we always responsibly collect children’s voice data, putting privacy and child safety first. This proactive approach ensures that children’s data privacy rights are protected from the earliest design stages through to the end-user experience. Any voice data collected is:

Anonymized and de-identified

Encrypted in our engine

Retained only to improve the accuracy of our voice engine

Externally audited by PRIVO to comply with all regulations and AI related guidances including COPPA in the US and GDPR in Europe.

Accuracy

As pioneers in building a child-specific voice engine, we have championed equity from the start and believe that for a voice-enabled learning product to work, it must work equally for every child, which means it must:

Understand young children’s voices accurately

Understand the diversity in language and speaking styles across Spanish-speaking communities in the US.

Continuously testing and refining our voice engine across accents, dialects and the different speaking styles of children ensures our voice engine works accurately and honors every child’s voice equally.

The Challenges of Gathering Voice Data

At NWAV 52, Emma and Claire will use their poster session to address the challenges in gathering accurate and representative voice data needed for robust testing and share insights on how they and their Dublin-based team have worked on:

Capturing Spanish language diversity

Designing engaging voice prompts for young children

Designing tasks that mirror a classroom setting

Meet Us in Person!

Want to learn more? Connect with Emma and Claire at any time during the conference, or come check out their poster session on Day 2 – Fri, Nov 8^th – from 2:20 to 3:30 p.m.

And, watch this space for insights from NWAV 52!

THE CHALLENGES OF COLLECTING HIGH QUALITY SPANISH VOICE DATA

Building a Spanish language voice engine

Two ethical data collection considerations

The Challenges of Gathering Voice Data

Meet Us in Person!

Share this

Related

Celebrating Variation in Multilingual Contexts: Insights from NWAV Conference

Invisible Assessments - 10 Quotes from our SX Panel

Categories

Solutions

Resources

Company