In a recent visit to a museum in Mexico, Tuochao Chen, a doctoral student at the University of Washington, encountered a common problem faced by many travelers – the struggle to understand and communicate in a foreign language. Despite using a translation app on his phone, the ambient noise in the museum made it difficult for Chen to accurately translate the tour guide’s speech, rendering the text useless.
While various technologies have emerged in recent years promising seamless translation services, none have effectively addressed the issue of translating multiple speakers in public spaces. For example, Meta’s new glasses are limited to translating the speech of isolated speakers, playing back automated voice translations only after the speaker has finished talking.
To tackle this problem, Chen and a team of researchers at UW have developed a groundbreaking headphone system known as Spatial Speech Translation. This innovative system is designed to translate the speech of multiple speakers simultaneously, while preserving the unique qualities and direction of each speaker’s voice. By utilizing off-the-shelf noise-canceling headphones equipped with microphones, the team’s algorithms are able to differentiate between different speakers in a space, track their movements, translate their speech, and play it back with a slight delay of 2-4 seconds.
The team presented their research at the ACM CHI Conference on Human Factors in Computing Systems in Yokohama, Japan, showcasing the potential of their Spatial Speech Translation system. Unlike traditional translation technologies that assume only one person is speaking, this new system maintains the authenticity of each speaker’s voice and the spatial direction it’s coming from.
The system boasts three key innovations. Firstly, it can detect the number of speakers in a given space upon activation, akin to radar technology scanning the area in 360 degrees. Secondly, it translates the speech of each speaker while preserving their expressive qualities and volume, running on devices such as mobile phones with Apple M2 chips. Lastly, the system tracks the direction and characteristics of each speaker’s voice as they move, ensuring a seamless translation experience.
In testing conducted in various indoor and outdoor settings, the system proved to be effective and reliable. Users expressed a preference for a 3-4 second delay in translation, as shorter delays led to more errors. While the system currently supports common languages like Spanish, German, and French, it has the potential to be trained to translate a wide range of languages in the future.
Overall, the Spatial Speech Translation system represents a significant advancement in breaking down language barriers and facilitating communication in diverse settings. With this technology, individuals like Chen can navigate foreign environments with ease, understanding and interacting with people speaking different languages. The team’s research opens up new possibilities for inclusive communication and cross-cultural interactions, paving the way for a more connected and globally integrated world.