Meta and a group of researchers at the University of Texas at Austin (UT Austin) are working to bring realistic audio to the metaverse.
As Kristen Garuman, research director at Meta AI, explains (opens in new tab), there’s more to augmented and virtual reality (AR and VR, respectively) than just visuals. Audio plays a very important role in making a world feel alive. Garuman says that “audio is shaped by the environment that [it’s] inside.” There are several factors that influence how sound behaves, such as the geometry of a room, what’s in the room, and distance from a source.
To accomplish this, Meta’s plan is to use AR glasses to record audio and video from a location, and then use a set of three AI models, transform and clean the recording so it looks like it’s happening in front of you when you reproduce it. at home. AIs will take into account the room you are in so it can match the environment.
Looking at the designs, it looks like Meta is focusing on AR glasses. Meta’s plan for VR headsets includes replicating the sights and sounds of an environment, such as a concert, so it feels like you’re there in person.
We asked Meta how people can hear enhanced audio. Will people need a pair of headphones to listen or will it come from the headset? We got no response.
We also asked Meta how developers can get these AI models. They were made open source so that third-party developers can work on the technology, but Meta offered no further details.
Transformed by AI
The question is how Meta can record audio on a pair of AR glasses and have it reflect a new setting.
The first solution is known as AViTAR which is a “Visual Acoustic Matching Model.” (opens in new tab) This is the AI that transforms the audio to match a new environment. Meta offers the example of a mother recording her son’s dance recital in an auditorium with a pair of AR glasses.
One of the researchers claims that the mother in question can take this recording and play it back at home, where the AI will transform the audio. It will sweep the room, take into account any obstacles in a room, and make the recital sound like it’s happening right in front of her with the same glasses on. The researcher claims that the audio will come from the glasses.
To help clean up the audio, there are Visually informed deseverberation (opens in new tab). Basically, it removes the reverb from the clip. The example given is to record a violin concert at a train station, take it home, and have the AI clean the clip so you hear nothing but music.
The latest AI model is VisualVoice (opens in new tab), which uses a combination of visual and audio cues to separate voices from other noise. Imagine recording a video of two people arguing. This AI will isolate a voice so you can understand it while silencing everything else. Meta explains that visual cues are important because the AI needs to see who is speaking to understand certain nuances and to know who is speaking.
Regarding visuals, Meta says it plans to bring in video and other suggestions to further enhance the AI-driven audio. As this technology is still in early development, it is unknown if and when Meta will bring these AIs to a Quest headset near you.
Be sure to read our latest review on the Oculus Quest 2 if you are thinking of buying one. Spoiler alert: we like it.