Hello everyone,
I am trying to implement Acoustic Echo Cancellation (AEC) in a Unity project using the Azure Speech SDK and the Microsoft Audio Stack (MAS), but I cannot get it to work correctly. The speech recognizer continues to pick up and transcribe audio from my speakers.
Goal: My goal is to configure the Speech SDK to perform speech recognition on a microphone input while simultaneously ignoring audio being played out of the system's speakers. Essentially, if someone is speaking through the speakers, the speech recognizer should not transcribe that audio, but it should still be able to recognize and transcribe a user speaking into the microphone.
Setup:
- Engine: Unity 2022.3.x
- SDK: Azure Speech SDK for C#
- Feature: Azure Speech SDK with Microsoft Audio Stack (MAS) (enabled via
AudioProcessingOptions
)
- Hardware:
- Output: A 5.1 speaker system
- Input: A standard microphone placed in front of the user.
Problem Details: I followed the documentation (https://learn.microsoft.com/en-us/azure/ai-services/speech-service/audio-processing-speech-sdk?tabs=csharp) to enable the Microsoft Audio Stack. My expectation was that MAS would use the speaker output as a reference signal to cancel it out from the microphone's input, thereby only recognizing the user's speech.
However, when voice is being played over the speakers, the SpeechRecognizer
transcribes everything the speaker says. This indicates that the AEC is not functioning.
My hypothesis is that MAS does not have the correct reference audio signal for the echo cancellation. The documentation mentions that MAS uses the "last channel of the input device" as the reference channel, but I'm unsure how to configure my system to correctly route the speaker's voice to this channel or how to verify if this is the root cause.
What I've Tried:
- Minimal Sample Project: I created a minimal Unity project to isolate the issue from our main, more complex project.
- Dependencies: I set up the dependencies by manually including the DLLs for
Microsoft.CognitiveServices.Speech
and Microsoft.CognitiveServices.Speech.Extension.MAS
, and using NuGetForUnity
for the Azure.Core
dependency.
- Code Implementation: I am using
SpeechRecognizer
with AudioConfig.FromDefaultMicrophoneInput(audioProcessingOptions)
to setup MAS
Code
The complete unity project is to big to paste here, but this is the only script that has any functionality:
using UnityEngine;
using UnityEngine.UI;
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;
using TMPro;
#if PLATFORM_ANDROID
using UnityEngine.Android;
#endif
public class AzureSpeechRecognizer : MonoBehaviour
{
private string subscriptionKey = "YOUR_SUBSCRIPTION_KEY_HERE"; // Replace with your Azure Speech subscription key
private string region = "YOUR_REGION";
// Public fields for the Unity Inspector
[Header("UI Elements")]
[Tooltip("The button to start speech recognition.")]
public Button startRecognitionButton;
[Tooltip("The UI Text element to display the recognized text.")]
public TextMeshProUGUI outputText;
public AudioSource speaker;
// Internal objects for speech recognition
private SpeechRecognizer recognizer;
private SpeechConfig speechConfig;
private AudioConfig audioConfig;
private bool isRecognizing = false;
private object threadLocker = new object();
private string message;
void Start()
{
// --- Initialization ---
if (outputText == null)
{
Debug.LogError("Output Text field is not assigned in the inspector.");
return;
}
if (startRecognitionButton == null)
{
Debug.LogError("Start Recognition Button is not assigned in the inspector.");
return;
}
// Add a listener to the button to call the StartRecognition method when clicked
startRecognitionButton.onClick.AddListener(StartRecognition);
// --- Permission Handling for Android ---
#if PLATFORM_ANDROID
if (!Permission.HasUserAuthorizedPermission(Permission.Microphone))
{
Permission.RequestUserPermission(Permission.Microphone);
}
#endif
// --- Speech SDK Configuration ---
// Creates an instance of a speech config with specified subscription key and service region.
speechConfig = SpeechConfig.FromSubscription(subscriptionKey, region);
speaker.PlayDelayed(5);
}
/// <summary>
/// Called when the start recognition button is clicked.
/// </summary>
public async void StartRecognition()
{
if (isRecognizing)
{
// If already recognizing, stop the recognition
await recognizer.StopContinuousRecognitionAsync();
isRecognizing = false;
UpdateUI("Recognition stopped.");
return;
}
// --- Audio Configuration ---
// Creates an audio configuration that will use the default microphone.
AudioProcessingOptions audioProcessingOptions = AudioProcessingOptions.Create(
AudioProcessingConstants.AUDIO_INPUT_PROCESSING_ENABLE_DEFAULT,
PresetMicrophoneArrayGeometry.Linear2,
SpeakerReferenceChannel.LastChannel);
foreach (var device in Microphone.devices)
{
Debug.Log("Name " + device);
}
audioConfig = AudioConfig.FromDefaultMicrophoneInput(audioProcessingOptions);
// --- Speech Recognizer Creation ---
// Creates a speech recognizer from the speech and audio configurations.
recognizer = new SpeechRecognizer(speechConfig, audioConfig);
// --- Event Subscriptions ---
// Subscribes to events.
recognizer.Recognizing += (s, e) =>
{
lock (threadLocker)
{
message = $"RECOGNIZING: Text={e.Result.Text}";
}
};
recognizer.Recognized += (s, e) =>
{
if (e.Result.Reason == ResultReason.RecognizedSpeech)
{
lock (threadLocker)
{
message = $"RECOGNIZED: Text={e.Result.Text}";
}
}
else if (e.Result.Reason == ResultReason.NoMatch)
{
lock (threadLocker)
{
message = "NOMATCH: Speech could not be recognized.";
}
}
};
recognizer.Canceled += (s, e) =>
{
lock (threadLocker)
{
message = $"CANCELED: Reason={e.Reason}";
}
if (e.Reason == CancellationReason.Error)
{
Debug.LogError($"CANCELED: ErrorDetails={e.ErrorDetails}");
Debug.LogError("CANCELED: Did you set the speech resource key and region values?");
}
};
recognizer.SessionStarted += (s, e) =>
{
Debug.Log("Session started event.");
};
recognizer.SessionStopped += (s, e) =>
{
Debug.Log("Session stopped event.");
isRecognizing = false;
};
recognizer.SpeechStartDetected += (s, e) =>
{
Debug.Log("Speech Started");
};
recognizer.SpeechEndDetected += (s, e) =>
{
Debug.Log("Speech Ended");
};
// --- Start Recognition ---
// Starts continuous recognition.
// Uses StopContinuousRecognitionAsync() to stop recognition.
await recognizer.StartContinuousRecognitionAsync().ConfigureAwait(false);
isRecognizing = true;
UpdateUI("Say something...");
}
void Update()
{
lock (threadLocker)
{
if (outputText != null)
{
outputText.text = message;
}
}
}
/// <summary>
/// Updates the UI text on the main thread.
/// </summary>
/// <param name="text">The text to display.</param>
private void UpdateUI(string text)
{
lock(threadLocker)
{
message = text;
}
}
void OnDestroy()
{
// --- Cleanup ---
if (recognizer != null)
{
recognizer.Dispose();
}
}
}
There is a AudioSource speaker
in the scene which plays the speakers voice over the speakers. Also, a button startRecognitionButton
for starting the recording and a TextMeshPro outputText
for showing the transcribed result.
Question: Has anyone successfully implemented AEC with the Azure Speech SDK and MAS in Unity? Is there a specific configuration required for the audio output or the system's microphone channels to provide the correct reference signal for echo cancellation?
Any guidance or examples would be greatly appreciated. Thank you!