azure speech to text cannot process spelt out words

Question

azure speech to text cannot process spelt out words

Tim 0

When using real-time speech to text, if the audio spells out a word or name, the result outputs the name as if it was said in whole and not spelled. (e.g. the audio says "My name is John. J-O-H-N", but the result I get is "My name is John. John".) My setup is quite basic:

        const speechConfig = SpeechConfig.fromSubscription(
          process.env.AZURE_SPEECH_KEY ?? "",
          process.env.AZURE_SPEECH_REGION ?? ""
        );
        speechConfig.speechRecognitionLanguage = "en-US";

        pushStream = AudioInputStream.createPushStream(
          AudioStreamFormat.getWaveFormat(8000, 16, 1, AudioFormatTag.MuLaw)
        );
        const audioConfig = AudioConfig.fromStreamInput(pushStream);

        speechRecognizer = new SpeechRecognizer(speechConfig, audioConfig);

        speechRecognizer.recognized = async (s, e) => {
          if (e.result.reason === ResultReason.RecognizedSpeech) {
            if (!e.result.text) return;
            console.log(e.result.text);
          }
        };
        speechRecognizer.startContinuousRecognitionAsync();

1 answer

Your answer

Answer 1

Hello Tim,

The issue you're experiencing with Azure Speech to Text not recognizing spelt-out words, but instead returning the full word (like "J-O-H-N" being transcribed as "John"), is a common limitation. The Azure Speech service is designed to interpret natural speech and often attempts to convert spelled-out letters into the most likely intended word for conversational scenarios.

Currently, there is no direct configuration or option in the basic setup to force the service to transcribe each letter individually when words are spelled out. This behavior is intentional to enhance the user experience for most typical speech recognition cases.

Here are a few suggestions to address or work around this limitation:

You may want to explore using the Custom Speech feature in Azure, which allows you to train models for more specialized vocabulary or behaviors. However, even with custom models, letter-by-letter spelling may not always be perfectly recognized.

Consider providing context or cues in your audio (such as saying “that’s J-O-H-N, spelled J-O-H-N”) to help the model treat the input as individual letters, though results may still vary.

If your application requires strict letter-by-letter recognition (for things like name spelling, codes, etc.), you might need to implement additional logic in your code to post-process and check for sequences that look like spelled-out letters (possibly using regex or NLP techniques).

If this functionality is crucial for your use case, I recommend submitting feedback to Microsoft via the Azure portal or their user voice forums. They continue to improve their models and may add this feature if there’s demand.

Best Regards,

Jerald Felix

Share via

azure speech to text cannot process spelt out words

1 answer

Your answer