How is the Synthesized Characters count for Azure's Text to Speech service when generating from an SSML?

Question

How is the Synthesized Characters count for Azure's Text to Speech service when generating from an SSML?

ggg 0

How is the Synthesized Characters count calculated for Azure's Text to Speech service when generating speech from an SSML document? What are the specific rules?
I converted the following SSML file into speech:

<!--ID=FCB40C2B-1F9F-4C26-B1A1-CF8E67BE07D1;Version=1|{"Files":{}}-->
<!--ID=5B95B1CC-2C7B-494F-B746-CF22A0E779B7;Version=1|{"Locales":{"de-DE":{"AutoApplyCustomLexiconFiles":[{}]},"en-US":{"AutoApplyCustomLexiconFiles":[{}]}}}-->
<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US">

<voice name="en-US-AvaMultilingualNeural"><prosody rate="-10.00%">about </prosody></voice>
<voice name="en-US-AndrewMultilingualNeural"><prosody rate="-10.00%">about </prosody></voice>
<voice name="en-US-EmmaMultilingualNeural"><prosody rate="-10.00%">about </prosody></voice>
<voice name="en-US-AlloyTurboMultilingualNeural"><prosody rate="-10.00%">about </prosody></voice>
<voice name="en-US-EchoTurboMultilingualNeural"><prosody rate="-10.00%">about </prosody></voice>
<voice name="en-US-FableTurboMultilingualNeural"><prosody rate="-10.00%">about </prosody></voice>
<voice name="en-US-OnyxTurboMultilingualNeural"><prosody rate="-10.00%">about </prosody></voice>
<voice name="en-US-NovaTurboMultilingualNeural"><prosody rate="-10.00%">about </prosody></voice>
<voice name="en-US-ShimmerTurboMultilingualNeural"><prosody rate="-10.00%">about </prosody></voice>
<voice name="en-US-BrianMultilingualNeural"><prosody rate="-10.00%">about </prosody></voice>
<voice name="en-US-JennyNeural"><prosody rate="-10.00%">about </prosody></voice>
<voice name="en-US-DavisNeural"><prosody rate="-10.00%">about </prosody></voice>

</speak>

After checking the billing, why was the above content charged for 480 characters when there are only 5 * 12 = 60 characters?!

Additionally, I am developing a paid text-to-speech feature for users in my application. When a user clicks to generate speech, I want to call an Azure API to pre-calculate the number of characters that will be consumed and estimate the cost of converting the text to speech. This way, the user can see the cost and confirm before proceeding with the actual conversion. How can I achieve this? Is there such an API?

1 answer

Your answer

Answer 1

Hello ggg,

Thank you for posting your question in the Microsoft Q&A forum.

Azure's Text-to-Speech (TTS) service calculates synthesized characters based on the fully processed SSML structure, including XML tags, attributes, and nested elements, not just the visible text. In the provided example, while the word "about" appears only 60 times (5 characters × 12 voices), the actual billed count reached 480 characters because each <voice> block's opening/closing tags, attributes like name and rate, and nested <prosody> elements contribute to the total. Azure treats each voice instance as an independent synthesis task, multiplying the character count per block.

To estimate costs before conversion, developers must parse SSML manually or replicate Azure's counting logic, as no dedicated pre-calculation API exists. A workaround involves stripping comments/whitespace and counting remaining characters client-side. Alternatively, logging test API responses or integrating Azure's Retail Prices API can help project expenses. For user-facing cost estimates, applications should pre-process SSML locally and multiply by Azure's per-character pricing. Until Microsoft offers a native solution, combining client-side calculations with caching optimizations remains the most practical approach for cost transparency in paid TTS features.

If the above answer helped, please do not forget to "Accept Answer" as this may help other community members to refer the info if facing a similar issue. Your contribution to the Microsoft Q&A community is highly appreciated.

ggg 0 Reputation points

2025-07-27T21:51:05.3266667+00:00

Oh, my god! This goes completely against people's intuitive understanding. I strongly demand that you clearly emphasize this point on the price description page: https://azure.microsoft.com/zh-cn/pricing/details/cognitive-services/speech-services/ (That the charging is based not on the actual characters but on the characters within the XML file elements.) Otherwise, I believe this is a consumer trap!

Since you don't provide a method for estimating costs, and we can only discover this through the bill, our expenses have skyrocketed to seven times the original amount. If we had known in advance about this charging method, we would have used the synthesisConfig.rate with inputKind set to "PlainText" to achieve the same functional effect, instead of using SSML. Where can I apply for a refund? This is not just a mere technical issue; it's a serious consumer transaction problem.

Share via

How is the Synthesized Characters count for Azure's Text to Speech service when generating from an SSML?

1 answer

Your answer