Hello ggg,
Thank you for posting your question in the Microsoft Q&A forum.
Azure's Text-to-Speech (TTS) service calculates synthesized characters based on the fully processed SSML structure, including XML tags, attributes, and nested elements, not just the visible text. In the provided example, while the word "about" appears only 60 times (5 characters × 12 voices), the actual billed count reached 480 characters because each <voice> block's opening/closing tags, attributes like name and rate, and nested <prosody> elements contribute to the total. Azure treats each voice instance as an independent synthesis task, multiplying the character count per block.
To estimate costs before conversion, developers must parse SSML manually or replicate Azure's counting logic, as no dedicated pre-calculation API exists. A workaround involves stripping comments/whitespace and counting remaining characters client-side. Alternatively, logging test API responses or integrating Azure's Retail Prices API can help project expenses. For user-facing cost estimates, applications should pre-process SSML locally and multiply by Azure's per-character pricing. Until Microsoft offers a native solution, combining client-side calculations with caching optimizations remains the most practical approach for cost transparency in paid TTS features.
If the above answer helped, please do not forget to "Accept Answer" as this may help other community members to refer the info if facing a similar issue. Your contribution to the Microsoft Q&A community is highly appreciated.