Get topics inference insights

2025-06-09

Topics inference creates inferred insights from transcribed audio, OCR content in visual text, and celebrities the Video Indexer facial recognition model recognizes in the video.

In the web portal, the extracted Topics and categories (when available) are listed in the Insights tab. To jump to the topic in the media file, select a Topic -> Play Previous or Play Next.

Topics inference use cases

Personalization using topics inference to match customer interests, for example websites about England posting promotions about English movies or festivals.
Deep-searching archives for insights on specific topics to create feature stories about companies, personas, or technologies, for example by a news agency.
Monetization, increasing the worth of extracted insights. For example, industries like the news or social media that rely on ad revenue can deliver relevant ads by using the extracted insights as other signals to the ad server.

View the insight JSON with the web portal

After you upload and index a video, download insights in JSON format from the web portal.

Select the Library tab.
Select the media you want.
Select Download, and then select Insights (JSON). The JSON file opens in a new browser tab.
Find the key pair described in the example response.

Use the API

Use a Get Video Index request. Pass &includeSummarizedInsights=false.
Find the key pairs described in the following example response.

Example response

    "topics": [
      {
        "id": 1,
        "name": "Pens",
        "referenceId": "Category:Pens",
        "referenceUrl": "https://en.wikipedia.org/wiki/Category:Pens",
        "referenceType": "Wikipedia",
        "confidence": 0.6833,
        "iabName": null,
        "language": "en-US",
        "instances": [
          {
            "adjustedStart": "0:00:30",
            "adjustedEnd": "0:01:17.5",
            "start": "0:00:30",
            "end": "0:01:17.5"
          }
        ]
      },
      {
        "id": 2,
        "name": "Musical groups",
        "referenceId": "Category:Musical_groups",
        "referenceUrl": "https://en.wikipedia.org/wiki/Category:Musical_groups",
        "referenceType": "Wikipedia",
        "confidence": 0.6812,
        "iabName": null,
        "language": "en-US",
        "instances": [
          {
            "adjustedStart": "0:01:10",
            "adjustedEnd": "0:01:17.5",
            "start": "0:01:10",
            "end": "0:01:17.5"
          }
        ]
      },

Important

Read the transparency note overview for all VI features. Each insight also has its own transparency note.

Topics inference notes

When uploading a file, always use high-quality video content. The recommended maximum frame size is HD and frame rate is 30 FPS. A frame should contain no more than 10 people. When outputting frames from videos to AI models, only send around two or three frames per second. Processing 10 and more frames might delay the AI result.
When uploading a file always use high quality audio and video content. At least 1 minute of spontaneous conversational speech is required to perform analysis. Audio effects are detected in nonspeech segments only. The minimal duration of a nonspeech section is 2 seconds. Voice commands and singing aren't supported.
Typically, small people or objects under 200 pixels and people who are seated might not be detected. People wearing similar clothes or uniforms might be detected as being the same person and are given the same ID number. People or objects that are obstructed might not be detected. Tracks of people with front and back poses might be split into different instances.

Topics inference components

Component	Definition
Source language	The user uploads the source file for indexing.
Preprocessing	Transcription, OCR, and facial recognition AIs extract insights from the media file.
Insights processing	Topics AI analyzes the transcription, OCR, and facial recognition insights extracted during preprocessing: - Transcribed text, each line of transcribed text insight is examined using ontology-based AI technologies. - OCR and Facial Recognition insights are examined together using ontology-based AI technologies.
Post-processing	- Transcribed text, insights are extracted and tied to a Topic category together with the line number of the transcribed text. For example, Politics in line 7. - OCR and Facial Recognition, each insight is tied to a Topic category together with the time of the topic’s instance in the media file. For example, Freddie Mercury in the People and Music categories at 20.00.
Confidence value	The estimated confidence level of each topic is calculated as a range of 0 to 1. The confidence score represents the certainty in the accuracy of the result. For example, an 82% certainty is represented as an 0.82 score.

Sample code

See all samples for VI

Azure AI Video Indexer documentation

Share via