Help with Object Detection and Bounding Box Extraction Using Azure Content Understanding

Question

Help with Object Detection and Bounding Box Extraction Using Azure Content Understanding

Duygu Doğan 20

Hello,

I’m currently working on a video analysis project using Azure Content Understanding. My goal is to detect objects in video files and extract their bounding box positions (coordinates within the frame).

However, I’m having trouble getting the expected results. I always receive an "Invalid request" error, even though I’ve tried changing the requested fields and categories based on the documentation.

Could you please guide me on:

How to correctly request object detection and retrieve bounding box information (e.g., x, y, width, height) using Azure Content Understanding?

What fields and categories are required or supported to get object location data?

How can I troubleshoot or resolve the "Invalid request" error?

Any example requests, payload structures, or best practices would be greatly appreciated.

Thank you!

Accepted answer

0 additional answers

Your answer

Answer 1

Hello Duygu Doğan,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that you're having difficulties while trying to extract object bounding boxes from video files using Azure Content Understanding.

To clarify, Azure Content Understanding is not directly optimized for full video analysis. Instead, Microsoft provides a more suitable tool for this purpose: Azure Video Indexer. This service is specifically designed to process video content and extract rich metadata, including object detection with bounding box coordinates.

Best Practice Approach:

Use Azure Video Indexer for Video Analysis, if your goal is to detect objects and retrieve their positions within video frames, Azure Video Indexer is the most appropriate solution.
Begin by uploading your video to Azure Video Indexer via the portal or API. Once uploaded, the service will automatically analyze the content and generate insights.
After indexing is complete, navigate to the Library section in the portal. Select your video, then choose Download > Insights (JSON). This file contains all detected metadata, including objects, scenes, and timestamps.
Within the JSON, look for the detectedObjects section. Each object entry includes its type (e.g., "Car", "Person"), confidence score, and time range. While bounding box coordinates are not always directly listed, they can be inferred from associated thumbnails or frame-level metadata. For more details, check the official documentation on Azure Video Indexer Object Detection.

This is an alternative method:

Using Frame-by-Frame Analysis Using Azure AI Vision. This work perfectly if you prefer more control or need precise bounding box coordinates per frame, another approach is to extract individual frames from the video and analyze them using the Azure AI Vision API. This method involves two main steps:

Use a tool like OpenCV to extract frames at a desired interval (e.g., 1 frame per second). Each frame is saved as an image file. This will help to extract Frames from the Video.
Use the /vision/v3.2/detect endpoint to analyze each image. The API returns detected objects along with their bounding box coordinates (x, y, w, h) and confidence scores. This will also help to send each Frame to Azure AI Vision.

You should be using Python code like below script for Frame Extraction and Object Detection, however irrespective of any code, you can switch and convert the below:

import cv2
import requests
import os
def extract_frames(video_path, output_folder, frame_rate=1):
    video = cv2.VideoCapture(video_path)
    fps = video.get(cv2.CAP_PROP_FPS)
    frame_interval = int(fps / frame_rate)
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
    frame_count = 0
    saved_frame_count = 0
    while video.isOpened():
        ret, frame = video.read()
        if not ret:
            break
        if frame_count % frame_interval == 0:
            frame_filename = os.path.join(output_folder, f"frame_{saved_frame_count}.jpg")
            cv2.imwrite(frame_filename, frame)
            saved_frame_count += 1
        frame_count += 1
    video.release()
    return saved_frame_count
def detect_objects_in_frame(frame_path, subscription_key, endpoint):
    analyze_url = f"{endpoint}/vision/v3.2/detect"
    headers = {
        'Ocp-Apim-Subscription-Key': subscription_key,
        'Content-Type': 'application/octet-stream'
    }
    with open(frame_path, 'rb') as f:
        data = f.read()
    response = requests.post(analyze_url, headers=headers, data=data)
    response.raise_for_status()
    return response.json()
def process_video(video_path, output_folder, subscription_key, endpoint):
    frame_count = extract_frames(video_path, output_folder)
    print(f"Extracted {frame_count} frames from the video.")
    for i in range(frame_count):
        frame_path = os.path.join(output_folder, f"frame_{i}.jpg")
        result = detect_objects_in_frame(frame_path, subscription_key, endpoint)
        print(f"Results for frame {i}:")
        for obj in result.get('objects', []):
            print(f"Object: {obj['object']}, Bounding Box: {obj['rectangle']}, Confidence: {obj['confidence']}")

The above Python script automates the entire process:

It extracts frames from a video.
Sends each frame to Azure AI Vision.
Prints out the detected objects and their bounding boxes.

If you prefer using the Azure SDK for Python, the below is a simplified version using the azure-cognitiveservices-vision-computervision package:

from azure.cognitiveservices.vision.computervision import ComputerVisionClient
from msrest.authentication import CognitiveServicesCredentials
subscription_key = "your_subscription_key"
endpoint = "https://your_region.api.cognitive.microsoft.com"
client = ComputerVisionClient(endpoint, CognitiveServicesCredentials(subscription_key))
with open("frame_0.jpg", "rb") as image_stream:
    detection_result = client.detect_objects_in_stream(image=image_stream)
for obj in detection_result.objects:
    print(f"{obj.object_property} at {obj.rectangle.x}, {obj.rectangle.y}, "
          f"{obj.rectangle.w}, {obj.rectangle.h} with confidence {obj.confidence}")

For more details check Azure AI Vision SDK Docs.

I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Share via

Help with Object Detection and Bounding Box Extraction Using Azure Content Understanding

0 additional answers

Your answer