Analyze Videos with Azure Open AI GPT-4 Turbo with Vision and Azure Data Factory
Published Jan 18 2024 09:17 AM 9,993 Views
Microsoft

Azure Open AI's GPT-4 Turbo with Vision (GPT-4V) is revolutionizing how businesses utilize video data. This powerful AI tool, built on Azure's robust cloud platform, offers scalable video analysis with enterprise-level security. Whether it's streamlining quality control in manufacturing with precise defect detection, accessing damage of products in transit, detecting a specific image in a video, or summarizing videos, GPT-4V provides a swift and accurate analysis, saving valuable time and resources.

 

Yet, for those not versed in Python or .Net, tapping into Azure Open AI's potential can seem daunting. Azure Data Factory (ADF) steps in as a low-code solution to orchestrate Azure Open AI service calls and manage output ingestion. ADF has features that allow for easy configuration, customization and parameterization of prompts and other AOAI inputs as well as data sources. These customizations and parameterization make the pipelines reusable for ingesting from different data sources, such as the storage account that contains the videos for GPT-4V analysis as well as for different prompts and system messages. ADF seamlessly and securely connects with Azure Open AI services and other Azure resources, like Key Vault, Storage Accounts, and databases such as  Azure Cosmos DB or Azure SQL. With ADF, data developers can swiftly craft secure, maintainable, and reusable pipelines.

 

In this blog post, I will cover an ADF solution which loops through a folder of videos and calls a pipeline to create the video retrieval index, ingest the video into the index, call the GPT-4V deployment and store the results in a database.  

 

This solution is ideal for a development/test environment where you can refine the system prompt, user message, and other inputs until you are satisfied with the GPT-4V output. You can then schedule the pipeline for batch processing or change the solution to analyze the video as soon as blob storage event occurs.

 

Architecture

 

jehayes_0-1705533940755.png

  1. Land videos in Azure Blob storage with Azure Event Grid, Azure Logic Apps, Azure Functions, other ADF pipelines or other applications. 
  2. The ADF pipeline retrieves the Azure AI API endpoints, keys and other configurations from Key Vault.
  3. The blob storage URL for the video file is retrieved.
  4. With Azure Computer Vision, a video retrieval index is created for the file and the video is ingested. Depending on your use case, you could ingest multiple videos to the same index.
  5. Call GPT4-V deployment in Azure Open AI, passing in video URL and the video retrieval index, system message, system prompt and other inputs.
  6. Save the response to Azure Cosmos DB.
  7. If the video processes successfully, move the video to an archive folder.

Resources Used in this Solution

Security Requirements

  • The ADF managed identity needs the following access to the following resources:

    • Key Vault- Key Vault Secrets User role
    • Cosmos – Contributor
    • Storage Account that contains videos – Storage Blob Data Reader
    • Storage Account for archive of videos - Storage Blob Data Contributor
    • Computer Vision – Cognitive Services Contributor
  • Store the following Secrets in Azure Key Vault:
    • Computer Vision Endpoint and Key
    • A Shared Access Signature Token for the container that has the videos. (SAS is currently required for Computer Vision Video Retrieval and Azure OpenAI to access the storage container)
    • Open AI Endpoint and Key
    • GPT-4V deployment name. This is really not an input that needs to be secure, but I saved here for ease of use.

jehayes_0-1705534728052.png

 

ADF Orchestration Pipeline

The ADF orchestration pipeline utilizes parameter inputs and gets secrets from Azure Key Vault, then loops through a Storage Account container, calling another pipeline to ingest the video into a Computer Vision Video Retrieval Index, call GPT-4V, and ingest the results into a Cosmos DB.

jehayes_1-1705535078360.png

  1. Input parameters for pipeline
    1. sys_message – initial instructions to the model about the task GPT-4V is expected to perform
    2. user_prompt – the query to be answered by GPT-4V
    3. storageaccounturl – endpoint for the storage account
    4. storageaccountcontainer – the container that contains the videos
    5. temperature – value between 0 and 2 where 0 is the most accurate and consistent result and 2 is the most creative
    6. top_p – value between 0  and 1 to consider a subset of tokens
  2. Get the secrets from key vault and store them as return variables
  3. Set a variable which contains the name/value pair for temperature. The parameter above for  temperature returns “temperature” : 0.5
  4. Set a variable which contains the name/value pair for top_p. The parameter above is not set so it will be blank.
  5. Get the child items (video file name) in the locations specified by the storageaccounturl and storageaccountcontainer values.
  6. For each video, call the pipeline childAnalyzeVideo, passing in the following values for the parameters pipeline parameters:jehayes_1-1705535323816.png

 

Video Ingestion/GPT-4V with Pipeline childAnalyzeVideo

The child pipeline will create an index, ingest the video into the index, call GPT-4V to analyze the video, store the results in Cosmos DB, and move the file to the appropriate folder. 

 

 

jehayes_0-1705538247648.png

 

  1. Parameters for processing the video file (see previous section, bullet point 6, on parameters and inputs to this pipeline)
  2. Set the indexName variable – the index name must be unique and can only include letters, numbers and hypens

jehayes_2-1705535479848.png

 

  1. Create an Index Id – this also must be unique

jehayes_1-1705535463976.png

 

  1. Create the Computer Vision Video Retrieval Index. The next 3 steps for creating the index, ingesting the video, and checking for ingestion completion follow the first 3 steps of this How-To Guide. Here's the complete Video Retrieval API Referencejehayes_3-1705535552558.png
  2. Ingest the video into the indexjehayes_1-1705538301059.png

     

  3. Call the Computer Vision Video Retrieval API until ingestion is complete or has timed outjehayes_0-1705537489565.png

     

  4. If the index has been created and the video successfully ingested ...jehayes_5-1705535743354.png

     

a. Call GPT-4V with inputs including system message and prompt and store results in Cosmos DB

 Copy Data Source properties - REST API Linked Service to GPT-4V Deployment. This article shows a good example of this REST API POST for GPT-4V. Here's the Chat Completion API Reference. 

jehayes_2-1705536106268.png

I also added Additional Columns to the source:

jehayes_0-1705593408164.png

 

 

Sink properties is simply the Cosmos DB Container

jehayes_0-1705588554281.png

 

 

 Set mapping properties including content from GPT-4V return message, prompt_tokens and completion_tokens and additional columns from source

jehayes_1-1705536542776.png

b. Do a Lookup on the Cosmos DB item just added and to get the Damage Probability  (See next section on creating results that can easily be queried)

jehayes_2-1705536926461.png

c. If the Damage Probability value is greater than one, set the processedfolder value to “reviewfordamage” otherwise set it to “processed”.

d. Move the video to the folder specified by the processedfolder variable and delete from the source storage location

 

That's it! The videos can be analyzed with GPT-4 Turbo with Vision, without having to worry about environment variables or config files for API Keys and Endpoints! No need to develop in in Python and deploy apps to Azure Functions!  The same ADF Orchestrator Pipeline can be used for many types of video analysis by just changing the system message, prompt and storage location for each ADF trigger that calls it!

 

Author comments:

Crafting the system message or user prompt to create data fields

In this solution, I wanted very specific output for analyzing the videos. The videos I have are of vehicles that may or may not have damage. So I composed my system message so GPT-4V would assess the likelihood of any damage, the severity of the damage, the location of the damage as well as the type of vehicle it is viewing in return that information in a specific format. Below is my system message:

 

Your task is to analyze vehicles for damage.  You need to inspect the video closely and describe any damage to the vehicle, such as dents, scratches, broken lights, broken windows, etc. Sometimes duct tape may be used to cover up damage which may be potential damage and should be described as well. You need to pay close attention, especially to distinguish between damage to the vehicles body and glare from the lights in the garage. First provide a summary of the vehicle and the damage or potential damage to the vehicle in the video. Also return a description for what type of vehicle it is in the format of VehicleType[vehicletype] for example VehicleType[Ford F150]. If you can't identify the exact model type, return what type of vehicle it is such as VehicleType[Sedan] or VehicleType[Truck].  Rank each video on a scale of 1 to 10 where 1 is the probability of no damage and 10 is a high probability of damage. Describe your reasoning for the rank and output your rank in the format of DamageProbability[rank], for example DamageProbability[4]. If there is damage, along with describing what the damage is, provide a short description of the damage in the format of Damage[damages]. For example Damage[dent] or Damage[dent, scratch]. If there is no damage, return Damage[NA]. Also rank the severity of the damage where a scratch or small dent would be Low; multiple scratches, many scratches, larger dents, broken headlights would be Medium;  broken windows, very large dents would be High. Provide the severity ranking in the format of Severity[severityranking]. For example Severity[medium]. If there is no damage, return Severity[NA]. Provide a short description of the location of the damage for example, Location[damagelocation]. For example,  Location[hood] or Location[front passenger door, hood]. If there is no damage, return the general location of the  portion  of the vehicle being examined, for example Location[passenger side low].

 

I can then query the results in the database:

jehayes_2-1705538733026.png

 

One video per index? Or many videos per index?

In this solution, I create a separate index for each video. You can also create an index and ingest many videos. I created one index per video because I wanted very specific questions about each video and wanted each video analyzed separately. If I had broader questions about all the videos, such as "count how many blue trucks are being transported", I would have created an index with many video ingestions.

 

I hope you enjoyed this article! This solution is available in our FTA Github Repo,  AI-in-a-Box! Checkout the other Open AI Solutions there that are already there and ready to be deployed into your Azure subscription! Also check out Analytics-in-a-Box for solutions on Data Factory, Synapse and Fabric!

 

Below are some sample videos you can use with the AI-in-a-Box solution:

1 Comment
Co-Authors
Version history
Last update:
‎Apr 05 2024 09:45 AM
Updated by: