Building a Document Intelligence Custom Classification Model with the Python SDK
Published Apr 03 2024 07:28 AM 2,330 Views
Microsoft

Building a Document Intelligence Custom Classification Model with the Python SDK

 

Introduction:

 

In the world of document processing and automation, one of the most frequent use cases is categorizing and organizing documents into predefined classes. For instance, an organization may have a process that ingests documents that then need to be classified into separate categories such as “invoices”, “contracts”, “reports”, etc. Azure AI Document Intelligence custom classification models can address these needs and offer a powerful way to bring order to document management.


Document Intelligence is a cloud-based Azure AI service that uses machine learning models to automate document processing in applications and workflows. New users and those unfamiliar with Document Intelligence's capabilities may be interested in starting their journey using Document Intelligence Studioan online tool to visually explore, understand, train, and implement features from the Document Intelligence service without having to write a single line of code. However, more advanced use cases and integrations may necessitate interacting with the Document Intelligence service programmatically. This can be achieved using the Document Intelligence REST API or SDKs available for .NET, Java, JavaScript, and Python. In this article we'll focus specifically on building a custom classification model using Python, one of the more popular languages amongst data science and machine learning developers.


Those wanting to get a head start creating a custom classification model programmatically may look to utilize the existing sample_classify_document.py code sample from the azure-sdk-for-python repository. However, for this sample script to work, the classifier training data set must already include ocr.json files for each document. Optical Character Recognition (OCR) is a critical step in converting scanned documents into editable and searchable data. While Azure AI Document Intelligence Studio automatically generates the OCR files behind-the-scenes when building a custom classification model using the visual interface, those utilizing the Python SDK may find themselves at a crossroads due to the lack of this built-in functionality.

studio-create-classifier-project.png

 

The Challenge:

 

The Document Intelligence Python SDK provides a powerful set of tools for extracting information from forms and documents. However, one key limitation is its lack of a method to easily generate ocr.json files from layout analysis results, a feature that is completely integrated and handled automatically in Document Intelligence Studio.

 

As described in the documentation here, the required ocr.json files can be created by analyzing each training document with Document Intelligence's pre-built layout model and saving the results in the proper API response format. There is a sample Python script sample_analyze_layout.py but since the SDK's layout results object is structured differently than the API's layout results object, there isn't a clear way to generate the required ocr.json files strictly using the Python SDK. This blog post delves into the custom solution we developed to manually code this process, addressing a common problem discussed in the Microsoft community

 

 

Screenshot (23).png

Our Custom Solution:

 

In order to bridge this gap and create the ocr.json files in the correct format programmatically, we've implemented custom code to access the API layout results object using a little-known callback method available within the Document Intelligence Python SDK. We developed a custom classifier code that emulates the OCR file creation process that Document Intelligence Studio performs. The code leverages the Python SDK to extract text and structural data from documents and then formats this information into the required JSON structure for OCR files.
 

Step-by-Step Guide to Building the Classifier:

 

The custom classifier code consists of several key components:

 

1. Preparation of Documents: Start with gathering the documents you wish to analyze. These could be in various formats, such as PDFs, Word documents, or images. You can reference the documentation here for the full list of supported file types. Make sure they are in a separate "training folder" that the code will reference as structured:
 
Screenshot (22).png
 

 

2. Document Analysis with Azure AI Document Intelligence Layout Model: Utilize the Azure Document Intelligence Layout model to analyze the documents. This is done by running analyze_layout.py, which will iterate through files in the specified directory (TRAINING_DOCUMENTS) and analyze each document using Azure AI Document Intelligence. It saves the results in a .ocr.json file alongside the original document. This format mirrors the OCR output of the Document Intelligence Studio, maintaining consistency and compatibility.
 
# Use begin_analyze_document to start the analysis process, and use a callback in order to recieve the raw response
with open(document_file_path, "rb") as f:
                            poller = document_intelligence_client.begin_analyze_document(
                                "prebuilt-layout", analyze_request=f, content_type="application/octet-stream", cls=lambda raw_response, _, headers: create_ocr_json(ocr_json_file_path, raw_response)
                            )
// ... other code ...
# Callback function to save the API raw response as .ocr.json file
def create_ocr_json(ocr_json_file_path, raw_response):
    with open(ocr_json_file_path, "w") as f:
        f.write(raw_response.http_response.body().decode("utf-8"))
        print(f"\tOutput saved to {ocr_json_file_path}")

 

3. Upload Documents with the labeled data to Azure Blob Storage container: This is done by running upload_documents.py, which will upload all the training documents, along with the .ocr.json files and a.jsonl file that will be used in building the classifier to reference each of the documents. The .jsonl file allows us to process multiple documents in a batch, improving the efficiency of the training process.

 

4. Build Classifier: The build_classifier.pyscript initiates the process of building a custom document classifier using the document types and labeled data from the .jsonl files. It utilizes the DocumentIntelligenceAdministrationClient and BlobServiceClientclasses, which are used to interface with the Document Intelligence and Azure Blob Storage services to retrieve and process the training data uploaded in the previous step. Once finished, it prints the results including the classifier ID, API version, description, and document classes used for training.

 

5. Classify Documents: classify_document.py utilizes the DocumentIntelligenceClient class to classify a document using a trained document classifier. It analyzes one document at a time and returns the document type along with the confidence score.

Conclusion:

 

While the Python SDK does not provide an out-of-the-box solution for OCR file generation, our custom classifier code offers a viable workaround. By understanding the limitations of the SDK, we were able to create a tool that not only solves the immediate problem but also enhances our overall document processing capabilities.
Co-Authors
Version history
Last update:
‎Apr 10 2024 06:04 AM
Updated by: