How to handle Large Objects (LOB) migration from Db2, Oracle to Azure SQL & Datalake - Part 2
Published Nov 27 2023 09:00 AM 1,959 Views
Microsoft

Introduction

 

Optimizing Large Object (LOB) Data storage for VLDBs is critical for the optimal performance of your workload. This 2 part blog covers the different strategies and technical design to help customers implement these data ingestion and storage mechanism.

 

This technical blog is part 2 of “How to handle Large Object (LOBs) migration from Db2, Oracle to Azure SQL & Datalake” blog. In this article we are discussing on moving large volume of LOB data to Azure Data Lake [ADLS] and Azure SQL DB (Stage 2 of blog) using Apache Spark in Azure Synapse Analytics. This approach allows large volume LOB data migration by using highly scalable Azure Synapse Analytics executed on Spark clusters.

 

General Overview

 

Earlier blog outlined on how to copy BLOB data from Db2 z/OS database (same concept applies for Db2 LUW, Db2 i, Oracle, PGSql, etc.) to Azure SQL in stage 1 and it’s a pre-requisite for this. Once LOB data is available in Azure SQL, in order to write the same to Azure Data Lake Storage (ADLS) we are using Synapse Analytics with Spark Runtime.

 

Azure Data Factory is a managed cloud service that's built for these complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects. It has a lot of connectors and Synapse Notebook is one of them.

Azure Synapse Analytics is an enterprise analytics service that brings together the best of SQL technologies used in enterprise data warehousing, Spark technologies used for big data and pipelines for data integration.

 

In our solution, we have considered a parameterized ADF pipeline invoking Azure Synapse Notebook, ADF pipeline can be replaced with Synapse pipeline and Spark engine is available both in Azure Synapse Analytics and Azure Databricks.

 

As we are referring to large volume of data, a lookup table is created to multithread the activity. For reference we have used a field called “Partid” which holds the partition key on Db2 table and used to run copy activities in-parallel from Db2 to Azure SQL DB. We have used the same key to partition the workflow. As our intent is to build an end-to-end pipeline from Db2 to Azure SQL DB and ADLS, we jhave therefore have used the same key in our code as well.

 

High level steps:

  • Create a lookup table with a column to hold partition key of source table.
  • Create a Azure Synapse Analytics Notebook
    • Read Azure SQL DB for a given partition key
    • Upload LOB data to ADLS as a blob object
  • Populate Azure SQL DB with object metadata
  • Create an ADF pipeline with a lookup to run multiple copy activities.

 

Below diagram describes high level architecture of the approach.

 

Screenshot 2023-11-08 224938.png

 

Components present in above diagram:

  1. Azure SQL DB with Db2 LOB data migrated AS IS (in stage 1).
  2. Azure Data Factory for Lookup and invoking Synapse Notebook. Alternatively, Synapse Pipeline can be used.
  3. Azure Synapse Analytics with Spark Runtime.
  4. ADLS Gen 2 with LOB data as an object.
  5. Azure SQL DB table to hold metadata of blob object in ADLS.

Configurations

 

Create linked service to Azure SQL DB

Azure data factory Azure SQL DB connector can be used to create linked service to SQL DB. Instructions for the same can be found here.

 

Create Synapse Workspace

Instructions on creating Azure Synapse workspace can be found here

 

Create Linked Service to Azure Synapse Analytics

Azure Data Factory Azure Synapse Analytics (Artifacts) can be used to create linked service to access Synapse Analytics artifacts like notebook etc. Instructions for the same can be found here.

 

Create Apache Spark Pool

Once Synapse Workspace is created, we need to attach a Spark pool for the same. Instructions on creating the same is listed here.

 

Lookup Table Information SQL :

 

CREATE TABLE [dbo].[LOBLOOKUP](
    [IDENTITY_ID] [int] IDENTITY(1,1) NOT NULL,
    [Partid] [int] NOT NULL,
    [SQLtoADLS_STATUS] [varchar](20) NULL,
    [ADLSCONTAINERNAME] [varchar](30) NULL
)

 

 
Target Table Information on Azure SQL to hold Metadata :

 

CREATE TABLE [dbo].[BLOB_TABLE_METADATA](
    [PART_ID] [int] NOT NULL,
    [PK_CUSTID] [int] NOT NULL,
    [RECORDTYPE] [char](3) NULL,
    [BLOB_PATH_FRONT] [varchar](500) NULL,
    [BLOB_PATH_BACK] [varchar](500) NULL
) ON [PRIMARY]
GO

 

 

Synapse Notebook Setup

Below is the high level logic in PySpark Script

  • Accepts Source partition id and ADLS container name as input parameters.
  • Reads the Azure SQL table for given partition id.
  • Write the LOB data as an object into ADLS Container .
  • Update ADLS Blob Object path for Front and Back Image into SQL table.

This logic is just for reference, based on requirement the same can be amended in PySpark or other Languages supported by Synapse notebook. Screen print(below) shows the outline of notebook. 

 

Screenshot-2.png

 

Please connect with our team (datasqlninja@microsoft.com) for the code artifacts.

Azure Data Factory Workflow

In ADF pipeline, include a lookup activity which fetches data from lookup table (dbo.LOBLOOKUP).

Sample data is as below, this lookup can be modified as per business requirements.

anilkota_0-1699508431634.png

 

We have included all rows(i.e. partitions) in Lookup query, it can be altered by giving top n in select query based on IR configuration and other considerations.

 

 

Screenshot-3.png

 

For each of the record in “Lookup”, ForLoop is repeated.

In this example, ForLoop has 2 activities.

  • Synapse notebook execution (Refer to section Synapse Notebook below).
  • Lookup table update to set SQLtoADLS_STATUS to COPYCOMPLETE.

Screenshot-4.png

 

Components present in above screenshot:

 

  1. Include “Notebook’ activity.
  2. Under “Settings”, select the Notebook that’s intended to be executed.
  3.  Notebook accepts 2 parameters, First one is “Partid” , provide column name in lookup table that’s holding partition id.
  4. Second parameter is “ADLSCONTAINERNAME”, include this parameter and give the value based on column names in Lookup table.
  5. In dropdown, select the Spark Pool that’s configured as part of Synapse workspace.
  6. Executor Size – Allows to select Spark Cluster size based on performance needs.

 

Script Execution to update Lookup Table

 

anilkota_1-1699508761059.png

 

For Notebook activity, on success attached a task for “Script” execution. This Script activity updates SQLtoADLS_STATUS column from “TOBECOPIED” to “COPYCOMPLETE”.

 

Sample SQL Query:

 

UPDATE dbo.LOBLOOKUP set SQLtoADLS_STATUS = 'COPYCOMPLETE' 
where Partid = '@{item().Partid}' ;

 

 

Another script for Failure of notebook execution can be added with a different status.

Sample Results

Input SQL table has 502 rows as shown below.

 

anilkota_0-1699508965086.png

 

Now Execute ADF pipeline.

 

anilkota_1-1699509006698.png

 

For testing, here we are using “Debug” . Based on business requirement, different kinds of triggers can be used. Please find more information on ADF triggers here.

Once ADF pipeline is executed, we should be able to see the status in “Monitor” section.

Based on Spark cluster configuration, it might take couple of minutes for Spark Cluster to be initiated.

On successful completion of the ADF pipeline, look at BLOB_TABLE_METADATA and LOOKUP Table on Azure SQL.

 

anilkota_2-1699509053318.png

 

In ADLS LOB data is available.

 

anilkota_3-1699509077469.png

 

Important points for Consideration 

  • Storing Db2 LOB data as a blob object, will require application changes on referring to blob object in storage account. This article doesn’t cover them.
  • The above example is shown to provide information about important steps on how to copy LOB data from Azure SQL to ADLS. This example is not exhaustive. Actual implementation based on volume, we may include ADF Copy in multiple threads and similarly at Synapse Analytics.
  • In this implementation, Azure Data Factory is used to invoke Synapse notebook. Synapse Pipelines can also be used for this implementation.
  • Monitoring, Logging and Error-handling would need to be considered.
  • ADF parallel Copy activity has default limit of 20 and max is 50.
  • Once LOB data is moved to Azure storage account and reconciled then table dbo.BLOB_TABLE (in this case) can be dropped or archived.
  • The above technical blog provides high-level guidance with no warranties with respect to the information contained herein. 

Feedback and suggestions 

If you have feedback or suggestions for improving this data migration asset, please contact the Databases SQL Ninja Engineering Team (datasqlninja@microsoft.com). Thanks for your support!

Note: For additional information about migrating various source databases to Azure, see the Azure Database Migration Guide.

Co-Authors
Version history
Last update:
‎Nov 16 2023 03:01 PM
Updated by: