Introduction
This technical article provides example for how to perform parallel data copy from Db2 to Azure SQL Database by generating ADF Copy activities dynamically. Solution provided in this article is performing data copy from Db2 installed on Windows (or Linux/Mainframe) to Azure SQL Database using Azure Data Factory Db2 Connector. Please note this concept applies to data copy from any RDBMS source to any RDBMS target supported by ADF.
Below diagram describes high level architecture of data copy from Db2 hosted in Azure Windows VM to Aure SQL Database using ADF.
High level architecture to Copy Data from Db2 to Azure SQL DB by generating multiple Copy activities dynamically
Components present in above diagram:
Pre-requisite:
1. Linked service to Source Db2 database can be created by following steps mentioned at doc.
2. Linked service to Target Azure SQL database can be created by following steps mentioned at doc.
3. Source Db2 Test Table
Table Name: SANDIPK."Employee" definition
CREATE TABLE "SANDIPK"."Employee" ( "EmployeeId" INTEGER NOT NULL , "LastName" VARCHAR(20) NOT NULL , "FirstName" VARCHAR(20) NOT NULL , "Title" VARCHAR(30) , "ReportsTo" INTEGER , "BirthDate" DATE , "HireDate" DATE , "Address" VARCHAR(70) , "City" VARCHAR(40) , "State" VARCHAR(40) , "Country" VARCHAR(40) , "PostalCode" VARCHAR(10) , "Phone" VARCHAR(24) , "Fax" VARCHAR(24) , "Email" VARCHAR(60), "ID" BIGINT NOT NULL DEFAULT 0 ) ORGANIZE BY ROW; CREATE INDEX "SANDIPK"."ID_INDEX" ON "SANDIPK"."Employee" ("ID" ASC) COLLECT DETAILED STATISTICS; |
Test Environment Configuration:
Configuration Item |
Value |
Schema Name |
SANDIPK |
Db2 Table Name |
Employee |
Number of rows |
29,360128 (29.36 million) |
Size of Employee Table |
5.6 GB |
ADF Integration Run time |
Azure Auto Resolve Integration Runtime for Dynamic scaling |
Target Azure SQL DB |
Business Critical / Storage: 100 GB / cores: 8 vCores |
Driver used to connect to Db2 LUW database |
Microsoft OLEDB Provider for Db2 |
ADF Pipeline to Copy data from Db2 to Azure SQL DB
Below screenshot shows how ADF Pipeline should be setup to perform dynamic data copy activities generation.
Every activity and corresponding setup will be explained in sections below:
1. Define Parameters for the pipeline
These parameters are defined in a such a way that it allows dynamic generation of ADF Copy activities. Even this pipeline can be called from master pipeline or external methods (.NET SDK / Azure PowerShell / REST API / Python SDK) etc. which will enable to run multiple pipelines concurrently. Every parameter is explained below:
Sr.No. | Parameter Name | Description |
a) | lowerBound | Minimum value of the partitionColumn |
b) | upperBound | Maximum value of the partitionColumn |
c) | numPartitions | Number of partitions in which input lowerBound to upperBound data should be divided. Note this is not number of parallel executions. Number of parallel executions will be less than or equal to number of Partitions. In ADF there is max cap limit of 50 parallel executions at a time where numPartitions can be any integer value. |
d) | schemaName | Source Db2 Schema Name |
e) | tableName | Source Db2 Table Name |
f) | partitionColumn | Source Db2 Table Column based on which logical partitions needs to be created |
g) | copyInRangeOnly |
'Y' – Copy data only in lowerBound and upperBound 'N' – Copy all data from source to target |
2. Set Size of One Partition
Based on provided parameters above we need to drive size of one partition as below:
upperBound - lowerBound
DataCopyInOneActivitySize = -------------------------------
numPartitions
Corresponding Dynamic content for the SetDataCopyInOneParitionSize activity variable
DataCopyInOneActivitySize value is shown below:
@string(div(sub(pipeline().parameters.upperBound,pipeline().parameters.lowerBound),pipeline().parameters.numPartitions))
Screenshot for Set Variable:
Dynamic content for the ForEach activity variable DataCopyInOneActivitySize Items is shown below:
@range(0,pipeline().parameters.numPartitions)
Screenshot for Set Variable:
3. Generate ADF Copy activities in parallel After clicking on ForEach Configuration option; we will be able to see most important activity of the pipeline which is Copy data activity as shown below:
A) Source tab
i) Source dataset: It points to source Db2 table. Link Service is created using step mentioned
in Create linked service to Db2 on Windows Server.
ii) Dynamic Query for generating Copy Activity 'SELECT' Query during Runtime:
For each execution of ForEach Activity dynamic query will generate SQL Query for every logical partition.
Example: If parameter number of logical partition (numPartitions) is provided as 150; 150 SELECT queries will be generated; Out of 150 max number of queries that will run in parallel will be defined by Batch Count parameter of ForEach Activity.
This dynamic generation of SELECT query where clause is based on provided pipeline parameter. Source Db2 database table may or may not have physical partitions. Though having physical partition on Source Db2 table will definitely lead to improvement in copy performance but this is not mandatory requirement.
Dynamic query generation for limited number of rows (only 17) is explained below to explain the concept. In actual execution there might be millions of rows to be migrated from Db2 to sink Azure SQL DB.
Case 1 : copyInRangeOnly = 'Y'
When option copyInRangeOnly is set to ‘Y’ dynamic SQL SELECT queries will only be generated for the data in the mentioned lowerBound and upperBound as shown in diagram below. Data outside lowerBound and upperBound range will NOT be copied from Source Db2 to sink SQL DB.
As shown in screenshot above 3 queries (equal to numPartitions) will be generated dynamically to copy data only in range lowerBound (1) and upperBound (10).
Case 2 : copyInRangeOnly = 'N'
When option copyInRangeOnly is set to ‘N’ dynamic SQL SELECT queries will be generated for the data in the mentioned lowerBound and upperBound. Also last query will also copy all data outside the range lowerBound and upperBound which means entire data from source will be copied to sink.
As shown in screenshot above 3 queries (equal to numPartitions) will be generated dynamically to copy entire data from the source table to sink.
B) Sink tab
i) Sink dataset: It points to sink Azure SQL DB table. Link Service is created using step mentioned in Create linked service to Azure SQL DB.
ii) Bulk insert table lock: Use this option to improve copy performance during bulk insert operation on table with no index.
Pipeline Execution:
Let’s see everything in action. We will run pipeline to copy 29,360128 (29.38 million) records of size 5.6 GB from Db2 on Windows to Azure SQL DB Business Critical 8 vCores.
Set pipeline parameters as shown in below screenshot and run pipeline.
Db2 DBMS Windows VM Statistics:
Below screenshot shows 32 threads were running in time in parallel to get data from Db2 on Windows virtual machine. In absence of dynamic SELECT query generation based on partitioning we will be able to see only 1 thread. Please note our pipeline execution BatchCount for ForEach activity was kept 50 which is maximum number of ADF activities that can run in parallel. In this case at the time when screenshot was taken 32 threads were running in parallel.
Sink Azure SQL DB Statistics:
As shown below Azure SQL DB Log I/O percentage is hitting 100 %. Maximum transaction log throughput rate for the configuration that we have used i.e. Business 8 vCores is 96 MB/s. Azure SQL DB Log I/O limits are explained at this Microsoft documentation page: link. This mean that parallel execution of SQL Queries is making sure that we are utilizing all the available capacity viz. getting data from source / transferring data on the network / writing data to sink.
Single Copy Activity statistics:
Single copy activity statistics are shown below. As it can be seen from the screenshot since we have divided input data copy in multiple logical partition Time to get first byte from source is less (5 seconds). Overall time to get 195,734 records from Db2 and write to Azure SQL DB is 32 seconds.
Now comes the most interesting statistics. How much time it took to copy 5.6 GB data with around 29 million records? Any guess?
As shown in below screenshot it took just 1 minute and 47 seconds to perform complete data copy of size 5.6 GB with around 29 million records from Db2 on Windows to Azure SQL DB.
If the same copy activity was run with single copy activity the amount of time it takes is around 24:30 minutes.
So, with this method of dynamic generation of SQL Queries and Parallel execution of the same to extract data from Source Db2 and writing to sink Azure SQL Database we were able to reduce time to copy from 24:30 minutes to 01:47 minutes; which is around 93 % reduction in Data Copy time.
Conclusion
As evident from above test data copy performance can be drastically improved by Dynamic generation of SQL Queries and parallel execution of them to get data from Source Db2 and writing to sink Azure SQL DB.
Below chart summarizes result of test execution with using dynamic copy activity generation approach and usual single copy activity approach:
Important points for Consideration
If you have feedback or suggestions for improving this data migration asset, please contact the Azure Databases SQL Customer Success Engineering Team. Thanks for your support!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.