Change Data Capture in Database Converters
This whitepaper explores the three most popular methods of Change Data Capture (CDC), a technique used for tracking and storing incremental changes in a source database to replicate them in near real-time to other databases. CDC has emerged as an efficient solution for low-latency database migration and synchronization between relational databases in high-load environments, whether they are on-premises or in the cloud. CDC methods are widely used in database converters focused on reducing system downtime and overall performance of the migration or replication process.
The most straightforward and easy-to-implement approach to data migration and synchronization is called ‘snapshot replication’. Database converters working on this method capture state of the source database as a snapshot. Then the tool replicates it to the destination database transferring all data in a single operation. This approach is not suitable for systems with permanent data flow since any updates of the source database are not allowed during the snapshot replication and this will cause essential downtime of the database.
Bottleneck of the system downtime especially affects migration to the cloud platforms in the circumstances when permanent connection to the target online database cannot be interrupted. Database converters involved into such projects must guarantee that updated data is constantly replicating.
Change Data Capture offers three important advantages over the snapshot replication. First, it minimizes system downtime and disruption of the source database. Second, CDC helps real-time applications to get the most recent data as fast as possible, which is important for some business models. Finally, Change Data Capture essentially decreases volume of the data being transferred over the network by processing incremental changes.
Change Data Capture Techniques
There are three approaches to Change Data Capture implementation having its own strengths and weaknesses:
· Timestamp tracking. This technique is based on adding a service column with the timestamp of the last change to the tables included in CDC. Any row with a timestamp after the last capture time is considered modified. The pros of this approach are its simplicity, but the cons are that it doesn’t track deleted rows and requires additional CPU resources to scan tables for modified data.
· Trigger-based Change Data Capture. This method is commonly used for capturing data changes in large-scale databases. Triggers for insert, update, and delete operations are created for each table involved in CDC. These triggers store data-changing events in a special ‘history’ table within the source database. The pros include reliability, detailed tracking, and an easily accessible journal of all changes. However, it can impact performance due to multiple writes for each data update, and the performance of the triggers should be tested and evaluated for tolerable overhead.
· Transaction Log CDC. Modern DBMS systems like Oracle, MySQL, PostgreSQL, and MS SQL use transaction logs for backup and recovery purposes, which can be leveraged to track and replicate changes to the target database. The benefits of this approach are that it doesn’t affect the source database system, as there are no additional transactions for each data update, and it doesn’t require changes to the table structures or the creation of a service table. However, extracting updated data from the transaction log can be challenging because the log format is not openly accessible. Parsing algorithms may not work with new versions of the database management system if the log format is modified.
Trigger-based vs Transaction Log CDC
To capture and load changes into the target database, the Transaction Log CDC technique can be combined with the Kafka event streaming platform, offering advantages compared to the trigger-based CDC approach. Unlike the trigger-based method that requires modifications at the application level and scanning of the ‘history’ table, Transaction Log CDC utilizes the native transaction logs (redo logs) of the database engine to extract information about data updates.
The redo logs serve as a storage mechanism for all database activities, ensuring database recovery in case of failures. By leveraging the transaction log CDC approach, changes are recorded directly from the redo log file, eliminating the need to create a separate events journal. This leads to improved performance since database converters implementing CDC operations are not directly tied to every transaction in the database.
However, there are challenges associated with the transaction log CDC technique due to limited control over the transaction log. The format of transaction logs is often undocumented by database management systems, making it difficult to develop parsing algorithms. Additionally, database administrators may be cautious about enabling transaction logs on the database server due to potential risks involved.
Those are reasons why most database converters use the trigger-based technique for capturing changed data. This approach guaranties full control over the changes tracking and it is relatively easy to develop and maintain. Finally, trigger-based CDC is not based on the feature that may be disabled in some database configurations or modified in next releases of DBMS.