Source Column Transformation #
From version 22.07.19.3 onwards, Replicant allows you to transform the data of source tables before it reaches the desired target. This is achieved via a configuration file where you can specify the transformation logic for each individual table. As Replicant processes data from source tables, it applies the transformation rules to the data, and then loads the transformed data into the destination tables. The column on the destination could either be a new column, or a source column with transformed values.
Source Column Transformation solves several business cases for data migration, while making it possible to build new features like data encryption and obfuscation.
Currently, Source Column Transformation is only supported for the following pipelines:
- MySQL to Databricks
- Informix to PostgreSQL
- Informix to Kafka
Transformation Configuration File #
The Transformation Conifiguration File has all the transformation logic specified on a per-table basis. You can access a sample config file called column_transformation.yaml
from the conf/transformation/
directory of your Replicant download folder replicant-cli
. The following sample describes Transformation rules for two tables nation
and lineitem
:
type: COLUMN
enable: true
per-table-config:
- catalog: tpch
schema: public
tables:
nation:
operations:
- modulo:
enable: true
source-column: A
mod-by-column: D
mod-by-value: 5
computed-column: F
computed-column-datatype: "INTEGER"
computed-column-key-type: SHARDKEY
- concat:
enable: true
source-column: P
concat-by-columns: ["_", "column:Q"]
computed-column: S
computed-column-datatype: ""
- trim:
enable: true
source-column: X
computed-column: Y
computed-column-datatype: ""
lineitem:
operations:
- modulo:
enable: true
source-column: l_orderkey
mod-by-column: l_partkey
computed-column: l_orderkey1
computed-column-datatype: "DECIMAL(15, 2)"
- modulo:
enable: true
source-column: l_linenumber
mod-by-value: 10
computed-column: l_linenumber
computed-column-datatype: ""
- concat:
enable: true
source-column: l_comment
concat-by-columns: [ " ", "column:l_shipmode", " ", "column:l_linestatus", " " ]
computed-column: concat_test
computed-column-datatype: "STRING(100)"
- trim:
enable: true
source-column: l_shipinstruct
computed-column: trim_test
Below are more details about the parameters:
type
: The type of transformation. In this case the only allowed value isCOLUMN
.enable
:true
orfalse
Whether to enable transformation or not.per-table-config
: Specify transformation logic on a per-table basis. This allows you to specify different transformation logic for each table. For example, in the sample config above, we have different rules fornation
andlineitem
tables.catalog
: The database catalog. Disable this parameter if not supported by source.schema
: The database schema. Disable this parameter if not supported by source.tables
: The specific tables to apply Transformation to.nation
:operations
:modulo
:computed-column
is modulo ofsource-column
by eithermod-by-column
ormod-by-value
.enable
:true
orfalse
. Enables/disables this modulo operation.source-column
: Source column which you want to Transform. You must specify numeric value column only.mod-by-column
: The numeric value column used to calculate the modulo of source column:F = A % D
. Don’t specifymod-by-column
ifmod-by-value
is used. Data type should be same assource-column
.mod-by-value
: The numeric value to calculate modulo of source column:F = A % 5
. Don’t specifymod-by-value
ifmod-by-column
is used.computed-column
: The column as result of the modulo operation. This can be the new column which will be created on Target, or same as the Source column, or any other column in same table.computed-column-datatype
: Specifies datatype for computed column. Ifnull
, the datatype ofsource-column
is used. If you specifycomputed-column-datatype
, then Replicant will try to convert the operation result into that datatype. The conversion might fail due to compatibility issue.computed-column-key-type
: Supported value isSHARDKEY
. With this parameter specified, it uses thecomputed-column
as shard key for the Destination table. As in this example, thecomputed-column
valueF
will be used as theSHARDKEY
column on Destination database.
concat
: Thecomputed-column
is a concatenation ofsource-column
andconcat-by-columns
.enable
:true
orfalse
. Enables/disables this concat operation.source-column
: The source column name.concat-by-columns
: List of string values and column names which we want to concat withsource-column
. Concatenation is done in the same order as you specify in the list. See below to understand how it works:"_"
: Concats this string value to source column."column: Q"
: Specifies concat value of columnQ
to Source column."column:"
: The column name identifier.
computed-column
: Computed column is the result of concatenation. For example, in the sample config above, it is:S = P + "_" + Q
.computed-column-datatype
: Specifies datatype for computed column. Ifnull
, the datatype ofsource-column
is used.
trim
:computed-column
is trim ofsource-column
. Removes space from front and back of string.enable
:true
orfalse
. Enables/disables this concat operation.source-column
: Please specify source column name.computed-column
: Computed column is result of trim. For example, in the sample config above, ifP = "name "
, thenS = "name"
.computed-column-datatype
: Specifies datatype for computed column. Ifnull
, the datatype of source-column is used.
- Computed column can be the same as source column, new column, or any other column in same table.
- If
computed-column-datatype
is not specified andcomputed-column
does not exist in Source table, then Replicant will use the datatype ofsource-column
.- If
computed-column-datatype
is specified, then Replicant will try its best to covert operation result into that datatype. The conversion might fail due to compatibility issue.- If
computed-column
exists in Source table, then Replicant will use the datatype ofsource-column
.We recommend that you do not use
computed-column-datatype
. In that case Replicant can fall back to using the datatype ofsource-column
.
Limitations #
- Column can be a part of
source-column
,computed-column
in only one operation. For example, if columnF
is part of modulo operation as asource-column
orcomputed-column
, then it cannot be used in any other operation. - Certain operations support specific data types only. For example, the
modulo
operation can be performed on numeric types only. - Only supported for
snapshot
,realtime
andfull
mode for now.
Running Replicant #
You can provide the location of the Column Transformation file to Replicant via the --transform
argument. For example, the following command runs Replicant in snapshot
mode with the Transformation configuration file specified:
./bin/replicant snapshot \
conf/conn/source_database_name_src.yaml \
conf/conn/target_database_name_dst.yaml \
--extractor conf/src/source_database_name.yaml \
--applier conf/dst/target_database_name.yaml \
--filter filter/source_database_name_filter.yaml \
--transform conf/transformation/column_transform.yaml \
--id repl2 --replace –overwrite