Source Cassandra #
The extracted replicant-cli
will be referred to as the $REPLICANT_HOME
directory.
CDC Prerequisites #
Arcion supports two mechanisms for accessing the Cassandra CDC log files. The working method depends on whether the CDC log files are accessible locally to the Replicant.
-
If Replicant is running on the same node as the source Cassandra server or the Cassandra CDC log files are accessible to Replicant using NFS mount, then you must configure the access-method as LOCAL and provide the locations of cdc-log-dir and cdc-raw-dir in the connection configuration file (explained in step I).
cdc-log-config: access-method: LOCAL cdc-log-dir: '/var/lib/cassandra/commitlog' cdc-raw-dir: '/var/lib/cassandra/cdc_raw'
-
If Replicant is running on a different node than the source Cassandra server and the Cassandra CDC logs are not accessible to Replicant using NFS mount, then you must configure the access-method as SFTP, provide the locations of cdc-log-dir and cdc-raw-dir, and setup the SFTP coniguration in the connection configuration file (explained in step I).
cdc-log-config: access-method: SFTP cdc-log-dir: '/var/lib/cassandra/commitlog' cdc-raw-dir: '/var/lib/cassandra/cdc_raw' sftp-config: username: 'cassandra' password: 'cassandra' port: 22
Steps to enable CDC Replication
-
Enable the CDC logging for the desired tables by setting the cdc property of table to true when creating or altering the table:
CREATE TABLE foo (a int, b text, PRIMARY KEY(a)) WITH cdc=true; OR ALTER TABLE foo WITH cdc=true;
-
When Replicant is configured to use the SFTP access-method, the SFTP server should be running on the machine running the Cassandra node. If SFTP server is not present on the machine, it can be installed with the following commands:
sudo apt-get update sudo apt-get install openssh-server /etc/init.d/ssh start
I. Set up Connection Configuration #
-
From
$REPLICANT_HOME
navigate to the connection configuration file:vi conf/conn/cassandra.yaml
-
You can store your connection credentials in a secrets management service and tell Replicant to retrieve the credentials. For more information, see Secrets management.
Otherwise, you can put your credentials like usernames and passwords in plain form like the sample below:
type: CASSANDRA cassandra-nodes: node1: host: 172.17.0.2 port: 9042 cdc-log-config: access-method: SFTP # access-method can be LOCAL, SFTP cdc-log-dir: '/var/lib/cassandra/commitlog' # Enter the path of the directory containing Cassandra commit log cdc-raw-dir: '/var/lib/cassandra/cdc_raw' # Enter the path of the directory containing Cassandra CDC log #Only specify the following configurations if your access method is SFTP sftp-config: username: 'cassandra' password: 'cassandra' port: 22 #If you are using multiple nodes, specify them in this section using the format above csv-load-connection: storage-location: /path/to/extracted/csvs access-method: LOCAL # access-method can be LOCAL, SFTP max-connections: 30 sftp-config: username: 'cassandra' # if access-method is SFTP, provide sftp-username to log on to host using SFTP password: 'cassandra' # if access-method is SFTP, provide sftp-password to log on to host using SFTP port: 22 # if access-method is SFTP, provide port on which SFTP service is running username: 'cassandra' password: 'cassandra' read-consistency-level: LOCAL_QUORUM #Enter one of the allowed values: ANY, ONE, TWO, THREE, QUORUM, ALL, LOCAL_QUORUM, EACH_QUORUM, SERIAL, LOCAL_SERIAL, LOCAL_ONE auth-type: "PlainTextAuthProvider" #Enter one of the allowed values: DsePlainTextAuthProvider, PlainTextAuthProvider max-connections: 30 #Maximum number of connections Replicant can open in Cassandra
II. Setup Filter Configuration #
-
From
$REPLICANT_HOME
navigate to the filter configuration file:vi filter/cassandra_filter.yaml
-
In accordance to your replication needs, specify the data which is to be replicated. Use the format of the below explained example:
allow: #In this example, data of object type Table in the schema tpch will be replicated schema: "tpch" types: [TABLE] #From schema tpch, only the lineitem, ORDERS, and usertable tables will be replicated. #Note: Unless specified, all tables in the catalog will be replicated allow: lineitem: #Within lineitem, only the item_one and item_two columns will be replicated allow: ["item_one, item_two"] ORDERS: #Within ORDERS, only the test_one and test_two columns will be replicated as long as they meet the condition "o_orderkey < 5000" allow: ["test_one", "test_two"] conditions: "o_orderkey <5000" usertable: #All columns in the table usertable will be replicated without any predicates
The following is a template of the format you must follow:
allow: schema: <your_schema_name> types: <your_object_type> allow: your_table_name_1: allow: ["your_column_name"] conditions: "your_condition" your_table_name_2: your_table_name_3: allow: ["your_column_name"] conditions: "your_condition"
For a detailed explanation of configuration parameters in the filter file, read: Filter Reference
III. Set up Extractor Configuration #
For real-time replication, you must create a heartbeat table in the source Cassandra.
-
Create a heartbeat table in the catalog you are going to replicate with the following DDL:
CREATE TABLE "<user_keyspace>"."replicate_io_cdc_heartbeat"("timestamp" BIGINT, PRIMARY KEY("timestamp"));
-
Grant
INSERT
,UPDATE
, andDELETE
privileges to the user configured for replication. -
From
$REPLICANT_HOME
navigate to the extractor configuration file:vi conf/src/cassandra.yaml
-
If required, make the necessary changes as follows:
snapshot: extraction-method: CSVLOAD #Allowed values are QUERY, CSVLOAD native-extract-options: control-chars: delimiter: ',' quote: '"' escape: "\u0000" null-string: "NULL" line-end: "\n" realtime: heartbeat: enable: true table-name [20.09.14.3]: replicate_io_cdc_heartbeat #Heartbeat table name if changed column-name [20.10.07.9]: timestamp #Heartbeat table column name if changed
Limitations #
The following limitations will apply when replicating from Cassandra as a source:
- Replication of counter tables is not supported.
- Changes resulted from any of these features are ignored:
- TTL on collection-type columns
- Range deletes
- Static columns
- Triggers
- Secondary indices
- Light-weight transactions
- Unsupported Datatypes:
- map
- set
Note: The operation(Insert/Update/Delete) count during real-time replication will be displayed on the dashboard as (number of operations)*(number of replication factors).