Setup guide

Setup guide for Google BigQuery target #

The extracted replicant-cli will be referred to as the $REPLICANT_HOME directory in the following steps.

I. Obtain the JDBC driver for Google BigQuery #

Replicant requires the JDBC driver for Google BigQuery as a dependency. To obtain the appropriate driver, follow these steps:

II. Required permissions #

To load data into BigQuery, you need to make sure you have the necessary IAM permissions. These permissions are required to run a load job and load data into BigQuery tables and partitions. For more information about these permissions, see Permissions to load data into BigQuery.

III. Set up connection configuration #

Specify our BigQuery connection details to Replicant with a connection configuration file. You can find a sample connection configuration file bigquery.yaml in the $REPLICANT_HOME/conf/conn directory.

Configure BigQuery server connection #

You can store your connection credentials in a secrets management service and tell Replicant to retrieve the credentials. For more information, see Secrets management.

Otherwise, you can put your credentials like usernames and passwords in plain form like the following sample:

type: BIGQUERY

host: https://www.googleapis.com/bigquery/v2
port: 443
project-id: bigquerytest-268904
auth-type: 0
o-auth-service-acc-email: bigquerytest@bigquerytest-268904.iam.gserviceaccount.com
o-auth-pvt-key-path: <path_to_oauth_private_key>
location: US
timeout: 500

username: "xxx"
password: "xxxx"

max-connections: 20

max-retries: 10
retry-wait-duration-ms: 1000

Arcion supports both HTTP and HTTPS for BigQuery connection.

To use HTTP, follow these steps:

  • Set http as the protocol in the host URL.
  • Set port to 80.

To use HTTPS, follow these steps:

  • Set https as the protocol in the host URL.
  • Set port to 443.

Configure stage #

Arcion supports CSV and Parquet as intermediatary formats to send data to the BigQuery server. To specify the stage format, use the stage field in the connection configuration file:

stage:
  type: NATIVE
  file-format: {CSV|PARQUET}

IV. Set up Applier configuration #

To configure replication according to your requirements, specify your configuration in the Applier configuration file. You can find a sample Applier configuration file bigquery.yaml in the $REPLICANT_HOME/conf/dst directory.

Arcion Replicant supports the following modes of replication for BigQuery target:

  • snapshot
  • realtime
  • full

For more information about different Replicant modes, see Running Replicant.

For a detailed explanation of configuration parameters in the Applier file, see Applier Reference.

Configure snapshot replication #

For snapshot replication, Replicant supports the following two methods:

Sample configuration with load job method #

snapshot:
  threads: 16

  batch-size-rows: 100_000_000
  txn-size-rows: 1_000_000_000
  
  use-write-storage-api: false

  bulk-load:
    enable: true
    type: FILE
    save-file-on-error: true
    serialize: true

  use-quoted-identifiers: false

Sample configuration with Storage Write API #

snapshot:
  threads: 16

  batch-size-rows: 100_000_000
  txn-size-rows: 1_000_000_000
  
  use-write-storage-api: true

  bulk-load:
    enable: true
    type: FILE
    save-file-on-error: true
    serialize: true

For more information about the configuration parameters for snapshot mode, see Snapshot Mode.

Configure realtime replication #

For realtime replication, Replicant supports the following two methods:

Sample configuration with load job method #

realtime:
  threads: 16
    _traceDBTasks: true
    skip-tables-on-failures : false
    replay-strategy: IN_MEMORY_MERGE
    per-table-config:
      - catalog: tpch_scale_0_01
        tables:
          nation:
            replay-strategy: MERGE
            enable-dependency-tracking: false
          region:
            replay-strategy: INSERT_MERGE
            enable-dependency-tracking: true

Sample configuration with Storage Write API #

realtime:
  threads: 16
  _traceDBTasks: true
  skip-tables-on-failures : false

  use-write-storage-api: true

  replay-strategy: IN_MEMORY_MERGE
  per-table-config:
    - catalog: tpch_scale_0_01
      tables:
        nation:
          replay-strategy: MERGE
          enable-dependency-tracking: false
        region:
          replay-strategy: INSERT_MERGE
          enable-dependency-tracking: true

For more information about the configuration parameters for realtime mode, see Realtime Mode.

Replay strategies #

Arcion uses replay strategies to implement CDC changes and apply the changes in realtime to the target. You can specify a replay strategy by setting the replay-strategy parameter as you can see in the preceding samples. Arcion supports the following replay strategies:

  • NONE
  • INSERT_DELETE
  • INSERT_MERGE
  • IN_MEMORY_MERGE