Setup guide for Google BigQuery target #
The extracted replicant-cli will be referred to as the $REPLICANT_HOME directory in the following steps.
I. Obtain the JDBC driver for Google BigQuery #
Replicant requires the JDBC driver for Google BigQuery as a dependency. To obtain the appropriate driver, follow these steps:
- Go to the JDBC drivers for BigQuery page.
- From there, download the latest JDBC 4.2-compatible JDBC driver ZIP.
- From the downloaded ZIP, locate and extract the
GoogleBigQueryJDBC42.jarfile. - Put the
GoogleBigQueryJDBC42.jarfile inside$REPLICANT_HOME/libdirectory.
II. Required permissions #
To load data into BigQuery, you need to make sure you have the necessary IAM permissions. These permissions are required to run a load job and load data into BigQuery tables and partitions. For more information about these permissions, see Permissions to load data into BigQuery.
III. Set up connection configuration #
Specify our BigQuery connection details to Replicant with a connection configuration file. You can find a sample connection configuration file bigquery.yaml in the $REPLICANT_HOME/conf/conn directory.
Configure BigQuery server connection #
You can store your connection credentials in a secrets management service and tell Replicant to retrieve the credentials. For more information, see Secrets management.
Otherwise, you can put your credentials like usernames and passwords in plain form like the following sample:
type: BIGQUERY
host: https://www.googleapis.com/bigquery/v2
port: 443
project-id: bigquerytest-268904
auth-type: 0
o-auth-service-acc-email: bigquerytest@bigquerytest-268904.iam.gserviceaccount.com
o-auth-pvt-key-path: <path_to_oauth_private_key>
location: US
timeout: 500
username: "xxx"
password: "xxxx"
max-connections: 20
max-retries: 10
retry-wait-duration-ms: 1000
Arcion supports both HTTP and HTTPS for BigQuery connection.
To use HTTP, follow these steps:
- Set
httpas the protocol in thehostURL. - Set
portto80.
To use HTTPS, follow these steps:
- Set
httpsas the protocol in thehostURL. - Set
portto443.
Configure stage #
Arcion supports CSV and Parquet as intermediatary formats to send data to the BigQuery server. To specify the stage format, use the stage field in the connection configuration file:
stage:
type: NATIVE
file-format: {CSV|PARQUET}
IV. Set up Applier configuration #
To configure replication according to your requirements, specify your configuration in the Applier configuration file. You can find a sample Applier configuration file bigquery.yaml in the $REPLICANT_HOME/conf/dst directory.
Arcion Replicant supports the following modes of replication for BigQuery target:
snapshotrealtimefull
For more information about different Replicant modes, see Running Replicant.
For a detailed explanation of configuration parameters in the Applier file, see Applier Reference.
Configure snapshot replication
#
For snapshot replication, Replicant supports the following two methods:
- Loading data with conventional load job method (Default method).
- Streaming data using BigQuery Storage Write API.
Sample configuration with load job method #
snapshot:
threads: 16
batch-size-rows: 100_000_000
txn-size-rows: 1_000_000_000
use-write-storage-api: false
bulk-load:
enable: true
type: FILE
save-file-on-error: true
serialize: true
use-quoted-identifiers: false
Sample configuration with Storage Write API #
snapshot:
threads: 16
batch-size-rows: 100_000_000
txn-size-rows: 1_000_000_000
use-write-storage-api: true
bulk-load:
enable: true
type: FILE
save-file-on-error: true
serialize: true
For more information about the configuration parameters for snapshot mode, see Snapshot Mode.
Configure realtime replication
#
For realtime replication, Replicant supports the following two methods:
- Loading data with conventional load job method (Default method).
- Streaming data using BigQuery Storage Write API.
Sample configuration with load job method #
realtime:
threads: 16
_traceDBTasks: true
skip-tables-on-failures : false
replay-strategy: IN_MEMORY_MERGE
per-table-config:
- catalog: tpch_scale_0_01
tables:
nation:
replay-strategy: MERGE
enable-dependency-tracking: false
region:
replay-strategy: INSERT_MERGE
enable-dependency-tracking: true
Sample configuration with Storage Write API #
realtime:
threads: 16
_traceDBTasks: true
skip-tables-on-failures : false
use-write-storage-api: true
replay-strategy: IN_MEMORY_MERGE
per-table-config:
- catalog: tpch_scale_0_01
tables:
nation:
replay-strategy: MERGE
enable-dependency-tracking: false
region:
replay-strategy: INSERT_MERGE
enable-dependency-tracking: true
For more information about the configuration parameters for realtime mode, see Realtime Mode.
Replay strategies #
Arcion uses replay strategies to implement CDC changes and apply the changes in realtime to the target. You can specify a replay strategy by setting the replay-strategy parameter as you can see in the preceding samples. Arcion supports the following replay strategies:
NONEINSERT_DELETEINSERT_MERGEIN_MEMORY_MERGE