Source Snowflake #
The following steps refer the extracted Arcion self-hosted CLI download as the $REPLICANT_HOME
directory.
Prerequisites for CDC-based replication #
Make sure that you possess the following object privileges for CDC-based replication:
Object | Privilege |
---|---|
DATABASE |
USAGE |
SCHEMA |
USAGE , CREATE |
TABLE |
SELECT , CREATE STREAM , CREATE TABLE |
Limitations #
- Streams may become stale over time. For more information, see Data Retention Period and Staleness .
- Snowflake can extract data on a per-table basis. Therefore, you don’t need to create heartbeat table manually.
Warning: If a stream goes stale, Replicant drops and recreates the stream. This might cause data loss. So we highly recommend that you take necessary measures so that streams don’t become stale.
I. Set up Connection Configuration #
-
From
$REPLICANT_HOME
, navigate to the sample connection configuration file:vi conf/conn/snowflake.yaml
-
The configuration file has two parts:
- Parameters related to source Snowflake server connection.
- Parameters related to stage configuration.
Parameters related to Source Snowflake server connection #
Note: All communications with Snowflake happens through port 443, the standard port for HTTPS. So all data is encrypted and secure with SSL by default.
For connecting to Source Snowflake server, you can choose between two methods for an authenticated connection:- RSA key pair authentication
- Basic username and password authentication
For connecting to Snowflake via basic username and password authentication, you have two options:
Fetch credentials from AWS Secrets Manager #
You can choose to store your username and password in AWS Secrets Manager, and tell Replicant to retrieve them. For more information, see Retrieve credentials from AWS Secrets Manager.
Specify credentials in plain form #
You can also specify your credentials in plain form in the connection configuration file like the sample below:
type: SNOWFLAKE host: SNOWFLAKE_HOSTNAME port: PORT_NUMBER warehouse: "WAREHOUSE_NAME" username: "USERNAME" password: "PASSWORD" max-connections: 20 #Specify the maximum number of connections replicant can open in Snowflake max-retries: 10 retry-wait-duration-ms: 1000
Replace the following:
SNOWFLAKE_HOSTNAME
: The Snowflake hostname. To find your Snowflake hostname, follow these steps:- Go to the Snowflake web interface and sign in into your account.
- Click the Account selector toolbar in the bottom of the left navigation menu.
- Hover the mouse over your account and click Copy account URL in the items list.
PORT_NUMBER
: The port number of Snowflake host.WAREHOUSE_NAME
: The name of the Snowflake warehouse.USERNAME
: The username to connect to the Snowflake server.PASSWORD
: The password associated withUSERNAME
.
Additional parameters #
-
credential-store
: Replicant supports consumingusername
andpassword
configurations from a credentials store rather than having users specify them in plain text config file. You can use keystores to store your credentials related to your Snowflake server connections.The following parameters are available:type
: Type of the keystore. Allowed types arePKCS12
,JKS
, andJCEKS
.path
: Location of the key-store.key-prefix
: You should create entries in the credential store for your configs using a prefix and specify the prefix here. For example, you can create keystore entries with aliasessnowflake1_username
andsnowflake1_password
. You can then specify the prefix here assnowflake1_
.password
: This field is optional. If you don’t specify the keystore password here, then you must use the UUID from your license file as the keystore password. Remember to keep your license file somewhere safe in order to keep the password secure.
Parameters related to stage configuration #
-
stage
: By default, Replicant uses Snowflake’s native stage for bulk loading. But it’s also possible to use an external stage like Azure. This section allows you to specify the details Replicant needs to connect to and use a specific stage. -
type
[v21.06.14.1]: The stage type. Allowed stages areNATIVE
,S3
, andAZURE
. -
root-dir
: Specify a directory on stage which can be used to stage bulk-load files. -conn-url
[v21.06.14.1]: URL for the stage. For example, if stage isS3
, specify bucket name; forAZURE
, specify container name. -
key-id
: This config is valid forS3
stage type only. Access Key ID for AWS account hosting s3. -
account-name
[v21.06.14.1] : This config is valid forAZURE
type only. Name of the ADLS storage account. -secret-key
[v21.06.14.1]: This config is valid for bothS3
andAZURE
types. For example, Secret Access Key for AWS account hosting s3 or ADLS account. -
token
[v21.06.14.1]: This config is valid forAZURE
type only. Indicates the SAS token for Azure storage.
Use RSA key pair for authentication #
You can also choose to use Snowflake’s key pair authentication support for enhanced authentication security instead of using basic authentication via username and password.
To set up key pair authentication using RSA keys, follow the steps below:
Generate the private key #
From your command line, execute the following command to generate an encrypted private key:
openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -v1 PBE-SHA1-RC4-128 -out rsa_key.p8
The command generates a private key in PEM format:
-----BEGIN ENCRYPTED PRIVATE KEY----- MIIFHDBOBgkqhkiG9w0BBQ0wQTApBgkqhkiG9w0BBQwwHAQIK0h8dqI1n5sCAggA MAwGCCqGSIb3DQIJBQAwFAYIKoZIhvcNAwcECNDwqMf6Xx1pBIIEyNmf044S+pEQ ... -----END ENCRYPTED PRIVATE KEY-----
Important: The command above to generate an encrypted key prompts for a passphrase to grant access to the key. We recommend using a passphrase that complies with PCI DSS standards to protect the generated private key. Additionally, we recommend storing the passphrase in a secure location. When using an encrypted key to connect to Snowflake, you will need to input the passphrase during the initial connection. The use of the passphrase is only for protecting the private key; it’s never to sent to Snowflake servers.
To generate a long and complex passphrase based on PCI DSS standards, follow the steps below:
- Go to the PCI Security Standards Document Library.
- For PCI DSS, select the most recent version and your desired language.
- Complete the form to access the document.
- Search for
Passwords/passphrases must meet the following:
and follow the recommendations for password/passphrase requirements, testing, and guidance.
Generate a public key #
From the command line, generate the public key by referencing the private key. The following command references the private key contained in a file named
rsa_key.p8
created in the previous step:openssl rsa -in rsa_key.p8 -pubout -out rsa_key.pub
The command generates a public key in PEM format:
-----BEGIN PUBLIC KEY----- MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAslwTa+Lj5SMI58GiDzWq 0rwj4FBymfKzHT16RXecnMcx7uI6KsVpqfh9HH0FMb/3C6YEMeGPkaRmKvXYjM5s ... -----END PUBLIC KEY-----
Store the Private and Public Keys Securely #
After following the above steps, you should find the private and public key files saved in a local directory of your system. Note down the path to those files. The private key is stored using the PKCS#8 (Public Key Cryptography Standards) format and is encrypted using the passphrase you specified in the first step.
However, maintain caution in protecting the file from unauthorized access using the file permission mechanism provided by your operating system. It’s your responsibility to secure the file when it’s not being used.
Assign the public key to a Snowflake user #
Execute the following command to assign the public key to a Snowflake user.
alter user jsmith set rsa_public_key='MIIBIjANBgkqh...';
- Only security administrators (i.e. users with the SECURITYADMIN role) or higher can alter a user.
- Exclude the public key delimiters in the SQL statement.
Verify the user’s public key fingerprint #
Execute the following command to verify the user’s public key:
DESC USER jsmith;
The command output is similar to the following:
+---------------------+-----------------------------------------------------+---------+----------------------------------------------+ | property | value | default | description | +---------------------+-----------------------------------------------------+---------+----------------------------------------------+ | NAME | JSMITH | null | Name | ... ... | RSA_PUBLIC_KEY | MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAslwT... | null | RSA public key of the user | | RSA_PUBLIC_KEY_FP | SHA256:nvnONUsfiuycCLMXIEWG4eTp4FjhVUZQUQbNpbSHXiA= | null | Fingerprint of user's RSA public key. | | RSA_PUBLIC_KEY_2 | null | null | Second RSA public key of the user | | RSA_PUBLIC_KEY_2_FP | null | null | Fingerprint of user's second RSA public key. | ... +---------------------+-----------------------------------------------------+---------+----------------------------------------------+
Edit the connection configuration file #
You need to modify Replicant’s connection configuration file for Snowflake and include RSA key information there. Specifically, add the following parameters to the connection configuration file:
private-key-path: "/PATH_TO_GENERATED_KEY/rsa_key.p8" private-key-passphrase: "PRIVATE_KEY_PASSPHRASE"
Replace the following:
PATH_TO_GENERATED_KEY
: the local directory path to thersa_key.p8
keyfilePRIVATE_KEY_PASSPHRASE
: the private key passphrase you specified in the first step
If you specify the
private-key-path
andprivate-key-passphrase
parameters, you don’t need to specify thepassword
parameter in the connection configuration file.
II. Set up Extractor Configuration #
To configure replication mode according to your requirements, specify your configuration in the Extractor configuration file. You can find a sample Extractor configuration file snowflake.yaml
in the $REPLICANT_HOME/conf/src
directory. For example:
Sample snapshot
mode configuration
#
snapshot:
threads: 32
fetch-size-rows: 100000
min-job-size-rows: 1000000
max-jobs-per-chunk: 32
_traceDBTasks : true
per-table-config:
- catalog: DEMO_DB
schema: tpch
tables:
CUSTOMER:
num-jobs: 32
split-key: C_CUSTKEY
row-identifier-key: [ C_CUSTKEY ]
ORDERS:
num-jobs: 32
split-key: O_ORDERKEY
row-identifier-key: [ O_ORDERKEY ]
split-hints:
row-count-estimate: 15000
For more information about the configuration parameters for snapshot
mode, see Snapshot mode.
Sample realtime
mode configuration
#
realtime:
threads: 32
fetch-size-rows: 100000
_traceDBTasks: true
fetch-interval-s: 0
For more information about the configuration parameters for realtime
mode, see Realtime mode.