Source Databricks Delta Lake #
The extracted replicant-cli will be referred to as the $REPLICANT_HOME directory in the proceeding steps.
I. Obtain the JDBC Driver for Databricks #
Replicant requires the Databricks JDBC Driver as a dependency. To obtain the appropriate driver, follow the steps below:
- Download the JDBC 4.2-compatible Databricks JDBC Driver ZIP.
- From the downloaded ZIP, locate and extract the
SparkJDBC42.jarfile. - Put the
SparkJDBC42.jarfile inside$REPLICANT_HOME/libdirectory.
- Go to the Databricks JDBC Driver download page and download the driver.
- From the downloaded ZIP, locate and extract the
DatabricksJDBC42.jarfile. - Put the
DatabricksJDBC42.jarfile inside$REPLICANT_HOME/libdirectory.
II. Set up Connection Configuration #
-
From
$REPLICANT_HOME, navigate to the sample connection configuration file:vi conf/conn/databricks.yaml -
You can store your connection credentials in a secrets management service and tell Replicant to retrieve the credentials. For more information, see Secrets management.
Otherwise, you can put your credentials like usernames and passwords in plain form like the sample below:
type: DATABRICKS_DELTALAKE host: "HOSTNAME" port: "PORT_NUMBER" url: "jdbc:databricks://HOST:PORT/DATABASE_NAME;transportMode=http;ssl=1;httpPath=<http-path>;AuthMech=3" # This URL can be copied from databricks cluster info page" username: "USERNAME" password: "PASSWORD" max-connections: 30 max-retries: 100 retry-wait-duration-ms: 1000Replace the following:
HOSTNAME: the hostname of your Databricks hostPORT_NUMBER: the port number of the Databricks clusterUSERNAME: a valid username that connects to your Databricks server. If you’re using personal access tokens for authentication, set this parameter totoken.PASSWORD: the password associated withUSERNAME. If you’re using personal access tokens for authentication, set this parameter to the value of your token—for example,fapi1234567890ab1cde1f3ab456c7d89efa.
Important: For Databricks Unity Catalog, set the connection
typetoDATABRICKS_LAKEHOUSE. To know more, see Databricks Unity Catalog Support.
III. Set up Extractor Configuration #
The Extractor configuration file has two parts:
- Parameters related to snapshot mode.
- Parameters related to realtime mode.
Parameters related to snapshot mode #
For snapshot mode, make the necessary changes as follows:
snapshot:
threads: 16
fetch-size-rows: 5_000 #Maximum number of records/documents fetched by replicant at once from the source system
min-job-size-rows: 1_000_000 #tables/collections are chunked into multiple jobs for replication. This configuration specifies a minimum size for each such job. This has a positive correlation with the memory footprint of replicant
max-jobs-per-chunk: 32 #Determines the maximum number of jobs created per source table/collection
_traceDBTasks: true
per-table-config:
- catalog: io_blitzz
tables:
orders:
num-jobs: 10 #Number of parallel jobs that will be used to extract the rows from a table. This value will override the number of jobs internally calculated by Replicant
split-key: ORDERKEY #This configuration is used by replicant to split a table into multiple jobs in order to do parallel extraction. This column will be used to perform parallel data extraction from table being replicated that has this column
lineitem:
split-key: orderkey
Important: For Unity Catalog, specify both bothcatalogandschemainper-table-config.
Parameters related to realtime mode #
If you want to operate in realtime mode, you can use the realtime section to specify your configuration. For example:
realtime:
threads: 16
fetch-size-rows: 5_000
_traceDBTasks: true
For a detailed explanation of configuration parameters in the Extractor file, see Extractor Reference.
IV. Set up Filter configuration (Optional) #
-
From
$REPLICANT_HOME, navigate to the sample Filter configuration file:vi filter/databricks.yaml -
The sample contains the following:
allow: - catalog: "tpch" types: [TABLE] allow: nation: region:Important: For Unity Catalog, specify both both
catalogandschemaunder the listallow.
For a detailed explanation of configuration parameters in the Filter file, see Filter Reference.
Databricks Unity Catalog Support (Beta) #
Note: This feature is currently in beta.
From version 22.08.31.3 onwards, Arcion has added support for Databricks Unity Catalog. The support is still in beta phase, with complete support to land gradually in future releases.
As of now, note the following about the state of Arcion’s Unity Catalog support:
-
Legacy Databricks only supports two-level namespace:
- Schemas
- Tables
With introduction of Unity Catalog, Databricks now exposes a three-level namespace that organizes data.
- Catalogs
- Schemas
- Tables
Arcion adds support for Unity Catalog by introducing a new child storage type (
DATABRICKS_LAKEHOUSEchild ofDATABRICKS_DELTALAKE). -
If you’re using Unity Catalog, notice the following when configuring your Source Databricks with Arcion:
- Set the connection
typetoDATABRICKS_LAKEHOUSEin the connection configuration file. - Specify both both
catalogandschemaas part ofper-table-configin the Extractor configuration file. - If you want to configure Filter on your Source Databricks, specify both both
catalogandschemaunder the listallowin the Filter configuration file.
- Set the connection
-
We’ll be using
SparkJDBC42driver for Legacy Databricks (DATABRICKS_DELTALAKE) andDatabricksJDBC42for Unity catalog (DATABRICKS_LAKEHOUSE). For instructions on how to obtain these drivers, see Obtain the JDBC Driver for Databricks. -
Replicant currently supports Unity Catalog on AWS and AZURE.