Databricks Delta Lake

Source Databricks Delta Lake #

The extracted replicant-cli will be referred to as the $REPLICANT_HOME directory in the proceeding steps.

I. Obtain the JDBC Driver for Databricks #

Replicant requires the Databricks JDBC Driver as a dependency. To obtain the appropriate driver, follow the steps below:

  • Go to the Databricks JDBC Driver download page and download the driver.
  • From the downloaded ZIP, locate and extract the DatabricksJDBC42.jar file.
  • Put the DatabricksJDBC42.jar file inside $REPLICANT_HOME/lib directory.

II. Set up Connection Configuration #

  1. From $REPLICANT_HOME, navigate to the sample connection configuration file:

    vi conf/conn/databricks.yaml
    
  2. You can store your connection credentials in a secrets management service and tell Replicant to retrieve the credentials. For more information, see Secrets management.

    Otherwise, you can put your credentials like usernames and passwords in plain form like the sample below:

    type: DATABRICKS_DELTALAKE
    
    host: "HOSTNAME"
    port: "PORT_NUMBER"
    
    url: "jdbc:databricks://HOST:PORT/DATABASE_NAME;transportMode=http;ssl=1;httpPath=<http-path>;AuthMech=3" # This URL can be copied from databricks cluster info page"
    
    username: "USERNAME"
    
    password: "PASSWORD"
    
    max-connections: 30
    
    max-retries: 100
    retry-wait-duration-ms: 1000
    

    Replace the following:

    • HOSTNAME: the hostname of your Databricks host
    • PORT_NUMBER: the port number of the Databricks cluster
    • USERNAME: a valid username that connects to your Databricks server. If you’re using personal access tokens for authentication, set this parameter to token.
    • PASSWORD: the password associated with USERNAME. If you’re using personal access tokens for authentication, set this parameter to the value of your token—for example, fapi1234567890ab1cde1f3ab456c7d89efa.
    Important: For Databricks Unity Catalog, set the connection type to DATABRICKS_LAKEHOUSE. To know more, see Databricks Unity Catalog Support.

III. Set up Extractor Configuration #

The Extractor configuration file has two parts:

  • Parameters related to snapshot mode.
  • Parameters related to realtime mode.

For snapshot mode, make the necessary changes as follows:

snapshot:
  threads: 16
  fetch-size-rows: 5_000 #Maximum number of records/documents fetched by replicant at once from the source system
  min-job-size-rows: 1_000_000 #tables/collections are chunked into multiple jobs for replication. This configuration specifies a minimum size for each such job. This has a positive correlation with the memory footprint of replicant
  max-jobs-per-chunk: 32 #Determines the maximum number of jobs created per source table/collection
  _traceDBTasks: true

  per-table-config:
  - catalog: io_blitzz
    tables:
      orders:
        num-jobs: 10 #Number of parallel jobs that will be used to extract the rows from a table. This value will override the number of jobs internally calculated by Replicant
        split-key: ORDERKEY #This configuration is used by replicant to split a table into multiple jobs in order to do parallel extraction. This column will be used to perform parallel data extraction from table being replicated that has this column
      lineitem:
        split-key: orderkey
Important: For Unity Catalog, specify both both catalog and schema in per-table-config.

If you want to operate in realtime mode, you can use the realtime section to specify your configuration. For example:

realtime:
  threads: 16
  fetch-size-rows: 5_000
  _traceDBTasks: true

For a detailed explanation of configuration parameters in the Extractor file, see Extractor Reference.

IV. Set up Filter configuration (Optional) #

  1. From $REPLICANT_HOME, navigate to the sample Filter configuration file:

    vi filter/databricks.yaml
    
  2. The sample contains the following:

    allow:
    - catalog: "tpch"
      types: [TABLE]
      allow:
        nation:
        region:
    
    Important: For Unity Catalog, specify both both catalog and schema under the list allow.

For a detailed explanation of configuration parameters in the Filter file, see Filter Reference.

Databricks Unity Catalog Support (Beta) #

Note: This feature is currently in beta.

From version 22.08.31.3 onwards, Arcion has added support for Databricks Unity Catalog. The support is still in beta phase, with complete support to land gradually in future releases.

As of now, note the following about the state of Arcion’s Unity Catalog support:

  • Legacy Databricks only supports two-level namespace:

    • Schemas
    • Tables

    With introduction of Unity Catalog, Databricks now exposes a three-level namespace that organizes data.

    • Catalogs
    • Schemas
    • Tables

    Arcion adds support for Unity Catalog by introducing a new child storage type (DATABRICKS_LAKEHOUSE child of DATABRICKS_DELTALAKE).

  • If you’re using Unity Catalog, notice the following when configuring your Source Databricks with Arcion:

  • We’ll be using SparkJDBC42 driver for Legacy Databricks (DATABRICKS_DELTALAKE) and DatabricksJDBC42 for Unity catalog (DATABRICKS_LAKEHOUSE). For instructions on how to obtain these drivers, see Obtain the JDBC Driver for Databricks.

  • Replicant currently supports Unity Catalog on AWS and AZURE.