Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Databricks SQL Warehouse Support to Golang Migrate #1167

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

caldempsey
Copy link

Currently, Databricks does not offer a built-in tool for deterministic schema migrations between Delta Table schemas. While schema evolution tools are available for managing changes in Delta Lake, they do not provide a controlled, additive approach to schema modifications. Given a need for precise schema management when transforming unstructured data into highly structured data within Delta Lake, a more controlled migration strategy is essential.

This PR introduces support for Databricks SQL Warehouse. This enhancement allows for precise and controlled schema management through Unity Catalog, facilitating seamless integration with both internal and external tables, such as Delta Lake or Iceberg tables. If you plan to use this, please review the Known Issues section, as there are some quirks in the implementation that need to be addressed.

Implementation Details:

  • SQL Warehouse Support: Enables schema migrations and version management within Databricks environments using SQL Warehouse.
  • Databricks CLI Integration: The implementation currently uses the Databricks CLI agent for connectivity. Future releases may extend support to ODBC or JDBC connections, broadening connectivity options and facilitating integration with Apache Hive through JDBC or ODBC drivers.
  • Migrations Management: Handles migration operations, including running migrations from input streams, setting and retrieving version information, and ensuring the migrations table exists.
  • Table Management: Provides functionality for dropping tables and creating the migrations table if it does not already exist.

Usage:

migrate -source file://blah -database databricks-sqlwarehouse://token:{{token}}@{{workspace_id}}.cloud.databricks.com:443/sql/1.0/warehouses/{{warehouse_id}} {{arg}}

Known Issues:

This implementation was developed quickly to address immediate needs for a controlled schema migration process. It may not handle all edge cases perfectly. The primary challenges include:

  • Error Handling: Error messages can be unclear when dealing with dirty migrations. While the migration process generally works, you may encounter issues with error reporting.
  • Transactions: Databricks SQL Driver does not yet support transactions. As a result, concurrent operations are not advised.
  • Parameters: The Databricks SQL Driver does not support parameters. Instead, creating the migrations table and other internal operations rely on direct SQL injections.
  • Catalog Requirements: The hive_metastore catalog must exist in Unity Catalog (UC) and the default schema, as the migrations table will be stored there. This may be configurable in future versions.
  • Multiple Queries: The driver does not handle multiple SQL queries in a single request well. To avoid issues, split multiple table creations into separate migrations.
  • Version Management: The SQL Migrate tool requires manual intervention if it prompts for a forced version change. Apply migrations after manually adjusting the database version.
  • SQL Syntax: The Databricks SQL Driver is stricter with SQL syntax compared to the notebooks. Ensure SQL queries are accurate and test them in an environment that closely mirrors the driver’s requirements.

Disclaimer: The author accepts no responsibility for any damage to personal or business systems, databases, networks, device drivers, or any other components resulting from the use of this driver.

var (
multiStmtDelimiter = []byte(";")

DefaultMigrationsTable = "schema_migrations"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in my notes, we could and probably should point this at catalog_name.schema_name.schema_migrations before merging.

Comment on lines +108 to +111
return database.CasRestoreOnErr(&d.isLocked, false, true, database.ErrLocked, func() error {
// Databricks SQL Warehouse does not support locking
// Placeholder for actual lock code
return nil
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the SQL Warehouse might but the database driver hasn't implemented it

Comment on lines +1 to +7
CREATE EXTERNAL TABLE IF NOT EXISTS `dog-park-db`.default.cat_naps (
nap_id STRING NOT NULL, -- id of the nap
nap_location STRING NOT NULL, -- location where the nap took place
checkpoint_id LONG NOT NULL, -- ID given to the batch per checkpoint, assigned to many process runs.
batch_id STRING NOT NULL, -- ID given to each independent batch
recorded_at TIMESTAMP NOT NULL -- Timestamp indicating when the nap was recorded.
) LOCATION 's3://dog-park-db-tables/cat_naps';
Copy link
Author

@caldempsey caldempsey Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wrote a migration that does this, should add one:

ALTER TABLE `dog-park-db`.default.cat_naps
ADD COLUMNS (
    md5 STRING COMMENT 'MD5 checksum of the file content'
);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant