Distributed Data Mesh, Delta Lake, and Terraform

Gain practical insights how to automate it all 1 Serge
Smertin Senior Specialist Solutions Architect, Databricks Distributed Data Mesh, Delta Lake, and Terraform

About Serge ▪ Lead maintainer of Databricks Terraform Provider ▪
Worked in all stages of data lifecycle for the past 14 years ▪ Built a couple of data science platforms from scratch ▪ Tracked cyber criminals through massively scaled data forensics ▪ Focusing on automation integration aspects now

https://martinfowler.com/articles/data-monolith-to-mesh.html

Distributed data mesh turns out to be just a mess.

https://martinfowler.com/articles/data-monolith-to-mesh.html THIS TALK

10 Terraform Basics

Infrastructure as a Code! … like HTML, but for the
Cloud We are the infrastructure for the DATA+AI in the Cloud. So need to codify it repeatable, shareable, auditable, and with the whole provisioning process automated.

Provide consistency … across multiple clouds and environments • Key
enabler for expansion • Authoritative state • less tribal knowledge • Peer-review changes • Supports all Databricks entities based on 50+ APIs

Generally available after 2+ years in Labs

Don’t have time writing conﬁgs from scratch? No problem -
experimental tooling available to generate conﬁguration for you! * rewrite them afterwards a bit, to modularize ;)

Automation on macro-level vs micro-level

17 Micro level

data "databricks_current_user" "me" {} resource "databricks_dbfs_file" "this" { content_base64 =
base64encode(jsonencode({ "host": "abc", "username": "admin", "password": "!password123@#" })) path = "${data.databricks_current_user.me.home}/config.json" } resource "databricks_notebook" "this" { language = "PYTHON" content_base64 = base64encode(<<-EOT import json with open('/dbfs/${data.databricks_current_user.me.home}/config.json', 'r') as f: config = json.load(f) print('User is {username} and password is {password}'.format(**config)) EOT ) path = "${data.databricks_current_user.me.home}/DAIS2022/ReadClearText" } output "notebook_url" { value = databricks_notebook.this.url } DON’T DO IT

data "databricks_current_user" "me" {} data "databricks_spark_version" "latest" {} data "databricks_node_type"
"smallest" { local_disk = true } resource "databricks_notebook" "this" { language = "PYTHON" content_base64 = base64encode(<<-EOT username = dbutils.widgets.get('username') password = dbutils.widgets.get('password') print(f'User is {username} and password is {password}') EOT ) path = "${data.databricks_current_user.me.home}/DAIS2022/NotebookTaskArguments" } resource "databricks_job" "this" { name = "DAIS 2022 - Task Arguments (${data.databricks_current_user.me.alphanumeric})" new_cluster { num_workers = 1 spark_version = data.databricks_spark_version.latest.id node_type_id = data.databricks_node_type.smallest.id } notebook_task { notebook_path = databricks_notebook.this.path base_parameters = { "host": "abc", "username": "admin", "password": "!password123@#" } } } output "notebook_url" { value = databricks_notebook.this.url } output "job_url" { value = databricks_job.this.url } DON’T DO IT

data "databricks_current_user" "me" {} data "databricks_spark_version" "latest" {} data "databricks_node_type"
"smallest" { local_disk = true } resource "databricks_dbfs_file" "this" { content_base64 = base64encode(<<-EOT import sys username, password = sys.argv[1:] print(f'User is {username} and password is {password}') EOT ) path = "${data.databricks_current_user.me.home}/run.py" } resource "databricks_job" "this" { name = "DAIS 2022 - Python Arguments" new_cluster { num_workers = 1 spark_version = data.databricks_spark_version.latest.id node_type_id = data.databricks_node_type.smallest.id } spark_python_task { python_file = databricks_dbfs_file.this.dbfs_path parameters = [ "admin", "!password123@#" ] } } output "job_url" { value = databricks_job.this.url } DON’T DO IT

resource "databricks_secret_scope" "app" { name = "dais2022-tfdemo" } resource "databricks_secret"
"pw" { key = "somepassword" string_value = "!password123@#" // would be something else in the real life scope = databricks_secret_scope.app.id } resource "databricks_job" "this" { name = "DAIS 2022 - Spark Conf (${data.databricks_current_user.me.alphanumeric})" new_cluster { num_workers = 1 spark_version = data.databricks_spark_version.latest.id node_type_id = data.databricks_node_type.smallest.id spark_conf = { "demo.dais.username" : "admin", "demo.dais.password" : "{{secrets/${databricks_secret_scope.app.name}/${databricks_secret.pw.key}}}", } } notebook_task { notebook_path = databricks_notebook.this.path } } resource "databricks_notebook" "this" { language = "PYTHON" content_base64 = base64encode(<<-EOT username = spark.conf.get('demo.dais.username') password = spark.conf.get('demo.dais.password') print(f'User is {username} and password is {password}') EOT ) path = "${data.databricks_current_user.me.home}/DAIS2022/NotebookTaskSparkConf" } SAFER

22 Macro level

https://registry.terraform.io/namespaces/databricks MLOps

Pattern: Isolated full control 24 data "databricks_group" "users" { display_name
= "users" } data "databricks_user" "everyone" { for_each = data.databricks_group.users.users user_id = each.value } resource "databricks_repo" "project" { for_each = data.databricks_user.everyone url = "https://github.com/databricks/notebook-best-practices" path = "${each.value.repos}/main-project" } resource "databricks_job" "this" { for_each = data.databricks_user.everyone name = "Experiment of ${each.value.display_name}" new_cluster { num_workers = 1 spark_version = data.databricks_spark_version.latest.id node_type_id = data.databricks_node_type.smallest.id } notebook_task { notebook_path = "${databricks_repo.project[each.key].path}/notebooks/covid_eda_raw" } } resource "databricks_group" "oncall" { display_name = "on-call" } data "databricks_current_user" "me" {} resource "databricks_permissions" "job_usage" { for_each = { for k, v in data.databricks_user.everyone : k => v if v.user_name != data.databricks_current_user.me.user_name } job_id = databricks_job.this[each.key].id access_control { user_name = each.value.user_name permission_level = "IS_OWNER" } access_control { group_name = databricks_group.oncall.display_name permission_level = "CAN_MANAGE" } } data "databricks_spark_version" "latest" {} data "databricks_node_type" "smallest" { local_disk = true }

Pattern: Library Management 25 resource "databricks_dbfs_file" "app" { source =
"${path.module}/app-0.0.1.jar" path = "/FileStore/app-0.0.1.jar" } data "databricks_clusters" "all" { } resource "databricks_library" "app" { for_each = data.databricks_clusters.all.ids cluster_id = each.key jar = databricks_dbfs_file.app.dbfs_path }

Pattern: Extending Cluster Policies variable "team" { description = "Team
that performs the work" } variable "policy_overrides" { description = "Cluster policy overrides" } locals { default_policy = { "autotermination_minutes": { "type": "fixed", "value": 20, "hidden": true }, "custom_tags.Team" : { "type" : "fixed", "value" : var.team } } } resource "databricks_cluster_policy" "fair_use" { name = "${var.team} cluster policy" definition = jsonencode(merge(local.default_policy, var.policy_overrides)) } resource "databricks_permissions" "can_use_cluster_policyinstance_profile" { cluster_policy_id = databricks_cluster_policy.fair_use.id access_control { group_name = var.team permission_level = "CAN_USE" } } module "marketing_compute_policy" { source = "../modules/databricks-cluster-policy" team = "marketing" policy_overrides = { // only marketing guys will benefit // from delta cache this way "spark_conf.spark.databricks.io.cache.enabled": { "value": "true" }, } } module "engineering_compute_policy" { source = "../modules/databricks-cluster-policy" team = "engineering" policy_overrides = { "dbus_per_hour" : { "type" : "range", // only engineering guys can spin // up big clusters "maxValue" : 50 }, } }

Pattern: Secure Bucket 27 // Step 1: Create bucket policy
that will give full access to this bucket data "databricks_aws_bucket_policy" "ds" { provider = databricks.mws full_access_role = aws_iam_role.data_role.arn bucket = aws_s3_bucket.ds.bucket } // Step 2: Create cross-account policy, which allows Databricks to pass given list of data roles data "databricks_aws_crossaccount_policy" "this" { pass_roles = [aws_iam_role.data_role.arn] } // Step 3: Allow Databricks to perform actions within your account, given requests are with AccountID data "databricks_aws_assume_role_policy" "this" { external_id = var.account_id } // Step 4: Register cross-account role for multi-workspace scenario (only if you're using multi-workspace setup) resource "databricks_mws_credentials" "this" { provider = databricks.mws account_id = var.account_id credentials_name = "${var.prefix}-creds" role_arn = aws_iam_role.cross_account.arn } // Step 5: Register instance profile at Databricks resource "databricks_instance_profile" "ds" { instance_profile_arn = aws_iam_instance_profile.this.arn skip_validation = false } // Step 6: now you can do `%fs ls /mnt/experiments` in notebooks resource "databricks_mount" "this" { mount_name = "experiments" s3 { instance_profile = databricks_instance_profile.ds.id bucket_name = aws_s3_bucket.this.bucket } }

resource "databricks_metastore" "this" { provider = databricks.workspace name = "primary"
storage_root = "s3://${aws_s3_bucket.metastore.id}/metastore" owner = var.unity_admin_group force_destroy = true } resource "databricks_metastore_data_access" "this" { provider = databricks.workspace metastore_id = databricks_metastore.this.id name = aws_iam_role.metastore_data_access.name aws_iam_role { role_arn = aws_iam_role.metastore_data_access.arn } is_default = true } resource "databricks_metastore_assignment" "default_metastore" { provider = databricks.workspace for_each = toset(var.databricks_workspace_ids) workspace_id = each.key metastore_id = databricks_metastore.unity.id default_catalog_name = "hive_metastore" } resource "databricks_catalog" "sandbox" { provider = databricks.workspace metastore_id = databricks_metastore.this.id name = "sandbox" comment = "this catalog is managed by terraform" properties = { purpose = "testing" } depends_on = [databricks_metastore_assignment.default_metastore] } resource "databricks_grants" "sandbox" { provider = databricks.workspace catalog = databricks_catalog.sandbox.name grant { principal = "Data Scientists" privileges = ["USAGE", "CREATE"] } grant { principal = "Data Engineers" privileges = ["USAGE"] } } resource "databricks_schema" "things" { provider = databricks.workspace catalog_name = databricks_catalog.sandbox.id name = "things" comment = "this database is managed by terraform" properties = { kind = "various" } } resource "databricks_grants" "things" { provider = databricks.workspace schema = databricks_schema.things.id grant { principal = "Data Engineers" privileges = ["USAGE"] } } resource "databricks_cluster" "unity_sql" { provider = databricks.workspace cluster_name = "Unity SQL" spark_version = data.databricks_spark_version.latest.id node_type_id = data.databricks_node_type.smallest.id autotermination_minutes = 60 enable_elastic_disk = false num_workers = 2 aws_attributes { availability = "SPOT" } data_security_mode = "USER_ISOLATION" } Unity Catalog

Every developer wants their own dev catalog? 29 data "databricks_group"
"users" { display_name = "users" } data "databricks_user" "everyone" { for_each = data.databricks_group.users.users user_id = each.value } resource "databricks_catalog" "sandbox" { for_each = data.databricks_user.everyone metastore_id = databricks_metastore.this.id name = "sandbox_${each.value.alphanumeric}" owner = each.value.user_name comment = "this catalog is managed by terraform" properties = { purpose = "research sandbox" } } resource "databricks_grants" "sandbox" { for_each = data.databricks_user.everyone catalog = databricks_catalog.sandbox[each.key].name grant { principal = "Data Scientists" privileges = ["USAGE"] } } You can now explore the realms of possibility.

Pattern: Disaster Recovery 30

Remember: You can generate conﬁgurations from existing workspace as one-off
action

Automated process, runs every 30 minutes SPN with ADB Contributor
role Azure Databricks Workspace #1 Databricks Groups Tables Clusters Secret Scopes Azure Databricks Workspace #2 Databricks Groups Tables Clusters Secret Scopes Azure Active Directory AAD Groups Contributor on workspaces (part of “admins” group in workspaces) Add users Remove users Directory.Read.All or Directory.AccessAsUser.All Pattern: user sync

// define which groups have access to a particular workspace
variable "groups" { default = { "AAD Group A" = { workspace_access = true allow_databricks_sql_access = false }, "AAD Group B" = { workspace_access = false allow_databricks_sql_access = true } } } // read group members of given groups from AzureAD // every time Terraform is started data "azuread_group" "this" { for_each = toset(keys(var.groups)) display_name = each.value } // create or remove groups within Azure Databricks: // all governed by "groups" variable resource "databricks_group" "this" { for_each = data.azuread_group.this display_name = each.key workspace_access = var.groups[each.key].workspace_access allow_sql_analytics_access = var.groups[each.key].allow_sql_analytics_access } // read users from AzureAD every time Terraform is started data "azuread_user" "this" { for_each = toset(flatten([for g in data.azuread_group.this: g.members])) object_id = each.value } // all governed by AzureAD, create or remove users from // Azure Databricks workspace resource "databricks_user" "this" { for_each = data.azuread_user.this user_name = each.value.user_principal_name display_name = each.value.display_name active = each.value.account_enabled } // put users to respective groups resource "databricks_group_member" "this" { for_each = toset(flatten( [for group_name in keys(var.groups): [for member_id in data.azuread_group.this[group_name].members: jsonencode({ user: member_id, group: group_name })]])) group_id = databricks_group.this[jsondecode(each.value).group].id member_id = databricks_user.this[jsondecode(each.value).user].id }

Other patterns We simply have no time to go over
them all • “Project Workspaces” ◦ gather a team ◦ spin up a carbon-copy of workspace ◦ work on a project for couple of weeks or months ◦ tear down the workspace in the end • Code Artifacts: shared and custom libraries ◦ think about databricks_mount and databricks_library • Networking: AWS Private Link, IP Access Control Lists, etc ◦ see guides on Databricks provider page on Terraform registry 34

How to run it all?

36 Serge Smertin Databricks Thank you

Distributed Data Mesh, Delta Lake, and Terraform

Distributed Data Mesh, Delta Lake, and Terraform

Serge Smertin

More Decks by Serge Smertin

Other Decks in Programming

Featured

Transcript

Gain practical insights how to automate it all 1 Serge

About Serge ▪ Lead maintainer of Databricks Terraform Provider ▪

https://martinfowler.com/articles/data-monolith-to-mesh.html

Distributed data mesh turns out to be just a mess.

https://martinfowler.com/articles/data-monolith-to-mesh.html THIS TALK

10 Terraform Basics

Infrastructure as a Code! … like HTML, but for the

Provide consistency … across multiple clouds and environments • Key

Generally available after 2+ years in Labs

Don’t have time writing conﬁgs from scratch? No problem -

Automation on macro-level vs micro-level

17 Micro level

data "databricks_current_user" "me" {} resource "databricks_dbfs_file" "this" { content_base64 =

data "databricks_current_user" "me" {} data "databricks_spark_version" "latest" {} data "databricks_node_type"

data "databricks_current_user" "me" {} data "databricks_spark_version" "latest" {} data "databricks_node_type"

resource "databricks_secret_scope" "app" { name = "dais2022-tfdemo" } resource "databricks_secret"

22 Macro level

https://registry.terraform.io/namespaces/databricks MLOps

Pattern: Isolated full control 24 data "databricks_group" "users" { display_name

Pattern: Library Management 25 resource "databricks_dbfs_file" "app" { source =

Pattern: Extending Cluster Policies variable "team" { description = "Team

Pattern: Secure Bucket 27 // Step 1: Create bucket policy

resource "databricks_metastore" "this" { provider = databricks.workspace name = "primary"

Every developer wants their own dev catalog? 29 data "databricks_group"

Pattern: Disaster Recovery 30

Remember: You can generate conﬁgurations from existing workspace as one-off

Automated process, runs every 30 minutes SPN with ADB Contributor

// define which groups have access to a particular workspace

Other patterns We simply have no time to go over

How to run it all?

36 Serge Smertin Databricks Thank you