Databricks Platform Architecture and Main Concepts

By Kaden Sungbin Cho

Published on: April 9, 2024

Sharing

Databricks is a company founded by the engineers who created Spark at UC Berkeley. It is a rapidly growing cloud that creates major open source data environments, including Spark, Delta Lake, MLFlow, and Koalas. We provide** the ‘Lakehouse Platform’ service.

In this article, we will look at the architecture and concepts of major services and how they provide a data platform environment:

Databricks’ services
Analytics Platform architecture
Analytics Platform main concepts

Databricks’ services

Databricks' service wraps a cloud service and provides a laptop-based data environment for data products such as analysis and AI. Starting from the initial Spark data processing environment, we have created open sources such as Delta Lake and MLFlow, and have additionally expanded functions such as data architecture management, catalog management, and MLOps. If we focus on the layer called Cloud - Runtime - Workspace, it has the following structure:

Currently, existing and added sub-products are provided comprehensively under the name Data Lakehouse, and the base environment can be linked with the three major cloud companies: AWS, Azure, and Google Cloud.

What differentiates it from the numerous data services provided by the three cloud companies is its excellent UX that focuses on the needs of each user. We simplify it so that data analysts, engineers, scientists, and platform operators can quickly build a data environment and perform their desired tasks on a single integrated platform.

Analytics Platform 아키텍쳐

The platform's architecture is largely divided into two planes: control and data:

Control plane: Includes backend services managed by Databricks, and laptop commands and other workspace settings are encrypted and stored here.
Data plane: Managed through AWS, Azure, and Google accounts, this is where data is stored and processed at the same time. You can use your account in each cloud to access other external data stores or obtain data from external streaming data sources.

Analytics Platform Key Concepts

Let’s look at key concepts that frequently appear when using Databricks’ analytics platform [6].

Workspace

Workspace is an environment where you can access all Databricks assets that exist in the Control Plane above. Workspaces organize objects such as notebooks, libraries, dashboards, and experiments into folders and provide access to data objects and computing resources. The main objects are as follows:

Notebook: Web-based interface to documents including executable commands, visualizations, text, etc.
Dashboard: Interface that provides visualization
Library: A code package that can be used in laptops or jobs running in a cluster. Databricks provides the basics and custom additions can be made.
Repo: Content synchronized to Git repository and version managed
Experiment: A bundle of MLflow runs for training ML models.

Interface

The interface provided by Databricks to access multiple assets includes UI, API, and CLI.

Data Management

This section covers objects that perform analysis or hold data that goes into ML models.

Databricks File System

It is a filesystem abstraction layer that wraps a blob store and contains directories containing files and other directories. DBFS is automatically created with datasets that can be used to learn Databricks.

Metastore

This is a component that stores structural information about various tables and partitions that exist in a data warehouse. Such information includes information about the columns, column types, serializers and deserializers for reading and writing data, and the files in which the data is stored. All Databricks deployments have a central Hive metastore accessible to all clusters to persist table metadata. Additionally, if it already exists, you can use the existing one by using an external Hive metastore.

Computation management

This section covers the concepts you need to know to compute in Databricks.

Cluster

A cluster is a collection of computing resources or configurations where laptops and jobs run. There are two cluster types:

all-purpose: You can create this type of cluster using the UI, CLI, or Rest API. Clusters can be shut down and restarted manually, and multiple users can share a cluster for collaboration.
job: When Databricks job scheduler runs a job in a new cluster, it creates a new job cluster and terminates the cluster when the job is finished.

Pool

Instances that can be used in an idle state reduce cluster startup and auto-scaling times. When attached to a pool, a cluster allocates its drivers and workers from the pool. If the pool does not have enough idle resources to service cluster requests, the pool expands by allocating new instances from the instance provider. Attached When a cluster terminates, the instances it used are returned to the pool and can be reused by other clusters.

Databricks Runtime

These are the core components that run on a cluster managed by Databricks. Databricks provides several runtime types:

Databricks Runtime: Includes Apache Spark, while adding several components and updates to improve usability, performance, and security for analytics.
Databricks Runtime for ML: Built on the Databricks Runtime and provides an environment for ML and data science. Includes popular libraries such as Tensorflow, Keras, PyTorch, and XGBoost.
Databricks Runtime for Genomics: This is a version of Databricks runtime optimized for handling genomic and biomedical data.
Databricks Lite: It is a packaging of the open source Apache Spark runtime. You can use it for tasks that don't require the additional performance or features that Databricks provides.

Job

A non-interactive mechanism to run a notebook or library immediately or on a scheduled basis.

Workload

Databricks distinguishes between two types of workloads that are subject to different pricing policies:

Data Engineering: Workloads that run on job clusters that the Databricks job scheduler creates for each workload.
Data analytics: Interactive workloads that run on all-purpose clusters. Although jobs run primarily on Databricks notebooks, jobs running on all-purpose clusters are also considered interactive workloads.

Execution Context

Refers to the REPL environment status for each supported programming language. Python, R, Scala, and SQL exist.

Model Management

This section covers concepts for training ML models.

Model

A mathematical function that represents the relationship between predicted values and results. ML consists of training and inference steps, where you train a model using an existing dataset and use the model to predict results for new data.

Run

A collection of parameters, metrics, and tags related to ML model training.

Experiment

'Runs' are the main unit of access management and grouping. All MLflow runs belong to experiments. Experiments allow you to visualize, search, compare runs, and download run artifacts or metadata for analysis in other tools.