Azure Databricks

What is Delta Lake in Databricks?

If you’re not familiar with Delta Lake in Databricks, I’ll cover what you need to know here. Delta Lake is a technology that was developed by the same developers as Apache Spark. It’s designed to bring reliability to your data lakes and provided ACID transactions, scalable metadata handling and unifies streaming and batch data processing.

Let’s begin with some of the challenges of data lakes:

  • Data lakes are notoriously messy as everything gets dumped there. Sometimes, we may not have a rhyme or reason for dumping data there; we may be thinking we’ll need it at some later date.
  • Much of this mess is because your data lake will have a lot of small files and different data types. Because there are many small files that are not compacted, trying to read them in any shape or form is difficult, if not impossible.
  • Data lakes often contain bad data or corrupted data files so you can’t analyze them unless you go back and pretty much start over again.

This is where Delta Lake comes to the rescue! It delivers an open-source storage layer that brings ACID transactions to Apache Spark big data workloads. So, instead of the mess I described above, you have an over layer of your data lake from Delta Lake. Delta Lake provides ACID transactions through a log that is associated with each Delta table created in your data lake. This log records the history of everything that was ever done to that data table or data set, therefore you gain high levels of reliability and stability to your data lake.

Key Features of Delta Lake are:

  • ACID Transactions (Atomicity, Consistency, Isolation, Durability) – With Delta you don’t need to write any code – it’s automatic that transactions are written to the log. This transaction log is the key, and it represents a single source of truth.
  • Scalable Metadata Handling – Handles terabytes or even petabytes of data with ease. Metadata is stored just like data and you can display it using a feature of the syntax called Describe Detail which will describe the detail of all the metadata that is associated with the table. Puts the full force of Spark against your metadata.
  • Unified Batch & Streaming – No longer a need to have separate architectures for reading a stream of data versus a batch of data, so it overcomes limitations of streaming and batch systems. Delta Lake Table is a batch and streaming source and sink. You can do concurrent streaming or batch writes to your table and it all gets logged, so it’s safe and sound in your Delta table.
  • Schema Enforcement – this is what makes Delta strong in this space as it enforces your schemas. If you put a schema on a Delta table and you try to write data to that table that is not conformant with the schema, it will give you an error and not allow you to write that, preventing you from bad writes. The enforcement methodology reads the schema as part of the metadata; it looks at every column, data type, etc. and ensures what you’re writing to the Delta table is the same as what the schema represents of your Delta table – no need to worry about writing bad data to your table.
  • Time Travel (Data Versioning) – you can query an older snapshot of your data, provide data versioning, and roll back or audit data.
  • Upserts and Deletes – these operations are typically hard to do without something like Delta. Delta allows you to do upserts or merges very easily. Merges are like SQL merges into your Delta table and you can merge data from another data frame into your table and do updates, inserts, and deletes. You can also do a regular update or delete of data with a predicate on a table – something that was almost unheard of before Delta.
  • 100% Compatible with Apache Spark

Delta Lake is really a game changer and I hope you educate yourself more and start using it in your organization. You’ll find a great training resource from the Databricks community at: https://academy.databricks.com/category/self-paced

Or reach out to us at 3Cloud. Our expert team and solution offerings can help your business with any Azure product or service, including Managed Services offerings. Contact us at 888-8AZURE or  [email protected].

 

Brian CusterWhat is Delta Lake in Databricks?
Read More

How to Upload and Query a CSV File in Databricks

Welcome to another post in our Azure Every Day mini-series covering Databricks. Are you just starting out with Databricks and need to learn how to upload a CSV? In this post I’ll show you how to upload and query a file in Databricks. For a more detailed, step by step view, check out my video at the end of the post. Let’s get started!

Andie LetourneauHow to Upload and Query a CSV File in Databricks
Read More

How to Merge Data Using Change Data Capture in Databricks

My post today in our Azure Every Day Databricks mini-series is about Databricks Change Data Capture (CDC). A common use case for Change Data Capture is for customers looking to perform CDC from one or many sources into a set of Databricks Delta tables. The goal here is to merge these changes into Databricks Delta.

Jon BloomHow to Merge Data Using Change Data Capture in Databricks
Read More

Databricks and Azure Key Vault

In our ongoing Azure Databricks series within Azure Every Day, I’d like to discuss connecting Databricks to Azure Key Vault. If you’re unfamiliar, Azure Key Vault allows you to maintain and manage secrets, keys, and certificates, as well as sensitive information, which are stored within the Azure infrastructure.

Jon BloomDatabricks and Azure Key Vault
Read More

Custom Libraries in Databricks

This week’s Databricks post in our mini-series is focused on adding custom code libraries in Databricks. Databricks comes with many curated libraries that they have added into the runtime, so you don’t have to pull them in. There are installed libraries in Python, R, Java, and Scala which you can get in the release notes in the System Environment section of Databricks.

Jeff BurnsCustom Libraries in Databricks
Read More

How to Integrate Azure DevOps within Azure Databricks

In this post in our Databricks mini-series, I’d like to talk about integrating Azure DevOps within Azure Databricks. Databricks connects easily with DevOps and requires two primary things. First is a Git, which is how we store our notebooks so we can look back and see how things have changed. The next important feature is the DevOps pipeline. The pipeline allows you to deploy notebooks to different environments.

Jon BloomHow to Integrate Azure DevOps within Azure Databricks
Read More

How to Create an Azure Key Vault in Databricks

Welcome to another edition of our Azure Every Day mini-series on Databricks. In this post, I’ll walk you through creating a key vault and setting it up to work with Databricks. I’ve created a video demo where I will show you how to: set up a Key Vault, create a notebook, connect to a database, and run a query.

Leslie AndrewsHow to Create an Azure Key Vault in Databricks
Read More