If you’re not familiar with Delta Lake in Databricks, I’ll cover what you need to know here. Delta Lake is a technology that was developed by the same developers as Apache Spark. It’s designed to bring reliability to your data lakes and provided ACID transactions, scalable metadata handling and unifies streaming and batch data processing.
Let’s begin with some of the challenges of data lakes:
- Data lakes are notoriously messy as everything gets dumped there. Sometimes, we may not have a rhyme or reason for dumping data there; we may be thinking we’ll need it at some later date.
- Much of this mess is because your data lake will have a lot of small files and different data types. Because there are many small files that are not compacted, trying to read them in any shape or form is difficult, if not impossible.
- Data lakes often contain bad data or corrupted data files so you can’t analyze them unless you go back and pretty much start over again.
This is where Delta Lake comes to the rescue! It delivers an open-source storage layer that brings ACID transactions to Apache Spark big data workloads. So, instead of the mess I described above, you have an over layer of your data lake from Delta Lake. Delta Lake provides ACID transactions through a log that is associated with each Delta table created in your data lake. This log records the history of everything that was ever done to that data table or data set, therefore you gain high levels of reliability and stability to your data lake.
Key Features of Delta Lake are:
- ACID Transactions (Atomicity, Consistency, Isolation, Durability) – With Delta you don’t need to write any code – it’s automatic that transactions are written to the log. This transaction log is the key, and it represents a single source of truth.
- Scalable Metadata Handling – Handles terabytes or even petabytes of data with ease. Metadata is stored just like data and you can display it using a feature of the syntax called Describe Detail which will describe the detail of all the metadata that is associated with the table. Puts the full force of Spark against your metadata.
- Unified Batch & Streaming – No longer a need to have separate architectures for reading a stream of data versus a batch of data, so it overcomes limitations of streaming and batch systems. Delta Lake Table is a batch and streaming source and sink. You can do concurrent streaming or batch writes to your table and it all gets logged, so it’s safe and sound in your Delta table.
- Schema Enforcement – this is what makes Delta strong in this space as it enforces your schemas. If you put a schema on a Delta table and you try to write data to that table that is not conformant with the schema, it will give you an error and not allow you to write that, preventing you from bad writes. The enforcement methodology reads the schema as part of the metadata; it looks at every column, data type, etc. and ensures what you’re writing to the Delta table is the same as what the schema represents of your Delta table – no need to worry about writing bad data to your table.
- Time Travel (Data Versioning) – you can query an older snapshot of your data, provide data versioning, and roll back or audit data.
- Upserts and Deletes – these operations are typically hard to do without something like Delta. Delta allows you to do upserts or merges very easily. Merges are like SQL merges into your Delta table and you can merge data from another data frame into your table and do updates, inserts, and deletes. You can also do a regular update or delete of data with a predicate on a table – something that was almost unheard of before Delta.
- 100% Compatible with Apache Spark
Delta Lake is really a game changer and I hope you educate yourself more and start using it in your organization. You’ll find a great training resource from the Databricks community at: https://academy.databricks.com/category/self-paced
Or reach out to us at 3Cloud. Our expert team and solution offerings can help your business with any Azure product or service, including Managed Services offerings. Contact us at 888-8AZURE or [email protected].