Big Data

What is Delta Lake in Databricks?

If you’re not familiar with Delta Lake in Databricks, I’ll cover what you need to know here. Delta Lake is a technology that was developed by the same developers as Apache Spark. It’s designed to bring reliability to your data lakes and provided ACID transactions, scalable metadata handling and unifies streaming and batch data processing.

Let’s begin with some of the challenges of data lakes:

  • Data lakes are notoriously messy as everything gets dumped there. Sometimes, we may not have a rhyme or reason for dumping data there; we may be thinking we’ll need it at some later date.
  • Much of this mess is because your data lake will have a lot of small files and different data types. Because there are many small files that are not compacted, trying to read them in any shape or form is difficult, if not impossible.
  • Data lakes often contain bad data or corrupted data files so you can’t analyze them unless you go back and pretty much start over again.

This is where Delta Lake comes to the rescue! It delivers an open-source storage layer that brings ACID transactions to Apache Spark big data workloads. So, instead of the mess I described above, you have an over layer of your data lake from Delta Lake. Delta Lake provides ACID transactions through a log that is associated with each Delta table created in your data lake. This log records the history of everything that was ever done to that data table or data set, therefore you gain high levels of reliability and stability to your data lake.

Key Features of Delta Lake are:

  • ACID Transactions (Atomicity, Consistency, Isolation, Durability) – With Delta you don’t need to write any code – it’s automatic that transactions are written to the log. This transaction log is the key, and it represents a single source of truth.
  • Scalable Metadata Handling – Handles terabytes or even petabytes of data with ease. Metadata is stored just like data and you can display it using a feature of the syntax called Describe Detail which will describe the detail of all the metadata that is associated with the table. Puts the full force of Spark against your metadata.
  • Unified Batch & Streaming – No longer a need to have separate architectures for reading a stream of data versus a batch of data, so it overcomes limitations of streaming and batch systems. Delta Lake Table is a batch and streaming source and sink. You can do concurrent streaming or batch writes to your table and it all gets logged, so it’s safe and sound in your Delta table.
  • Schema Enforcement – this is what makes Delta strong in this space as it enforces your schemas. If you put a schema on a Delta table and you try to write data to that table that is not conformant with the schema, it will give you an error and not allow you to write that, preventing you from bad writes. The enforcement methodology reads the schema as part of the metadata; it looks at every column, data type, etc. and ensures what you’re writing to the Delta table is the same as what the schema represents of your Delta table – no need to worry about writing bad data to your table.
  • Time Travel (Data Versioning) – you can query an older snapshot of your data, provide data versioning, and roll back or audit data.
  • Upserts and Deletes – these operations are typically hard to do without something like Delta. Delta allows you to do upserts or merges very easily. Merges are like SQL merges into your Delta table and you can merge data from another data frame into your table and do updates, inserts, and deletes. You can also do a regular update or delete of data with a predicate on a table – something that was almost unheard of before Delta.
  • 100% Compatible with Apache Spark

Delta Lake is really a game changer and I hope you educate yourself more and start using it in your organization. You’ll find a great training resource from the Databricks community at: https://academy.databricks.com/category/self-paced

Or reach out to us at 3Cloud. Our expert team and solution offerings can help your business with any Azure product or service, including Managed Services offerings. Contact us at 888-8AZURE or  [email protected].

 

Brian CusterWhat is Delta Lake in Databricks?
Read More

Hortonworks and Cloudera Have Merged: How It Impacts You

Cloudera-Hortonworks-merger

On October 3rd, Cloudera and Hortonworks announced their merger, a huge and highly significantly announcement in the big data space. Big data has slowed down since some of its hype passed by and really left only two big players in the market, Cloudera and Hortonworks. Both companies are known for reducing the complexity of Hadoop and implementing a Hadoop ecosystem in your organization. They have packaged it up for IT departments who wanted a big data ecosystem but didn’t want the hassle of open-source Hadoop along the way.

Tom WardHortonworks and Cloudera Have Merged: How It Impacts You
Read More

Building a Scalable Application with Azure Functions

icon_azure@2x.pngAre you interested in learning how to leverage Azure Functions to create an app that can scale to demand? In a recent webinar, Sr. BI Consultant, Joshuha Owen, discusses and demos how to build a scalable application with Azure functions and things to think about as far as scaling an application. The demo will not only walk through the process of building an application in Azure, but also how to create a Power BI report that utilizes what the application produces.

Tom WardBuilding a Scalable Application with Azure Functions
Read More

4 Mistakes to Avoid in Your Data Analytics Projects

In speaking with customers who are using Big Data for analytics, we found clients who love it and some that don’t. These clients are having struggles that are leading them to thoughts of abandoning it. In this edition of Azure Every Day, I want to share a few of these challenges so you can identify them in your environment and hopefully avoid them.

4 Mistakes to Avoid in Your Data Analytics Projects
Read More

The Big Reason You Should Think About Azure SQL DW

290x195_CloudMainstream.jpgThere are many reasons why your company should strongly consider using Azure SQL DW. However, the point of this post to convince you – the developer, consultant, architect, and/or accidental-DBA – that you should
start getting familiar with Azure SQL DW for one simple and hefty reason — Big Data.

Tom WardThe Big Reason You Should Think About Azure SQL DW
Read More