In today’s post I’ll look at some considerations for choosing to use Azure Blob Storage or Azure Data Lake Store when processing data to be loaded into a data warehouse. My basis here is a reference architecture that Microsoft published, see diagram below.
The diagram shows a typical pattern and what caught my eye was that it suggests loading data from your source system into Azure Blob Storage. On a couple projects, we are using Azure Data Lake Store instead of Azure Blob Storage. So, this got me thinking and here are my thoughts on why you may choose one over the other.
In many cases they are very similar and in many cases it’s the classic ‘it depends’. Ultimately, in most cases you can’t go wrong either way. One difference I see is it comes down to the type of files that each are good at working with.
I think blob storage is good at non-text based files – database backups, photos, videos and audio files. Whereas data lake I feel is a bit better at large volumes of text data. More often than not, personally, I would choose Data Lake Store if I’m using text file data to be loaded into my data warehouse. Of course, you can use blob storage, but I feel that is for those non-text data that I mentioned above.
There are tradeoffs with both. One thing Azure Blob Storage currently has over Azure Data Lake is the availability to geographic redundancy. You can set this up yourself with Data Lake by setting up a job to periodically replicate your Data Lake Store data to another geographic region, but it’s not available out of the box as with Blob Storage. If geo redundant storage is an important feature, then Blob Storage is the way to go.
Your data is secure in blob storage or Data Lake, but what Data Lake has over Blob Storage is that it works with Azure Active Directory; Blob storage currently does not. So, if you’re using Active Directory, that will integrate well with Data Lake from a security perspective. Bottom line is they are both secure, it’s just a matter of a different method of accessing it; you would access your data in blob storage though keys instead of Active Directory.
Depending on your workload, having your data in Data Lake Store will provide some additional opportunities for analytics, specifically Azure Data Lake Analytics. This gives you the ability to use SQL to do some neat analytics on top of data in your Data Lake Store, which obviously you couldn’t do in Blob.
How about pricing? Generally, Data Lake will be a bit more expensive although they are in close range of each other. Blob storage has more options for pricing depending upon things like how frequently you need to access your data (cold vs hot storage). Data Lake is priced on volume, so it will go up as you reach certain tiers of volume.
Either way, you can’t go wrong, but when Microsoft published this reference architecture, I thought it was an interesting point to make. There are many ways to approach this, but I wanted to give my thoughts on using Azure Data Lake Store vs Azure Blob Storage in a data warehousing scenario.
If you’d like to learn more about this topic or anything Azure related, we’re here to help. Click the link below or contact us, our team is ready and excited to help you where ever you are on your Azure journey.