Wednesday, September 27, 2017

Microsoft Ignite 2017: Modernizing ETL with Azure Data Lake with @MikeDoesBigData

Mike has done extensive work on the u-sql language and framework. The session will focus on modern Data Warehouse architectures as well as introducing Azure Data Lake.
A traditional Data Warehouse has Data sources, Extract-Transform-Load “ETL”, Data warehouses and BI and analytics as foundational components.
Today many of the data sources are increasing in data volume and the current solutions do not scale. In addition you are getting data that is non-relational from things like devices, web sensors and social channels.
A Data Lake allows you store data as it is essentially a very large scalable file system. From there you can do analysis using Hadoop, Spark and R. A Data Lake is really designed for the questions you don’t know while a Data Warehouse is designed for the questions you do.
Azure Data Lake consists of a highly scalable storage area called the ADL Store. It is exposed through a HDFS Compatible REST API which allows analytic solutions to site on top and operate at scale.
Cloudera and Hortonworks are available from the Azure Marketplace. Microsoft version of Hadoop is HDInsight. With HDInsight you pay for the cluster whether you use it or not.
Data Lake Analytics is a batch workload analytics engine. It is designed to do Analytics at Very Large Scale. Azure ADL Analytics allows you to pay for the resources you are running vs. spinning up the entire cluster with HDInsight.
You need to understand the Big Data pipeline and data flow in Azure. You go from ingestion to the Data Lake Store. From there you move it into the visualization layer. In Azure you can move data through the Azure Data Factory. You can also ingest through the Azure Event Hub.
Azure Data Factory is designed to move data from a variety of data stores to Azure Data Lake. For example you can take data out of AWS Redshift and move it to Azure Data Lake Store. Additional information can be found here:
U-SQL is the language framework that provides the scale out capabilities. It scales out your custom code in .NET, Python, R over your Data Lake. It is called U because is unifies SQL seamlessly across structured and unstructured data.
Microsoft suggest that you query the data where it lives. U-SQL allows you query and read/write data not just from you Azure Data Lake but also from storage blobs, Azure SQL in VMs, Azure SQL and Azure SQL Data Warehouse.
There are a few built-in cognitive functions that are available to you. You can install this code in your Azure Data Lake to add cognitive capabilities to your queries.

No comments:

Post a Comment