In the past year, I’ve been working with the HDInsight team on making the Hadoop eco-system available on Windows Server and Windows Azure. We built services that enable Azure users to quickly deploy elastic Hadoop clusters (based on Hortonworks’s Hadoop Distribution Package for Windows) on Windows Azure. By harnessing the parallel processing power of Hadoop, HDInsight clusters enable users to effectively analyze Tera bytes and Peta bytes of data stored in Azure Blob Storage.
We released installation package that enables developers to quickly and easily install a single node HDInsight cluster (HDInsight Emulator) on their dev box.
We developed open source SDK that includes PowerShell cmdlets (now integrated with Windows Azure PowerShell) and .NET APIs that make it easier to deploy, manage and run jobs against HDInsight cluster.
To help you get started, this post demonstrates how to:
- Deploy and interact with HDInsight cluster on Windows Azure.
- Install HDInsight Emulator on a local dev box.
HDInsight Service enables users to deploy Hadoop clusters on Azure. These cluster are used to analyze data stored in Azure Blob Storage. To make this work, Azure HDInsight clusters are configured to use the ASV (Azure Storage Vault) implementation of HDFS, which reads/writes/streams data via Azure Blob instead of the local file-system.
The practice of using Azure Blob instead of local storage often raises questions regarding network latency and the loss of Data Locality. Luckily, HDInsight clusters and storage accounts are deployed on Azure Q10 infrastructure that features incredibly low networking overhead. As a result, for up to 50 worker nodes, reading from Azure Blob is just as fast as reading from the local disk.
Storing the data in Azure storage instead of on the workers local storage has many benefits. In addition to geo replication and faster writes, the most obvious gain is that the data is not attached to the cluster. This enables you to create/delete clusters without a need to migrate the data. Brad Sarsfield and Denny Lee did a great job explaining ‘Why use Blob Storage with HDInsight on Azure?’
Creating an HDInsight cluster
Once you have Windows Azure subscription, deploying HDInsight cluster is only couple of clicks away. You can use the management portal to create a storage account (will store the data to be processed by your HDInsight cluster) and HDInsight cluster that will be associated with that account.
Once your cluster is ready, it will appear under the HDInsight tab.
Since your new HDinsight cluster is using its associated storage account as Distributed File System - you will notice that all the files that you might be used to see in HDFS (if you used Hadoop on non cloud environment) are stored in the Blob Storage under a container with the name of your cluster.
Interacting with HDInsight cluster
Once the installation is complete, launch the Windows Azure PowerShell window.
Select your subscription:
PS C:\> $subscriptionName ="Visual Studio Ultimate with MSDN"
PS C:\> Select-AzureSubscription $subscriptionName
Query for available cluster:
Once the appropriate subscription is selected, you can query for a list of your HDInsight clusters by:
PS C:\> Get-AzureHDInsightCluster
Run 10GB GraySort (Tera Gen/Sort/Validate) Job
Since hadoop-examples.jar comes with the HDInsight cluster, you can run any one of the jobs available in that examples package.
Follow the instructions here to run the GraySort mini benchmark that will generate 10GB of data, sort the data and validate the results. Works like a champ!
HDInsight Emulator is a single node HDInsight deployment that allows developers to develop and debug jobs on their local development box. You can install the HDInsight Emulator via Web Platform Installer from here. All the missing prerequisites will be detected and installed automatically!
Once the installation is complete, you will notice that all the supported Hadoop services will be running as Windows Services on your local machine.
You are good to go! Follow the instructions here to learn how to run MR/HIve/Pig jobs on your local HDInsight cluster. You will notice that HDFS is configured as the default distributed file system. You can however, change the core-site.xml to point to your Azure Blob Storage account.