Sunday, January 5, 2014

HDInsight Service (Hadoop on Windows Azure) – Hive Partitions

The post will walk you through the process of creating, loading and querying partitioned Hive Table via HDInsight

Follow the steps here to create HDInsight Cluster and to install  Windows Azure PowerShell.

Upload data to HDInsight Cluster

Download the file w3c_2MB.txt to your local drive, say to ‘C:\w3c_2MB.txt’

Since HDInsight uses the Azure Blob Storage as its distributed file storage (unlike non-could Hadoop clusters that use the default HDFS implementation that’s based on local file-system), you can choose your preferred tool to upload data. We will use Windows Azure PowerShell to upload the sample data.

Open MIcrosoft Azure PowerShell

Run the command below, you will be prompt to login to your Azure account:

Add-AzureAccount

Paste the following script into the Windows Azure PowerShell console window to run it.

$subscriptionName ="<your-subscription>"
$storageAccountName = "<your-storage-name>"
$containerName = "<your-cluster-name>"

$fileName ="C:\w3c_2MB.txt"
$blobName = "user/hdp/myexample/w3c_2MB.txt"

# Get the storage account key
Select-AzureSubscription $subscriptionName
$storageaccountkey = get-azurestoragekey $storageAccountName | %{$_.Primary}

# Create the storage context object
$destContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageaccountkey

# Copy the file from local workstation to the Blob container        
Set-AzureStorageBlobContent -File $fileName -Container $containerName -Blob $blobName -context $destContext
image 

Initialize Hive session in Windows Azure PowerShell

Since we are already using Windows Azure PowerShell, we will continue to use it to submit Hive jobs. Quite frankly, PowerShell is not the best tool for ad-hoc interaction with Hive. To debug Hive jobs, (as of now) it’s better to remote connect to the cluster and use the built in Hive CLI. Even for automation, it’s simpler to submit jobs via the .NET SDK. The Hive support in Windows Azure PowerShell might become handy if you already have automated PS script and you want to keep everything in that script.

Before you start running Hive queries via Windows Azure PowerShell (using the Invoke-Hive cmdlet), you need to select your cluster. Paste the following script into the Windows Azure PowerShell console window to run it.

$subscriptionName ="<your-subscription>" 
$clusterName = "<your-cluster-name>"

# Select the cluster
Select-AzureSubscription $subscriptionName
Use-AzureHDInsightCluster $clusterName 

Create Non-Partitioned Hive Table

The script bellow will create a new Hive table name ‘w3c’ and load it with the date that we uploaded to the blob in the previous section

You should know that operations like create/delete execute faster when using New-AzureHDInsightHiveJobDefinition and Start-AzureHDInsightJob instead of Invoke-Hive (more details here). I choose to use Invoke-Hive for all the hive queries in this post to keep things simple.

# HiveQL query
$queryString = "DROP TABLE w3c;" + 
               "CREATE TABLE w3c(
        logdate string,  logtime string,  c_ip string,  
        cs_username string,  s_ip string, 
        s_port string,  cs_method string, cs_uri_stem string,  
        cs_uri_query string,  sc_status int,  
        sc_bytes int,  cs_bytes int,  time_taken int,  
        cs_agent string,  cs_Referrer string)
        row format delimited
        fields terminated by ' ';"  + 
        "LOAD DATA INPATH 'myexample/w3c_2MB.txt' OVERWRITE INTO TABLE w3c";


# Submit the job to the cluster 
Invoke-Hive -Query $queryString

image

Since we created internal table. the file will be deleted from its original location and moved to hive warehouse.

Query the Non-Partitioned Hive Table

The script bellow will count the number of rows where the column cs_uri_stem equals ‘/Default.aspx’

$queryString = "SELECT COUNT(*) FROM w3c WHERE cs_uri_stem='/Default.aspx';"

# Submit the job to the cluster 
Measure-Command { Invoke-Hive -Query $queryString }
image 

Create Partitioned Hive Table

The script bellow will create a new, partitioned Hive table name ‘w3c_partitioned’. Notice that comparing to the table ‘w3c’, cs_uri_stem is not a column, It’s a partition

$queryString = "DROP TABLE w3c_partitioned;" + 
        "CREATE TABLE w3c_partitioned(
        logdate string, logtime string, c_ip string, 
        cs_username string,  s_ip string,  
        s_port string,  cs_method string,  cs_uri_query string,
        sc_status int,  sc_bytes int, cs_bytes int, time_taken int, 
        cs_agent string,  cs_Referrer string)
        PARTITIONED BY (cs_uri_stem string)
        row format delimited
        fields terminated by ' ';"

# Submit the job to the cluster 
Invoke-Hive -Query $queryString 

Insert data into the Partitioned Table

The following script will populate the partitions with data selected from the table w3c.

Notice that:

  • We’re inserting the rows where the cs_uri_stem cell equals ‘/Default.aspx‘ into partition ‘cs_uri_stem='/Default.aspx‘.
  • We’re inserting the rows where the cs_uri_stem cell equals ‘/Info.aspx‘ into partition ‘cs_uri_stem='/Info.aspx‘.
  • We’re inserting the rows where the cs_uri_stem cell equals ‘/UserService‘ into partition ‘cs_uri_stem='/UserService‘
$queryString = "FROM w3c 
        INSERT OVERWRITE TABLE w3c_partitioned PARTITION (cs_uri_stem='/Default.aspx')
        SELECT w3c.logdate, w3c.logtime, w3c.c_ip, 
        w3c.cs_username, w3c.s_ip, 
        w3c.s_port, w3c.cs_method, w3c.cs_uri_query, 
        w3c.sc_status, w3c.sc_bytes, w3c.cs_bytes, w3c.time_taken, 
        w3c.cs_agent, w3c.cs_Referrer 
        WHERE cs_uri_stem='/Default.aspx';"


# Submit the job to the cluster 
Invoke-Hive -Query $queryString 

$queryString = "FROM w3c 
        INSERT OVERWRITE TABLE w3c_partitioned PARTITION (cs_uri_stem='/Info.aspx')
        SELECT w3c.logdate, w3c.logtime, w3c.c_ip, 
        w3c.cs_username, w3c.s_ip, 
        w3c.s_port, w3c.cs_method, w3c.cs_uri_query, 
        w3c.sc_status, w3c.sc_bytes, w3c.cs_bytes, w3c.time_taken, 
        w3c.cs_agent, w3c.cs_Referrer 
        WHERE cs_uri_stem='/Info.aspx';"


# Submit the job to the cluster 
Invoke-Hive -Query $queryString 

$queryString = "FROM w3c 
        INSERT OVERWRITE TABLE w3c_partitioned PARTITION (cs_uri_stem='/UserService')
        SELECT w3c.logdate, w3c.logtime, w3c.c_ip, 
        w3c.cs_username, w3c.s_ip, 
        w3c.s_port, w3c.cs_method, w3c.cs_uri_query, 
        w3c.sc_status, w3c.sc_bytes, w3c.cs_bytes, w3c.time_taken, 
        w3c.cs_agent, w3c.cs_Referrer 
        WHERE cs_uri_stem='/UserService';"


# Submit the job to the cluster 
Invoke-Hive -Query $queryString 

 

image

Looking into the partitioned table in hive warehouse, you should notice that a dedicated folder has been added for each partition

image

Query the Partitioned Hive Table

The script bellow will query the table w3c_partitioned for the number of rows where the column cs_uri_stem='/Default.aspx' .

$queryString = "SELECT COUNT(*) FROM w3c_partitioned WHERE cs_uri_stem='/Default.aspx';"

# Submit the job to the cluster 
Invoke-Hive -Query $queryString 

image

Since the table is partitioned by the column cs_uri_stem, instead of scanning the entire data-set, Hive is scanning only the partition '/Default.aspx‘.

Delete the tables  

$queryString = "Drop table w3c;Drop table w3c_partitioned;" 

# Submit the job to the cluster 
Invoke-Hive -Query $queryString

2 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. after partition the table using condition,the column names are not displayed in output partition table,only the filtered data is displayed.how can i get column names after filtering the table data?

    ReplyDelete