Sunday, May 11, 2014

Testing in Production – Benefits, Risks and Mitigations

Testing in Production (TiP) is the most important mind-shift required for building and operating a successful service at scale. This post outlines the benefits of Testing in Production, walks through the methodologies and explains the practices that can be applied to mitigate the associated risks.

Abstract

As your service grows and becomes more complex, it becomes increasingly difficult to mimic the production environment and predict how the service will be configured and used by users. For these reasons, extensive up-front testing becomes less and less effective of a tool to assess quality, and Testing in Production becomes a necessity.  

How much testing should we do in production? Well, the better we get in mitigating the associated risks, the more we can and should do. Starting from building telemetry pipelines and running Synthetic Transactions to Load Testing and Fault Injection in production.

The extra load that we put on production and the habit of presenting users with less (up-front) tested code doesn’t come without a risk. Downtimes, Disappointed customers, SLA breaches, etc. Luckily, since we are building a cloud service, we can control the deployment, change configuration to turn on/off features, and push (or rollback) new bits whenever we want. Testing in Production methodologies include practices like Synthetic Transactions, Canary Deployment, Controlled Test Flight and Data Driven Quality that have been proven very useful in mitigating these risks.

Implementing these methodologies is not easy to say the least. To make this all work, the engineering team must own the production environment (DevOps). The service needs to be separated to small, independently deployable, versioned, backward compatible and patchable services (a monolithic codebase is discouraged). Finally, the development process must be efficient to support rapid yet quality releases.

Life Without TiP

Engineering teams that don’t put their testing and engineering focus in production usually find that 1) it takes a lot of time until they detect failures in production (aka MTTD - Mean Time to Detect) and 2) that it takes a lot of time until they are able to develop a fix and deploy it with confidence (aka MTTR – Mean Time to Recover).

MTTF stands for Mean Time to Failure, which is the amount of time during which the service is up.

image

Since the formula for availability is: Up Time (MTTF) / Total Time (MTTF + MTTD + MTTR). It’s apparent that in order to maximize availability we need to either increase MTTF (i.e. aim for zero failures) or reduce MTTD and MTTR (combined, they are also referred to as MTTH - Mean Time to Heal).

A popular way to increase MTTF is by running tests in test labs (scale down version of the production environment). The problem with lab tests is that they require significant engineering investment. These tests tend to fail, most of the time on false positives (test bugs). It’s also difficult to predict how users will use the system thus automated test cases can only get us so far. Lastly, most production environment are very complex, it’s hard (if not impossible) to simulate the configuration, network, load balancers and other (3rd party) hardware and software that make up the production environment.

Since we can’t predict how users will use our service and we can’t mimic the production environment – we practically have no choice, we must Test in Production. However, we also must acknowledge the associated risks and come out with the right mitigations.

The Twist (Engineering Investment)

image

The first thing to change when shifting to TiP is increase the investment in dev testing, which includes unit testing and continues integration. These unit tests run in isolation and don’t require a lab. Since they run fast and execute more frequently – they help to improve the so called ‘dev inner loop’, which means that we can get faster feedback on the quality of a given change.

By applying TiP methodologies like Canary Deployment (covered in the next chapter) and with our improved dev inner loop: we can push small and targeted changes directly to the Pre-Production (Dogfood/Beta) environment, potentially bypassing the traditional up-front testing in the lab. The Pre-Production environment is deployed in our production data center. It is used by the engineering team and by internal users. The main tool to asses quality in this environment is Data Mining. In addition to the traffic generated by the real users, we can run Synthetic Tests (potentially re-use the end to end lab tests) to trigger new scenarios.

We can use the time that we saved to invest in getting more meaningful insights from the data that we collect. Insights that will allow us to asses quality and detect bugs that impact customers (the ones that actually need fixing). We can also invest more in synthetic tests, alerting, reducing noise, auto healing etc.

Methodologies

image 

TiP Methodologies are divided to Passive and Active. The Passive methodologies include data mining, which is used to asses quality and measure performance. The active ones are obviously riskier, they include Synthetic Tests and Load Testing.

We separate the methodologies to Crawl, Walk and Run. Crawl methodologies are lower risk. Walk and Run methodologies are riskier and require the tools that we built while crawling. We can use this system to start small, pick up the low hanging fruits and as we get better, we can mitigate the risks associated with Walk and Run methodologies.

Data Driven Quality (Passive, Crawl)

The traditional way to asses quality is via manual testing or automated tests. Data driven quality is about crawling though the data that the service generates (logs, counters etc) in order to asses quality. To enable this, we need to instrument the code such that our service will produce meaningful data. Alan Page does a great job in explaining this mind shift.

Here’s the example from Alan’s blog: Say that we are building an AppStore service. The service enables users to locate an application in the store, download and install the app. The traditional approach will be to use test automation to verify the happy path (can I find an app, download and install?), and test the failure cases by injecting failures into system. We will typically use this test automation to asses the quality of service in order to decide if we can go to production.

The Data Driven approach suggest to 1) instrument the code such that the service log all the important events including failure events, and 2) mine the log data that the service generates to asses quality.

Using the appropriate tools, we can analyze the data that’s coming from our preproduction and production environments to generate charts that look like this:

image

If we can ship new bits to our preproduction environment frequently enough (say every day), we can use the analysis results to decide if the bits are quality enough to go to production. 

Canary Deployment

Canary deployment is about deploying new code to a small sub-set of the machines in production, verifying that the the new bits didn’t cause regression (functionality and performance), and slowly increasing the exposure of the bits to the rest of the machines in production.

By limiting the exposure to the new bits, we can minimize the impact of failures and can afford to release new bits with less up-front testing.

image

Controlled Test Flight

With Controlled Test Flight, we separate the the users to two or more groups, and assign a ‘flight’ to each group via configuration. When a user issue a request, the service may route the user to a different experience depending on the flight to which to user is assigned.

Controlled Test Flight can help to validate that a new code works well in production, without exposing all the users to the new code.

Controlled Test Flight is often used in combination with Canary Deployment. We use Canary Deployment to deploy new bits to sub-set of the machines, and we use Controlled Test Flight to route only test users to the new bits. We trigger the important scenario using the test users and verify that the new bits work as expected. If the new bits work as expected (the Canary didn’t die…), we release the bits to the rest of the machines and complete the experiment.

image

A/B Testing (Experimentation for Design)

A/B Testing is very similar to Controlled Test Flight in the sense that in both we divide the users to groups and present each group with a different experience. The difference is in the intent. With Controlled Test Flight we try to verify that our new bits work as expected. Essentially we check if we built it right. With A/B Testing we experiment with multiple experiences in order to make sure that we build the right thing.

With A/B Testing we divide the users to two groups. We present each group with a different experience, analyze user interaction patterns for each experiment, and based on the results, we pick the best experience. Using this system we detect ineffective features and can cut our losses early, and we can detect successful features and increase the investment accordingly.

In the example bellow we try to calculate which user experience will yield more signups. We present 50% of the users with UI in which the Signup button is in the right hand side. The remaining 50% will see UI in which the Signup button is in the left hand side. We run the experiment for a week. We calculate the click rate for each UI and based on the results we choose the UI that yielded more clicks.

image

Synthetic Transaction Monitoring

Synthetic tests are automated tests running against the production instances. Synthetic tests can be divided to two groups. API Tests and User Scenario Tests.

API Tests usually run against every component that expose public API. If the example bellow, the service can be used via PowerShell, that calls region wide API managed by Azure Traffic Manager. Traffic Manager will route the request to one of the FE services. The latter will use an Highly Available Table Store service.

Since each one of the layers expose public API, we will run Synthetic Tests against each API. This way, in case of a failure that surface in the upper layer, we will be able to peel the onion and find the faulty layer.

image

User Scenario Tests are about testing the service though the same interfaces used by the users. If the users interact with the service via Browser, we will run the tests via Browser, etc.

image

Even for mature services where the important scenarios are already triggered by real users, Synthetic Tests are useful for triggering new scenario that are not yet discoverable by real users. In addition, they keep the service busy when there’s low engagement form real users. 

Friday, May 9, 2014

Azure Table Storage: Writing data in a batch

Azure table support writing items in a batch, as long as 1) all the items are in the same partition and 2) the batch is not bigger than 100 items (if it is, you get ‘unexpected response code for operation : 0’). 

If you have a big list of items of different partitions you will have to write the logic to split the items by partition and send 100 items at a time. This post provides a suggested implementation for the same

The code+tests can be downloaded from here.

Here’s the main workflow, it does the following.

  1. We create a new Azure Table (in the Table Emulator)
  2. Create 2 groups of 150 items, each group in a different partition. 
  3. Use the BatchCalculator class to split the items into batches.
  4. Write each batch to the table
        static void Main(string[] args)
        {
            // Make sure that Azure Storage Emulator is running
            
            Console.WriteLine("Creating new table");
            CloudTable table = CreateTable();
            try
            {
                const int numberOfItemsInPartition = 150;
                IEnumerable<ITableEntity> items1 = GetNewPartitionData(numberOfItemsInPartition);
                IEnumerable<ITableEntity> items2 = GetNewPartitionData(numberOfItemsInPartition);

                var all = new List<ITableEntity>(items1);
                all.AddRange(items2);

                Console.WriteLine("Writing {0} items", all.Count);
                BatchDiagnostics batchDiagnostics = WriteBatch(table, all).Result;

                Console.WriteLine("Diagnostic:" + batchDiagnostics);
            }
            catch(Exception exception)
            {
                Console.WriteLine(exception);
            }
            finally
            {
                Console.ReadLine();
                Console.WriteLine("Deleting table");
                table.Delete();
            }
        }

        private static CloudTable CreateTable()
        {
            CloudStorageAccount storageAccount = CloudStorageAccount.DevelopmentStorageAccount;

            // Create the table client.
            CloudTableClient tableClient = storageAccount.CreateCloudTableClient();

            // Create the table if it doesn't exist.
            var table = tableClient.GetTableReference("batchtesttable");
            table.CreateIfNotExists();

            return table;
        }

        private static IEnumerable<ITableEntity> GetNewPartitionData(int numberOfItems)
        {
            var list = new List<ITableEntity>();

            Guid partition = Guid.NewGuid();
            for (int i = 0; i < numberOfItems; i++)
            {
                list.Add(new MyEntity()
                {
                    SomeData = "Data",
                    RowKey = i.ToString(CultureInfo.InvariantCulture),
                    PartitionKey = partition.ToString()
                });
            }
            return list;
        }

        private static async Task<BatchDiagnostics> WriteBatch<T>(CloudTable table, IEnumerable<T> entities) where T : ITableEntity
        {
            BatchDiagnostics diagnostics;
            IEnumerable<TableBatchOperation> batches = BatchCalculator.GetBatches(entities, out diagnostics);

            var tasks = batches.Select(table.ExecuteBatchAsync);
            await Task.WhenAll(tasks);

            return diagnostics;
        }
    }

    public class MyEntity : TableEntity
    {
        public string SomeData;
    }

Here’s the implementation of BatchCalculator. This class is responsible to split the items to groups, where each group is not bigger than 100 and all items belong to the same partition.

    public class BatchCalculator
    {
        const int batchMaxSize = 100;

        public static IEnumerable<TableBatchOperation> GetBatches<T>(IEnumerable<T> entities, out BatchDiagnostics diagnostics) where T : ITableEntity
        {
            var list = new List<TableBatchOperation>();

            IGrouping<string, T>[] partitionGroups = entities.GroupBy(arg => arg.PartitionKey).ToArray();
            foreach (IGrouping<string, T> entitiesGroupedByPartition in partitionGroups)
            {
                T[] groupList = entitiesGroupedByPartition.ToArray();
                int pointer = batchMaxSize;
                T[] items = groupList.Take(pointer).ToArray();
                while (items.Any())
                {
                    var tableBatchOperation = new TableBatchOperation();
                    foreach (var item in items)
                    {
                        tableBatchOperation.Add(TableOperation.InsertOrReplace(item));
                    }
                    list.Add(tableBatchOperation);
                    items = groupList.Skip(pointer).Take(batchMaxSize).ToArray();
                    pointer += batchMaxSize;
                }
            }

            diagnostics = new BatchDiagnostics(partitionGroups.Length, list.Count);
            return list;
        }
    }

    public class BatchDiagnostics
    {
        private readonly int partitions;
        private readonly int batches;

        public BatchDiagnostics(int partitions, int batches)
        {
            this.partitions = partitions;
            this.batches = batches;
        }

        public int Partitions
        {
            get { return partitions; }
        }

        public int Batches
        {
            get { return batches; }
        }

        public override string ToString()
        {
            return string.Format("Partitions: {0}, Batches: {1}", partitions, batches);
        }
    }

HTA! Aviad

Sunday, January 5, 2014

HDInsight Service (Hadoop on Windows Azure) – Hive Partitions

The post will walk you through the process of creating, loading and querying partitioned Hive Table via HDInsight

Follow the steps here to create HDInsight Cluster and to install  Windows Azure PowerShell.

Upload data to HDInsight Cluster

Download the file w3c_2MB.txt to your local drive, say to ‘C:\w3c_2MB.txt’

Since HDInsight uses the Azure Blob Storage as its distributed file storage (unlike non-could Hadoop clusters that use the default HDFS implementation that’s based on local file-system), you can choose your preferred tool to upload data. We will use Windows Azure PowerShell to upload the sample data.

Open MIcrosoft Azure PowerShell

Run the command below, you will be prompt to login to your Azure account:

Add-AzureAccount

Paste the following script into the Windows Azure PowerShell console window to run it.

$subscriptionName ="<your-subscription>"
$storageAccountName = "<your-storage-name>"
$containerName = "<your-cluster-name>"

$fileName ="C:\w3c_2MB.txt"
$blobName = "user/hdp/myexample/w3c_2MB.txt"

# Get the storage account key
Select-AzureSubscription $subscriptionName
$storageaccountkey = get-azurestoragekey $storageAccountName | %{$_.Primary}

# Create the storage context object
$destContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageaccountkey

# Copy the file from local workstation to the Blob container        
Set-AzureStorageBlobContent -File $fileName -Container $containerName -Blob $blobName -context $destContext
image 

Initialize Hive session in Windows Azure PowerShell

Since we are already using Windows Azure PowerShell, we will continue to use it to submit Hive jobs. Quite frankly, PowerShell is not the best tool for ad-hoc interaction with Hive. To debug Hive jobs, (as of now) it’s better to remote connect to the cluster and use the built in Hive CLI. Even for automation, it’s simpler to submit jobs via the .NET SDK. The Hive support in Windows Azure PowerShell might become handy if you already have automated PS script and you want to keep everything in that script.

Before you start running Hive queries via Windows Azure PowerShell (using the Invoke-Hive cmdlet), you need to select your cluster. Paste the following script into the Windows Azure PowerShell console window to run it.

$subscriptionName ="<your-subscription>" 
$clusterName = "<your-cluster-name>"

# Select the cluster
Select-AzureSubscription $subscriptionName
Use-AzureHDInsightCluster $clusterName 

Create Non-Partitioned Hive Table

The script bellow will create a new Hive table name ‘w3c’ and load it with the date that we uploaded to the blob in the previous section

You should know that operations like create/delete execute faster when using New-AzureHDInsightHiveJobDefinition and Start-AzureHDInsightJob instead of Invoke-Hive (more details here). I choose to use Invoke-Hive for all the hive queries in this post to keep things simple.

# HiveQL query
$queryString = "DROP TABLE w3c;" + 
               "CREATE TABLE w3c(
        logdate string,  logtime string,  c_ip string,  
        cs_username string,  s_ip string, 
        s_port string,  cs_method string, cs_uri_stem string,  
        cs_uri_query string,  sc_status int,  
        sc_bytes int,  cs_bytes int,  time_taken int,  
        cs_agent string,  cs_Referrer string)
        row format delimited
        fields terminated by ' ';"  + 
        "LOAD DATA INPATH 'myexample/w3c_2MB.txt' OVERWRITE INTO TABLE w3c";


# Submit the job to the cluster 
Invoke-Hive -Query $queryString

image

Since we created internal table. the file will be deleted from its original location and moved to hive warehouse.

Query the Non-Partitioned Hive Table

The script bellow will count the number of rows where the column cs_uri_stem equals ‘/Default.aspx’

$queryString = "SELECT COUNT(*) FROM w3c WHERE cs_uri_stem='/Default.aspx';"

# Submit the job to the cluster 
Measure-Command { Invoke-Hive -Query $queryString }
image 

Create Partitioned Hive Table

The script bellow will create a new, partitioned Hive table name ‘w3c_partitioned’. Notice that comparing to the table ‘w3c’, cs_uri_stem is not a column, It’s a partition

$queryString = "DROP TABLE w3c_partitioned;" + 
        "CREATE TABLE w3c_partitioned(
        logdate string, logtime string, c_ip string, 
        cs_username string,  s_ip string,  
        s_port string,  cs_method string,  cs_uri_query string,
        sc_status int,  sc_bytes int, cs_bytes int, time_taken int, 
        cs_agent string,  cs_Referrer string)
        PARTITIONED BY (cs_uri_stem string)
        row format delimited
        fields terminated by ' ';"

# Submit the job to the cluster 
Invoke-Hive -Query $queryString 

Insert data into the Partitioned Table

The following script will populate the partitions with data selected from the table w3c.

Notice that:

  • We’re inserting the rows where the cs_uri_stem cell equals ‘/Default.aspx‘ into partition ‘cs_uri_stem='/Default.aspx‘.
  • We’re inserting the rows where the cs_uri_stem cell equals ‘/Info.aspx‘ into partition ‘cs_uri_stem='/Info.aspx‘.
  • We’re inserting the rows where the cs_uri_stem cell equals ‘/UserService‘ into partition ‘cs_uri_stem='/UserService‘
$queryString = "FROM w3c 
        INSERT OVERWRITE TABLE w3c_partitioned PARTITION (cs_uri_stem='/Default.aspx')
        SELECT w3c.logdate, w3c.logtime, w3c.c_ip, 
        w3c.cs_username, w3c.s_ip, 
        w3c.s_port, w3c.cs_method, w3c.cs_uri_query, 
        w3c.sc_status, w3c.sc_bytes, w3c.cs_bytes, w3c.time_taken, 
        w3c.cs_agent, w3c.cs_Referrer 
        WHERE cs_uri_stem='/Default.aspx';"


# Submit the job to the cluster 
Invoke-Hive -Query $queryString 

$queryString = "FROM w3c 
        INSERT OVERWRITE TABLE w3c_partitioned PARTITION (cs_uri_stem='/Info.aspx')
        SELECT w3c.logdate, w3c.logtime, w3c.c_ip, 
        w3c.cs_username, w3c.s_ip, 
        w3c.s_port, w3c.cs_method, w3c.cs_uri_query, 
        w3c.sc_status, w3c.sc_bytes, w3c.cs_bytes, w3c.time_taken, 
        w3c.cs_agent, w3c.cs_Referrer 
        WHERE cs_uri_stem='/Info.aspx';"


# Submit the job to the cluster 
Invoke-Hive -Query $queryString 

$queryString = "FROM w3c 
        INSERT OVERWRITE TABLE w3c_partitioned PARTITION (cs_uri_stem='/UserService')
        SELECT w3c.logdate, w3c.logtime, w3c.c_ip, 
        w3c.cs_username, w3c.s_ip, 
        w3c.s_port, w3c.cs_method, w3c.cs_uri_query, 
        w3c.sc_status, w3c.sc_bytes, w3c.cs_bytes, w3c.time_taken, 
        w3c.cs_agent, w3c.cs_Referrer 
        WHERE cs_uri_stem='/UserService';"


# Submit the job to the cluster 
Invoke-Hive -Query $queryString 

 

image

Looking into the partitioned table in hive warehouse, you should notice that a dedicated folder has been added for each partition

image

Query the Partitioned Hive Table

The script bellow will query the table w3c_partitioned for the number of rows where the column cs_uri_stem='/Default.aspx' .

$queryString = "SELECT COUNT(*) FROM w3c_partitioned WHERE cs_uri_stem='/Default.aspx';"

# Submit the job to the cluster 
Invoke-Hive -Query $queryString 

image

Since the table is partitioned by the column cs_uri_stem, instead of scanning the entire data-set, Hive is scanning only the partition '/Default.aspx‘.

Delete the tables  

$queryString = "Drop table w3c;Drop table w3c_partitioned;" 

# Submit the job to the cluster 
Invoke-Hive -Query $queryString

Friday, January 3, 2014

HDinsight Emulator – Using Sqoop to import/export data from/to SQL Server Express

This post will walk you through the process of importing/exporting data from/to SQL Server Express via Sqoop .

Create dev box in the cloud (optional)

If you don't want to install the HDInsight Emulator on your development machine, you can create a VM in the cloud via Azure IaaS Virtual Machine Service (given that you have Azure Subscription).

Open the Azure Management Portal, click on the Virtual Machine tab, click ‘New’ and select your preferred version on Visual Studio.

image

If you don’t need Visual Studio, you can install VM with Windows Server.

image

Once your VM is ready, it will appear as ‘Running’ in the Virtual Machine Tab. You can Remote Connect to the VM using the link in the portal.

image

For more details on Azure VM, in this post, Scott Hanselman shows how he’s using the Visual Studio image to get work done from his Surface 2 Tablet.

Install HDinsight Emulator

HDInsight Emulator is a single node HDInsight deployment that allows developers to develop on and debug jobs on their local development box. You can install the HDInsight Emulator via Web Platform Installer from here. All the missing prerequisites will be detected and installed automatically!

image

Once the installation is complete, you will notice that all the supported Hadoop services will be running as Windows Services on your local machine.

HDI.Emulator.Services.

Install and configure SQL Server Express

If you have full SQL Server instance installed on your machine, you can skip to the next step. If you want to use the free SQL Express instance, some extra configurations are needed. Here’s how you install and configure SQL Express to make it work with Sqoop.

  • Open ‘Web Platform Installer’ (install from here)
  • Type ‘sql express ‘ in the search box
  • Click ‘Add’ next to ‘SQL Server Express 2008’ and ‘SQL Server Express 2008 Management Studio’

image

  • Set the password for the admin account (you will need to use it in the next step)

image

  • Open ‘SQL Server Configuration Manager’
image
  • Enable the TCP/IP protocol for SQLExpress.
    • Expend ‘SQL Server Network Configuration’
    • Right click on ‘Protocols for SQLExpress’
    • Right click on ‘TCP/IP’ and select ‘Enable’

image

  • Restart SQL Server service.
    • Click on ‘SQL Server Services’
    • Right click on ‘SQL Server (SQLExpress)’ and select ‘Restart’

image 

  • Start ‘SQL Server Browser’ service
    • Click on ‘SQL Server Services’
    • Right click on ‘SQL Server browser’ and select ‘Properties’
    • Change the ‘Start Mode’ to ‘Automatic’, click apply and close the properties window.
    • Start the service

image

Create Database and Table

  • Open ‘SQL Server Management Studio’ and Connect to ‘SQL Express’ server (default)

image

  • Create ‘New Query’

image

  • Run the following command to create new Database (sqoop) and Table (w3c) schema
CREATE DATABASE sqoop 

Go

Use sqoop

Go

CREATE TABLE [dbo].[w3c]( 

[logdate] nvarchar(150) NULL, [logtime] nvarchar(150) NULL, 

[c_ip] nvarchar(150) NULL, [cs_username] nvarchar(150) NULL, 

[s_ip] nvarchar(150) NULL, [s_port] nvarchar(150) NULL, 

[cs_method] nvarchar(150) NULL, [cs_uri_stem] nvarchar(150) NULL, 

[cs_uri_query] nvarchar(150) NULL, [sc_status] [bigint] NULL, 

[sc_bytes] [bigint] NULL, [cs_bytes] [bigint] NULL, 

[time_taken] [bigint] NULL, [cs_agent] nvarchar(150) NULL, 

[cs_referrer] nvarchar(150) NULL) ON [PRIMARY] 

Upload data to HDFS

  • Download the file w3c_2MB.txt 
  • Upload the file from your local drive to HDFS under the folder ‘sqoop’
    • Open the ‘Hadoop Command Prompt’ window (shortcut should be available on your desktop), type:

   > hadoop fs -copyFromLocal w3c_2MB.txt ./sqoop/w3c_2MB.txt

    • Make sure that the file has been uploaded, type:

   > hadoop fs –lsr

image

Export from HDFS to SQL Express

  • Open the ‘Hadoop Command Prompt’ window (shortcut should be available on your desktop), type:
  • In the script bellow replace:
    • <your-sql-password> with the password that you’ve chosen when you installed SQL Express
    • <your-user-name> with the name of your machine

If you are using full SQL Server: instead of localhost\sqlexpress, use localhost. Make sure that the specified user (sa) is enabled and has write permissions on your SQL Server. Notice that you must use SQL Server Authentication.

   > cd c:\Hadoop\sqoop-1.4.2\bin

   > sqoop export --connect "jdbc:sqlserver://localhost\sqlexpress;databaseName=sqoop;user=sa;password=<your-sql-password>" --table w3c --export-dir /user/<your-user-name>/sqoop -m 1 -input-fields-terminated-by " "

Note:

  • Make sure that the export completed successfully

image

  • Open ‘SQL Server Management Studio’ and Connect to ‘SQL Express’ server (default).
  • Create ‘New Query’ Run:
Use sqoop

Go

Select * from w3c
image 

Import from SQL Express back to HDFS

  • Open the ‘Hadoop Command Prompt’ window (shortcut should be available on your desktop)
  • In the script bellow replace:
    • <your-sql-password> with the password that you’ve chosen when you installed SQL Express
    • <your-user-name> with the name of your machine

   > cd c:\Hadoop\sqoop-1.4.2\bin

   > sqoop import --connect "jdbc:sqlserver://localhost\sqlexpress;databaseName=sqoop;user=sa;password=<your-sql-password>" --table w3c --target-dir /user/<your-user-name>/sqoop/import --fields-terminated-by " " -m 1

  • Make sure that the export completed successfully

    > hadoop fs –ls  /user/<your-user-name>/sqoop/import

image

Import data from SQL Server Express directly to Hive table

  • Open the ‘Hadoop Command Prompt’ window (shortcut should be available on your desktop)
  • In the script bellow replace:
    • <your-sql-password> with the password that you’ve chosen when you installed SQL Express

   > cd c:\Hadoop\sqoop-1.4.2\bin

   > sqoop import --connect "jdbc:sqlserver://localhost\sqlexpress;databaseName=sqoop;user=sa;password=<your-sql-password>" --table w3c --hive-table w3c_sqoop --create-hive-table --hive-import -m 1

image

  • Make sure that the export completed successfully.
  • Open hive CLI:

   > c:\Hadoop\hive-0.9.0\bin\hive

   hive> SELECT COUNT(*) FROM w3c_sqoop

image