Cloud Zone is brought to you in partnership with:

I'm a Windows Azure MVP, the principal developer for OakLeaf Systems and the author of 30+ books about Microsoft software. The books have more than 1.25 million English copies in print and have been translated into 20+ languages. Forbes Magazine ranked me seventh in their "Who Are The Top 20 Influencers in Big Data?" article of 2/3/2012. Roger has posted 37 posts at DZone. You can read more from them at their website. View Full User Profile

Analyzing FAA Air Traffic Datasets Using Cloud Numerics

03.20.2012
| 7313 views |
  • submit to reddit

The U.S. Federal Aviation Administration (FAA) publishes monthly an On-Time Performance dataset for all airlines holding a Department of Transportation (DOT) Air Carrier Certificate. The FAA’s Research and Innovative Technology Administration (RITA) of the Bureau of Transportation Statistics (BTS) publishes the data sets in the form of prezipped comma-separated value (CSV, Excel) files here:

imageClick images to view full size version.

The BTS also publishes summaries of on-time performance, such as the Percent of Flights On Time (2011-2012) chart shown emphasized here:

image

Clicking one of the bars displays flight delay details for the past 10 years by default:

image

You can filter the data by airport, carrier, month, and arrival or departure delays. Many travel-related Web sites and some airlines use the TranStats data for consumer-oriented reports.

The FAA On_Time_Performance Database’s Schema and Size

The database file for January 2012, which was the latest available when this post was written, has 486,133 rows of 83 columns, only a few of which are of interest for analyzing on-time performance:

OnTimePerformance2012-01InExcel

The ZIP file includes a Readme.html file with a record layout (schema) containing field names and descriptions.

The size of the extracted CSV file for January 2012 is 213,455 MB, which indicates that a year’s data would have about 5.8 million rows and be about 2.5 GB in size, which borders on qualifying for Big Data status.

Applying a filter to display flights with departure delays >0 shows that 151,762 (31.2%) of the 486,133 flights for the month suffered departure delays of 1 minute or more:

OnTimePerformance2012-01Filtered

You’ll notice that many flights with departure delays had no arrival delays, which means that the flight beat its scheduled duration. Arrival delays are of more concern to passengers so a filter on arrival delays >0 (149,036 flights, 30.7%) is more appropriate:

image

 

The Microsoft Codename “Cloud Numerics” Sample Solution for Analyzing Air Carrier Arrival Delays

Microsoft Codename “Cloud Numerics” is an SQL Azure Labs incubator project for numerical and data analysis by “data scientists, quantitative analysts, and others who write C# applications in Visual Studio. It enables these applications to be scaled out, deployed, and run on Windows Azure” with high-performance computing (HPC) techniques for parallel processing of distributed data arrays. The current version is limited to arrays which will fit in the main memory of a Windows 2008 R2 Server cluster. As the “Cloud Numerics” Team observes, disk-based data can be pre-processed by existing “big data” processing tools [such as Hadoop/MapReduce] and ingested into a “Cloud Numerics” application for further processing.

Roope Astala of the Codename “Cloud Numerics” Team described a “Cloud Numerics” Example: Analyzing Air Traffic “On-Time” Data in a 3/8/2012 post to the team’s blog:

You sit at the airport only to witness your departure time get delayed. You wait. Your flight gets delayed again, and you wonder “what’s happening?” Can you predict how long it will take to arrive at your destination? Are there many short delays in front of you or just a few long delays? This example demonstrates how you can use “Cloud Numerics” to sift though and calculate a big enough cross section of air traffic data needed to answer these questions. We will use on-time performance data from the U.S. Department of Transportation to analyze the distribution of delays.

The data is available at http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time. This data set holds data for every scheduled flight in the U.S. from 1987 to 2011 and is —as one would expect— huge! For your convenience, we have uploaded a sample of 32 months—one file per month with about 500,000 flights in each—to Windows Azure Blob Storage at this container URI: http://cloudnumericslab.blob.core.windows.net/flightdata.

You cannot access this URI directly in a browser, you must use a Storage Client C# API … or a REST API query (http://cloudnumericslab.blob.core.windows.net/flightdata?restype=container&comp=list), to return a public blob list, the first three items of which are shown here:

OnTimePerformanceBlobList

Dates of the sample files range from May 2009 to December 2011; months don’t appear sequentially in the list.

Cloud Numeric’s IParallelReaderInterface could use four ExtraLarge compute instances having eight cores each to process all 32 months in parallel, if you have obtained permission to exceed the default 20 . Unlike most other SQL Azure Labs incubator projects, Cloud Numerics doesn’t offer free Windows Azure resources to run sample projects. ExtraLarge compute instances cost US$0.96 per hour of deployment, so assigning a core to each month would cost US$3.84 per deployed hour. Roope’s post recommends:

You should use two to four compute nodes when deploying the application to Windows Azure. One node might not have enough memory, and for larger-sized deployments there are not enough files in the sample data set to assign to all distributed I/O processes. You should not attempt to run the application on a local system because of data transfer and memory requirements. [Emphasis added.]

Note!

You specify how many compute nodes are allocated when you use the Cloud Numerics Deployment Utility to configure your Windows Azure cluster (3 is the default). For details, see this section in the Getting Started guide.

 

Creating the OnTimePerformance Solution from the Microsoft Cloud Numerics Application Template

The Microsoft Cloud Numerics Application template proposes to create a new MSCloudNumerics1.sln solution with six prebuilt projects:

  • AppConfigure
  • AzureSampleService
  • ComputeNode
  • FrontEnd
  • HeadNode
  • MSCloudNumericsApp

Changes to the template code are required only to the MSCloudNumericsApp project’s Program.cs class file.

To create the AirCarrierOnTimeStats.sln solution and add a required reference, do the following:

1.  Launch Visual Studio 2010 Web Developer Express or higher, choose New, Project to open the select the Microsoft Cloud Numerics Application and name the project AirCarrierOnTimeStats:

image

2. Click OK to create the templated solution.

3. Right-click the MSCloudNumericsApp node and choose Add Reference to open the eponymous dialog. Scroll to and select the .NET tab’s Microsoft.WindowsAzure.StorageClient library:

image

4. Click OK to add the reference to the project.

Replacing Template Code with OnTimePerformance-Specific Procedures

1. Recreate the MSCloudNumericApp’s prebuilt Program.cs class, by replacing all prebuilt Program.cs class code with the following using block and procedure stubs:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using msnl = Microsoft.Numerics.Local;
using msnd = Microsoft.Numerics.Distributed;
using Microsoft.Numerics.Statistics;
using Microsoft.Numerics.Mathematics;
using Microsoft.Numerics.Distributed.IO;
using Microsoft.WindowsAzure;
using Microsoft.WindowsAzure.StorageClient;
namespace FlightOnTime
{ 
   [Serializable]
   public class FlightInfoReader : IParallelReader<double>
   {

   }

   class Program
   {
      static void WriteOutput(string output)
      {

      }

      static void Main()
      {
         // Initialize runtime
         Microsoft.Numerics.NumericsRuntime.Initialize();
         // Shut down runtime
         Microsoft.Numerics.NumericsRuntime.Shutdown();
      }
   }
}

2. Add methods for reading blog data by replacing the public class FlightInfoReader stub with the following code:

[Serializable]
 public class FlightInfoReader : IParallelReader<double>
 {

    string _containerAddress;

    public FlightInfoReader(string containerAddress)
     {
         _containerAddress = containerAddress;
     }
 
    public int DistributedDimension
     {
         get {return 0;}
         set {}
     }
 
    public Object[] ComputeAssignment(int ranks)
     {
         // Get list of flight info files (blobs) from container
         var container = new CloudBlobContainer(_containerAddress);
         var blobs = container.ListBlobs().ToArray();
 
        // Allocate blobs to workers in round-robin fashion
         List<Uri> [] assignments = new List<Uri> [ranks];
         for (int i = 0; i < ranks; i++)
         {
             assignments[i] = new List<Uri>();
         }
 
        for (int i = 0; i < blobs.Count(); i++)
         {
             int currentRank = i % ranks;
             assignments[currentRank].Add(blobs[i].Uri);
         }
         return (Object[]) assignments;
     }
 
    public msnl.NumericDenseArray<double> ReadWorker(Object assignment)
     {
         
         List<Uri> assignmentUris = (List<Uri>) assignment;
 
        // If there are no blobs to read, return empty array
         if (assignmentUris.Count == 0)
         {
             return msnl.NumericDenseArrayFactory.Create<double>(new long[] { 0 });
         }
 
        List<double> arrivalDelays = new List<double>();
 
        for (int blobCount = 0; blobCount < assignmentUris.Count; blobCount++)
         {
             // Open blob and read text lines
             var blob = new CloudBlob(assignmentUris[blobCount].AbsoluteUri);
             var rows = blob.DownloadText().Split(new char[] {'\n'});
             int nrows = rows.Count();
 
            // Offset by one row because of header file, also, note that last row is empty
             for (int i = 1; i < nrows - 1; i++)
             {
                 // Remove quotation marks and split row
                 var thisRow = rows[i].Replace("\"", String.Empty).Split(new char[] { ',' });
  
                 // Filter out canceled and diverted flights
                 if (!thisRow[49].Contains("1") && !thisRow[51].Contains("1"))
                 {
                     // Add arrival delay from column 44 to list
                     arrivalDelays.Add(System.Convert.ToDouble(thisRow[44]));
                 }
             }
         }
         // Convert list to numeric dense array and return it from reader
         return msnl.NumericDenseArrayFactory.CreateFromSystemArray<double>(arrivalDelays.ToArray());
     }
 }

 

3. Read the data and implement the statistics algorithm by replacing the static void Main stub with the following:
static void Main()
 {
     // Initialize runtime
     Microsoft.Numerics.NumericsRuntime.Initialize();
 
    // Instantiate StringBuilder for writing output
     StringBuilder output = new StringBuilder();
 
    // Read flight info
     string containerAddress = @"http://cloudnumericslab.blob.core.windows.net/flightdata/";
     var flightInfoReader = new FlightInfoReader(containerAddress);
     var flightData = Loader.LoadData<double>(flightInfoReader);
 
    // Compute mean and standard deviation
     var nSamples = flightData.Shape[0];
     var mean = Descriptive.Mean(flightData);
     flightData = flightData - mean;
     var stDev = BasicMath.Sqrt(Descriptive.Mean(flightData * flightData) * ((double)nSamples / (double)(nSamples - 1)));
 
    output.AppendLine("Mean (minutes), " + mean);
     output.AppendLine("Standard deviation (minutes), " + stDev);
 
    // Compute how much of the data is below or above 0, 1,...,5 standard deviations
 
    long nStDev = 6;
     for (long k = 0; k < nStDev; k++)
     {
         double aboveKStDev = 100d * Descriptive.Mean((flightData > k * stDev).ConvertTo<double>());
         double belowKStDev = 100d * Descriptive.Mean((flightData < -k * stDev).ConvertTo<double>());
         output.AppendLine("Samples below and above k standard deviations (percent), " + k + ", " + belowKStDev + ", " + aboveKStDev);
     }
 
    // Write output to a blob
     WriteOutput(output.ToString());
 
    // Shut down runtime
     Microsoft.Numerics.NumericsRuntime.Shutdown();
 }

 

4. Write results to a blob in your Windows Azure storage account by replacing the WriteOutput stub with the following:

static void WriteOutput(string output)
 {
     // Write to blob storage
     // Replace "myAccountKey" and "myAccountName" by your own storage account key and name
     string accountKey = "myAccountKey";
     string accountName = "myAccountName";
     // Result blob and container name
     string containerName = "flightdataresult";
     string blobName = "flightdataresult.csv";
 
    // Create result container and blob
     var storageAccountCredential = new StorageCredentialsAccountAndKey(accountName, accountKey);
     var storageAccount = new CloudStorageAccount(storageAccountCredential, true);
     var blobClient = storageAccount.CreateCloudBlobClient();
     var resultContainer = blobClient.GetContainerReference(containerName);
     resultContainer.CreateIfNotExist();
     var resultBlob = resultContainer.GetBlobReference(blobName);
 
    // Make result blob publicly readable,
     var resultPermissions = new BlobContainerPermissions();
     resultPermissions.PublicAccess = BlobContainerPublicAccessType.Blob;
     resultContainer.SetPermissions(resultPermissions);
 
    // Upload result to blob
     resultBlob.UploadText(output);
 }

 

You will Replace myAccountName with the name of your Windows Azure Storage account and myAccountKey with the name of your Windows Azure Storage account’s access key after you complete the cluster configuration process in the next section.

5. Right-click the AppConfigure node and choose Set as StartUp Project:

image

Configuring the AirCarrierOnTimeStats Solution

This section assumes that you have Windows Azure Compute and Storage accounts, which are required to upload and run the solution in Windows Azure. If you don’t have a subscription with these accounts, you can sign up for a Three-Month Free Trial here. The Free Trial includes:

  • Compute Virtual Machine: 750 Small Compute hours per month
  • Relational Database: 1GB Web edition SQL Azure database
  • Storage: 20GB with 1,000,000 storage transactions
  • Content Delivery Network (CDN): 500,000 CDN transactions
  • Data Transfer (Bandwidth): Unlimited inbound / 20GB Outbound

Small Compute nodes have the equivalent of a single 1.6 GHz CPU core, 1.75 GB of RAM, 225 GB of instance storage and “moderate” I/O performance. If you configure the minimum recommended number of Extra Large CPU instances with eight cores each, you will consume 18 Small Compute hours per hour that the solution is deployed. The additional two Small Compute instances are for the Head Node and Web Role (Front End). See step 11 below for more details.

Tip: See the Deploying “Cloud Numerics” Sample Applications to Windows Azure HPC Clusters post for a similar Cloud Numerics sample application if you want more detailed instructions for deploying the project.

1. Create or use a subscription with an Affinity Group that specifies the North Central US data center, where the data blobs are stored.

2. In Visual Studio, choose Build, Configuration Manager to open the Configuration Manager dialog, select Release in the Active Solution Configurations list, and click OK to change the build configuration from Debug to Release:

image

If you don’t build for Release, the Cloud Numerics job you submit at the end of this post will fail to run to completion.

3. Press F5 to build the solution and start the configuration process with the Cloud Numerics Deployment Utility.

4. Copy the Subscription ID from the management portal and paste it to the Subscription ID text box:

image

5. If you’ve created a Microsoft Cloud Numerics Azure Management Certificate previously, click the Browse button to open the Windows Security dialog and select the certificate in the list:

image 

Otherwise, click the Create button to open the Certificate Name Dialog, accept the Certificate Name, browse to the folder in which to store the *.cer file, and specify the File Name:

image

Click OK to accept the certificate and close the dialog.

6. If you created a new certificate, the following dialog appears.

image

Click OK to confirm either process.

7. Return to the Management Portal, select Management Certificates in the navigation pane, select the appropriate subscription:

image 

8. Click the Certificates group’s Add Certificate button to open the Add New Management Certificate dialog.

9. Click the Browse button to open the Open dialog, browse to the location where you saved the certificate file for the previous or new certificate, and double-click the *.cer file to add it:

image

10. Click OK to complete the certificate addition process, verify that the certificate appears under the subscription node, and return to the Utility dialog.

11. Type a globally unique hosted Service Name, aircarrierstats for this example, and select the North Central US data center in the Location list:

image

The configuration process also will create a Storage Account with the Service Name its name.

12. Click Next to select the Cluster in Azure tab, type and administrator name, password and confirm the password:

image

13. By default, the Utility specifies 3 Extra Large Compute nodes which have 8 CPU cores per instance. The default maximum number of cores (without requesting more from Windows Azure Billing Support) is 20, so change the number of Compute nodes to 2, which results in a total of 18 cores, including the Head and Web FrontEnd.

14. Click Next to activate the SQL Azure Server tab, accept the New Server and Administrator defaults and click Configure Cluster to start the configuration process. (You don’t need to complete the Application Code page until you submit a job.) After a few minutes, the cluster configuration process completes:

image

15. Select Storage Accounts in the in the Windows Azure Portal’s navigation pane that the aircarrierstats storage account has been created:

image

16. Click the Primary Access Key’s View button to open a dialog with primary and secondary key values:

image

17. Click the Clipboard icon to the right of the Primary Key text box to copy the value and click Close to dismiss the dialog.

18. Open the MSCloudNumericsApp project’s Program.cs file, if necessary, and replace MyAccountName with aircarrierstats and paste-replace MyAccountKey with the Clipboard value.

Deploying the Cluster to Windows Azure and Processing the MSCloudNumericsApp.exe Job

The final steps in the process are to deploy the project to a Windows Azure hosted service and submit (run) it to generate a results blob in the aircarrierstats storage service.

1. Return to the Utility dialog and click Deploy Cluster to create the aircarrierstats hosted service, apply its Service Certificate, and start the deployment process:

image

Copying the HPC package to blob storage is likely to take an hour or more, depending on your Internet connection’s upload speed. Initialization of the four nodes takes an additional

2. Completing deployment enables the Check Cluster Status button:

image 

3. Return to the Windows Azure Portal, select Hosted Services in the navigation pane to display the status of the four nodes and select the Air Carrier On-Time Stats subscription to display its properties:

image

4. When all nodes reach Ready status, reopen the Cloud Numerics Deployment Utility, click the Application Code tab, click the Browse button, navigate to the \MyDocuments\Visual Studio 2010\Projects\AirCarrierOnTimeStats\MSCloudNumericsApp\bin\Release folder, select MSCloudNumericsApp.exe as the executable to run, and click Submit Job:

image

Using the Windows Azure HPC Scheduler Web Portal to Check Job Status

1. When the “Job successfully submitted” message appears, type the URL for the hosted service (https://aircarrierstats.cloudapp.net for this example) and click the Continue to Web Site link to temporarily accept the self-signed certificate and open the Sample Application page with a Certificate Warning in the address bar:

image

2. Click the Certificate Warning to display an Untrusted Certificate popup and click View Certificates to open the Certificate dialog:

image

3. Click the Install Certificate button to start the Certificate Import Wizard, click Next to display the Certificate Store page, and accept the default option:

image

4. Click Next to import the certificate, click Finish, and click OK to dismiss “The import was successful” message.

5. Type your Cluster Administrator username and password in the Windows Security dialog:

image

6. Click OK to open the Windows Azure HPC Scheduler Web Portal page and click All Jobs to display the submitted job status:

image

If you receive an error message, click the My Jobs link.

7. Click the MSCloudNumericsApp.exe link to open job details page:

image

Notice that the job completed in about two minutes.

8. Click the View Tasks tab to display execution details:

image

The Help link opens Release Notes for Microsoft HPC Pack 2008 R2 Service Pack 2, which doesn’t provide any information about the HPC Scheduler Web Portal.

Viewing the FlightDataResult Data in Excel

Open flightdataresult.csv in Excel from the aircarrierstats storage account’s flightdataresult container in an Azure storage utility, such as Cerebrata’s Cloud Storage Studio, to display the result of the standard deviation computations:

image

 

Published at DZone with permission of its author, Roger Jennings. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)