Getting NYC Taxi Data into Azure Data Lake

I wanted to get a meaningful dataset into Azure Data Lake so that I could test it out. I came across this article, that walks through using the NYC Taxi Dataset with Azure Data Lake:

https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-process-data-lake-walkthrough

The article kind of skips over the whole part of getting the dataset into Azure. Here is how I did it:

  • Spin up a VM on Azure
  • On Server Manager, click on Local Server, next to IE Enhanced Security Configuration click the On link, and at least set Admin to Off (or else you will have to click ok a dozen times a web page)
  • Download the files from the NYC Taxi Trip website to your VM http://www.andresmh.com/nyctaxitrips/
  • Install 7-Zip so that you can unzip the 7z files.
  • Because the files in the trip_data.7z file are larger than 2GB, you cannot upload them using the portal, and you need to use Powershell.
  • You need to install the Azure PowerShell Commandlets – look for the Windows Install link a bit down this page https://azure.microsoft.com/en-us/downloads/
  • You will probably need to restart the VM for the Azure commands to be available in PowerShell
  • Go wild on Azure Data Lake Store using this doc https://github.com/Microsoft/azure-docs/blob/master/articles/data-lake-store/data-lake-store-get-started-powershell.md – here are the key steps:

 # Log in to your Azure account
Login-AzureRmAccount

# List all the subscriptions associated to your account
Get-AzureRmSubscription

# Select a subscription
Set-AzureRmContext -SubscriptionId “xxx-xxx-xxx”

# Register for Azure Data Lake Store
Register-AzureRmResourceProvider -ProviderNamespace “Microsoft.DataLakeStore”

#Verify your ADL account name
Get-AzureRmDataLakeStoreAccount

#Figure out what folder to put the files
Get-AzureRmDataLakeStoreChildItem -AccountName mlspike -Path “/”

NOTE: if you do not want to copy the files one-by-one, you can just copy the whole folder using this format: Import-AzureRmDataLakeStoreItem -AccountName mlspike -Path “C:\Users\Taxi\Desktop\files2\trip_data” -Destination $myrootdir\TaxiDataFiles

Once you have the files uploaded to Azure Data Lake, you can delete the VM.

If you know of a faster way of getting them there (without downloading them to your local machine), I would love to hear it!

Thanks.

Matt

Learn Machine Learning with Datacamp (on the Side)

I have worked with Machine Learning off and on for the last couple of years. Unfortunately, it is not part of my day job, so I need to find ways to integrate machine learning into an already busy day and week.

I stumbled across Datacamp, and so far I love what I see. After creating a free account in about 10 seconds, I was watching my first sub-10 minute video. Yes, no biggie there, but that is where it gets great: Datacamp starts providing an interactive R session, along with detailed instructions and readily available datasets, where you can actually start doing your very own work – no software required. Pretty impressive.

Datacamp

I would love to know more about Datacamp: who are the founders, what is their business model, and how did they get so awesome? If you know the founders, please drop a comment below.

Thanks, and if you end up testing it out let me know what you think!

Matt