I wanted to get a meaningful dataset into Azure Data Lake so that I could test it out. I came across this article, that walks through using the NYC Taxi Dataset with Azure Data Lake:
The article kind of skips over the whole part of getting the dataset into Azure. Here is how I did it:
- Spin up a VM on Azure
- On Server Manager, click on Local Server, next to IE Enhanced Security Configuration click the On link, and at least set Admin to Off (or else you will have to click ok a dozen times a web page)
- Download the files from the NYC Taxi Trip website to your VM http://www.andresmh.com/nyctaxitrips/
- Install 7-Zip so that you can unzip the 7z files.
- Once you install it from http://www.7-zip.org/download.html, go to the install folder (probably C:\Program Files\7-Zip) and right click the 7z.exe file. Select the 7zip > open archive option and then click the + sign and browse to your downloads folder
- Because the files in the trip_data.7z file are larger than 2GB, you cannot upload them using the portal, and you need to use Powershell.
- You need to install the Azure PowerShell Commandlets – look for the Windows Install link a bit down this page https://azure.microsoft.com/en-us/downloads/
- You will probably need to restart the VM for the Azure commands to be available in PowerShell
- Go wild on Azure Data Lake Store using this doc https://github.com/Microsoft/azure-docs/blob/master/articles/data-lake-store/data-lake-store-get-started-powershell.md – here are the key steps:
# Log in to your Azure account
# List all the subscriptions associated to your account
# Select a subscription
Set-AzureRmContext -SubscriptionId “xxx-xxx-xxx”
# Register for Azure Data Lake Store
Register-AzureRmResourceProvider -ProviderNamespace “Microsoft.DataLakeStore”
#Verify your ADL account name
#Figure out what folder to put the files
Get-AzureRmDataLakeStoreChildItem -AccountName mlspike -Path “/”
NOTE: if you do not want to copy the files one-by-one, you can just copy the whole folder using this format: Import-AzureRmDataLakeStoreItem -AccountName mlspike -Path “C:\Users\Taxi\Desktop\files2\trip_data” -Destination $myrootdir\TaxiDataFiles
Once you have the files uploaded to Azure Data Lake, you can delete the VM.
If you know of a faster way of getting them there (without downloading them to your local machine), I would love to hear it!