Setting up an Azure Data Lake and Azure Data Factory using Powershell

Login-AzureRmAccount

#first ensure that you have an Azure Data Lake that you want to use for ODX
#$resourceGroups = Get-AzureRmResourceGroup
#$azureDataLakeNames = “”;

# foreach ($resourceGroup in $resourceGroups) {
# $azureDataLake = Get-AzureRmDataLakeStoreAccount -ResourceGroupName $resourceGroup.ResourceGroupName
#$azureDataLake
# $azureDataLakeName = $azureDataLake.Name
# $azureDataLakeNameLength = $azureDataLakeName.Length
# $azureDataLakeNameLength -gt 0
# if ($azureDataLakeNameLength -gt 0) {
# $azureDataLakeNames += ” ” + $azureDataLake.Name + ” (resource group: ” + $resourceGroup.ResourceGroupName + ” & location: ” + $resourceGroup.Location + “)”
# }
# }
# “———–”
#”DataLakeNames: ” + $azureDataLakeNames
#—————————————————————————————-
#—————————————————————————————-
#REQUIRED: you must enter a unique appname which will be used as the security principal
$appname = “sune”
#OPTIONAL: change the password for the security principal password
$password = “Xyzpdq”
#run the above script, and replace DATALAKESTORENAME with the appropriate name/rg/location from your existing data lake store; or enter a new name to have a data lake created
$dataLakeStoreName = “sunelake”
$odxResourceGroup = “odxDemo”
$dataLakeLocation = “Central US” #Central US, East US 2, North Europe
#recommended to use the same resource group as the data factory for simplicity, but you can use any resource group or enter a new name to create
$dataFactoryResourceGroup = $dataLakeStoreResourceGroup
#specify where you want your data factory – current options are East US, North Europe, West Central US, and West US
$dataFactoryLocation = “West US”
#—————————————————————————————-
#—————————————————————————————-

#create odxResourceGroup, if it does not exist
Get-AzureRmResourceGroup -Name $odxResourceGroup -ErrorVariable notPresent1 -ErrorAction 0
if ($notPresent1)
{
New-AzureRmResourceGroup -Location $dataLakeLocation -Name $odxResourceGroup
}

#create data lake, if it does not exist
Get-AzureRmDataLakeStoreAccount -Name $dataLakeStoreName -ErrorVariable notPresent2 -ErrorAction 0
if ($notPresent2)
{
New-AzureRmDataLakeStoreAccount -Location $dataLakeLocation -Name $dataLakeStoreName -ResourceGroupName $odxResourceGroup
}

$homepage = “https://ODXPS.com/” + $appname

#create security principal, if it does not exist
$app = New-AzureRmADApplication -DisplayName $appname -HomePage $homepage -IdentifierUris $homepage -Password $password
$app = Get-AzureRmADApplication -DisplayName $appname

$servicePrincipal = New-AzureRmADServicePrincipal -ApplicationId $app.ApplicationId
Start-Sleep 10
New-AzureRmRoleAssignment -RoleDefinitionName “Contributor” -Id $servicePrincipal.Id -ResourceGroupName $odxResourceGroup
New-AzureRmRoleAssignment -RoleDefinitionName “Data Factory Contributor” -Id $servicePrincipal.Id -ResourceGroupName $odxResourceGroup
New-AzureRmRoleAssignment -RoleDefinitionName “Reader” -Id $servicePrincipal.Id -ResourceGroupName $odxResourceGroup

#Set-AzureRmDataLakeStoreItemAclEntry -AccountName $dataLakeStoreName -Path / -AceType User -Id $app.ApplicationId -Permissions All
Set-AzureRmDataLakeStoreItemAclEntry -AccountName $dataLakeStoreName -Path / -AceType User -Id $servicePrincipal.Id -Permissions All
Get-AzureRmDataLakeStoreItem -Account $dataLakeStoreName -Path /ODX -ErrorVariable notPresent3 -ErrorAction 0
if ($notPresent3)
{
New-AzureRmDataLakeStoreItem -Folder -AccountName $dataLakeStoreName -Path /ODX
}
Set-AzureRmDataLakeStoreItemAclEntry -AccountName $dataLakeStoreName -Path /ODX -AceType User -Id $servicePrincipal.Id -Permissions All
#Start-Sleep 60 #there seems to be a lag between when these permissions are added and when they are applied…trying 1 minutes to start

$subscription = Get-AzureRmSubscription
$subscriptionId= ($subscription).Id
$tenantId = ($subscription).TenantId

#ensure there are permissions
#Get-AzureRmDataLakeStoreItemAclEntry -Account $dataLakeStoreName -Path /

#get information on datalake
$dataLake = Get-AzureRmDataLakeStoreAccount -Name $dataLakeStoreName

#here is a printout
“—————————————————————”
“—————————————————————”
$text1= “Azure Data Lake Name: ” + $dataLakeStoreName + “`r`n” +
“Tenant ID: ” + $tenantId + “`r`n” +
“Client ID: ” + $app.ApplicationId + “`r`n” +
“Client Secret: ” + $password + “`r`n” +
“Subscription ID: ” + $subscriptionId + “`r`n” +
“Resource Group Name: ” + $odxResourceGroup + “`r`n” +
“Data Lake URL: adl://” + $dataLake.Endpoint + “`r`n” +
“Location: ” + $dataFactoryLocation
“—————————————————————”
“—————————————————————”

Out-File C:\Users\MattDyor\Desktop\DataLake.ps1 -InputObject $text1

Getting NYC Taxi Data into Azure Data Lake

I wanted to get a meaningful dataset into Azure Data Lake so that I could test it out. I came across this article, that walks through using the NYC Taxi Dataset with Azure Data Lake:

https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-process-data-lake-walkthrough

The article kind of skips over the whole part of getting the dataset into Azure. Here is how I did it:

  • Spin up a VM on Azure
  • On Server Manager, click on Local Server, next to IE Enhanced Security Configuration click the On link, and at least set Admin to Off (or else you will have to click ok a dozen times a web page)
  • Download the files from the NYC Taxi Trip website to your VM http://www.andresmh.com/nyctaxitrips/
  • Install 7-Zip so that you can unzip the 7z files.
    • Once you install it from http://www.7-zip.org/download.html, go to the install folder (probably C:\Program Files\7-Zip) and right click the 7z.exe file. Select the 7zip > open archive option and then click the + sign and browse to your downloads folder
  • Because the files in the trip_data.7z file are larger than 2GB, you cannot upload them using the portal, and you need to use Powershell.
  • You need to install the Azure PowerShell Commandlets – look for the Windows Install link a bit down this page https://azure.microsoft.com/en-us/downloads/
  • You will probably need to restart the VM for the Azure commands to be available in PowerShell
  • Go wild on Azure Data Lake Store using this doc https://github.com/Microsoft/azure-docs/blob/master/articles/data-lake-store/data-lake-store-get-started-powershell.md – here are the key steps:

 # Log in to your Azure account
Login-AzureRmAccount

# List all the subscriptions associated to your account
Get-AzureRmSubscription

# Select a subscription
Set-AzureRmContext -SubscriptionId “xxx-xxx-xxx”

# Register for Azure Data Lake Store
Register-AzureRmResourceProvider -ProviderNamespace “Microsoft.DataLakeStore”

#Verify your ADL account name
Get-AzureRmDataLakeStoreAccount

#Figure out what folder to put the files
Get-AzureRmDataLakeStoreChildItem -AccountName mlspike -Path “/”

NOTE: if you do not want to copy the files one-by-one, you can just copy the whole folder using this format: Import-AzureRmDataLakeStoreItem -AccountName mlspike -Path “C:\Users\Taxi\Desktop\files2\trip_data” -Destination $myrootdir\TaxiDataFiles

Once you have the files uploaded to Azure Data Lake, you can delete the VM.

If you know of a faster way of getting them there (without downloading them to your local machine), I would love to hear it!

Thanks.

Matt

From Text File to Map in 30 Minutes

I wanted to see where United States patent attorneys reside. Using data from the United States Patent and Trademark Office and a free account from Microsoft’s Power BI, I was able to create the two visualizations below in about 30 minutes. Pretty amazing. Let me know if you want any details on how to do this – I sketched out the process below.

Steps to Create Power BI Map Visualization

  • Grab data: text file of all active US Patent Attorneys can be found here: https://oedci.uspto.gov/OEDCI/practitionerRoster.jsp
  • Visit Power BI Getting Started Page https://powerbi.microsoft.com/en-us/get-started/ so you can:
    • Download the Software (this allows you to author your visualizations); AND
    • Sign Up for a Power BI account (this allows you to publish your visualizations to the web)
  • Once you have downloaded the Desktop Power BI software, create a new report (File > New).
  • Click on Get Data > File > Text, and point to the file you downloaded (you need to extract it from the zip file if you have not already done that).
  • There are two types of maps in Power BI: maps and filled maps; I used maps that represents each node as a bubble.
  • For the top left map, just drag the City on to the Location box.
  • For the bottom left map, drag State on to the Location box, and drag State on to Size; click on the State under Size, and make sure Count is selected (not Count Distinct).
  • Finally, add a couple of tables to the right. For the top map, drag over City and Firm Name under values, and then click on the down arrow for Firm Name and select Count (this will aggregate for the city instead of showing all of the Firm Names along with their city).
    • I should have used registration number, instead of Firm Name, because this is actually a count of FIRMS by city, not a count of practitioners. Alas,  if that is the worst mistake I make today I am doing all right.
  • On the desktop, you can now click on the Publish icon in the top right.
  • Now navigate to the web version of Power BI and navigate to your app.
  • Under File select “Publish to Web”
  • Grab the “HTML you can add to your website” and paste this into a text view of your blog, and you are done. Super easy.

If you have published or stumbled upon some nice Power BI visualizations, drop them in a comment. I am a bit surprised this has not become more common. My prediction: data visualizations will become the norm in 2017, because all of these visualization tools are racing to become the standard, and are breaking down the barriers to adoption that have historically prevented people from jumping in (primarily complexity and cost). Exciting times.

Thanks.

Matt

Visualizing Sports League Standings

I have been thinking about a new visualization for sports league standings. Something that you could use to see teams on the ascent, and teams on their way down. Ideally, this would include all of the teams in a league, but since it takes a bit of time for me to prepare the data, I figured I would start by sharing the a micro-set of data, seeing whether this visualization already exists somewhere, and, if not, whether you find the visualization intriguing.
MLS Standinds

If you click on the image, you can download the Excel spreadsheet that I used to create the chart.

What you are seeing is the 2016 season of wins and losses for the Seattle Sounders (red) and the Portland Timbers (orange). A win is worth +1, a loss is worth -1, and a tie is worth 0. For the first game, you see that the Sounders lose and the Timbers win.

Some interesting things jump out from this visualization: the teams are in a dead heat from games 8 to 13, the Timbers are on top for most of the rest of the season, and around game 25 the Sounders are in the toilet. But, you have to get hot at the right time, and that is exactly what the Sounders did: advancing from -4 to +1 over the course of the last 9 games.

Let me know what you think.

Thanks.

Matt

The problem with Data Discovery

Despite the power of data discovery tools like Qlik, Tableau, and Power BI, a vast majority of businesses are failing in their efforts to become data driven. Qlik estimates that for their existing customers, there is only a 25% adoption rate (Qonnections 2016). That means that even for companies that have invested in becoming data driven, 75% of their employees are just winging it. Without the right data in the right format for each type of business user, these businesses can end up with entire departments that are not benefiting from data discovery.

There are 2 core reasons why this adoption rate is so low: 1) the right data is not available at all within the Qlik environment; and 2) the data that is available within Qlik is not refined in such a way that it can be useful to the 75%’ers.

If you are struggling with these challenges, you should check out TimeXtender (disclaimer: TimeXtender is an awesome company; disclaimer 2: I work for TimeXtender:). TimeXtender provides software that gets complex data into Qlik, and enables business intelligence (BI) teams to integrate, join, cleanse, denormalize, and otherwise refine the data for each business user. TimeXtender does the job of 3 tools: ETL software (like Informatica and SSIS), database management software (like SQL Server Management Studio), and semantic modeling (like Qlik Data Load Editor). Not only does TimeXtender do the work of these 3 tools, but it does it without required BI teams to write a single line of code.

If you are struggling to increase adoption of data discovery tools like Qlik within your organization, consider whether you are providing all of your users with the data they need and in the format that makes sense to them. If not, take a look at TimeXtender.

Regards,

Matt

Getting Click Data from Twitter API

I was trying to figure out whether the Twitter API allows you to get “click” counts via the API. This information is available via Twitter app (Tweet activity as “Link clicks”), but I did not see it as part of the API.

It looks like Twitter has made an interesting decision to have this as a commercially, but not publicly, available part of the API (e.g., you can get it via Gnip).

REST API Support for Impressions Data.  Short answer: buy it.

I am hoping that there is some way that Twitter has provided to pull this data (perhaps the Analytics API?). But, as a short-term fix I am going to try this:

  • Connect Buffer to Bit.ly so that all of my buffered tweets use a Bitly link: https://blog.bufferapp.com/url-shorteners
  • Pull stats from Buffer and Bitly using Crunch Data‘s C-Connect
  • Join these stats on the url

The Twitter API will give me the retweets and favorites, and the Bitly API will give me the clicks.

I will write a follow-up based on what I learn. I am curious to see whether stitching together this data from multiple APIs introduces some noise, and whether masking links with Bitly decreases clicks (e.g., a greater number of hops => increased time to hit the destination page, people may be afraid of masked Bitly links, etc).

If you have a better way, I would love to hear it.

Thanks.

Matt

 

 

Quick Qlik Tip: How to Automatically Identify Relationships in Data Models

Have you ever dreamed of being able to visually explore your data? If you have, Qlik is your dream come true. Qlik, and in particular Qlik Sense, allows you to perform data discovery on just about any data set with relative ease. In this post, I am going to walk you through my favorite way let Qlik help you create valid relationships between tables in an unfamiliar data set.  All told, this article should take 5 minutes to read, and less than 30 minutes if you are following along with the actual Qlik app.

Qlik’s First Question: to Add Data or Not to Add Data

Whenever you create a new Qlik application, you are presented with an option of “Add Data” and “Data load editor.”

Qlik-Data-Load-Editor

The way I think about these differences is that “Add data” is easy, and “Data load editor” is for people who enjoy punishment (they will claim that they prefer greater control over the import process). Both approaches have their strengths, but if you are just looking to jump in with both feet and start exploring your data, I would start with Add data.

If you want to play with a simple dataset, you can download this Excel file that I created to demonstrate some of the fundamentals of adding data to Qlik Sense. I will be using this data for the rest of this post.

Associate Data in Qlik Sense

After you add data, you are going to be slapped in the face with a simple truth: Qlik Sense, which allows you to bring visual meaning to very complex data, has some very complex user interface (Qlik is radically improving this in Qlik Sense 3!). Unless you are just dealing with a single set of data, the first thing you need to do after adding data is associate the data you added. Here is a screen shot that you are presented with…do you see how to create the relationships between your data? Of course you do! It is that little tiny “Associations” button at the bottom of the screen. 🙂

Qlik-Data-Associations

Once you have clicked on the Associations tab, you are presented with an option to associate each of the tables together. For each table to table relationship (e.g., Customers – Theaters), you need to select either “No Association” or one of the recommended associations. For the Customers – Theaters combination, there is no relationship, so you select “No Association” and then select the down arrow to advance to the next table.

Qlik-Overjoining

NOTE ABOUT OVER-JOINING: Notice how Qlik recommends the association between Customer and Theater with 100% confidence.  There is no relationship between customers and theaters (a person can purchase a ticket at any theater, and the only proper relationship is through the purchase entity). But, because Customer and Theater both have a city field and these have the same values, Qlik recommends a relationship 100%. I still love Qlik for recommending these joins, because it is a lot easier to see a recommendation and determine that it is invalid than having to try to imagine what relationships might exist – particularly on a large and complex data model.

Now that we advance to the Purchases – Theaters table, we see some pretty amazing smarts on display by Qlik: we have the purchaser’s name as “name” and the theater’s name as “name” – but a person’s name (e.g., Matt) is not the same as a theater’s name (e.g., Regal 8). Qlik recognizes that the proper join is between Purchase.Theater and Theater.Name and suggests this association with 100% confidence.

Qlik-Smart-Recommendations

After you go through all of the possible associations, you should have a pretty good object model.

Qlik-Data-Model

And if you switch over to the Data Load Editor, you will see a nice clean script that Qlik created for your data.

[Theaters]:
LOAD
[Name] AS [Theater],
[City] AS [Theaters.City],
[State] AS [Theaters.State],
[Rating] AS [Theaters.Rating];
LOAD
[Name],
[City],
[State],
[Rating]
FROM [lib://Desktop/Movies.xlsx]
(ooxml, embedded labels, table is Theaters);

[Customers]:
LOAD
[FirstName] AS [FirstName],
[City] AS [Customers.City],
[State] AS [Customers.State];
LOAD
[FirstName],
[City],
[State]
FROM [lib://Desktop/Movies.xlsx]
(ooxml, embedded labels, table is Customers);

[Purchases]:
LOAD
[Name] AS [FirstName],
[Movie] AS [Title],
[Theater] AS [Theater],
[Price] AS [Price];
LOAD
[Name],
[Movie],
[Theater],
[Price]
FROM [lib://Desktop/Movies.xlsx]
(ooxml, embedded labels, table is Purchases);

[Movies]:
LOAD
[Title] AS [Title],
[Movie Length] AS [Movie Length],
[Rating] AS [Movies.Rating];
LOAD
[Title],
[Movie Length],
[Rating]
FROM [lib://Desktop/Movies.xlsx]
(ooxml, embedded labels, table is Movies);

If you have any questions, drop a comment below.

Thanks.

Matt

Learn Machine Learning with Datacamp (on the Side)

I have worked with Machine Learning off and on for the last couple of years. Unfortunately, it is not part of my day job, so I need to find ways to integrate machine learning into an already busy day and week.

I stumbled across Datacamp, and so far I love what I see. After creating a free account in about 10 seconds, I was watching my first sub-10 minute video. Yes, no biggie there, but that is where it gets great: Datacamp starts providing an interactive R session, along with detailed instructions and readily available datasets, where you can actually start doing your very own work – no software required. Pretty impressive.

Datacamp

I would love to know more about Datacamp: who are the founders, what is their business model, and how did they get so awesome? If you know the founders, please drop a comment below.

Thanks, and if you end up testing it out let me know what you think!

Matt

Deploying TimeXtender on Azure

When I was an IT contractor for the Marine Corps in Washington DC, the summers could get pretty hot. The only respite from the heat was, you guessed it, the server room. Sometimes, I would just fire up one of the servers and do some research on a web browser to bask in that high-powered AC (please do not tell anybody). The biggest problem from my IT worker perspective was that when the AC went down or some other piece of hardware decided to die over the weekend, one of us would have to drive in and fix it.

Fast forward 10 years, and the days of the boutique server room seem to be drawing to a close. Companies are moving to public and private cloud at an accelerated rate. There are a number of benefits to moving to the cloud: shifting capex to opex, just-in-time capacity, and not having to come in on the weekend when the air conditioner goes down.

But what does that mean for on-premise software – software that used to operate inside that soon-to-be-shut-down boutique server room? One trend I am seeing is that companies are shifting on-premise software to the cloud with a Virtual Machine (VM) architecture. Granted, there are differences between a purpose-built cloud offering and a VM-based cloud offering, but it seems that the VM architecture is the optimal approach for companies that have invested time and money creating valuable on-premise software.

With a little bit of creativity, companies can realize all of the key benefits of the of the cloud (e.g., no on-premise footprint, opex, just-in-time capacity, subscription-/consumption-based pricing, and not having to come in on the weekend when the air conditioner goes down), and these benefits are realized without having to throw away a valuable piece of software that is already delivering value to customers.

Deploying TimeXtender on an Azure VM

Let’s get to the Technical Part. I assume that you already have an Azure account. The first step is to create a new SQL Server VM (not a SQL Database…that is for Azure SQL, Microsoft’s SQL as a Service offering). I chose SQL Server 2014 SP1 Standard on Windows Server 2012 R2.

SqlServer2012

Accept the defaults when you create this VM, but make sure that you write down the username and password that you use to create the VM, because you will log into the VM with this information. I did use a resource group instead of the Classic VM, because I think that is the direction Microsoft is heading.

For the VM sizing, I chose DS1_V2 mostly from a cost perspective. It costs me $104.16 a month, so I can leave it spinning on the cloud for demos without burning through too much budget.  This is not going to handle monster workloads, but particularly if you are just exploring TimeXtender and building super fast data warehouses/data discovery hubs, it will have enough muscle for your needs.

Azure VM Sizing for TimeXtender

For the SQL Settings, select Public – assuming that you are going to want to consume data from your data discovery hub from other machines (e.g., a Qlik application). For the port, you can stick with 1433 or you can be sneaky (just be sure to remember your sneakiness when trying to connect to your database).

Install TimeXtender Software

The easiest way to get the software installed on the VM is to download it to your desktop and then drag it onto the VM: https://support.timextender.com/hc/en-us/articles/210439583-Current-Release-of-TX-DWA (chose 64-bit server version).

For full details on the installation process, http://go.timextender.com/tx-dwa-trial-dl

Connect to a Data Source + Create Data Structures

Now let’s get started. My preferred approach is to grab a copy of the Data Discovery Hub template, Import the Template as a new project, and Run the Connection Wizard.

Let’s start by importing the project. First, download the Data Discovery Hub template at https://dl.dropboxusercontent.com/u/19691948/DDH_Demo.XML. Place it on the desktop of your VM. Then open TimeXtender and select Import/Export, and then Import Project, like this:

Import TimeXtender Solution Template

 

As soon as you import the project, run the connection wizard. You will see that there are 4 different connections: ODS, Stage, MDW, and Dynamics. Dyanamics represents an existing Dynamics database. The other 3 (ODS, Stage, MDW) are part of the data discovery hub. This is the data repository that we are going to create to store our Dyanamics data.

Run TimeXtender Connection Wizard

In short, the data discovery hub includes the ODS (Operational Data Store) where we land non-transformed data right out of the source, the Stage Data Warehouse where we do our transformations and denormalizations, and the MDW (or Modern Data Warehouse) where we house the “business ready” data. Power Users can be granted access to the ODS so that they have access to all data across the enterprise and can conduct unfettered data discovery. The MDW is used to populate Qlik (or other data visualization/front-end systems). The data discovery hub is the most effective way that I have ever seen to get world class data to all employees within a company.

TimeXtender Data Discovery Hub

But I digress…let’s get back to setting up our system. When we run the wizard, it will first ask us where we want to store our ODS. Keep the defaults, but you have to click the “Create” button before hitting OK to actually create this database on your machine. Once you hit “Create” then you can hit OK.

Next, the wizard prompts us to connect to a Dynamics database. You can either connect to your own Dynamics database, or you can connect to a sample database we are hosting on Azure. To connect to our Dynamics database, the server name is txvm.westus.cloudapp.azure.com, the SQL Username is TxSQL, and the SQL Password is TimeXtenderRocks! (with an exclamation point!). Your wizard should look something like the image below.

Connect to Remote Data Source

Notice that there is no option to create this database, because we are pulling data from an existing data source, not creating a repository where we can store data.

Run through MDW and Stage (do not forget to click Create for ODS, MDW, and Stage!), and then click the Test Connections button on the wizard. Everything should be green, like shown below.

TimeXtender Connection Wizard Green

For starters, I have just created the Data Discovery Hub architecture, and I will leave it to you to pull tables into your ODS, structure tables from your ODS into business ready data in your Stage DW, and then present business ready data to your business users in the MDW.

If you have any problems with any of these, let me know!

Thanks.

Matt

1 2