Boosting SSIS Performance By Replacing Lookup With Script Transform

We recently had a situation where we wanted to load 20 million rows from one database server (on-premise) to another (cloud), and the SSIS package was run on the source server. For privacy reasons, we had to transform some of the the columns (age and postal code) during this load, which we solved using lookups against a couple of destination dimension tables. Normally, I would load data into a staging table at the destination and then perform all lookups when moving from the staging table to the fact table, but in this case it wasn’t acceptable to store sensitive data in the staging table.

This was fine during testing, but didn’t work at all in production, where we had to do an initial loading of about 20 million rows. It turned out our package was only able to load 10 000 rows per 8 minutes, so the whole lot would take 11 days. Of course, the connection was broken after a couple of days.

It turned out that although we had (partial) caching, the package repeatedly queried the destination database with queries like these:

exec sp_executesql N'select * from [Dim_FbfPostnummer] 
 where [Postnummer] = @P1 
 and ([FrånOchMed] <= @P2 or [FrånOchMed] is null) 
 and (@P3 < Till or Till is null)',N'@P1 char(5),@P2 datetime2(3),@P3 datetime2(3)','79495','2007-07-02 15:09:01','2007-07-02 15:09:01' 

I suspect that because the dates were never the same, SSIS could not cache this properly. We had to find another solution.

After some thinking and searching, I found help on Matt Masson’s blog on MSDN: http://blogs.msdn.com/b/mattm/archive/2008/11/25/lookup-pattern-range-lookups.aspx

I adopted some of his code. The first lookup was fairly simple – lookup an age group from age (in years). There should always be a match, and the dimension is constant (not slowly changing). The only real difficulty I found was how to use our existing OLEDB connection. I found another helpful blog post by Matt Masson: http://blogs.msdn.com/b/mattm/archive/2008/08/22/accessing-oledb-connection-managers-in-a-script.aspx

Here is the code for this lookup script. A reference to Microsoft.SQLServer.ManagedDTS must be added.

/* Microsoft SQL Server Integration Services Script Component
*  Write scripts using Microsoft Visual C# 2008.
*  ScriptMain is the entry point class of the script.*/

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.Collections.Generic;
using System.Data.OleDb;

public struct AgeGroup
{
    public int DWKey;
    public int From;
    public int To;

    public AgeGroup(int dwKey, int from, int to)
    {
        this.DWKey = dwKey;
        this.From = from;
        this.To = to;
    }
}

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    OleDbConnection _connection = null;
    List<AgeGroup> _cache = new List<AgeGroup>();
    AgeGroup _default;
    const string Query = "SELECT DWKey, FrånOchMedÅr, TillÅr FROM Dim_Åldersgrupp";

    public override void PreExecute()
    {
        base.PreExecute();
        /*
          Add your code here for preprocessing or remove if not needed
        */
        Microsoft.SqlServer.Dts.Runtime.ConnectionManager cm = Microsoft.SqlServer.Dts.Runtime.DtsConvert.GetWrapper(Connections.DataWarehouse);
        IDTSConnectionManagerDatabaseParameters100 cmParams = cm.InnerObject as IDTSConnectionManagerDatabaseParameters100;
        _connection = cmParams.GetConnectionForSchema() as OleDbConnection;
        FillCache();
    }

    private void FillCache()
    {
        var command = new OleDbCommand(Query, _connection);
        var reader = command.ExecuteReader();
        try
        {
            while (reader.Read())
            {
                if (reader.IsDBNull(1) && reader.IsDBNull(2))
                    _default = new AgeGroup(reader.GetInt32(0), int.MinValue, int.MaxValue);
                else
                {
                    int from = reader.IsDBNull(1) ? int.MinValue : reader.GetInt32(1);
                    int to = reader.IsDBNull(2) ? int.MaxValue : reader.GetInt32(2);
                    var ageGroup = new AgeGroup(reader.GetInt32(0), from, to);
                    _cache.Add(ageGroup);
                }
            }
        }
        finally
        {
            reader.Close();
        }
    }

    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        if (Row.AgeInYearsInt_IsNull)
            Row.Åldersgrupp = _default.DWKey;
        else
            Row.Åldersgrupp = _cache.Find(item => item.From <= Row.AgeInYearsInt && Row.AgeInYearsInt < item.To).DWKey;
    }
}

The second one was a little bit harder. This dimension was slowly changing. For an incoming postal code and date, we must look up the correct version of the dimension, because an important parameter is population, and that varies from year to year. In addition, if we don’t get a match in the lookup, we follow a different path in the dataflow, so this transform must have two outputs.

Here is the code I wrote:

/* Microsoft SQL Server Integration Services Script Component
*  Write scripts using Microsoft Visual C# 2008.
*  ScriptMain is the entry point class of the script.*/

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.Collections.Generic;
using System.Data.OleDb;

public class Postnummer
{
    public int DWKey;
    public DateTime From;
    public DateTime To;
    public int? Befolkning;

    public Postnummer(int dwKey, DateTime from, DateTime to, int? befolkning)
    {
        this.DWKey = dwKey;
        this.From = from;
        this.To = to;
        this.Befolkning = befolkning;
    }
}

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    OleDbConnection _connection = null;
    Dictionary<string, List<Postnummer>> _cache = new Dictionary<string, List<Postnummer>>();
    const string Query = "SELECT DWKey, Postnummer, FrånOchMed, Till, Befolkning FROM Dim_FbfPostnummer";

    public override void PreExecute()
    {
        base.PreExecute();
        /*
          Add your code here for preprocessing or remove if not needed
        */
        Microsoft.SqlServer.Dts.Runtime.ConnectionManager cm = Microsoft.SqlServer.Dts.Runtime.DtsConvert.GetWrapper(Connections.DataWarehouse);
        IDTSConnectionManagerDatabaseParameters100 cmParams = cm.InnerObject as IDTSConnectionManagerDatabaseParameters100;
        _connection = cmParams.GetConnectionForSchema() as OleDbConnection;
        FillCache();
    }

    private void FillCache()
    {
        var command = new OleDbCommand(Query, _connection);
        var reader = command.ExecuteReader();
        try
        {
            int rows = 0;
            while (reader.Read())
            {
                int dwKey = reader.GetInt32(0);
                string pnr = reader.GetString(1);
                DateTime from = reader.IsDBNull(2) ? DateTime.MinValue : reader.GetDateTime(2);
                DateTime to = reader.IsDBNull(3) ? DateTime.MaxValue : reader.GetDateTime(3);
                int? befolkning = reader.IsDBNull(4) ? null : (int?)reader.GetInt32(4);
                var postnummer = new Postnummer(dwKey, from, to, befolkning);
                if (_cache.ContainsKey(pnr))
                {
                    var list = _cache[pnr];
                    list.Add(postnummer);
                }
                else
                    _cache.Add(pnr, new List<Postnummer>() { postnummer });
                rows++;
            }
            Log("Cache initialized.", rows, null);
        }
        finally
        {
            reader.Close();
        }
    }

    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        /*
          Add your code here
        */
        Postnummer postnummer = null;
        if (Row.NationalRegZipCode != null && _cache.ContainsKey(Row.NationalRegZipCode))
        {
            var list = _cache[Row.NationalRegZipCode];
            postnummer = list.Find(item => item.From <= Row.SignedTimeRounded && Row.SignedTimeRounded < item.To);
        }
        if (postnummer == null)
        {
            Row.FbfPostnummerDWKey_IsNull = true;
            Row.FbfPostnummerBefolkning_IsNull = true;
            Row.DirectRowToLookupNoMatch();
        }
        else
        {
            Row.FbfPostnummerDWKey = postnummer.DWKey;
            if (postnummer.Befolkning == null)
                Row.FbfPostnummerBefolkning_IsNull = true;
            else
                Row.FbfPostnummerBefolkning = postnummer.Befolkning.Value;
            Row.DirectRowToLookupMatch();
        }
    }

}

Migrating Azure Subscriptions to Organization Accounts (Including Visual Studio Online)

We started creating Azure subscriptions before organization accounts were available, but now when even Visual Studio Online supports organization accounts, we wanted to switch from Microsoft accounts. (Microsoft accounts have the obvious drawbacks: more passwords to remember and access does not go away automatically when people quit.)

The subscription owner (service account) was a Microsoft account – let us call this account1@company.se. The subscriptions were linked to a domain that I refer to here as domain1.onmicrosoft.com, and the new domain, which is synced with our Active Directory, I refer to as company.onmicrosoft.com.

Azure Co-Administrators

So here is what we had to do make the switch with co-administrators:

  1. Change the AD account name of account1@company.se to account2@company.se. Why? Because this account was synced to company.onmicrosoft.com, and there can only be one account with the same name in the same Azure Active Directory (AAD). (The Microsoft account is added in step 4 below.)
  2. Go to manage.windowsazure.com and login as account1@company.se.
  3. Create a new directory. Choose custom create. Select Use existing directory.
  4. Log in as a global administrator of the company.onmicrosoft.com domain. What happens now is that account1@company.se (the Microsoft account) is automatically added as a global administrator in company.onmicrosoft.com.
  5. Log out and log in again as account1@company.se. There should now be a new directory (company.onmicrosoft.com) visible.
  6. Go to settings, select a subscription and click Edit Directory at the bottom of the page. Change to company.onmicrosoft.com. You get a warning saying all co-administrators will be deleted. That is OK, because that is what we wish to accomplish.
  7. Add organization accounts as co-administrators as appropriate.

Azure Service Administrators

Now, we also wanted to change the service administrator from account1@company.se (Microsoft account) to account2@company.se (organization account). That is accomplished in the account portal.

  1. Go to account.windowsazure.com and log in.
  2. Select a subscription.
  3. Click Edit subscription details.
  4. Change the service administrator.

Account Administrator

I don’t know how to change the account administrator. Maybe it requires a support ticket.

Visual Studio Online

Visual Studio Online was the most difficult service. You can administer some aspects in the Azure managment portal (manage.windowsazure.com). But when you connect it to the new domain (company.onmicrosoft.com in my case), users not in that domain will loose access. Because the Microsoft accounts probably do not exist in that domain, you will have to create new users in Visual Studio Online, and because the probably have the same names as the corresponding Microsoft accounts, you must first delete these.

But beware! Before you start deleting users and switch domain, you must make sure no user has anything checked out. You might also want to delete all workspaces, because when the “new” users start Visual Studio and log in, they must create new workspaces, because the old workspaces are owned by different users. And you cannot have work folder mappings in two workspaces to the same local folder on the same computer (error: The working folder is already in use by the workspace…)

Use the following command to list all workspaces:

tf workspaces /server:&lt;instance&gt;.visualstudio.com\DefaultCollection /owner:*

or the following to see details on a specific computer:

tf workspaces /server:instance.visualstudio.com\DefaultCollection /computer:&lt;computer&gt;/owner:* /format:detailed

You can delete a workspace with the following command:

tf workspace /delete /collection:&lt;instance&gt;.visualstudio.com\DefaultCollection &quot;&lt;computer&gt;;&lt;user&gt;&quot;

Or simpler, use Team Foundation Sidekicks. In fact, I did not delete my work space before making the switch, and the only way I could delete it afterwards was to use Sidekicks.

  1. Ensure no user has anything checked out.
  2. Delete workspaces for Microsoft accounts that you plan to abandon in favour of organization account.
  3. Delete accounts in VisualStudio Online using User Management (https://<instance&gt;.visualstudio.com/_user).
  4. Go to manage.windowsazure.com and login as an administrator. Select your Visual Studio Online service and click Configure. Click the Connect button and connect to the company.onmicrosoft.com domain.
  5. MSDN subscribers must connect their Microsoft accounts with their organization accounts in the My Account section of MSDN (https://msdn.microsoft.com/subscriptions/manage/hh442900). Under Visual Studio Online they can create this link.
  6. In Visual Studio Online using User Management, add users again. This time, their organization accounts will be added. If they have MSDN, you should choose Eligible MSDN Subscriber as license.
  7. Click the cog wheel in the upper right corner in Visual Studio Online to go to the control panel. Select a project, click on your team and add team members and administrators.
  8. In Visual Studio, users must switch user by clicking Connect to Team Projects in Team Explorer and then Select Team Projects… In the bottom of the Connect to Team Foundation Server dialog, click Switch User. Enter the organization account credentials.
  9. If a user doesn’t see any code, just work items, he or she has probably a stakeholder license, not an MSDN license. They can check this by logging in to <instance>.visualstudio.com. See step 5 above. If it still doesn’t work, try changing license to basic and then back to MSDN.

Generating a Large Excel File from Code

This is one way of generating a large Excel file with random data from code:

  1. Fire up Visual Studio 2013.
  2. Create a new Excel 2013 Workbook project.
  3. Use the following code:
        public partial class ThisWorkbook
        {
            private void ThisWorkbook_Startup(object sender, System.EventArgs e)
            {
                var random = new Random();
                var activeWorksheet = ((Excel.Worksheet)Application.ActiveSheet);
                for (int i = 1; i <= 1024*30; i++)
                {
                    byte[] bytes = new byte[768];
                    random.NextBytes(bytes);
                    string s = Convert.ToBase64String(bytes);
                    var range = activeWorksheet.get_Range("A" + i);
                    range.Value = s;
                }
            }
    
    		// ...
    	}
    

    It will create 1024 * 30 rows of 1024 characters each.

  4. Press F5 to run.

Before Cancelling an Azure Subscription, Remove All Virtual Networks

I recently cancelled an Azure Subscription, and then went on to delete all of its resources (virtual machines, databases, etc.). This went fairly well with a couple of exceptions.

Problem #1: I wasn’t possible to delete my virtual network. I got an error message saying it wasn’t possible because the subscription wasn’t active. I had to submit a support request, and a helpful technician re-enabled my subscription.

Lesson: Delete all resources before you cancel a subscription.

Problem #2: I still cannot delete the associated Active Directory. I get an error message saying

The following issue(s) prevent deletion of this directory:
· Directory has one or more Azure subscriptions.

But that is not true – I changed the subscription to be associated with a different directory. I hope Microsoft Support has a solution for this issue as well…

UPDATE: I discovered there was another subscription, owned by another user, associated with the domain. After changing that, I could delete the domain.

Starting Minecraft Server (or another program) remotely

I wanted a quick way of starting a Java program (Minecraft Server) from client computers, and the first solution that came to mind was PowerShell remoting. While this is great for background processes, Minecraft Server has a user interface, and so does the NoIP dynamic DNS client, which I am using. The solution was to create a task using task scheduler, and starting this task with PowerShell.

image

This action starts powershell.exe with a script as argument: C:\Mojang\Minecraft_server\StartMinecraftServer.ps1

The script looks like this:

$proc = Get-Process | where { $_.Name -eq "DUC40" }
if ($proc -eq $null) { Start-Process "C:\Program Files (x86)\No-IP\DUC40.exe" }

$proc = Get-Process | where { $_.Name -eq "java" -and $_.MainWindowTitle -eq "Minecraft server" }
if ($proc -eq $null) { Start-Process "C:\Program Files\Java\jre7\bin\java.exe" -ArgumentList "-Xms512m", "-Xmx1024m", "-jar", "minecraft_server.1.8.zip" }

As you see, it checks if “DUC40” (NoIP dynamic DNS client) is started, and if not, starts it. The same goes for Minecraft itself.

Then, I have another scheduled task, which runs at 00:00 each night and starts a script which stops these processes if they are running:

$proc = Get-Process | where { $_.Name -eq "java" -and $_.MainWindowTitle -eq "Minecraft server" }
if ($proc -ne $null) { Stop-Process $proc }
$proc = Get-Process | where { $_.Name -eq "DUC40" }
if ($proc -ne $null) { Stop-Process $proc }

On the client side, the following command needs to be run once from an elevated PowerShell console. (I would like to use the server name rather than the IP address, but it didn’t work for me.)

# The following is needed on the client if computers are not domain joined. It must be run from an elevated process.
Set-Item "wsman::localhost\Client\TrustedHosts" "htpc-dator"
Set-Item "wsman::localhost\Client\TrustedHosts" "192.168.0.101"
Restart-Service WinRM

Then, the following PowerShell script is used to start the task on the server:

# On the server, you must run Enable-PSRemoting.
# Invoke-Command -ComputerName "htpc-dator" -ScriptBlock { Start-ScheduledTask -TaskPath "\Henrik" -TaskName StartMinecraftServer }
Invoke-Command -ComputerName "192.168.0.101" -ScriptBlock { Start-ScheduledTask -TaskPath "\Henrik" -TaskName StartMinecraftServer } -Credential htpc-dator\htpc