ARM Template Role Assignment Learnings

ARM templates are one of those things that the learning curve can be considered steep, but once you get there they make your life so much easier you’re glad you did it. If you’re like me, Google is your friend and whenever you hit an issue with your latest template you resort to searching error messages and hope someone else has not only found the solution, but also been kind enough to write it up. And the latter point is where this post comes in, when doing ARM Template Role Assignments, there are a couple of gotchas that I often forget and when Google doesn’t have any “I’m Feeling Lucky” results, it’s time I try to be a nice person!

Let’s set the scene, doing permissions in the template rather than after the deployment allows you to use incremental updates, and stops people doing “clicky clicky” changes in the environment. You know those people, “I’ll just do it quickly…”, “I’ll fix it later, honest”. And hell, we’ve all done it, but let’s try to be better than that.

So permissions is all about scope, you can assign to the Resource Group of the resource itself. The approach is actually different, and this is explained perfectly here, there is every chance you’ve stumbled across this post because of this error:

"error": {
"code": "InvalidCreateRoleAssignmentRequest",
"message": "The request to create role assignment '{guid}' is not valid. Role assignment scope '/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/Microsoft.Insights/components/{resourceGroupName}' must match the scope specified on the URI  '/subscriptions/{resourceGroupName}/resourcegroups/{resourceGroupName}'."
  }

After reading the Microsoft documentation you’d think the scope property would help you, but instead I found it was easier to follow the approach the kind people at Stack Overflow explained. So lets look at some examples:

Parameters & Variables:

    "parameters": {        
"runbookAutomationOperators": {
            "value": [
                "a195af43-xxxx-49fd-xxxx-c1e0de11b118",
                "5deb670c-xxxx-4642-xxxx-de5290266bad",
                "2541d966-xxxx-4d1d-xxxx-8ecc6c2e8a39"
            ]
        }
}
"variables": {
        "automationOperatorId": "d3881f73-407a-4167-8283-e981cbba0404",
        "readerId": "acdd72a7-3385-48ef-bd42-f606fba81ae7"
    },

Notes:

  • I’ve put an array of accounts so I can loop through them and assign them
  • I’ve put the built in Azure roles as variables as they are referenced more than once (you can find the Id’s by running Get-AzRoleDefinition)

Resource Group:

        {
            "type": "Microsoft.Authorization/roleAssignments",
            "apiVersion": "2020-04-01-preview",
            "name": "[guid(concat(resourceGroup().id), resourceGroup().name, variables('readerId'), parameters('runbookAutomationOperators')[copyIndex()])]",
            "copy": {
                "name": "resourceGroupReader",
                "count": "[length(parameters('runbookAutomationOperators'))]"
                },
            "dependsOn": [],
            "properties": {
                "roleDefinitionId": "[concat('/subscriptions/', subscription().subscriptionId, '/providers/Microsoft.Authorization/roleDefinitions/', variables('readerId'))]",                        
                "principalId": "[parameters('runbookAutomationOperators')[copyIndex()]]"
            }
        }

Notes:

  • The role assignment is at the top level, as we’re doing a resource group deployment it’s the RG where the rights will be written
  • The name has to be a unique guid, so I’ve included the actual account id in the string be used to build the guid. By doing so it ensures if you have multiple assignments (which you will) that they are unique as it’s combing the Resource Group and the account being assigned to the RG
  • I’ve done a copy because I want to run it for the length of the array, in this case 3

Resource:

        {
            "type": "Microsoft.Automation/automationAccounts/providers/roleAssignments",
            "apiVersion": "2020-04-01-preview",
            "name": "[concat(parameters('automationAccountName'), '/Microsoft.Authorization/', guid(concat(resourceGroup().id), resourceId('Microsoft.Automation/automationAccounts', parameters('automationAccountName')),variables('automationOperatorId'), parameters('runbookAutomationOperators')[copyIndex()]))]",
            "copy": {
                "name": "runbookAutomationOperators",
                "count": "[length(parameters('runbookAutomationOperators'))]"
                },
            "dependsOn": [
                "[parameters('automationAccountName')]"
            ],
            "properties": {
                "roleDefinitionId": "[concat('/subscriptions/', subscription().subscriptionId, '/providers/Microsoft.Authorization/roleDefinitions/', variables('automationOperatorId'))]",
                "principalId": "[parameters('runbookAutomationOperators')[copyIndex()]]"
            }
        },

Notes:

  • This example is for an automation account and that is key, because you actually specify the type of account in the type (something you can easily change for other resource, e.g. Microsoft.Storage/storageAccounts)
  • Similar to the RG, you need a unique guid, so I’ve included the resource itself as well as the account we are allocating, so it once again is unique
  • Again we’ve done a copy so it’s ran three times, allocating the users to this Operator Role

So I talked about the name having to be a guid in both examples, but it’s a point I’d like to talk about some more. firstly, if it’s not you’ll get this error:

The role assignment ID must be a GUID

I mean, you can’t blame MS for this error, it is pretty clear. But how do you make a guid? Well I thought ARM Templates have guid functions so I jumped over to the MS doco. But on reading this I definitely over thought these two lines of text:

  • The returned value isn’t a random string, but rather the result of a hash function on the parameters. The returned value is 36 characters long. It isn’t globally unique. To create a new GUID that isn’t based on that hash value of the parameters, use the newGuid function.
  • Returns a value in the format of a globally unique identifier. This function can only be used in the default value for a parameter.

And if you’ve not had enough coffee you might think, I just want a bloody unique Guid and I don’t want to mess around with default values, especially as we’re doing a copy loop so we need some salt to make it different. But, just to state the obvious, this actually makes sense. You want to create a string that will be the same every single time, because you want to be able to run this incrementally. If it was unique, your IAM role assignment page would be a shambles as every time you try run your pipeline it’s come up with a lovely new guid! So the guid function is your friend, just ensure you include all the attributes so that the base string is unique. E.g. if you’re doing an Automation account, you want both that resource and the account you are assigning as the base string. If you don’t include that resource, then it could clash with another role assignment of that user in the RG, and if you don’t include user, well then it’s going to be the same for all users being granted access to that account.

While I appreciate this may be obvious I do hope it is useful as I really didn’t think the vendor doco was particularly clear, if anything else it’ll save Future Dave from working this out yet again because if he had a memory he’d be dangerous…


AKS in a Security Conscious Enterprise

Containers and in particular Kubernetes popularity has been going from strength to strength of late. Azure Kubernetes Service (AKS) is the blue PaaS-like offering of this where the vendor manages the masters and you just need to maintain the agent nodes, better still you only pay for the compute.

But like all things PaaS, while it was convenient at first having things just sitting on the internet so you can quickly adopt and consume them, that’s not exactly palatable to security architects, often found at the larger organisations. And it’s to Microsofts credit that they’ve really upped their game of late with ways to let you control connectivity to those lovely PaaS offerings, the ones we as cloud advocates try so hard to get our customers to adopt vs. costly and old school IaaS. Two of the technologies jtwo have been helping deploy of late are:

Service Endpoints: This is where the PaaS infrastructure still has a public IP, but you can restrict access to a VNet/Subnet(s). This only works with Azure originating traffic (you can exempt your on-prem PIP) but uses the Azure backbone network so it provides improved performance and at least stops people on the internet confirming if you really did set the security on that blob correctly.

Private Endpoints: This is where the security team normally stop saying no, and smile when you mention RFC1918. It lets you create an endpoint in your subnet with a private IP address, and that is the target to direct traffic, so ensuring it’s all internal. Of course, to do this you’ll need Private DNS and there is a cost associated, but who said being secure was going to be free!

Anyways, back to AKS. Microsoft has now made it possible to deploy a Private AKS cluster…well I say now, it was a little while back, but when we actually tried to do it in anger we hit lots of bugs…you know, that preview experience, aka beta. Now it works!

We love automation at jtwo, and in an ideal world for Azure we like use Azure DevOps pipelines to deploy ARM templates. But we also know that if you’re playing with a brand new toy/technology, using the Azure CLI often gets you a bit further along the road and then you template it later. So my initial CLI command was as follows:

az aks create --name $clusterName -g $mainRG --enable-vmss --network-plugin kubenet 
--load-balancer-sku standard --enable-private-cluster --node-resource-group $nodeRG 
-k 1.17.9 --docker-bridge-address 172.18.0.1/16 --dns-service-ip 172.17.0.10 
--service-cidr 172.17.0.0/16 --pod-cidr 172.16.0.0/16 --vnet-subnet-id $subnetId 
--outbound-type userDefinedRouting --service-principal $sp.appId --client-secret $sp.password 
--generate-ssh-keys | ConvertFrom-Json

Some points to callout:

  • We wanted scale sets –enable-vmss
  • We decided to use kubenet -network-plugin kubenet, you can use Azure for the more advanced features, but it wasn’t necessary and we’re fans of KISS
  • We wanted a standard load balancer –load-balancer-sku standard, this is the default for AKS clusters too
  • We wanted to control the managed RG –node-resource-group $nodeRG, yeah I know, this is just aesthetics but you might as well keep it clean, you have to run the aks-preview CLI extension for this nicety
  • We wanted to deploy it into an existing VNet/Subnet –vnet-subnet-id, this isn’t mandatory and if you don’t specify it you’ll get a new VNet as part of CLI execution, but that’s never going to fly with our security friends
  • We wanted to control routing via a FW –outbound-type userDefinedRouting, it’s good to snoop on internet traffic right?
  • We then configured a service principal –service-principal $sp.appId –client-secret $sp.password, however in hindsight, I’d recommend –enable-managed-identity instead, much more elegant!
  • And finally controlled the SSH for troubleshooting the nodes –generate-ssh-keys

So that got us up and running and all things were good. But remember that ‘doing things properly’ quip I made earlier, here is the ARM albeit heavily parameterised:

"resources": [ 
{ 
"type": "Microsoft.ContainerService/managedClusters", "apiVersion": "2020-04-01", 
"name": "[parameters('clusterName')]", 
"location": "[parameters('location')]", 
"sku": { 
"name": "Basic", 
"tier": "Free" 
}, 
"identity": { 
"type": "SystemAssigned" 
}, 
"properties": { 
"kubernetesVersion": "1.17.9", 
"dnsPrefix": "[parameters('dnsPrefix')]", 
"agentPoolProfiles": [ 
{ 
"name": "nodepool1", 
"count": 3, 
"vmSize": "Standard_D2s_v3", 
"osDiskSizeGB": 60, 
"vnetSubnetID": "[variables('subnetId')]", 
"maxPods": 110, 
"type": "[parameters('clusterType')]", 
"orchestratorVersion": "1.17.9", 
"enableNodePublicIP": false, 
"nodeLabels": {}, 
"mode": "System", 
"osType": "Linux" } 
], 
"linuxProfile": 
{ 
"adminUsername": "azureuser", 
"ssh": { 
"publicKeys": [ 
{ 
"keyData": "[parameters('keyData')]" 
} ] } 
}, 
"nodeResourceGroup": "[parameters('nodeRG')]", 
"enableRBAC": true, 
"enablePodSecurityPolicy": false, 
"networkProfile": { 
"networkPlugin": "[parameters('networkProfile')]",
"loadBalancerSku": "[parameters('loadBalancerSku')]",
"podCidr": "[parameters('podCidr')]", 
"serviceCidr": "[parameters('serviceCidr')]", 
"dnsServiceIP": "[parameters('dnsServiceIP')]",
"dockerBridgeCidr": "[parameters('dockerBridgeCidr')]",
"outboundType": "[parameters('outboundType')]" },
"apiServerAccessProfile": { 
"enablePrivateCluster": "[parameters('enablePrivateCluster')]"
} } 
}, 
{ 
"type": "Microsoft.ContainerService/managedClusters/agentPools",
"apiVersion": "2020-04-01", 
"name": "[concat(parameters('clusterName'), '/nodepool1')]",
"dependsOn": [ 
"[resourceId('Microsoft.ContainerService/managedClusters', parameters('clusterName'))]"
], 
"properties": { 
"count": 3, 
"vmSize": "Standard_D2s_v3", 
"osDiskSizeGB": 60, 
"vnetSubnetID": "[variables('subnetId')]", 
"maxPods": 110, 
"type": "[parameters('clusterType')]", 
"orchestratorVersion": "1.17.9", 
"enableNodePublicIP": false, 
"nodeLabels": {}, 
"mode": "System", 
"osType": "Linux" 
 } 
}

Again some points to callout:

  • We took our advice from earlier and used system managed identities, further info available here
  • the enablePrivateCluster value should be boolean true
  • outboundType should be userDefinedRouting, as before
  • networkPlugin will be kubenet as we discussed, KISS etc…
  • vnetSubnetID is the id of an existing Vnet/Subnet so we can integrate with our previously agreed network architecture

And there you have it, an AKS cluster deployed with the control plane on a private IP address, time to crack that tinny for a job well done…But remember that private DNS thing I skimmed over, well there is some considerations to that. 168.63.129.16 is the Azure DNS and it’s really important in this scenario, especially if you want on-premises devices to be able to connect. Microsoft do a very good job of explaining this here, but the quick summary is to make sure you have your on-premises DNS server forwarding resolution to Azure DNS, as it knows about the private zones which lets you stop external resolution being attempted and failing. Point 3 is key, as lets be honest you’ll normally have at least a DC in Azure with custom DNS configured on the VNET:

1. By default, when a private cluster is provisioned, a private endpoint (1) and a private DNS zone (2) are created in the cluster managed resource group. The cluster uses an A record in the private zone to resolve the IP of the private endpoint for communication to the API server.
2. The private DNS zone is linked only to the VNet that the cluster nodes are attached to (3). This means that the private endpoint can only be resolved by hosts in that linked VNet. In scenarios where no custom DNS is configured on the VNet (default), this works without issue as hosts point at 168.63.129.16 for DNS which can resolve records in the private DNS zone because of the link.
3. In scenarios where the VNet containing your cluster has custom DNS settings (4), cluster deployment fails unless the private DNS zone is linked to the VNet that contains the custom DNS resolvers (5). This link can be created manually after the private zone is created during cluster provisioning or via automation upon detection of creation of the zone using event-based deployment mechanisms (for example, Azure Event Grid and Azure Functions).

Now did someone mention a tinny?