AKS in a Security Conscious Enterprise

Containers and in particular Kubernetes popularity has been going from strength to strength of late. Azure Kubernetes Service (AKS) is the blue PaaS-like offering of this where the vendor manages the masters and you just need to maintain the agent nodes, better still you only pay for the compute.

But like all things PaaS, while it was convenient at first having things just sitting on the internet so you can quickly adopt and consume them, that’s not exactly palatable to security architects, often found at the larger organisations. And it’s to Microsofts credit that they’ve really upped their game of late with ways to let you control connectivity to those lovely PaaS offerings, the ones we as cloud advocates try so hard to get our customers to adopt vs. costly and old school IaaS. Two of the technologies jtwo have been helping deploy of late are:

Service Endpoints: This is where the PaaS infrastructure still has a public IP, but you can restrict access to a VNet/Subnet(s). This only works with Azure originating traffic (you can exempt your on-prem PIP) but uses the Azure backbone network so it provides improved performance and at least stops people on the internet confirming if you really did set the security on that blob correctly.

Private Endpoints: This is where the security team normally stop saying no, and smile when you mention RFC1918. It lets you create an endpoint in your subnet with a private IP address, and that is the target to direct traffic, so ensuring it’s all internal. Of course, to do this you’ll need Private DNS and there is a cost associated, but who said being secure was going to be free!

Anyways, back to AKS. Microsoft has now made it possible to deploy a Private AKS cluster…well I say now, it was a little while back, but when we actually tried to do it in anger we hit lots of bugs…you know, that preview experience, aka beta. Now it works!

We love automation at jtwo, and in an ideal world for Azure we like use Azure DevOps pipelines to deploy ARM templates. But we also know that if you’re playing with a brand new toy/technology, using the Azure CLI often gets you a bit further along the road and then you template it later. So my initial CLI command was as follows:

az aks create --name $clusterName -g $mainRG --enable-vmss --network-plugin kubenet 
--load-balancer-sku standard --enable-private-cluster --node-resource-group $nodeRG 
-k 1.17.9 --docker-bridge-address 172.18.0.1/16 --dns-service-ip 172.17.0.10 
--service-cidr 172.17.0.0/16 --pod-cidr 172.16.0.0/16 --vnet-subnet-id $subnetId 
--outbound-type userDefinedRouting --service-principal $sp.appId --client-secret $sp.password 
--generate-ssh-keys | ConvertFrom-Json

Some points to callout:

  • We wanted scale sets –enable-vmss
  • We decided to use kubenet -network-plugin kubenet, you can use Azure for the more advanced features, but it wasn’t necessary and we’re fans of KISS
  • We wanted a standard load balancer –load-balancer-sku standard, this is the default for AKS clusters too
  • We wanted to control the managed RG –node-resource-group $nodeRG, yeah I know, this is just aesthetics but you might as well keep it clean, you have to run the aks-preview CLI extension for this nicety
  • We wanted to deploy it into an existing VNet/Subnet –vnet-subnet-id, this isn’t mandatory and if you don’t specify it you’ll get a new VNet as part of CLI execution, but that’s never going to fly with our security friends
  • We wanted to control routing via a FW –outbound-type userDefinedRouting, it’s good to snoop on internet traffic right?
  • We then configured a service principal –service-principal $sp.appId –client-secret $sp.password, however in hindsight, I’d recommend –enable-managed-identity instead, much more elegant!
  • And finally controlled the SSH for troubleshooting the nodes –generate-ssh-keys

So that got us up and running and all things were good. But remember that ‘doing things properly’ quip I made earlier, here is the ARM albeit heavily parameterised:

"resources": [ 
{ 
"type": "Microsoft.ContainerService/managedClusters", "apiVersion": "2020-04-01", 
"name": "[parameters('clusterName')]", 
"location": "[parameters('location')]", 
"sku": { 
"name": "Basic", 
"tier": "Free" 
}, 
"identity": { 
"type": "SystemAssigned" 
}, 
"properties": { 
"kubernetesVersion": "1.17.9", 
"dnsPrefix": "[parameters('dnsPrefix')]", 
"agentPoolProfiles": [ 
{ 
"name": "nodepool1", 
"count": 3, 
"vmSize": "Standard_D2s_v3", 
"osDiskSizeGB": 60, 
"vnetSubnetID": "[variables('subnetId')]", 
"maxPods": 110, 
"type": "[parameters('clusterType')]", 
"orchestratorVersion": "1.17.9", 
"enableNodePublicIP": false, 
"nodeLabels": {}, 
"mode": "System", 
"osType": "Linux" } 
], 
"linuxProfile": 
{ 
"adminUsername": "azureuser", 
"ssh": { 
"publicKeys": [ 
{ 
"keyData": "[parameters('keyData')]" 
} ] } 
}, 
"nodeResourceGroup": "[parameters('nodeRG')]", 
"enableRBAC": true, 
"enablePodSecurityPolicy": false, 
"networkProfile": { 
"networkPlugin": "[parameters('networkProfile')]",
"loadBalancerSku": "[parameters('loadBalancerSku')]",
"podCidr": "[parameters('podCidr')]", 
"serviceCidr": "[parameters('serviceCidr')]", 
"dnsServiceIP": "[parameters('dnsServiceIP')]",
"dockerBridgeCidr": "[parameters('dockerBridgeCidr')]",
"outboundType": "[parameters('outboundType')]" },
"apiServerAccessProfile": { 
"enablePrivateCluster": "[parameters('enablePrivateCluster')]"
} } 
}, 
{ 
"type": "Microsoft.ContainerService/managedClusters/agentPools",
"apiVersion": "2020-04-01", 
"name": "[concat(parameters('clusterName'), '/nodepool1')]",
"dependsOn": [ 
"[resourceId('Microsoft.ContainerService/managedClusters', parameters('clusterName'))]"
], 
"properties": { 
"count": 3, 
"vmSize": "Standard_D2s_v3", 
"osDiskSizeGB": 60, 
"vnetSubnetID": "[variables('subnetId')]", 
"maxPods": 110, 
"type": "[parameters('clusterType')]", 
"orchestratorVersion": "1.17.9", 
"enableNodePublicIP": false, 
"nodeLabels": {}, 
"mode": "System", 
"osType": "Linux" 
 } 
}

Again some points to callout:

  • We took our advice from earlier and used system managed identities, further info available here
  • the enablePrivateCluster value should be boolean true
  • outboundType should be userDefinedRouting, as before
  • networkPlugin will be kubenet as we discussed, KISS etc…
  • vnetSubnetID is the id of an existing Vnet/Subnet so we can integrate with our previously agreed network architecture

And there you have it, an AKS cluster deployed with the control plane on a private IP address, time to crack that tinny for a job well done…But remember that private DNS thing I skimmed over, well there is some considerations to that. 168.63.129.16 is the Azure DNS and it’s really important in this scenario, especially if you want on-premises devices to be able to connect. Microsoft do a very good job of explaining this here, but the quick summary is to make sure you have your on-premises DNS server forwarding resolution to Azure DNS, as it knows about the private zones which lets you stop external resolution being attempted and failing. Point 3 is key, as lets be honest you’ll normally have at least a DC in Azure with custom DNS configured on the VNET:

1. By default, when a private cluster is provisioned, a private endpoint (1) and a private DNS zone (2) are created in the cluster managed resource group. The cluster uses an A record in the private zone to resolve the IP of the private endpoint for communication to the API server.
2. The private DNS zone is linked only to the VNet that the cluster nodes are attached to (3). This means that the private endpoint can only be resolved by hosts in that linked VNet. In scenarios where no custom DNS is configured on the VNet (default), this works without issue as hosts point at 168.63.129.16 for DNS which can resolve records in the private DNS zone because of the link.
3. In scenarios where the VNet containing your cluster has custom DNS settings (4), cluster deployment fails unless the private DNS zone is linked to the VNet that contains the custom DNS resolvers (5). This link can be created manually after the private zone is created during cluster provisioning or via automation upon detection of creation of the zone using event-based deployment mechanisms (for example, Azure Event Grid and Azure Functions).

Now did someone mention a tinny?