Fault Tolerant Multi AZ EC2, On a beer budget – live from AWS Meetup

Filmed on 18th of March 2021 at the Adelaide AWS User Group, where Arran Peterson presented on how to put together best practice (and cheap!) cloud architecture for business continuity. The title:

“Enterprise grade fault tolerant multi-AZ server deployment on a beer budget”

Recording

RATED ‘PG’ – Mild course language.

Presenter

Arran Peterson
Arran Peterson

Arran is an Infrastructure Consultant with a passion for Microsoft Unified Communications and the true flexibility and scalability of cloud-based solutions.
As a Senior Consultant, Arran brings his expertise in enterprise environments to work with clients around Microsoft Unified Communications product portfolio of Office 365, Exchange and Skype/Teams, along with expertise around transitioning to the cloud-based platforms including AWS, Azure and Google.

More Reading

Amazon Elastic Block Store

https://aws.amazon.com/ebs/

AWS Sydney outage prompts architecture rethink

https://www.itnews.com.au/news/aws-sydney-outage-prompts-architecture-rethink-420506

Chalice Framework

https://aws.github.io/chalice/

Adelaide AWS User Group

https://www.meetup.com/en-AU/Amazon-Web-Services-User-Group-Adelaide/events/276728885/


Azure Migrate – Additional Firewall Rules

When deploying Azure Migrate Appliances to discovery servers, the appliance needs outbound internet access. In many IT environments, servers are disallowed internet access unless prescribed to certain URL sets. Gratefully Microsoft have given us a list of what they think is the list of URLs that the appliance will need to have whitelisted. This can be found here:

https://docs.microsoft.com/en-us/azure/migrate/migrate-appliance#public-cloud-urls

Issue

Once the appliance has booted up and your onto the GUI, you must ener your Azure Migrate Project key from your subscription and then authenticate to your subscription. We entailed the following error when attempting to resolve the initial key:

Azure Migrate Error

Failed to connect to the Azure Migrate project. Check the errors details, follow the remediation steps and click on ‘Retry’ button

The Azure Migrate Key doesn’t have an expiration on it so this wasn’t the issue. We had whitelisted the URL‘s but on the firewall we were seeing dropped packets:

13:40:41Default DROPTCP10.0.0.10:50860204.79.197.219:443
13:40:41Default DROPTCP10.0.0.10:50861204.79.197.219:80
13:40:41Default DROPTCP10.0.0.10:50857152.199.39.242:443
13:40:42Default DROPTCP10.0.0.10:50862204.79.197.219:80
13:40:42Default DROPTCP10.0.0.10:50858104.74.50.201:80
13:40:43Default DROPTCP10.0.0.10:5086352.152.110.14:443
13:40:44Default DROPTCP10.0.0.10:50860204.79.197.219:443
13:40:44Default DROPTCP10.0.0.10:50861204.79.197.219:80
13:40:45Default DROPTCP10.0.0.10:50862204.79.197.219:80
13:40:46Default DROPTCP10.0.0.10:5086352.152.110.14:443
13:40:46Default DROPTCP10.0.0.10:50859204.79.197.219:443
13:40:47Default DROPTCP10.0.0.10:5086440.90.189.152:443
13:40:47Default DROPTCP10.0.0.10:5086552.114.36.3:443
13:40:49Default DROPTCP10.0.0.10:5086440.90.189.152:443
13:40:50Default DROPTCP10.0.0.10:5086552.114.36.3:443
13:40:50Default DROPTCP10.0.0.10:50860204.79.197.219:443
13:40:50Default DROPTCP10.0.0.10:50861204.79.197.219:80
13:40:51Default DROPTCP10.0.0.10:50862204.79.197.219:80
13:40:52Default DROPTCP10.0.0.10:5086352.152.110.14:443
Subset of the dropped packets based on IP destination during connection failure

Reviewing the SSL certificates on these IP addresses, they are all Microsoft services with multiple SAN entries. We also had a look at the traffic from the developer tools in the browser:

We can see that the browser is trying to start a AAD workflow for device login, which is articulated in the onboarding documentation. Our issue was that the JavaScript for inside the browser session wasn’t located in the whitelist URLs. Reviewing the SAN entries in the certificates presented in the IP destination table we looked for ‘CDN’ or ‘Edge’ URLs.

The fix

The following URLs were added to the whitelist group for the appliance and problems went away.

204.79.197.219*.azureedge.net
152.199.39.242*.azureedge.net
152.199.39.242*.wpc.azureedge.net
152.199.39.242*.wac.azureedge.net
152.199.39.242*.adn.azureedge.net
152.199.39.242*.fms.azureedge.net
152.199.39.242*.azureedge-test.net
152.199.39.242*.ec.azureedge.net
152.199.39.242*.wpc.ec.azureedge.net
152.199.39.242*.wac.ec.azureedge.net
152.199.39.242*.adn.ec.azureedge.net
152.199.39.242*.fms.ec.azureedge.net
152.199.39.242*.aspnetcdn.com
152.199.39.242*.azurecomcdn.net
152.199.39.242cdnads.msads.net


Disable-CsAdForest – “Cannot remove the Active Directory settings for the domain due to ‘FE’ still being activated”

I’ve spent 15 years deploying on-premises versions of Microsoft Unified Communications, namely OCS, Lync & Skype for Business. During that period I did a lot of installations, but never had I done a full removal of the product. I guess that speaks to the usability of Microsoft Voice solutions. Once your in, the years just roll by like a good marriage. All good things must come to an end, now with Teams being purely cloud based no schema objects need to remain in Active Directory.

When attempting to do the final cleanup steps of the environment I was getting the following output when attempting Disable-CSAdDomain:

Disable-CsAdDomain : DomainUnprepareTask execution failed on an unrecoverable error.
At line:1 char:1
+ Disable-CsAdDomain
+ ~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (:SourceCollection) [Disable-CsAdDomain], DeploymentException
    + FullyQualifiedErrorId : TaskFailed,Microsoft.Rtc.Management.Deployment.UnprepareDomainCmdlet
WARNING: Disable-CsAdDomain encountered errors. Consult the log file for a detailed analysis, and ensure all errors (2)
 and warnings (0) are addressed before continuing.

The HTML report presented me with some red:

Error: Cannot remove the Active Directory settings for the domain due to “FE” still being activated.

I had a few hours of scratching my head. I’d fully un-installed the Skype for Business Server software (minus the administrative tools) from the last Front End in the environment. Even the CMS had been deleted, so why did it think it was still active?! No CMS, means no jersey for a Front End server. I waited for domain replication but still no change.

Solution

The secret is to remove the last Front End server computer object from the domain. Install the tools on something else and re-run the cmdlet. Simple, but not obvious.

Thanks to Michael for the bright idea on this one.


Azure Bastion – Unable to query Bastion data.

I’ve recently setup Azure Bastion to give external users/vendors access to resources via RDP or SSH following these instructions:

https://docs.microsoft.com/en-us/azure/bastion/tutorial-create-host-portal

The key permissions outlined in the prerequisites at point 3 are:

  • A virtual network.
  • A Windows virtual machine in the virtual network.
  • The following required roles:
    • Reader role on the virtual machine.
    • Reader role on the NIC with private IP of the virtual machine.
    • Reader role on the Azure Bastion resource.
  • Ports: To connect to the Windows VM, you must have the following ports open on your Windows VM:
    • Inbound ports: RDP (3389)

My scenario is to invite a guest AAD account, add them to a group and grant the group access as per below:

  • Grant Contributor role to the resource group that has the VM’s for the application.
  • Grant Reader role to the resource group that has the Bastion host.

This way the guest user logs into the Azure Portal complying with our conditional access policy and then they are presented with only the resources they have read or higher access too. In this scenario that is the two resource groups outlined above.

The guest user locates the virtual machine they wish to connect and then chooses Connect > Bastion > Use Bastion the following error message is presented.

Error Message:

“Unable to query Bastion data”

Initially working with Microsoft support we found that granting reader access at the subscription level gave the user permission to in-act the Bastion service, which simply give a username and password input.

These permissions were too lacks as a workaround and exposed a lot of production data to accounts that didn’t really have any business looking at it.

Workaround

[12/11/2020] The case is on-going inside Microsoft and I will leave a definitive response when I get the information. I’ve done some further investigation what would be the least amount of additional ‘Reader‘ permissions are required. I’ve found the following permissions are required in my scenario:

  • Reader permissions on the Virtual Network that has the ‘AzureBastionSubnet‘ subnet.
  • Reader permissions on the Virtual Network that has the connected virtual machine network interface.

In my scenario, the virtual machines are located in a development Virtual Network that is peered with the production Virtual Network which has the subnet ‘AzureBastionHost‘. So I had two sets of permissions to add. After applying the permissions you may need to get a coffee and return to the portal as it took 5-10 minutes to kick in for me.

Hope this helps someone that has done some googling but is still scratching their head with this error message.