Cost effective design for SAP High Availability in Microsoft Azure

Be careful with resources across different Azure Availability Zones

Organisations that deploy SAP resources over Microsoft Azure are about to face significant changes in the way they are charged when accessing those resources across different Azure Availability Zones. This is especially important for systems that need a high level of availability, such as SAP systems, because distributing SAP resources across different zones is the accepted best practice to maximise the everyday resilience of business systems.

In this blog, I will discuss the implications for current best practice and explain other options that you might be considering once the changes kick in from 1 July 2022. Note that we’re talking about day-to-day availability and resilience, not disaster recovery if an entire Azure region goes down.

What is the current best practice to maximise the availability of SAP ABAP Platform or SAP Netweaver within a single region?

If you’re looking to maximise the availability and resilience of your SAP deployment across a single region, Azure Availability Zones can help protect your computing resources.

Availability Zones first arrived on Azure in 2018 and most regions now support them. Each region typically includes three separate zones and each of these has its own segregated power and other features that allow them to continue functioning independently if one of them fails.

If the entire region goes down then so will all the zones, but barring that catastrophic scenario, Azure Availability Zones are considered the robust choice for building a high-availability SAP deployment within an Azure region.

The Microsoft reference architecture for SAP deployments in Azure shows the use of Availability Zones across both the SAP ABAP Platform (for S/4HANA) and SAP Netweaver (for ECC/ERP) tiers.

Let’s consider a deployment with the ABAP SAP Central Services (ASCS), primary application (app) server and primary database (DB) server, grouped together in one zone, and the Enqueue Replication Server (ERS), secondary app server and secondary DB server in another.

Are there other design choices for deploying high availability within a single region?

Availability Zones are popular but they weren’t available for use with SAP until March 2018.

Before Availability Zones, you would simply double up your resource tiers (ASCS/ERS, 2x app servers and 2x database servers) and use Availability Sets to group your resources and to minimise any downtime caused by physical host issues. This is still an option and is a relatively simple set up.

However, Availability Sets provide minimal control over how your resources are dispersed within the Azure region. Microsoft Azure can deploy them any way it wants.

There is no charge for using Availability Sets, but you must take care to apply Availability Sets accurately to the correct tiers. The way in which resources are deployed also means that any zone can be used for any resource, which could increase network latency. Another alternative uses the concept of Proximity Placement Groups (PPGs), which Microsoft introduced on Azure in December 2019. Unlike Availability Sets, PPGs allow some level of control over where resources are deployed.

A PPG is simply a method of grouping resources so that Azure will attempt to deploy them as close as possible to each other to reduce network latency. This could be within the same zone, same room, within the same physical rack or even on the same physical host, but you still have no control over the actual placement within a region – this is decided by the Azure Resource Manager.

As you can see, the aim with PPGs is to ‘align’ within the tiers. When a computer resource is aligned it means it is deployed close to the so-called anchor point, which is the first resource allocated inside a PPG. Deploying the anchor in a particular zone will force the aligned resources to also be deployed in that zone.

So far so good, but there are some downsides to PPGs that need serious consideration. In fact, in late 2021, SAP and Microsoft advised against using PPGs unless you have serious latency problems. This is because the requirement for aligned status forces resources to be deployed close to the anchor, but if the capacity (in a rack, for example) is insufficient, you may experience allocation issues. The situation is not helped by the fact that Microsoft recommends anchor resources should be the least abundant of the compute resources within your PPG (like an M-Series for HANA-related PPGs). This is contrary to some other Microsoft documentation and examples, which mention using the SAP ASCS or SCS Virtual Machines as the anchor points. Using PPGs by themselves, you do not have the assurance that your resources are running on different physical racks. This means your system is more vulnerable to failures than if you used Availability Sets.

Can we mix Availability Sets and PPGs?

You can create a hybrid architecture by assigning a PPG to an Availability Set. This makes it easy to assign a PPG to all the resources that are created within the Availability Set, which sounds like a good idea in theory, because you would get the availability protection of Availability Sets and the proximity control of a PPG. However, when you look at what happens in practice, this hybrid approach only provides a benefit if you are using multiple Availability Zones.

All the resources are indeed protected with Availability Sets, but they are then grouped in close proximity and in the same zone (you assign the zone during anchor VM creation). Moreover, the use of the PPG means you need twice the quantity of least abundant resources to be able to successfully allocate all your other resources (if you follow the SAP & Microsoft guidance from the SAP notes).

If you have multiple Availability Zones, then the hybrid approach of assigning a PPG to an Availability Set can help control latency within your SAP system and also ensure physical separation, but the Availability Set design is slightly different:

By placing the DB and app Virtual Machines within the same zones into the same Availability Set, Azure is guaranteeing they will not be running on the same physical host and therefore unlikely to fail at the same time (see Update Domains and Failure Domains). The diagram above shows the ASCS and ERS are separate, but they could be included in the same Availability Sets in their respective zones.

What’s happening in July 2022?

Until now, using Availability Zones has been an obvious but pretty fundamental choice, because there was little to no cost implication from using them. However, on 1 July 2022, Microsoft will start charging for the inter-zonal data transfer (ingress and egress) when Availability Zones are used: https://azure.microsoft.com/en-us/pricing/details/bandwidth/#faq

This means that any data transfer between (for example) Zone 1 and Zone 2 would be chargeable at £0.008 per GB.

NOTE: This excludes the persistence of data from app to DB as it's difficult to measure.

That may not sound like much but it will soon add up. Here’s a sample calculation I ran for a customer:

Let’s assume one of our two SAP app servers is pulling 503GB of data per day from a HANA database that is ~2TB in size. 503GB x £0.008 = £4.00 x 365 days = £1,460.00 per year (just for egress).

That’s £1,460.00 per year (for egress), just for just getting the data out of the DB when the DB is in a different zone from the app server.

I then looked at the DB replication for a small-ish SAP system with ~200 concurrent users:

£255

per year for HANA System Replication (estimated from the transaction log volume transferred to secondary DB)

£127

per year for one cross-zone app server data input into DB (based on 50% of the transaction log throughput)

£1,460

for one cross-zone app server data retrieval out of DB (our previous calculation) - ingress charges only

If we assume that the second app server is in the same zone as the DB, the total cost comes to £1,842 per annum (ingress charges only) for just one system environment. Now imagine you had pre-production and quality assurance using the same architecture, sizing and some throughput.

This is a simple example, but for many SAP implementations, the costing situation will be much more complex. In a worst-case scenario, you could be paying five times for the same data to be transferred multiple times between zones!

This is obviously a rare case and probably not what most organisations will have, but as we have discussed, the use of Availability Zones is the optimal solution for High Availability, and so many different architectural designs may have been created, incorporating Availability Zones.

What does this price change mean for your architecture design?

Quite simply, you will now have to pay a cost penalty when designing your architecture for maximum availability.

You will therefore need to check whether your risk acceptance level, your RPO (Recovery Point Objective) and your RTO (Recovery Time Objective) align with your solution design choice and the hosting costs per month.

If we still use Availability Zones, what can we do to minimise inter-zonal data ingress/egress costs?

So far, I've shown images of a misaligned landscape incurring inter-zonal ingress/egress data transfer costs, but there is another possible version in which the tiers are fully aligned.

The fully aligned version effectively reduces transfer costs. You will still need to pay for the HANA System Replication (or whatever DB level replication you use), as well as for some inter-zonal transfer, but the aim is to keep all the SLT replication data transfer (from HANA -> S/4 -> SLT -> HANA) within the same zone.

The bad news is that this is unlikely to be feasible in practice. The SAP Netweaver and SAP ABAP Platforms are designed to provide load-balancing and high availability, so trying to re-configure them to not provide these things is going to be difficult. You may end up with an infrastructure-level solution to divert the connections, but in the end, it will be difficult to manage.

So what can you do?

First, don’t pull so much data to the app servers.

Since the dawn of computing, programmers have sought to reduce the use of expensive computing operations. They traditionally did this via compression and caching. For instance, an SAP Netweaver or SAP ABAP Platform environment has a built-in mechanism for caching table data, known as table buffers.

Before fast network connections between the app server and DB, the roundtrip time to the SAP GUI front-end was reduced by caching recently accessed data in memory within the app servers. However, SAP HANA made many of us forget all about this feature since the response time from HANA is quick and network throughput has increased over the years.

You may benefit from getting your BASIS team to correctly size those table buffers and revisit that ST02 transaction to look at how many swaps you have (resulting in DB lookups).

As a last resort, make sure that end-users do not have the capability of pulling ridiculous volumes of data via the new SE16* group of transactions. There is never any need to extract all data to a small screen for one person to view. If that's one of your business processes, it may cost you dearly after July 2022.

References:

https://azure.microsoft.com/en-us/pricing/details/bandwidth/#faq

https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/sap/sap-high-availability-architecture-scenarios
https://docs.microsoft.com/en-us/azure/virtual-machines/availability-set-overview

https://docs.microsoft.com/en-us/azure/virtual-machines/co-location

https://docs.microsoft.com/en-us/azure/availability-zones/az-overview?context=/azure/virtual-machines/context/context

https://docs.microsoft.com/en-us/rest/api/resources/subscriptions/check-zone-peers

https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/sap/sap-proximity-placement-scenarios#combine-availability-sets-and-availability-zones-with-proximity-placement-groups

https://docs.microsoft.com/en-us/azure/virtual-machines/availability-set-overview#how-do-availability-sets-work

SAP note 2931465 - when to use PPGs in Azure.

Article Tags:

Blog Banking Automotive Consumer Goods Energy and Utilities Insurance Public Services Retail Telco Media Travel Transportation and Logistics Cloud SAP Technology Microsoft

Cost effective design for SAP High Availability in Microsoft Azure

Be careful with resources across different Azure Availability Zones

What is the current best practice to maximise the availability of SAP ABAP Platform or SAP Netweaver within a single region?

Are there other design choices for deploying high availability within a single region?

Can we mix Availability Sets and PPGs?

What’s happening in July 2022?

What does this price change mean for your architecture design?

If we still use Availability Zones, what can we do to minimise inter-zonal data ingress/egress costs?

So what can you do?

References:

Related Insights

Defining Important Business Services: A Guide to Enhancing Operational Resilience

Cloud costs are spiralling. Is Full Stack FinOps the answer?

NTT DATA is marking Earth Day by launching a new corporate sustainability strategy

The Many Uses of AI in Human Experience Management

See who we are

Unleash the power within you