live migrating VM's - cool tech for the future, that you should avoid using today (Review & Benchmark of GCP E2 Instances)

Eugene Cheah - Dec 16 '19 - - Dev Community

Should I buy summary: NO

Unless you are using GCP E2 specifically for, non-CPU dependent burst workloads (eg. background tasks), temporary autoscale, or are planning to use committed use discount - stay with N1/N2.

For nearly all other workloads, it makes no economic sense to use E2, due to its approximately 50% reduction in CPU performance (+/-).

If you want the raw numbers, skip the prelude section and go straight to the benchmark section below. or alternatively our github

However after benchmarking it, and reviewing the tech behind it. I realized there is a lot more then meets the eye - how E2 could however potentially be game-changing for cloud providers in the future ... Assuming they have the "courage" to use it ...

(read on to learn more)


Sample Uilicious server usage

Prelude: Understanding the VM workload

Before understanding E2, you will need to understand a typical VM workload.

Where most server VM workloads, are underutilized "most of the time", with the occasional bump in resource consumption, during certain hours of the day, or when they appear on Reddit.

This can be easily be observed by looking at most VM resource usage charts.

The chart above is a browser test automation server, attached to one of several clients within ulicious.com. Notice how this server spikes twice with a small gap in between.

This is extremely typical for work applications representing the start of the workday, followed by lunch, and the rest of the workday (after offsetting time of day, for different timezone)

Also note, on how the system is idle under 40% most of the time (running some background tasks).

What this means is that, for most data centers, thousands of servers are paid for in full, but are underutilized, and burning through GigaWatts of electricity.


Sunrise in the clouds

Prelude: The auto-scaling cloud era

For infrastructure providers to lower overall cost, and pass these savings to their customers, as a competitive advantage. New ideas came along on how to safely maximize the usage of idle resources.

First came auto-scaling, where VM's can be added only when needed, allowing cloud consumers to allocate only the bare minimum server resources required to handle the idle workload, and buffer surges to request as auto-scaling kicks in (there is a time lag).

Significantly cutting down on idle resource wastage.

This change, along with changes in the billing model. Started the "cloud revolution".

And it was only the beginning...


Pikachu Dancing - every day I'm shufflin

Prelude: Shuffling short-lived servers

New ideas were then subsequently introduced for a new type of server workload, which can be easily shuffled around, to maximize CPU utilization and performance for all users.

The essence of it is simple, slowly put as many applications as possible into a physical server, until it approaches near its limits (maybe. 80%), once it crosses the limit, pop out an application and redistribute to another server - either by immediately terminating it, or waiting for its execution to be completed and redirecting its workload.

This is achieved either through extremely short-lived tasks (eg. Function as a service, serverless). Or VMs which are designed to be terminated when needed (eg. Pre-emptible, spot instances).

The downside of the two approaches given above, however, is that the application software may need to be re-designed to support such workload patterns.


Broken Glass, metaphor to broken rules

How E2 - changes the rules of cloud shuffling

To quote the E2 announcement page ...

After VMs are placed on a host, we continuously monitor VM performance and wait times so that if the resource demands of the VMs increase, we can use live migration to transparently shift E2 load to other hosts in the data center.

Without getting too deep into how much pretty crazy and awesome engineering goes into writing a new hypervisor.

In a nutshell, what E2 does, is enable google to abstract a live running VM from its CPU hardware. Performing migration in between instances with near-zero ms downtime (assumption).

Your VM servers can be running on physical server A in the first hour, and be running in physical server B in the next hour as its resources are being used up.

So instead of shuffling around a specially designed VM workload. It can shuffle around any generic VM workload instead. Allowing much better utilization of resources, with lower cost, without reducing the user experience ... Or at least it should in theory ...

Onto the benchmarks!

Side note, while live migration's of VM's are not new tech. This marks the first time it is integrated as part of a cloud offering, to lower the cost of the product. Also, for those who have used it before, the list of issues is endless - which google presumingly has resolved for their custom E2 instances.


Benchmarking: Show me the numbers!

Full details on the benchmark raw numbers the steps involved can be found at GitHub link

The following benchmarks were performed in us-central1-f region. Using N1/N2/E2-standard-4 instances. With N1-standard-4 serving the baseline for comparison.

Covering the following

  • Sysbench CPU
  • Sysbench Memory
  • Sysbench Mutex
  • Nginx + Wrk
  • Redis-Benchmark

Sysbench CPU

CPU benchmarking summary

While its no surprise that a new hypervisor designed to share CPU's across multiple workloads would be "slower". A 69% reduction might be too much for most people to stomach.

However, that is not the only detail to keep track of

E2 cpu run result

E2 also sees the largest variance in request statistics, across min/avg/95 percentile. This is in contrast to NX series benchmark (below) which these 3 numbers would be mostly the same.

N1 cpu run result

Sysbench Memory / Mutex

Memory benchmark summary

On the plus side, it seems like E2 instances, with the newer generation of memory hardware and clock speeds, blow pretty much N1/N2 instance workload out of the water. By very surprisingly large margins.

So it's +1/-1 for now.

Workload benchmark: Nginx + Wrk, Redis-Benchmark

Workload Benchmark Summary

Unfortunately, despite much better memory performance. The penalty in CPU performance results in an approximate ~50% reduction in workload performance for even memory-based workloads.


Lies, damned lies and benchmarks

These numbers are just indicators for comparison between equivalent GCP instance types. As your application workload may be a unique snowflake, there will probably be differences which you may want to benchmark on your own.

Also this is meant for GCP to GCP comparison, and not GCP to other cloud provider comparison.

Note, that as I am unable to induce a live migration event, and benchmark its performance under such load. Until we can find a way to get data on this, let's just presume its in milliseconds? maybe? (not that it changes my review)


Pricing Review: is it worth it ??

sorry but no

In case that was not clear: NO

If it was a poorer performer at a lower price E2 would make for a compelling offer. This, however, is what's the confusing thing about the E2 launch.

GCP tweet on E2

While its marketing materials say "up to 30% savings". The reality is much more complicated.

Instance price comparision

Or would I even dare say misleading?

You see for N1/N2 instances, they receive sustained usage discount. That scales between 0-to-30% when used continuously for a month. With E2 instances not having any sustained use discount (as it's built-in).

So in a sustained 24/7 usage not only is the cost marginally higher, it has a much worse performance profile.

And unfortunately, it is this pricing structure which makes E2, a really cool tech with little to no use case. Especially considering that the VM overall capacity is expected to suffer from an approximately 50% penalty in performance.

If it's not clear, I made a table to elaborate on.

E2 Usage Guidelines

So unless you read through this whole review, and tested your application performance for it. Stick to the NX series of instances.


Pug looking sad

Overall Conclusion: Wasted opportunity

Internally for Uilicious "pokemon collection of cloud providers" we currently see no use for E2. And will be sticking with our N1 instances. As our main GCP server.

Despite all that, I do really look forward to the next iteration of E2, because as improvements are made to the hypervisor, and Moores law hold true. It would be about 2 more years where it outright replaces the N1 series, as the "better choice".

More importantly, what this technology opens up is a new possibility. Of a future instance type (E3 ?), where one could be paying for raw CPU / ram usage directly instead. For any VM workload. Making the previous optimization (preemptible, serverless) potentially obsolete.

Giving even legacy application developers a "have your cake and eat it too moment", where they can take any existing workload, and with no additional application change, with the benefit of "preemptible" instances.

If Google Cloud has not realized it yet, if done correctly they can make huge wins in enterprise sales, which they desperately need (aka the people still running 20+ year old software).

Till then, I will wait for GCP or another cloud provider to make such a change.

~ Happy Testing 🖖🚀


About Uilicious

Uilicious is a simple and robust solution for automating UI testing for web applications. Writing test scripts to validate your web platforms, can be as easy as the script below.

Which will run tests like these ...

uilicious demo
Catfeeding: Uilicious testing inboxkittens XD

👊 Test your own web app, with a free trial on us today


Skeptical Baby

(personal opinion rant) On GCP lacking the courage to leverage on E2 tech 😡

The whole purpose of E2 is to create a new dynamic migratable workload

So why is there even a preemptible option which makes no sense in almost any scenario when compared to other preemptible options?

Also isn't the very point of E2 series designed to help for long low-CPU usage workload, why does your pricing structure not favor it?

These are just the tip of the iceberg on Google confusing launched messaging.

If GCP removed preemptible discount option and made this new lower-performing line of VM's 30% cheaper, where it sits nicely between preemptible N1 workload, and higher performing sustained N2 workload.

It would have become a serious consideration and contender. However, without doing so, it's just cool tech, desperately trying to find a use case.

Sadly, and frankly, the only reason I would see why GCP is reluctant to do price cut for a new instance type is that they either

  • legitimately fear a massive price war with Amazon (which is famously known to be willing to out-bleed their competitors),
  • fear the new lower-priced product will eat into their revenue from existing customers.
  • Or worse, they just didn't think of it.

Considering that this is the same company that made serverless tech in 2008, way ahead of any of their competitors, and not capitalize on it. There is a good chance it's the last option (deja vu?)

All of this is disappointing for me to see, considering the massive amount of R&D and engineering resources went into making this happen. Which is a very classic google problem, really strong tech, that is disconnected from their business goals and users.

Finally GCP please stop abusing the phrase "TCO", or "total cost of ownership" - we sysadmin and infrastructure personal have the tendency to think in terms of months or years (you know the server's entire potential lifespan and total cost of ownership) ... Some of us actually find it insulting confusing when the term was used to imply savings when comparing to the existing long-running workload. When you actually meant to compare extremely short-lived workload instead.

We actually calculate server expenses, and such misleading marketing just leads us into wasted time and effort evaluating our options and making an article on it in the process.

~ Peace

. . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player