The ultimate guide to running Google Kubernetes Engine

A highly opinionated guide to building GKE clusters

Oct 13, 2020

I’ve been running Google Cloud Kubernetes clusters in production for some time now, and over the course of the last year, moved nearly every application that can be containerised over to it.

I do this for many reasons: the cost savings, scalability, added security benefits it can provide — but mostly because I’m a Soylent drinking Techbro, and Kubernetes is what the cool kids are doing.

To this end, then, I thought I’d share my ultimate guide to building and running a production-ready GKE cluster**

** This guide is highly opinionated, and some of the advice might not work for your envrionment. Always consult your threat model first.

P.S. just skip to the bottom if you’re boring and just want the commands

Getting a production-ready cluster doesn’t actually take too long in terms of actual provisioning time; once you’ve navigated all those options and flags, you can enable it — takes but a moment for GKE to get its act together and build it. That said, the first thing you need to do, is not even GKE related:

Getting the network right

A lot of people will skip over this bit as it’s “not really relevant”, but I do implore that you quickly check these few items before carrying on:

Not on a legacy VPC (you’d know if you are)
Not using the default network
You’ve got a Cloud NAT and Router configured

Not using the default network

GCP is very kind in that it will provision you a healthy and working “default VPC” which you can just get started with. Don’t use it. It carves out large amounts of internal address space for regions you’ll almost never work in. Make your own network and carve out the address space manually (like the good old days).

This is also an easy win should you ever CIS benchmark your envrionment, as CIS also says this is bad.

If you are following my advice on this, I recommend at least a /16 for the region you’re building in, and to enable “private Google access” (because why wouldn’t you?).

Configuring a Cloud NAT

The cluster we’re about to create is a “private cluster”, meaning all of the nodes are not going to have external interfaces. This is really good for security, and incidentally, your wallet too, since every public IP address you use is a line item on your invoice.

“Why do they need public IP addresses anyway Dan? Didn’t you say to enable private Google Access?”

Good question. Indeed I did suggest turning on private Google access, meaning all of the GCP API’s you’re interested in (the good ones, that is) are available without needing an external interface — apart from if you want to reach your own container registry.

The well cultured developer that I am, I use Gitlab CI/CD to build and test my images, and to store them — meaning I need to be able to egress to Gitlab.com.

Luckily there is nothing new about NAT, and Googles managed cloud NAT offering works a charm. It’s also free to run (other than the associated network costs) which is always nice.

Creating a Cloud Router (note the variables)

gcloud compute routers create nat-router \
    --network "projects/${PROJECT_ID}/global/networks/${NETWORK}" \
    --asn 4201337420 \
    --region "${REGION}"

Creating a Cloud NAT (note the variables)

gcloud compute routers nats create k8s-nat \
    --router=nat-router \
    --auto-allocate-nat-external-ips \
    --nat-all-subnet-ip-ranges \
    --region "${REGION}" \
    --enable-logging

Right. Now that that’s all sorted, onto the actually interesting bits.

Provisioning a cluster

The basic building blocks to build on for any cluster is:

VPC native
Regional
Secure by default

VPC native clusters are more secure, and provide a nicer integration with the rest of Google Cloud. Routes based clusters will likely get the Google treatment sooner or later anyway, so why bother.

Regional clusters are the only form of GKE that’s truly highly-available. It basically means that the masters are spread across 3 zones (in said region) and the node pools you provision can be spread across them, leaving you zone failure tolerant. This is also the only form of GKE cluster where the masters stay highly available during an update, which is always nice.

Secure by default is more of a guiding principle to keep in mind as we go, but we’re already off to a good start by using a private, VPC native cluster.

We can keep the trend going as well, by selecting a release channel, rather than a static version. With this choice, Google will automatically update the masters and node pools of your cluster to keep them up to date — just because we’re doing Kubernetes, doesnt mean there isn’t good hygiene to be done.

Choosing this is subjective to your operating model. Personally I like the rapid channel because I’m impatient, however, if you need something a little more “stable” then by all means choose the boring one. The only important thing to really keep in mind is to make sure you’re on at least v1.16.13-gke.401

Version 1.16 is where a lot of the fun features really start to come in, so aim for at least that.

Do it for the networking

I’ll confess I’m not a networking aficionado, and how exactly “IP aliasing” works is beyond me. That said, it still needs configuring.

Since we’ve elected to use a private cluster, we need a small IP range for the masters to sit in. You only need a small /28 for this, so any RFC 1918 range will do. That said — if you’re out of RFC 1918 ranges, I’ve found no issues using public ones either (just accept the warning that it’s in beta).

Ensure you allow access to the masters using it’s external IP address (you’re not getting much CI/CD done without it), and enable master global access along with it.

The IP ranges for the pods and cluster services also need setting. Well technically they don’t — Google will do it for you — but it’s better to carve out some IP address space you know won’t clash later on. You’re going to want to reserve a at least a nice /16 for each. The bigger the range the more pods you can deploy, but a /16 should be plenty for you (especially if you’re reading this article).

Remember: these ranges, like any other in your VPC’s, cannot overlap with any other networking (including peered networks you might want to connect to some time in the future), so pick something sensible. Personally I go in to start at 172.17.0.0/16 and then just use the next /16 up from there for the services range. Repeat as needed for more clusters.

Master Networks

This is where things get very subjective, and is a good time for you to check your threat model. The Kubernetes API is a very powerful endpoint, it literally runs the cluster, and leaving it exposed to the internet for potential abuse might not be the best of ideas if you’re running a bank.

I don’t run a bank. I don’t use VPNs. I trust Public Key Encryption.

For my CI/CD pipeline to work, I need to be able to access the master API endpoint from anywhere in Gitlabs runner range. I also like to access kubectl from my laptop (without the need for tedious VPNs) for manual tasks and debugging. I don’t need master authorized networks, and so I don’t turn them on.

Google Clouds OAuth flows for generating credentials to the cluster endpoint mean that we never need to hardcode the endpoints credentials anywhere — and we trust our developers with the power of kubectl. This is within our threat model. It might not be within yours.

Your threat model is different to mine. Please consult and make relevant decisions accordingly.

My First Node Pool

Obviously no Kubernetes cluster is complete without its node pools, so we need to put the first one on your cluster. Sadly the first node pool needs to be kinda boring; we can’t add taints to the nodes since there are workloads that need to run that will never tolerate them. For this, then, I like to use basic-bitch nodes, of the E2 series variety.

You want to make these nodes just about big enough to run workloads that can’t accept taints, plus whatever other daemonsets they need, but not so big that there’s a bunch of wasted compute. Personally I go in for 1 node per zone of e2-standard-2 which covers everything I need, plus a little overhead.

The real magic sauce here, is that these nodes should be preemptible. This is where the real power of GKE comes into its own. Preemptible nodes get a silly discount rate (somewhere in the 70-80% range), but obviously Kubernetes will always make sure that the workloads you want to run are available on whatever nodes it has on hand, meaning chopping out a node after 24 hours means nothing to you. Literally — you won’t even notice it’s happening. GKE does all of this chopping and changing for you, and you won’t even feel a thing.

Following our earlier principles of secure by default, make sure you also turn on “secure boot” and “integrity monitoring” here — because you really have no excuse not to.

The final bit that needs mentioning here is to add some network tags to the nodes you make. This can be handy for grouping and searching, but more than anything, it allows you to set L4 firewall rules on the traffic (the way Google’s network mesh enforces these means they’ll always be applied regardless of where the traffic comes from).

The True Kubernetes Experience™

Next up is to look at enabling relevant features that give us the true Kubernetes experience. Google made Kubernetes, and as such it’s only fitting that Google Cloud is the nicest place to use it (my slight masochism streak forced me to try the others).

The boring one to start with is to enable the “Compute Engine Persistent Disk CSI Driver”. The CSI API is the future direction of stateful applications and storage in Kubernetes, just enable it and move on.

If you’re a bit of a nutcase, you can also enable Google’s Service Mesh, Istio, here. Frankly, if you’re really in need of a Service Mesh, just go get Linkerd (he says as he runs Consul because he has an account with Hashicorp). It’s for this reason that I say to leave “Enable Network Policy” disabled, since that’s what my service mesh is for. If you don’t know what you’re doing just leave it turned off, firewalls can be a pain enough as it is without some other random project having a say on the matter.

For those SRE’s who don’t have time to be dealing with manually scaling applications like myself, enable Vertical Pod Autoscaling to make sure no pod ever feels unloved and out of resources.

Make sure you check Enable HTTP Load Balancing, as this allows you to do fun things with Ingress and HTTP(S) Load Balancers which quite frankly is black magic (really cool black magic though).

Enable Local DNS Caching if you expect DNS to become a bottleneck (it probably will at a certain critical mass of microservices).

“Enabling Intranode visbility” is fun if you’re relying on GCPs network fabric for monitoring. I’m not — I use the Elastic stack for all my observability needs. Enable if you feel like it.

“Shielded GKE Nodes” will be enabled by default in all versions from 1.18+. Turn it on, it adds a good deal of security at literally no cost to you.

Workload Identity

Finally, but absolutely not the least, make sure you enable workload identity. As far as I’m concerned Google made a deal with a demon for this to work, but I am forever grateful that they did as it solves the one of the biggest security challenges you are faced with when running workloads on Kubernetes: How do you authenticate applications to Google APIs?

Kubernetes workloads run in the context of a service account, and Google Cloud Platform has the concept of service accounts which you can grant permissions to, to access API’s in a sensible least privilege approach. These concepts are mutually exclusive to each other — or at least they were until Workload Identity came along.

Workload Identity allows you to “bind” a service account in Kubernetes to a service account in the Google Cloud Project you’re building in, meaning you no longer need to manage static credential files (that you probably never rotate) across containers.

Instead, you can now bind the Kubernetes “webapp” user to a “webapp” user in Google Cloud, and grant it permissions to the relevant databases and backends that it needs in order to function.

Turn this on

Here’s one I prepared earlier

So if you’ve made it through all of that, what you should land up with is something that looks a little like this (note the variable names):

gcloud beta container clusters create "${CLUSTER_NAME}" --region "${REGION}" \
    --no-enable-basic-auth --release-channel "rapid" \
    --metadata disable-legacy-endpoints=true --machine-type "e2-standard-2" --image-type "COS_CONTAINERD" \
    --disk-type "pd-standard" --disk-size "45" --node-labels security.cto/sandboxed=false,cto/node-pool=default \
    --node-taints cto/node-pool=default:PreferNoSchedule \
    --scopes=gke-default \
    --preemptible --num-nodes "1" --enable-stackdriver-kubernetes --enable-private-nodes \
    --master-ipv4-cidr "172.16.0.32/28" --enable-master-global-access --enable-ip-alias \
    --network "projects/${PROJECT_ID}/global/networks/${NETWORK}" \
    --subnetwork "projects/${PROJECT_ID}/regions/${REGION}/subnetworks/${SUBNETWORK}" \
    --cluster-ipv4-cidr "172.17.0.0/16" --services-ipv4-cidr "172.18.0.0/16" --default-max-pods-per-node "110" \
    --no-enable-master-authorized-networks \
    --addons HorizontalPodAutoscaling,HttpLoadBalancing,NodeLocalDNS,GcePersistentDiskCsiDriver \
    --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 \
    --autoscaling-profile optimize-utilization --enable-vertical-pod-autoscaling \
    --workload-pool "${PROJECT_ID}.svc.id.goog" \
    --enable-shielded-nodes --shielded-secure-boot \
    --tags "${CLUSTER_NAME}","control-plane"

Plus one quick fix

Private clusters currently have a slight bug in them — GKE doesn’t write in all the required Firewall rules that allow it to connect to the nodes. This can be fixed trivially with 1 quick command:

gcloud compute firewall-rules create allow-gke-masters \
    --action ALLOW \
    --direction INGRESS \
    --network ${NETWORK} \
    --source-ranges ${MASTERS_SUBNET_RANGE} \
    --rules tcp \
    --target-tags ${CLUSTER_NAME}

If you followed the above to create the cluster, you’ll have added network tags to each node which are the same as the cluster name. Leverage these to allow traffic from the masters to the nodes, and problem solved (you’re welcome).

You get a node, and you get a node

So if you’ve made it through all that and you think “aha, I’m ready to run some workloads,” then sorry but you’re mistaken.

Recall that earlier we only made a node pool that is big enough to run the daemons that GKE requires to run. We’ve not actually created enough compute to run any meaningful amount of workloads; we require a production node pool.

It’s worth also pointing out that there are some second order benefits to this approach. Keeping the default node pool nice and light-touch, and splitting the rest of the compute into other pools, means that you can chop and change entire node pools with relative safety to cluster health, since the necessary daemons will continue to run on that first pool (which you shouldn’t really need to ever touch now).

The real fun here is that you can start using taints on your nodes. Taints repel workloads, and unless your workload “tolerates” these taints, it won’t schedule. You can use these to make sure powerful compute goes to workloads that need it, and not to just backend cron operations that won’t see any benefit from all that extra cash spent.

There are also even more security benefits to be reaped here though. Separate node pools allow us to protect the meta-data server (which contains a lot of potentially sensitive data), and enables us to run sandboxed workloads.

Sanboxed workloads are containers that are run in… well, in a sandbox — meaning that untrusted code (looking at you PaaS provider people) can be run safely in a shared envrionment. I enable sandboxes on all of my nodes, and have the workloads either sandboxed (such as edge workloads like proxies) or tolerate the taint it applies to the node.

tolerations:
  - effect: NoSchedule
    key: sandbox.gke.io/runtime
    operator: Equal
    value: gvisor

Like before, you can make these node pools preemptible and you shouldn’t feel it. The only places I have seen them impact is when you’re running stateful applications, such as Elastic Cloud Kubernetes. That said, distributed clusters such as Elastic are fault tolerant, and a well designed cluster should be able to handle a node going down very gracefully (I know mine does).

Getting the node size right here is more of a science then an art. The bigger the node, the less overhead there is from daemonsets (such as monitoring agents) and more value for money you will get from the nodes. Larger nodes, however, mean that more pods are disrupted when the node is replaced in 24 hours. Strike a balance between how many pods you can afford to have disrupted, and how tall you need the VMs to be for said pods.

gcloud beta container node-pools create "production" \
    --cluster "${CLUSTER_NAME}" --region "europe-west2" --machine-type "n2-standard-32" \
    --image-type "COS_CONTAINERD" --disk-type "pd-standard" --disk-size "50" \
    --node-labels elastic.cto/node-type=standard,security.cto/sandboxed=true \
    --metadata disable-legacy-endpoints=true \
    --scopes=gke-default \
    --preemptible --sandbox type=gvisor --num-nodes "1" --enable-autoupgrade --enable-autorepair \
    --shielded-secure-boot --tags "${CLUSTER_NAME}" --node-version="${MASTER_VERSION}"

The number of nodes is the number of nodes per zone, meaning in this instance I’m getting 3 of these (1 * 3 quick maths)

So that’s it. That’s probably not quite ultimate but reasonably comprehensive guide to getting a production GKE cluster. If for some reason you copy and paste the commands and it doesn’t work, Google it (how do you think I got here?). If for some reason you copy and paste the commands and it does work, I do take payment in beer.

Feel free to follow me on Twitter if you’d like to complain about any of this advice, or to see me live stream my perpetual struggles with being an SRE.

The Last Miles

Discussion about this post