Unpacking OpenShift break-glass access

Have you ever been locked out of an OpenShift cluster?

I get locked out every time I build a cluster - deliberately 🙂. I'll explain:

When you first install an OpenShift cluster, the kube-apiserver, kube-controller-manager, kube-scheduler, kubelet, and many other internal components get signed certificates by the cluster’s own PKI (the Cluster Machine Approver and the Kubernetes CSR signing controller).

These certificates expire within the first 24 hours, and are rotated automatically. But, if the If the cluster nodes are not running when the certificate expiry occurs, then rotation won't occur, and you get a bunch of pending certificate signing requests on the cluster, like this:

$ oc get csr
NAME        AGE     SIGNERNAME                                    REQUESTOR                                                                   CONDITION
csr-btq25   36h     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-fwntw   5m17s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-lshgr   5m22s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-ncxfl   35h     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-qkjls   36h     kubernetes.io/kube-apiserver-client-kubelet
csr-whmr5   5m22s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-wntm7   5m5s    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-ww55q   34s     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending

If this happens, several key components in OpenShift (like the web console, and the OAuth server) are not available. You can't login via the web console, or using oc login.

I mentioned earlier that I do this deliberately; when I create an OpenShift cluster in the public cloud I don't want to leave it running for 24 hours just to perform a certificate rotation. I want to provision the cluster, then shut down the nodes to save costs, and start it again when I need it.

This means that practically every time I deploy a cluster to public cloud, the first certificate rotation has not occurred, and the OAuth server is not available. I end up with this situation where I need to login to the cluster using a "break glass" mechanism, approve the certificate signing requests, and bring the cluster services back up.

Break glass / emergency access is incredibly important to support the "CIA triad" - a model for information security that refers to confidentiality, integrity and availability. If my OpenShift cluster is broken, and there is no emergency access, I can't remediate it and ensure service availability.

Break glass access is also explicitly called out as a requirement in the Australian Information Security Manual, published by the Australian Signals Directorate. I've included some excerpts here:

In this article I want to take a look at some of the ways I've seen break glass access configured, and some of the "gotchas". Let's take a look!

Understanding OpenShift authentication

Before diving into "break glass" with OpenShift, I think it's important to understand how OpenShift authentication works.

Ok, before we do that, we need to look at OAuth. OAuth is an industry standard for authorisation. Essentially, OAuth a way for users to grant websites or applications access to their information without giving away their passwords.

Let's say I build an app that draws moustaches on people in photos. I want to give you, the user, the ability to do this with your Google photos. I could use OAuth to provide delegated access to the app to some of your Google photos - and you could allow the app to authenticate with Google Photos, without giving it your Google password.

Here it is in a diagram:

+----------------+                                   +--------------+
|   Service A |  |                                   |   Service B  |
| (the moustache |                                   |  ( Google    |
|       app)     |                                   |    Photos )  |
+----------------+                                   +--------------+
        |                                               ^
        |                                               |
        | 1. Redirect user to login + consent ----------|
        |                                               |
        v                                               |
+------------------+                                    |
|   User + Browser |                                    |
+------------------+                                    |
        |                                               |
        | 2. User logs in & approves access             |
        |---------------------------------------------->|
        |                                               |
        |   3. Auth Code                                |
        |<----------------------------------------------|
        |                                               |
        | 4. Auth Code -> exchanged for Token           |
        |---------------------------------------------->|
        |                                               |
        |   5. Access Token                             |
        |<----------------------------------------------|
        |                                               |
        | 6. Use Access Token to call APIs              |
        |---------------------------------------------->|
        |                                               |
        |        Protected resources returned           |
        |<----------------------------------------------|

So what does this have to do with OpenShift? OpenShift includes a built-in OAuth server. You can think of the " moustache app" and "Google photos" as separate components within Openshift - the OpenShift console is like the moustache app, and the OpenShift API server is like Google photos, providing access to resources.

When you use oc login to access an OpenShift cluster, or login to the OpenShift web console, you're not providing your password directly to the platform. Instead, you're redirected under the hood to the OpenShift OAuth server. The OAuth server authenticates you using whatever identity provider OpenShift is configured with (LDAP, GitHub, Google, htpasswd, etc.), and once you're authenticated, you get an OAuth access token to interact with APIs (via the console / oc).

BUT - there is an alternative way of authenticating with OpenShift, which bypasses the OAuth server. You can authenticate directly with the OpenShift API server using X.509 certificates!

This is the same mechanism that the kubelet uses to authenticate with the OpenShift / Kubernetes API, inside the cluster. When you login as a user you're redirected to the OAuth server, and provided an access token for the API. But the kubelet has X.509 certificates available on the node, and uses this to authenticate with the API directly.

If you've installed an OpenShift cluster recently, you'll have noticed a file auth/kubeconfig created in the directory that you ran openshift-install. This file contains X509 certificates that can be used to authenticate to the cluster as the system:admin user, who has the cluster-admin role.

There is now a third mechanism to authenticate, which is new in OpenShift 4.19, and that's using an external OpenID Connect provider to directly authenticate with OpenShift. This bypasses the built-in OAuth server and uses the external identity provider directly. This can be really useful, because it means you are not limited by the capabilities of the built-in OAuth server, but can leverage the advanced capabilities of external OpenID Connect providers (Keycloak, Microsoft Entra, etc).

Direct authentication with an external OpenID Connect provider is currently a technology preview in OpenShift 4.19, and you can read more about it here

Break-glass with htpasswd

Ok, let's see one of the ways I've seen break-glass access configured, using the htpasswd identify provider supported with the OpenShift OAuth server. This mechanism provides access to the OpenShift console and API via oc.

Firstly, create a htpasswd file with a user temp-admin and the password 1800redhat created:

htpasswd -c -B -b htpasswd temp-admin 1800redhat

At this point you can either create a secret and specify the provider as code, or create an identify provider via the OpenShift console:

Once updated you'll see that the OAuth cluster operator rolls out new config:

$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.19.11   True        True          False      3d6h    OAuthServerDeploymentProgressing: deployment/oauth-openshift.openshift-authentication: 1/3 pods have been updated to the latest generation and 2/3 pods are available
baremetal                                  4.19.11   True        False         False      3d6h
cloud-controller-manager                   4.19.11   True        False         False      3d6h
cloud-credential                           4.19.11   True        False         False      3d6h
cluster-autoscaler                         4.19.11   True        False         False      3d6h
config-operator                            4.19.11   True        False         False      3d6h

And once complete, the new option to login is shown in the OpenShift console:

Before we test login using this break-glass mechanism, we should create a role for our temp-admin user, so that they have the right access for emergency situations:

$ oc adm policy add-cluster-role-to-user cluster-admin temp-admin
Warning: User 'temp-admin' not found
clusterrole.rbac.authorization.k8s.io/cluster-admin added: "temp-admin"

The warning message is shown because our user hasn't yet logged in for the first time, and been created on the cluster. This role binding will work fine though.

Let's test it out. Try logging in using the htpasswd-based "break glass" identity provider, and see that the user has access.

Great! Now we have a mechanism for "break glass" access to the OpenShift console.

There is one huge issue with this emergency access mechanism though - and that's that it relies on the OpenShift OAuth server being available. Let's replicate a situation where the OAuth server is not available, like my example at the start of this article where the certificate rotation has not been performed.

$ oc get deploy -n openshift-oauth-apiserver
NAME        READY   UP-TO-DATE   AVAILABLE   AGE
apiserver   3/3     3            3           3d7h

$ oc scale deploy/apiserver -n openshift-oauth-apiserver --replicas 0
deployment.apps/apiserver scaled

$ oc get pods -n openshift-oauth-apiserver
NAME                        READY   STATUS        RESTARTS   AGE
apiserver-ddb5559d9-6jk2l   0/1     Terminating   2          3d6h
apiserver-ddb5559d9-7rl4g   0/1     Pending       0          20s
apiserver-ddb5559d9-jnggg   0/1     Terminating   2          3d6h
apiserver-ddb5559d9-p4tql   0/1     Pending       0          20s
apiserver-ddb5559d9-r5gm9   0/1     Terminating   2          3d6h
apiserver-ddb5559d9-s2fth   0/1     Pending       0          20s

We have to be pretty quick here, as the OAuth operator is already scaling the pods back again. But if you're quick, and try to login via htpasswd, you'll see the following:

{"error":"server_error","error_description":"The authorization server encountered an unexpected condition that prevented it from fulfilling the request.","state":"025bb0fbf40dd280d8accbc359ff98c8"}

Hmm - well that's not great. Our "emergency" access method should work in emergency situations, and that's clearly not the case here.

I mentioned earlier in this article that direct authentication with an external OpenID Connect provider is now in technology preview. I would equate this method with the OAuth server for "break glass" - it's still reliant on an external IDP, and if it's not available, then using this for "emergency" access is fundamentally flawed.

Let's look at another method for break-glass access that is natively available with OpenShift

Break-glass with X509 certificates

Clearly there's some challenges with htpasswd-based "break glass" access - or any mechanism that relies on the built-in OAuth server, or an external OpenID Connect provider. If the OpenShift OAuth server is not available, or the external OIDC provider is not available, our "break glass" mechanism is not available - which is not good for an 'emergency' access mechanism.

Another option is X509-based break-glass access. This is natively available inside OpenShift - in fact, it's also how the kubelet interacts with the OpenShift / Kubernetes API! It's also very simple to configure, because everything is done for you at install time.

If you remember at the start of this article, I said that I deliberately get "locked out" of OpenShift when I provision a cluster, because I shut down the nodes and the certificate rotation does not occur. This X509 mechanism is how I can authenticate, and sign all the pending certificate signing requests (which requires access to the API).

Let's take a closer look at X509 authentication to the OpenShift API. When you create an OpenShift cluster using openshift-install you will see a folder created that looks like this:

drwxr-x--- 2 auth
-rw-r----- 1 metadata.json
-rw-r----- 1 terraform.platform.auto.tfvars.json
-rw-r----- 1 terraform.tfvars.json
drwxr-x--- 2 tls

To authenticate using X509 certificates you can simply do an export KUBECONFIG=auth/kubeconfig, and select the correct context via oc config:

$ export KUBECONFIG=auth/kubeconfig
$ oc config get-contexts
CURRENT   NAME                                                           CLUSTER                                    AUTHINFO                                              NAMESPACE
          admin                                                          cluster1                                   admin                                       
          default/api-cluster1-sandbox285-opentlc-com:6443/kube:admin    api-cluster1-sandbox285-opentlc-com:6443   kube:admin/api-cluster1-sandbox285-opentlc-com:6443   default
*         policies/api-cluster1-sandbox285-opentlc-com:6443/kube:admin   api-cluster1-sandbox285-opentlc-com:6443   kube:admin/api-cluster1-sandbox285-opentlc-com:6443   policies

In this example I have already logged in as a kube-admin user, and need to select the admin context:

$ oc config use-context admin
Switched to context "admin".
$ oc whoami
system:admin

Great! I've logged in as the system:admin user that was created during installation, using X509 certificates to authenticate directly with the API and bypass the OpenShift OAuth server. Now for the big test - let's kill off the OAuth server pods, and see if this auth mechanism still works.

$ oc scale deploy/apiserver -n openshift-oauth-apiserver --replicas 0
deployment.apps/apiserver scaled

$ oc get pods -n openshift-oauth-apiserver
NAME                        READY   STATUS        RESTARTS   AGE
apiserver-ddb5559d9-4hh2h   0/1     Pending       0          7s
apiserver-ddb5559d9-6xkf6   0/1     Pending       0          7s
apiserver-ddb5559d9-7rl4g   1/1     Terminating   0          13m
apiserver-ddb5559d9-lzjcw   0/1     Pending       0          7s
apiserver-ddb5559d9-p4tql   1/1     Terminating   0          13m
apiserver-ddb5559d9-s2fth   1/1     Terminating   0          13m

Does it still work, even when the OAuth server pods are not available?

$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.19.11   False       False         False      14s     APIServerDeploymentAvailable: no apiserver.openshift-oauth-apiserver pods available on any node....
baremetal                                  4.19.11   True        False         False      3d7h
cloud-controller-manager                   4.19.11   True        False         False      3d7h
cloud-credential                           4.19.11   True        False         False      3d7h
cluster-autoscaler                         4.19.11   True        False         False      3d7h
config-operator                            4.19.11   True        False         False      3d7h

Great! Even though the OAuth server is unavailable, I can still access the API.

I think this is far better than using htpasswd - it is easy to configure, easy to use, and works when the OAuth server is unavailable.

PS. If you want to look at some other ways you can use X509 certificates for user access in OpenShift, take a look at my article here.

Wrapping up

This was a pretty brief intro to break-glass / emergency access in OpenShift. I looked at how authentication works in OpenShift, the limitations configuring emergency access via htpasswd and the built-in OpenShift OAuth server, and a better mechanism for emergency acccess using X509 certificates, bypassing the OAuth server.

There's still a few outstanding issues though. Many organisations require emergency credentials to be rotated after use, and access to these credentials needs to be monitored and audited. In a future article I'll take a closer look at governance around emergency credentials usage, and how you can bring this into an OpenShift model for "break-glass" access.

Thanks for reading!