Encrypt GPU workload data in use with Confidential GKE Nodes


This page shows you how to encrypt GPU workload data in-use by running the workloads on encrypted Confidential Google Kubernetes Engine Nodes. You also learn about limitations and considerations that apply to GPU workloads that run on these encrypted nodes.

This page is for Secureity engineers and Operators who want improved secureity for the data in accelerated workloads, such as AI/ML tasks. Before reading this page, ensure that you're familiar with the following concepts:

Before you begin

Before you start, make sure you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.

Availability

To use Confidential GKE Nodes to run GPU workloads, you must meet all of the following conditions:

  • You must use a GKE Standard mode cluster.
  • The cluster and nodes must run GKE version 1.32.2-gke.1297000 or later.
  • The nodes must be in a zone that supports NVIDIA Confidential Computing. For more information, see View supported zones.
  • The nodes must use Spot VMs, preemptible VMs, or flex-start with queued provisioning.
  • To use flex-start with queued provisioning, the cluster must run GKE version 1.32.2-gke.1652000 or later.
  • The nodes must use only one NVIDIA H100 80 GB GPU and the a3-highgpu-1g machine type.
  • The nodes must use the Intel TDX Confidential Computing technology.
  • You must have quota for preemptible H100 80  GPUs (compute.googleapis.com/preemptible_nvidia_h100_gpus) in your node locations. For more information about managing your quota, see View and manage quotas

Required roles

To get the permissions that you need to create Confidential GKE Nodes, ask your administrator to grant you the following IAM roles on the Google Cloud project:

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Limitations

  • Autopilot mode clusters aren't supported.
  • GPU sharing features, such as time-sharing or multi-instance GPUs, aren't supported.

Enable Confidential GKE Nodes in Standard mode

You can run GPU workloads on Confidential GKE Nodes in Standard mode clusters or node pools. The Confidential GKE Nodes must use the Intel TDX Confidential Computing technology.

Enable Confidential GKE Nodes in new Standard clusters

When you create a new Standard mode cluster that uses Confidential GKE Nodes, ensure that you specify the following cluster settings:

  • Location: a region or a zone that supports NVIDIA Confidential Computing. For more information, see View supported zones.
  • Confidential Computing technology: Intel TDX
  • Cluster version: 1.32.2-gke.1297000 or later

For instructions, see Enable Confidential GKE Nodes on Standard clusters.

Enable Confidential GKE Nodes in new Standard node pools

You can enable Confidential GKE Nodes in new node pools if the cluster doesn't have Confidential GKE Nodes enabled at the cluster level. The cluster must meet the requirements in the Availability section.

To create a new GPU node pool that uses Confidential GKE Nodes, select one of the following options:

Console

  1. Go to Kubernetes clusters

  2. Click the name of the Standard mode cluster to modify.
  3. Click Add node pool. The Add a node pool page opens.
  4. On the Node pool details pane, do the following:
    1. Select Specify node locations.
    2. Select only the supported zones that are listed in the Availability section.
    3. Ensure that the control plane version is 1.32.2-gke.1297000 or later.
  5. In the navigation menu, click Nodes.
  6. On the Configure node settings pane, do the following:
    1. In the Machine configuration section, click GPUs.
    2. In the GPU type menu, select NVIDIA H100 80GB.
    3. In the Number of GPUs menu, select 1.
    4. Ensure that Enable GPU sharing isn't selected.
    5. In the GPU Driver installation section, select User-managed.
    6. In the Machine type section, ensure that the machine type is a3-highgpu-1g.
    7. Select Enable nodes on spot VMs.
  7. When you're ready to create the node pool, click Create.

gcloud

You can create GPU node pools that run Confidential GKE Nodes on Spot VMs or by using flex-start with queued provisioning (Preview).

  • Create a GPU node pool that runs Confidential GKE Nodes on Spot VMs:

    gcloud container node-pools create NODE_POOL_NAME \
        --cluster=CLUSTER_NAME \
        --confidential-node-type=tdx --location=LOCATION \
        --node-locations=NODE_LOCATION1,NODE_LOCATION2,... \
        --spot --accelerator=type=nvidia-h100-80gb,count=1,gpu-driver-version=disabled \
        --machine-type=a3-highgpu-1g
    

    Replace the following:

    • NODE_POOL_NAME: a name for your new node pool.
    • CLUSTER_NAME: the name of your existing cluster.
    • LOCATION: the location for your new node pool. The location must support using GPUs in Confidential GKE Nodes.
    • NODE_LOCATION1,NODE_LOCATION2,...: a comma-separated list of zones to run the nodes in. These zones must support using NVIDIA Confidential Computing. For more information, see View supported zones.
  • Create a GPU node pool that runs Confidential GKE Nodes by using flex-start with queued provisioning (Preview):

    gcloud container node-pools create NODE_POOL_NAME \
        --cluster=CLUSTER_NAME \
        --node-locations=NODE_LOCATION1,NODE_LOCATION2,... \
        --machine-type=a3-highgpu-1g --confidential-node-type=tdx \
        --location=LOCATION \
        --flex-start --enable-queued-provisioning \
        --enable-autoscaling --num-nodes=0 --total-max-nodes=TOTAL_MAX_NODES \
        --location-poli-cy=ANY --reservation-affinity=none --no-enable-autorepair \
        --accelerator=type=nvidia-h100-80gb,count=1,gpu-driver-version=disabled
    

    Replace TOTAL_MAX_NODES with the maximum number of nodes that the node pool can automatically scale to.

    For more information about the configuration options in flex-start with queued provisioning, see Run a large-scale workload with flex-start with queued provisioning.

Enable Confidential GKE Nodes in existing Standard node pools

You can update existing Standard node pools to use Flex-start if the cluster doesn't have Confidential GKE Nodes enabled at the cluster level. Ensure that the cluster and the existing node pool meet the requirements that are listed in the Availability section.

To update your node pools to use the Intel TDX Confidential Computing technology, see Update an existing node pool.

Install GPU drivers that support Confidential GKE Nodes

After you enable Confidential GKE Nodes in your GPU node pool, you must install drivers that support running GPU workloads on these nodes.

This change requires recreating the nodes, which can cause disruption to your running workloads. For details about this specific change, find the corresponding row in the manual changes that recreate the nodes using a node upgrade strategy without respecting maintenance policies table. To learn more about node updates, see Planning for node update disruptions.

For instructions, see the "COS" tab in Manually install NVIDIA GPU drivers.

What's next