Use Kubeflow Tensorboards

TensorFlow's visualization toolkit, TensorBoard, is a powerful dashboard for visualizing machine learning experiments. It allows you to track metrics like loss and accuracy, visualize the model graph, view histograms of weights and biases, and much more.

Kubeflow provides a native way to spawn TensorBoard instances directly within your Kubernetes cluster, pointing them to existing logs stored on Persistent Volume Claims (PVCs) or Object Storage (S3, MinIO).

Prerequisites

Before creating a TensorBoard instance, ensure that your training jobs are writing logs to a location accessible by the cluster.

  • PVC: If your training job writes logs to a Persistent Volume, note the PVC name and the path within it.
  • Object Storage: If your training job writes logs to S3/MinIO, ensure you have the necessary credentials (often configured via PodDefaults) and the bucket URI (e.g., s3://my-bucket/logs/experiment-1).

Generating Logs with PyTorch

To visualize your training metrics, your PyTorch code must write events to a log directory. The SummaryWriter class is the main entry point for logging data for consumption by TensorBoard.

import torch
import torchvision
from torch.utils.tensorboard import SummaryWriter
from torchvision import datasets, transforms

# Writer will output to ./runs/ directory by default
writer = SummaryWriter()

transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
trainset = datasets.MNIST('mnist_train', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)
model = torchvision.models.resnet50(False)
# Have ResNet model take in grayscale rather than RGB
model.conv1 = torch.nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False)
images, labels = next(iter(trainloader))

grid = torchvision.utils.make_grid(images)
writer.add_image('images', grid, 0)
writer.add_graph(model, images)
writer.close()

Create a TensorBoard Instance

  1. Access the Kubeflow Dashboard: Navigate to the TensorBoards section in the Kubeflow central dashboard.

  2. New TensorBoard: Click the New TensorBoard button.

  3. Configure the Instance:

    • Name: Enter a unique name for your TensorBoard instance (e.g., experiment-1-viz).
    • PVC Source:
      • Check this box if your logs are on a PVC.
      • PVC Name: Select the PVC from the dropdown.
      • Mount Path: Specify the path inside the PVC where logs are stored (e.g., /logs/run1).
    • Object Storage Source:
      • Check this box if your logs are in cloud storage.
      • Object Store Link: Provide the full URI to the log directory (e.g., s3://my-bucket/my-model/logs/).
      • Configuration: Select a configuration (PodDefault) if your bucket requires credentials.
  4. Create: Click Create. The TensorBoard instance will be provisioned as a Pod in your namespace.

Accessing the Dashboard

Once the status of your TensorBoard instance changes to Running:

  1. Click Connect next to the instance name.
  2. The TensorBoard UI will open in a new tab.
  3. You can now explore the scalars, graphs, distributions, and other visualizations generated by your training run.

Usage Scenarios

Visualizing Training Metrics

Use the Scalars tab to view plots of accuracy, loss, and learning rate over time. This helps diagnose if your model is overfitting or if the learning rate needs adjustment.

Comparing Runs

If you point TensorBoard to a parent directory containing subdirectories for multiple runs (e.g., run1, run2), TensorBoard will automatically overlay the metrics from these runs, allowing you to compare performance across different hyperparameters.

Debugging Model Architecture

Use the Graphs tab to visualize the computational graph of your model. This ensures that the model is built as expected and helps identify structural issues.

Cleanup

TensorBoard instances consume cluster resources (CPU/Memory). When you are finished analyzing your experiments:

  1. Go back to the TensorBoards list.
  2. Click the Delete (trash icon) button next to your instance.
  3. Confirm deletions. This removes the visualization server but does not delete your training logs or models stored on the PVC or Object Storage.