Use Kubeflow Tensorboards
TensorFlow's visualization toolkit, TensorBoard, is a powerful dashboard for visualizing machine learning experiments. It allows you to track metrics like loss and accuracy, visualize the model graph, view histograms of weights and biases, and much more.
Kubeflow provides a native way to spawn TensorBoard instances directly within your Kubernetes cluster, pointing them to existing logs stored on Persistent Volume Claims (PVCs) or Object Storage (S3, MinIO).
TOC
PrerequisitesGenerating Logs with PyTorchCreate a TensorBoard InstanceAccessing the DashboardUsage ScenariosVisualizing Training MetricsComparing RunsDebugging Model ArchitectureCleanupPrerequisites
Before creating a TensorBoard instance, ensure that your training jobs are writing logs to a location accessible by the cluster.
- PVC: If your training job writes logs to a Persistent Volume, note the PVC name and the path within it.
- Object Storage: If your training job writes logs to S3/MinIO, ensure you have the necessary credentials (often configured via PodDefaults) and the bucket URI (e.g.,
s3://my-bucket/logs/experiment-1).
Generating Logs with PyTorch
To visualize your training metrics, your PyTorch code must write events to a log directory. The SummaryWriter class is the main entry point for logging data for consumption by TensorBoard.
Create a TensorBoard Instance
-
Access the Kubeflow Dashboard: Navigate to the TensorBoards section in the Kubeflow central dashboard.
-
New TensorBoard: Click the New TensorBoard button.
-
Configure the Instance:
- Name: Enter a unique name for your TensorBoard instance (e.g.,
experiment-1-viz). - PVC Source:
- Check this box if your logs are on a PVC.
- PVC Name: Select the PVC from the dropdown.
- Mount Path: Specify the path inside the PVC where logs are stored (e.g.,
/logs/run1).
- Object Storage Source:
- Check this box if your logs are in cloud storage.
- Object Store Link: Provide the full URI to the log directory (e.g.,
s3://my-bucket/my-model/logs/). - Configuration: Select a configuration (PodDefault) if your bucket requires credentials.
- Name: Enter a unique name for your TensorBoard instance (e.g.,
-
Create: Click Create. The TensorBoard instance will be provisioned as a Pod in your namespace.
Accessing the Dashboard
Once the status of your TensorBoard instance changes to Running:
- Click Connect next to the instance name.
- The TensorBoard UI will open in a new tab.
- You can now explore the scalars, graphs, distributions, and other visualizations generated by your training run.
Usage Scenarios
Visualizing Training Metrics
Use the Scalars tab to view plots of accuracy, loss, and learning rate over time. This helps diagnose if your model is overfitting or if the learning rate needs adjustment.
Comparing Runs
If you point TensorBoard to a parent directory containing subdirectories for multiple runs (e.g., run1, run2), TensorBoard will automatically overlay the metrics from these runs, allowing you to compare performance across different hyperparameters.
Debugging Model Architecture
Use the Graphs tab to visualize the computational graph of your model. This ensures that the model is built as expected and helps identify structural issues.
Cleanup
TensorBoard instances consume cluster resources (CPU/Memory). When you are finished analyzing your experiments:
- Go back to the TensorBoards list.
- Click the Delete (trash icon) button next to your instance.
- Confirm deletions. This removes the visualization server but does not delete your training logs or models stored on the PVC or Object Storage.