euler-cluster-guide

Troubleshooting

This page covers common issues and solutions when working with containers on the Euler cluster.

Common Issues

Container Build Issues

Docker daemon not running

Cannot connect to the Docker daemon at unix:///var/run/docker.sock

Solution: Start Docker service

sudo systemctl start docker
# or on macOS:
open -a Docker

Apptainer/Singularity not found

bash: apptainer: command not found

Solution: Install Apptainer (v1.2.5 recommended)

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y apptainer

# From source (recommended for version control)
wget https://github.com/apptainer/apptainer/releases/download/v1.2.5/apptainer-1.2.5.tar.gz

Transfer Issues

Connection timeout during SCP

ssh: connect to host euler.ethz.ch port 22: Connection timed out

Solution:

Insufficient space during transfer

scp: write: No space left on device

Solution: Check quotas and clean up

# Check your quotas
lquota

# Clean old containers
rm /cluster/work/rsl/$USER/containers/old-*.tar

SLURM Job Issues

Job stays pending

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
12345    gpu.4h container   user PD       0:00      1 (Resources)

Solution:

Container extraction fails

tar: my-app.sif: Cannot open: No such file or directory

Solution: Verify container path

# List available containers
ls -la /cluster/work/rsl/$USER/containers/

# Check if extraction completed
ls -la $TMPDIR/

GPU not detected

CUDA available: False

Solution:

  1. Check job allocation:
    echo $CUDA_VISIBLE_DEVICES  # Should show GPU ID
    nvidia-smi  # Should list GPUs
    
  2. Verify --nv flag in singularity command
  3. Check CUDA version compatibility:
    singularity exec --nv container.sif nvidia-smi
    

Runtime Issues

Out of memory (OOM)

RuntimeError: CUDA out of memory

Solution:

Permission denied

Permission denied: '/output/results.txt'

Solution:

Slow I/O performance

Solution: Always use local scratch

# Good - use $TMPDIR
cp /cluster/scratch/$USER/data.tar $TMPDIR/
tar -xf $TMPDIR/data.tar -C $TMPDIR/

# Bad - network I/O
tar -xf /cluster/scratch/$USER/data.tar -C /cluster/work/$USER/

Debugging Techniques

Interactive Debugging

Start an interactive session:

# Request resources
srun --gpus=1 --mem=16G --tmp=50G --time=1:00:00 --pty bash

# Extract container
tar -xf /cluster/work/rsl/$USER/containers/debug.tar -C $TMPDIR

# Enter container interactively
singularity shell --nv $TMPDIR/debug.sif

# Test commands manually
python3 -c "import torch; print(torch.cuda.is_available())"

Verbose Output

Add debugging to job scripts:

#!/bin/bash
#SBATCH --job-name=debug-job

# Enable bash debugging
set -x

# Print environment
echo "=== Environment ==="
env | grep -E "(CUDA|SINGULARITY|SLURM)" | sort

# Check allocations
echo "=== Allocations ==="
echo "GPUs: $CUDA_VISIBLE_DEVICES"
echo "CPUs: $SLURM_CPUS_PER_TASK"
echo "Memory: $SLURM_MEM_PER_CPU MB per CPU"
echo "Tmp space: $(df -h $TMPDIR | tail -1)"

# Time each step
echo "=== Extraction ==="
time tar -xf container.tar -C $TMPDIR

echo "=== Container Info ==="
singularity inspect $TMPDIR/container.sif

Common Debug Commands

# Check job details
scontrol show job $SLURM_JOB_ID

# Monitor resource usage
watch -n 1 'sstat -j $SLURM_JOB_ID --format=JobID,MaxRSS,MaxDiskRead,MaxDiskWrite'

# Check GPU usage on node
ssh $NODE 'nvidia-smi -l 1'

# View detailed job info after completion
sacct -j $SLURM_JOB_ID --format=JobID,JobName,Partition,State,ExitCode,Elapsed,MaxRSS,AllocGRES

# Check why job failed
scontrol show job $SLURM_JOB_ID | grep -E "(Reason|ExitCode)"

Performance Optimization

Container Size Optimization

Reduce container size:

# Multi-stage build
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 AS builder
RUN apt-get update && apt-get install -y build-essential
# Build steps...

FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04
COPY --from=builder /app/bin /app/bin
# Minimal runtime dependencies only

Data Loading Optimization

# Use local scratch for datasets
import os
import shutil

# Copy dataset to local scratch at job start
if os.environ.get('SLURM_JOB_ID'):
    local_data = f"{os.environ['TMPDIR']}/dataset"
    if not os.path.exists(local_data):
        shutil.copytree('/cluster/scratch/user/dataset', local_data)
    data_path = local_data
else:
    data_path = './dataset'

# Use multiple workers for data loading
dataloader = DataLoader(dataset, 
                       batch_size=32,
                       num_workers=8,  # Match CPU count
                       pin_memory=True,
                       persistent_workers=True)

Getting Help

RSL-Specific Support

Cluster Support

Community Resources


Back to Home Container Workflow Scripts