This guide covers interactive sessions and batch job submission on the Euler cluster.
Interactive sessions on Euler allow you to work directly on compute nodes with allocated resources. This is essential for development, debugging, and testing before submitting batch jobs.
The basic command to request an interactive session is:
srun --pty bash
This gives you a basic session with default resources. For more control, specify your requirements:
# 2 hours, 8 CPUs, 32GB RAM
srun --time=2:00:00 --cpus-per-task=8 --mem=32G --pty bash
# 4 hours, 1 GPU, 16 CPUs, 64GB RAM, 100GB local scratch
srun --time=4:00:00 --gpus=1 --cpus-per-task=16 --mem=64G --tmp=100G --pty bash
# 2 hours, 4 GPUs, 32 CPUs, 128GB RAM
srun --time=2:00:00 --gpus=4 --cpus-per-task=32 --mem=128G --pty bash
# 1 hour, 4 CPUs, 256GB RAM
srun --time=1:00:00 --cpus-per-task=4 --mem=256G --pty bash
Once your session starts, youโll be on a compute node:
# Check your allocated resources
echo "Hostname: $(hostname)"
echo "CPUs: $(nproc)"
echo "Memory: $(free -h | grep Mem | awk '{print $2}')"
echo "GPUs: $CUDA_VISIBLE_DEVICES"
echo "Local scratch: $TMPDIR"
# Load necessary modules
module load eth_proxy
# Activate your conda environment
conda activate myenv
# For GPU sessions, verify CUDA
nvidia-smi
# Request GPU session
srun --gpus=1 --mem=32G --time=2:00:00 --pty bash
# Navigate to your code
cd /cluster/home/$USER/my_project
# Run and debug
python train.py --debug
# In your interactive session
module load eth_proxy
conda activate myenv
# Start IPython with GPU support
ipython
# In IPython:
# >>> import torch
# >>> torch.cuda.is_available()
# >>> # Interactive debugging here
# Request session with local scratch
srun --gpus=1 --tmp=50G --mem=32G --pty bash
# Extract container to local scratch
tar -xf /cluster/work/rsl/$USER/containers/dev.tar -C $TMPDIR
# Enter container interactively
singularity shell --nv \
--bind /cluster/project/rsl/$USER:/project \
$TMPDIR/dev.sif
# Now you're inside the container for testing
ETH provides JupyterHub access to Euler:
# In an interactive session
srun --gpus=1 --mem=32G --time=4:00:00 --pty bash
# Load modules
module load eth_proxy
# Start Jupyter (note the token in output)
jupyter notebook --no-browser --ip=$(hostname -i)
# From your local machine, create SSH tunnel:
# ssh -L 8888:compute-node:8888 euler
# Then open http://localhost:8888 in your browser
You can use VSCode directly on Euler nodes:
# First, request an interactive session
srun --gpus=1 --mem=32G --time=4:00:00 --pty bash
# Note the compute node name (e.g., eu-g1-001)
hostname
# From your local VSCode:
# 1. Install "Remote - SSH" extension
# 2. Connect to: ssh username@euler
# 3. Then SSH to the compute node from terminal
--tmp
for I/O intensive work - Local scratch is much fasterSession wonโt start (pending)
# Check queue status
squeue -u $USER
# Check available resources
sinfo -p gpu
# Try requesting fewer resources or different partition
Disconnection from interactive session
# Prevent with screen/tmux
screen -S mysession
srun --gpus=1 --pty bash
# Detach: Ctrl+A, D
# Reattach: screen -r mysession
Out of memory in session
# Monitor memory usage
watch -n 1 free -h
# Check your process memory
ps aux | grep $USER
# Request more memory next time
Task | Command |
---|---|
Basic session | srun --pty bash |
GPU session | srun --gpus=1 --pty bash |
Specific time | srun --time=4:00:00 --pty bash |
More memory | srun --mem=64G --pty bash |
Local scratch | srun --tmp=100G --pty bash |
Check allocation | scontrol show job $SLURM_JOB_ID |
Exit session | exit or Ctrl+D |
SLURM batch scripts allow you to submit jobs that run without manual intervention. Here are tested examples for common use cases on the Euler cluster.
#!/bin/bash
#SBATCH --job-name=my_job
#SBATCH --output=logs/job_%j.out
#SBATCH --error=logs/job_%j.err
#SBATCH --time=04:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=4G
# Load required modules
module load eth_proxy
# Job info
echo "Job started on $(hostname) at $(date)"
echo "Job ID: $SLURM_JOB_ID"
# Your commands here
cd /cluster/home/$USER/my_project
python my_script.py
echo "Job completed at $(date)"
Submit with: sbatch my_job.sh
# Request any available GPU
#SBATCH --gpus=1
# Request specific GPU model
#SBATCH --gpus=nvidia_geforce_rtx_4090:1 # RTX 4090 (24GB VRAM)
#SBATCH --gpus=nvidia_geforce_rtx_3090:1 # RTX 3090 (24GB VRAM)
#SBATCH --gpus=nvidia_a100_80gb_pcie:1 # A100 (80GB VRAM)
#SBATCH --gpus=nvidia_a100-pcie-40gb:1 # A100 (40GB VRAM)
#SBATCH --gpus=tesla_v100-sxm2-32gb:1 # V100 (32GB VRAM)
# Request multiple GPUs of same type
#SBATCH --gpus=nvidia_geforce_rtx_4090:4 # 4x RTX 4090
# System memory (RAM) - shared by CPUs
#SBATCH --mem=64G # Total memory for job
#SBATCH --mem-per-cpu=8G # Memory per CPU core
# GPU memory (VRAM) is fixed by GPU type:
# RTX 4090: 24GB VRAM
# RTX 3090: 24GB VRAM
# A100: 40GB or 80GB VRAM
# V100: 32GB VRAM
# RTX 2080 Ti: 11GB VRAM
#!/bin/bash
#SBATCH --job-name=llm-training
#SBATCH --gpus=nvidia_a100_80gb_pcie:1 # Need 80GB VRAM for model
#SBATCH --cpus-per-task=32 # Many CPUs for data loading
#SBATCH --mem=256G # Large system RAM for dataset
#SBATCH --time=72:00:00
#SBATCH --tmp=500G # Local scratch for dataset
module load eth_proxy
# The A100 80GB GPU allows loading larger models
# System RAM (256GB) is for CPU operations and data loading
# GPU VRAM (80GB) is for model weights and activations
#!/bin/bash
#SBATCH --job-name=gpu-training
#SBATCH --output=logs/train_%j.out
#SBATCH --error=logs/train_%j.err
#SBATCH --time=24:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem-per-cpu=4G
#SBATCH --gpus=1 # Request any available GPU
#SBATCH --tmp=100G
# For specific GPU types, use one of these instead:
# #SBATCH --gpus=nvidia_geforce_rtx_4090:1 # RTX 4090 (24GB)
# #SBATCH --gpus=nvidia_geforce_rtx_3090:1 # RTX 3090 (24GB)
# #SBATCH --gpus=nvidia_a100_80gb_pcie:1 # A100 80GB
# #SBATCH --gpus=nvidia_a100-pcie-40gb:1 # A100 40GB
# #SBATCH --gpus=tesla_v100-sxm2-32gb:1 # V100 32GB
# Load modules
module load eth_proxy
# Job information
echo "========================================="
echo "SLURM Job ID: $SLURM_JOB_ID"
echo "Running on: $(hostname)"
echo "Starting at: $(date)"
echo "GPU allocation: $CUDA_VISIBLE_DEVICES"
echo "========================================="
# Copy dataset to local scratch for faster I/O
echo "Copying dataset to local scratch..."
cp -r /cluster/scratch/$USER/datasets/my_dataset $TMPDIR/
# Activate conda environment
source /cluster/project/rsl/$USER/miniconda3/bin/activate
conda activate ml_env
# Run training
cd /cluster/home/$USER/my_ml_project
python train.py \
--data-dir $TMPDIR/my_dataset \
--output-dir /cluster/project/rsl/$USER/results/$SLURM_JOB_ID \
--epochs 100 \
--batch-size 64 \
--lr 0.001
# Copy final results back
echo "Copying results..."
cp -r $TMPDIR/checkpoints/* /cluster/project/rsl/$USER/checkpoints/
echo "Job completed at $(date)"
#!/bin/bash
#SBATCH --job-name=distributed-training
#SBATCH --output=logs/distributed_%j.out
#SBATCH --error=logs/distributed_%j.err
#SBATCH --time=48:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --mem-per-cpu=8G
#SBATCH --gpus=4
#SBATCH --tmp=200G
module load eth_proxy
echo "Multi-GPU training on $(hostname)"
echo "GPUs: $CUDA_VISIBLE_DEVICES"
echo "Number of GPUs: $(echo $CUDA_VISIBLE_DEVICES | tr ',' '\n' | wc -l)"
# Prepare data on local scratch
tar -xf /cluster/scratch/$USER/datasets/imagenet.tar -C $TMPDIR/
# Activate environment
source /cluster/project/rsl/$USER/miniconda3/bin/activate
conda activate pytorch_env
# Set distributed training environment variables
export MASTER_ADDR=$(hostname)
export MASTER_PORT=29500
export WORLD_SIZE=4
# Run distributed training
cd /cluster/home/$USER/vision_project
python -m torch.distributed.run \
--nproc_per_node=4 \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
train_distributed.py \
--data $TMPDIR/imagenet \
--output /cluster/project/rsl/$USER/results/$SLURM_JOB_ID \
--sync-bn \
--amp
echo "Training completed at $(date)"
#!/bin/bash
#SBATCH --job-name=param-sweep
#SBATCH --output=logs/array_%A_%a.out
#SBATCH --error=logs/array_%A_%a.err
#SBATCH --time=02:00:00
#SBATCH --array=1-50
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=4G
#SBATCH --gpus=1
module load eth_proxy
# Array job information
echo "Array Job ID: $SLURM_ARRAY_JOB_ID"
echo "Array Task ID: $SLURM_ARRAY_TASK_ID"
echo "Running on: $(hostname)"
# Define parameter arrays
learning_rates=(0.001 0.0001 0.00001 0.01 0.1)
batch_sizes=(16 32 64 128 256)
# Calculate indices for 2D parameter grid
lr_index=$(( ($SLURM_ARRAY_TASK_ID - 1) / ${#batch_sizes[@]} ))
bs_index=$(( ($SLURM_ARRAY_TASK_ID - 1) % ${#batch_sizes[@]} ))
LR=${learning_rates[$lr_index]}
BS=${batch_sizes[$bs_index]}
echo "Testing LR=$LR, Batch Size=$BS"
# Activate environment
source /cluster/project/rsl/$USER/miniconda3/bin/activate
conda activate ml_env
# Run experiment
cd /cluster/home/$USER/hyperparameter_search
python train.py \
--lr $LR \
--batch-size $BS \
--epochs 20 \
--output /cluster/project/rsl/$USER/hp_search/lr${LR}_bs${BS} \
--seed $SLURM_ARRAY_TASK_ID
Submit array job: sbatch --array=1-25 array_job.sh
#!/bin/bash
#SBATCH --job-name=container-job
#SBATCH --output=logs/container_%j.out
#SBATCH --error=logs/container_%j.err
#SBATCH --time=12:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=8G
#SBATCH --gpus=2
#SBATCH --tmp=150G
module load eth_proxy
echo "Container job started on $(hostname)"
echo "Extracting container to local scratch..."
# Extract container (much faster than running from /cluster/work)
time tar -xf /cluster/work/rsl/$USER/containers/ml_stack.tar -C $TMPDIR
# Prepare data
echo "Preparing data..."
mkdir -p $TMPDIR/data
cp -r /cluster/scratch/$USER/datasets/train_data $TMPDIR/data/
# Run training in container
echo "Starting training..."
singularity exec \
--nv \
--bind $TMPDIR/data:/data:ro \
--bind /cluster/project/rsl/$USER/results/$SLURM_JOB_ID:/output \
--bind /cluster/project/rsl/$USER/checkpoints:/checkpoints \
$TMPDIR/ml_stack.sif \
python /app/train.py \
--data /data/train_data \
--output /output \
--checkpoint-dir /checkpoints \
--resume-from latest
echo "Job completed at $(date)"
# Submit job
sbatch my_job.sh
# Check job status
squeue -u $USER
# Detailed job info
scontrol show job <job_id>
# Cancel job
scancel <job_id>
# Cancel all your jobs
scancel -u $USER
# View job efficiency after completion
seff <job_id>
# Monitor job in real-time
watch -n 10 squeue -u $USER
#!/bin/bash
#SBATCH --job-name=debug-job
#SBATCH --output=logs/debug_%j.out
#SBATCH --error=logs/debug_%j.err
#SBATCH --time=00:30:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=4G
#SBATCH --gpus=1
# Enable bash debugging
set -e # Exit on error
set -u # Exit on undefined variable
set -x # Print commands as they execute
module load eth_proxy
# Print environment for debugging
echo "=== Environment ==="
env | grep SLURM
echo "=== GPU Info ==="
nvidia-smi
echo "=== Memory Info ==="
free -h
echo "=== Disk Space ==="
df -h $TMPDIR
# Your actual commands with error checking
if ! python --version; then
echo "Python not found!"
exit 1
fi
# Run with explicit error handling
python my_script.py || {
echo "Script failed with exit code $?"
echo "Current directory: $(pwd)"
echo "Files present: $(ls -la)"
exit 1
}
mkdir -p logs
before submittingBefore submitting your job, verify:
#!/bin/bash
mkdir -p logs
)eth_proxy
is loaded$USER
variable$TMPDIR
used for I/OWe provide test scripts for all computing scenarios:
To run the tests:
# Test basic job submission
sbatch test_cpu_job.sh
# Test GPU allocation
sbatch test_gpu_job.sh
# Test specific GPU request
sbatch test_gpu_specific.sh
# Test array jobs (creates 6 tasks)
sbatch test_array_job.sh