This guide covers setting up Python environments on Euler and provides a complete machine learning training workflow.
Python virtual environments are commonly used on Euler for managing dependencies in a clean, reproducible way. One of the best tools for this purpose is Miniconda, a lightweight version of Anaconda.
To install Miniconda3, follow these steps:
# Create a directory for Miniconda
mkdir -p /cluster/project/rsl/$USER/miniconda3
# Download the installer
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /cluster/project/rsl/$USER/miniconda3/miniconda.sh
# Run the installer silently
bash /cluster/project/rsl/$USER/miniconda3/miniconda.sh -b -u -p /cluster/project/rsl/$USER/miniconda3/
# Clean up the installer
rm /cluster/project/rsl/$USER/miniconda3/miniconda.sh
# Initialize conda for bash
/cluster/project/rsl/$USER/miniconda3/bin/conda init bash
# Prevent auto-activation of the base environment
conda config --set auto_activate_base false
Installing Miniconda in your home directory (~/miniconda3
) is not recommended due to storage and inode limits.
Instead, install it in your project directory:
/cluster/project/rsl/$USER/miniconda3
You can move it to your project directory and create a symbolic link:
mv ~/miniconda3 /cluster/project/rsl/$USER/
ln -s /cluster/project/rsl/$USER/miniconda3 ~/miniconda3
This way, conda commands referencing ~/miniconda3
will still work, while the files reside in a directory with more storage and inode capacity.
Once Miniconda is installed and configured, you can create a new conda environment like this:
conda create -n myenv python=3.10
To activate the environment:
conda activate myenv
You can install packages as needed:
conda install numpy pandas matplotlib
To deactivate the environment:
conda deactivate
/cluster/project/rsl/$USER/miniconda3
to avoid inode overflow in your home directory.This section provides a complete end-to-end workflow for machine learning training on Euler using conda environments.
ssh euler
cd /cluster/home/$USER
mkdir -p ml_project/{scripts,data,configs,logs}
cd ml_project
# Request interactive session for installation
srun --time=1:00:00 --mem=16G --cpus-per-task=8 --pty bash
# Create environment with specific Python version
conda create -n ml_training python=3.10 -y
# Activate environment
conda activate ml_training
# Install PyTorch with CUDA support
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia -y
# Install additional ML packages
conda install numpy pandas scikit-learn matplotlib seaborn jupyterlab -y
pip install wandb tensorboard transformers datasets
# Save environment
conda env export > environment.yml
Create prepare_data.py
:
#!/usr/bin/env python3
"""Prepare and preprocess dataset for training."""
import os
import argparse
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import datasets, transforms
import h5py
def prepare_dataset(args):
"""Download and preprocess dataset."""
# Define transforms
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
# Download dataset
print(f"Downloading dataset to {args.data_dir}")
train_dataset = datasets.CIFAR10(
root=args.data_dir,
train=True,
download=True,
transform=transform
)
val_dataset = datasets.CIFAR10(
root=args.data_dir,
train=False,
download=True,
transform=transform
)
# Convert to HDF5 for faster loading
if args.convert_hdf5:
print("Converting to HDF5 format...")
train_file = os.path.join(args.output_dir, 'train.h5')
val_file = os.path.join(args.output_dir, 'val.h5')
# Save training data
with h5py.File(train_file, 'w') as f:
images = f.create_dataset('images', (len(train_dataset), 3, 224, 224), dtype='f')
labels = f.create_dataset('labels', (len(train_dataset),), dtype='i')
for idx, (img, label) in enumerate(train_dataset):
if idx % 1000 == 0:
print(f"Processing training sample {idx}/{len(train_dataset)}")
images[idx] = img.numpy()
labels[idx] = label
print(f"Saved training data to {train_file}")
return len(train_dataset), len(val_dataset)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--data-dir', type=str, required=True)
parser.add_argument('--output-dir', type=str, required=True)
parser.add_argument('--convert-hdf5', action='store_true')
args = parser.parse_args()
os.makedirs(args.output_dir, exist_ok=True)
train_size, val_size = prepare_dataset(args)
print(f"Dataset prepared: {train_size} training, {val_size} validation samples")
Create a comprehensive training script train.py
that includes:
[Full training script available in the complete example below]
Create train_job.sh
:
#!/bin/bash
#SBATCH --job-name=ml-training
#SBATCH --output=logs/train_%j.out
#SBATCH --error=logs/train_%j.err
#SBATCH --time=24:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem-per-cpu=4G
#SBATCH --gpus=1
#SBATCH --tmp=100G
# Load modules
module load eth_proxy
# Job information
echo "========================================="
echo "SLURM Job ID: $SLURM_JOB_ID"
echo "Running on: $(hostname)"
echo "Starting at: $(date)"
echo "GPU: $CUDA_VISIBLE_DEVICES"
echo "========================================="
# Set up environment variables
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export WANDB_DIR=/cluster/scratch/$USER/wandb
export WANDB_CACHE_DIR=$WANDB_DIR/cache
export WANDB_CONFIG_DIR=$WANDB_DIR/config
mkdir -p $WANDB_DIR
# Activate conda environment
source /cluster/project/rsl/$USER/miniconda3/bin/activate
conda activate ml_training
# Copy dataset to local scratch for faster I/O
echo "Copying dataset to local scratch..."
cp -r /cluster/scratch/$USER/datasets/cifar10 $TMPDIR/
echo "Dataset copied successfully"
# Set paths
DATA_DIR=$TMPDIR/cifar10
OUTPUT_DIR=/cluster/project/rsl/$USER/results/$SLURM_JOB_ID
CHECKPOINT_DIR=/cluster/project/rsl/$USER/checkpoints/$SLURM_JOB_ID
mkdir -p $OUTPUT_DIR $CHECKPOINT_DIR
# Run training
cd /cluster/home/$USER/ml_project
python train.py \
--data-dir $DATA_DIR \
--output-dir $OUTPUT_DIR \
--checkpoint-dir $CHECKPOINT_DIR \
--model resnet18 \
--epochs 100 \
--batch-size 128 \
--lr 0.001 \
--num-workers 8 \
--log-interval 50 \
--save-interval 10
# Copy final results
echo "Copying results..."
cp -r $OUTPUT_DIR/* /cluster/project/rsl/$USER/final_results/
echo "Job completed at $(date)"
# Check job status
squeue -u $USER
# Watch GPU utilization
ssh euler
squeue -u $USER -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %N"
# Get node name, then:
ssh <node_name> nvidia-smi -l 1
# View training logs in real-time
tail -f logs/train_*.out
# Check tensorboard logs
cd /cluster/project/rsl/$USER/results/<job_id>/tensorboard
tensorboard --logdir . --port 8888
# Then create SSH tunnel: ssh -L 8888:localhost:8888 euler
$TMPDIR
) for datasetsnvidia-smi
during trainingtorch.distributed
# 1. Prepare environment and data
ssh euler
cd /cluster/home/$USER/ml_project
sbatch prepare_data_job.sh
# 2. Test with small run
srun --gpus=1 --mem=16G --time=0:30:00 --pty bash
python train.py --epochs 2 --batch-size 32 ...
# 3. Submit full training
sbatch train_job.sh
# 4. Monitor progress
watch -n 60 squeue -u $USER
tail -f logs/train_*.out
# 5. Analyze results
python analyze_results.py --results-dir /cluster/project/rsl/$USER/results/<job_id>
We provide complete test scripts to verify the ML workflow:
To test the full workflow:
# Copy scripts to your project
cp fake_train.py /cluster/home/$USER/
cp test_full_training_job.sh /cluster/home/$USER/
# Submit the test job
sbatch test_full_training_job.sh
For a complete training script example with all features (checkpointing, logging, etc.), see the Computing Guide or download our template from the Scripts Library.