euler-cluster-guide

Data Management on Euler

Effective data management is critical when working on the Euler Cluster, particularly for machine learning workflows that involve large datasets and model outputs. This section explains the available storage options and their proper usage.


πŸ“ Home Directory (/cluster/home/$USER)


⚑ Scratch Directory (/cluster/scratch/$USER or $SCRATCH)


πŸ“¦ Project Directory (/cluster/project/rsl/$USER)


πŸ“‚ Work Directory (/cluster/work/rsl/$USER)

In exceptional cases we can approve more storage space. For this, ask your supervisor to contact patelm@ethz.ch.

πŸ“‚ Local Scratch Directory ($TMPDIR)

❗ Quota Violations:

🎯 FAQ: What is the difference between the Project and Work Directories and why is it necessary to make use of both?

Basically, both Project and Work are persistent storages (meaning the data is not deleted automatically); however, the use cases are different. When you have lots of small files, for example, conda environments, you should store them in the Project directory as it has a higher capacity for # of inodes. On the other hand, when you have larger files such as model checkpoints, singularity containers and results you should store them in the Work directory as the storage capacity is higher.

🎯 FAQ: What is Local Scratch Directory ($TMPDIR) ?

Whenever you run a compute job, you can also ask for a certain amount of local scratch space ($TMPDIR) which allocates space on a local hard drive. The main advantage of the local scratch is, that it is located directly inside the compute nodes and not attached via the network. Thus it is highly recommended to copy over your singularity container / datasets to $TMPDIR and then use that for the trainings. Detailed workflows for the trainings are provided later in this guide.


πŸ“Š Summary Table of Storage Locations

Storage Location Max Inodes Max Size per User Purged Recommended Use Case
/cluster/home/$USER ~450,000 45 GB No Code, config, small files
/cluster/scratch/$USER 1 M 2.5 TB Yes (older than 15 days) Datasets, training data, temporary usage
/cluster/project/rsl/$USER 2.5 M 75 GB No Conda envs, software packages
/cluster/work/rsl/$USER 50,000 200 GB No Large result files, model checkpoints, Singularity containers,
$TMPDIR very high Upto 800 GB Yes (at end of job) Training Datasets, Singularity Images

πŸ§ͺ Test Scripts

To verify your storage setup and check quotas:

Submit as a job to test $TMPDIR:

sbatch test_storage_quotas.sh

πŸ’‘ Best Practices

  1. Use the right storage for the right purpose - Don’t waste home directory space on large files
  2. Compress datasets - Use tar/zip to reduce inode usage
  3. Clean up regularly - Remove old data from scratch before it’s auto-deleted
  4. Monitor your usage - Check quotas regularly with lquota
  5. Use $TMPDIR for active jobs - Copy data to local scratch for faster I/O during computation