Slurm单机部署

分布式系统的作业,需要修改slurm源程序。此博文记录在ubuntu环境下部署slurm的方法

安装munge用于鉴权

$ sudo apt install munge
$ sudo create-munge-key 
The munge key /etc/munge/munge.key already exists
Do you want to overwrite it? (y/N) N
$ sudo ls -l /etc/munge/munge.key
-r-------- 1 munge munge 1024 11月 30 20:31 /etc/munge/munge.key
$ sudo service munge start

源码下载, 版本:21.08.4

参考INSTALL

$tar --bzip -x -f slurm-21.08.4.tar.bz2 
$ cd slurm-21.08.4/
$ ./configure --with-hdf5=no  # 防止make时缺少相关库报错
$ make
$ sudo make install

打开 file:///home/cstar/project/slurm-21.08.4/doc/html/configurator.easy.html,按照link填写 UserName 改为 slurm(需要在linux下新建一个slurm用户)

$ sudo mkdir -p /var/spool/slurm-llnl
$ sudo touch /var/log/slurm_jobacct.log
$ sudo chown root:root /var/spool/slurm-llnl /var/log/slurm_jobacct.log

将submit的内容复制到/usr/local/etc/slurm.conf 

可以读取slurm.conf了,但是无法启动,通过查看cat /var/log/slurmctld.log,发现缺少munge的相关库

按照这个issue修改

$ sudo apt install libmunge-dev libmunge2
~/project/slurm-21.08.4$ make uninstall
$ make distclean
$ ./configure --with-hdf5=no
$ make
$ sudo make install
~/project/slurm-21.08.4$ sudo cp etc/slurmd.service /etc/systemd/system
~/project/slurm-21.08.4$ sudo cp etc/slurmctld.service /etc/systemd/system
$ sudo systemctl daemon-reload 
$ sudo systemctl start slurmctld
$ sudo systemctl start slurmd
$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1   idle cstar-Linux-Server

Slurm使用

$srun -l /path/to/bin/app 在前台运行一个job

使用sbatch提交脚本任务在后台执行,输出结果由--output指定

myjob.sbatch

#!/bin/bash
#SBATCH --job-name=sugarjob
#SBATCH --output=./test.log

pwd; date
./app

sbatch提交脚本进行执行,在这个job执行过程中,中间结果有时候是不输出的,可以用squeue来查看任务执行的状态。

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 8     debug sugarjob    cstar  R       0:03      1 cstar-Linux-Server

最后可以在test.log中看程序的输出

使用sattach对接job step的标准io

第一个终端使用srun提交一个job

$ srun app
Hello world
Sleep for 1 seconds
Sleep for 2 seconds
Sleep for 3 seconds
Sleep for 4 seconds
....

第二个终端查看此job的信息

$ scontrol
scontrol: show step
StepId=15.0 UserId=1000 StartTime=2021-12-16T18:27:25 TimeLimit=UNLIMITED
   State=RUNNING Partition=debug NodeList=cstar-Linux-Server
   Nodes=1 CPUs=1 Tasks=1 Name=app Network=(null)
   TRES=cpu=1,mem=512M,node=1
   ResvPorts=(null)
   CPUFreqReq=Default Dist=Block
   SrunHost:Pid=cstar-Linux-Server:3291982

scontrol: exit
(base) cstar@cstar-Linux-Server:~$ sattach 15.0
Hello world
Sleep for 1 seconds
Sleep for 2 seconds
Sleep for 3 seconds
...