Slurm单机部署
分布式系统的作业,需要修改slurm源程序。此博文记录在ubuntu环境下部署slurm的方法
安装munge用于鉴权
$ sudo apt install munge
$ sudo create-munge-key
The munge key /etc/munge/munge.key already exists
Do you want to overwrite it? (y/N) N
$ sudo ls -l /etc/munge/munge.key
-r-------- 1 munge munge 1024 11月 30 20:31 /etc/munge/munge.key
$ sudo service munge start
源码下载, 版本:21.08.4
参考INSTALL
$tar --bzip -x -f slurm-21.08.4.tar.bz2
$ cd slurm-21.08.4/
$ ./configure --with-hdf5=no # 防止make时缺少相关库报错
$ make
$ sudo make install
打开 file:///home/cstar/project/slurm-21.08.4/doc/html/configurator.easy.html,按照link填写 UserName 改为 slurm(需要在linux下新建一个slurm用户)
$ sudo mkdir -p /var/spool/slurm-llnl
$ sudo touch /var/log/slurm_jobacct.log
$ sudo chown root:root /var/spool/slurm-llnl /var/log/slurm_jobacct.log
将submit的内容复制到/usr/local/etc/slurm.conf
可以读取slurm.conf了,但是无法启动,通过查看cat /var/log/slurmctld.log
,发现缺少munge的相关库
按照这个issue修改
$ sudo apt install libmunge-dev libmunge2
~/project/slurm-21.08.4$ make uninstall
$ make distclean
$ ./configure --with-hdf5=no
$ make
$ sudo make install
~/project/slurm-21.08.4$ sudo cp etc/slurmd.service /etc/systemd/system
~/project/slurm-21.08.4$ sudo cp etc/slurmctld.service /etc/systemd/system
$ sudo systemctl daemon-reload
$ sudo systemctl start slurmctld
$ sudo systemctl start slurmd
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 idle cstar-Linux-Server
Slurm使用
$srun -l /path/to/bin/app
在前台运行一个job
使用sbatch提交脚本任务在后台执行,输出结果由--output
指定
myjob.sbatch
#!/bin/bash
#SBATCH --job-name=sugarjob
#SBATCH --output=./test.log
pwd; date
./app
用sbatch
提交脚本进行执行,在这个job执行过程中,中间结果有时候是不输出的,可以用squeue
来查看任务执行的状态。
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8 debug sugarjob cstar R 0:03 1 cstar-Linux-Server
最后可以在test.log中看程序的输出
使用sattach对接job step的标准io
第一个终端使用srun提交一个job
$ srun app
Hello world
Sleep for 1 seconds
Sleep for 2 seconds
Sleep for 3 seconds
Sleep for 4 seconds
....
第二个终端查看此job的信息
$ scontrol
scontrol: show step
StepId=15.0 UserId=1000 StartTime=2021-12-16T18:27:25 TimeLimit=UNLIMITED
State=RUNNING Partition=debug NodeList=cstar-Linux-Server
Nodes=1 CPUs=1 Tasks=1 Name=app Network=(null)
TRES=cpu=1,mem=512M,node=1
ResvPorts=(null)
CPUFreqReq=Default Dist=Block
SrunHost:Pid=cstar-Linux-Server:3291982
scontrol: exit
(base) cstar@cstar-Linux-Server:~$ sattach 15.0
Hello world
Sleep for 1 seconds
Sleep for 2 seconds
Sleep for 3 seconds
...