condorの設定
計算資源として複数台の multi-core CPU な CentOS 5.8 machine があるので、これらを有効活用するために HTCondor を入れたときのメモ
今ひとつよく理解できずに設定したので、適切でないオプションとか付けてるかもしれない。だけれども、とりあえずこれで動いているので当面はこれで運用。(適宜修正する方向で)
machine 等について
hostname of central manager | tesla |
hostname of execute node's | fermi, kepler |
domain name | ult-local |
説明上、実行ノードは 2台にしてある
設定の前に、iptables 等の設定がどうなっているかを確認しておくこと(実はそれが原因で、一部の実行ノードが見えるのに走らないという悩ましい状態になったので。というか、この手の数値計算用サーバは隔離されたローカルネットワークの中に閉じ込めて、iptables などのようなオーバヘッドを極力取り除いたほうがいいはずですよね。)
condor の rpm パッケージのインストール
condor-8.2.2-265643.rhel5.10.x86_64.rpm を http://research.cs.wisc.edu/htcondor/ からもらってきて、対象となるサーバすべてにインストール:
yum localinstall condor-8.2.2-265643.rhel5.10.x86_64.rpm --nogpgcheck
rpm パッケージインストール後、central manager machine と実行ノード machine 上の /etc/condor/{condor_config, condor_config.local} を各々以下のように修正。その後、各々 /etc/init.d/condor start を。
Central Manager 機 (machine name: tesla)
/etc/condor/condor_config:
RELEASE_DIR = /usr LOCAL_DIR = /var LOCAL_CONFIG_FILE = /etc/condor/condor_config.local LOCAL_CONFIG_DIR = /etc/condor/config.d use SECURITY : HOST_BASED ALLOW_WRITE = * FLOCK_FROM = fermi, kepler FLOCK_TO = tesla RUN = $(LOCAL_DIR)/run/condor LOG = $(LOCAL_DIR)/log/condor LOCK = $(LOCAL_DIR)/lock/condor SPOOL = $(LOCAL_DIR)/lib/condor/spool EXECUTE = $(LOCAL_DIR)/lib/condor/execute BIN = $(RELEASE_DIR)/bin LIB = $(RELEASE_DIR)/lib64/condor INCLUDE = $(RELEASE_DIR)/include/condor SBIN = $(RELEASE_DIR)/sbin LIBEXEC = $(RELEASE_DIR)/libexec/condor SHARE = $(RELEASE_DIR)/share/condor PROCD_ADDRESS = $(RUN)/procd_pipe
/etc/condor/condor_config.local:
COLLECTOR_NAME = Personal Condor at $(FULL_HOSTNAME) START = TRUE SUSPEND = FALSE PREEMPT = FALSE KILL = FALSE DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD FILESYSTEM_DOMAIN = ult-local UID_DOMAIN = ult-local ALLOW_WRITE = *.ult-local ALLOW_READ = *.ult-local ALLOW_NEGOTIATOR = *.ult-local ALLOW_NEGOTIATOR_SCHEDD = *.ult-local
Execute node 機 (machine name: fermi, kepler)
/etc/condor/condor_config:
RELEASE_DIR = /usr LOCAL_DIR = /var LOCAL_CONFIG_FILE = /etc/condor/condor_config.local LOCAL_CONFIG_DIR = /etc/condor/config.d use SECURITY : HOST_BASED FLOCK_TO = tesla RUN = $(LOCAL_DIR)/run/condor LOG = $(LOCAL_DIR)/log/condor LOCK = $(LOCAL_DIR)/lock/condor SPOOL = $(LOCAL_DIR)/lib/condor/spool EXECUTE = $(LOCAL_DIR)/lib/condor/execute BIN = $(RELEASE_DIR)/bin LIB = $(RELEASE_DIR)/lib64/condor INCLUDE = $(RELEASE_DIR)/include/condor SBIN = $(RELEASE_DIR)/sbin LIBEXEC = $(RELEASE_DIR)/libexec/condor SHARE = $(RELEASE_DIR)/share/condor PROCD_ADDRESS = $(RUN)/procd_pipe
/etc/condor/condor_config.local:
CONDOR_HOST = tesla.ult-local COLLECTOR_HOST = tesla.ult-local COLLECTOR_NAME = Pool at $(FULL_HOSTNAME) START = TRUE SUSPEND = FALSE PREEMPT = FALSE KILL = FALSE DAEMON_LIST = MASTER, SCHEDD, STARTD HOSTALLOW_ADMINISTRATOR = * HOSTALLOW_OWNER = * HOSTALLOW_READ = * HOSTALLOW_WRITE = * HOSTALLOW_NEGOTIATOR_SCHEDD = * FILESYSTEM_DOMAIN = ult-local UID_DOMAIN = ult-local TRUST_UID_DOMAIN = TRUE
実行node上の CPU 数
ある node 機上で、n 個の CPU のみを参加させたい場合は、condor_config.local に以下の一行を追加する
NUM_CPUS=n
Keyword(s):
References:[CentOS]