Condor and EnergyPlus on a small cluster

I thought a good way to open the blog would be a post on some work I’ve done as part of my research at Loughborough. Hopefully this will be of some use to others running building energy simulations (and likely other CPU intensive jobs); I at least will be using it as a record of what I did in case I need it again. I expect some of my terminology won’t be quite right as I’m not an expert in high throughput computing, but if you can excuse that it might still be useful as a practical guide.

We’ve been running experiments using a building performance package called EnergyPlus. It takes a text file as an input (this contains the building design and some other parameters), runs for a while, then writes out a text file as an output. We’ve mostly been using E+ as the fitness function in a genetic algorithm, with the idea that we evolve building designs that minimise things like operation energy use or construction cost. We have 7 desktop PCs which were purchased for my project or acquired from other ones. Originally, we ran experiments on each separately, and as a run of the GA would take a few thousand simulation runs of a couple of minutes each, a run would take a day or two. That’s great, but it was a little inflexible. Genetic algorithms work with “populations” – which means that a batch of simulations need run all at once. Each of the machines had 4-8 CPU cores (with one of them having 24) – meaning that simulations could be run 4-8 at a time. Typically we’ve have batches of 30, and with a new problem we’ll be looking at there could be thousands to run all in one go, so really it would be better if we could distribute the simulation runs across all the machines. That way we could do up to about 60 runs all at once.

Years ago I’d have written a whole framework to do this by myself, just for the fun of it. However, these days I prefer the much more sensible approach of seeing if there’s an existing application that does the job. Some researchers I’d worked with in the past had used HTCondor to created a cluster on a set of computing student lab PCs. (working something like the SETI@Home project – jobs running in the background when the machines aren’t busy) Having had a go at setting it up for this project, my conclusion is that it’s pretty good. I’m sure there are plenty of alternatives – but it does the job, so here’s an overview of setting it up, and getting EnergyPlus to run on it.

We’ll start by installing Condor on a couple of machines, then how to run a job on the pool, then finally some hints on admin for regular use.

Installation

First task, get a copy of HTCondor from their website. Installers are available for windows and the usual Linux flavours. Condor works by having a central manager, then a set of nodes that associate themselves with the manager to form a pool. All the machines in the pool have a number of slots on them, one for each CPU core (or virtual core with hyperthreading). The manager’s job is to receive job submissions, match jobs with suitable available slots in the pool, keep track of jobs until they’re done, at which point the manager copies back the results to the submitting machine. Any machine in the pool can be set up to submit jobs to the manager. Before you start – work out which machine should be the manager (I went for “TSB1”) – ideally one with higher than average CPU power for the pool.

Setting up the machines

On each, run the installer. On windows it’s just your usual install wizard. The questions it’ll ask are (with the answers I used):

  • Do I agree with the Apache software licence? Well, of course I do.
  • Install type:
    • On the manager, choose “Create a new Condor Pool”; and choose a value for Name of new pool: I chose “TSB”
    • On the client machines, choose “Join an existing Condor Pool”, and enter the IP address of your manager machine
  • Submit jobs to condor pool? This means that the machine is allowed to submit jobs. I have control over all the machines in my pool so I allowed this on all of them. If you’ve got a lot of slave machines that you don’t want people submitting jobs from (e.g. student lab computers) then leave this unchecked. On the manager, I’d expect this would usually be checked to allow submission.
  • When should condor run jobs? Options are:
    • Do not run jobs on this machine (good idea if you have a big pool – you don’t want the manager getting bogged down by jobs running on it; for a smaller pool you might as well enable job running on the manager too, to get some more CPUs. In my pool of 64 CPUs, the manager copes just fine while running jobs too)
    • Always run jobs and never suspend (good idea if the machines in your pool are dedicated just to you)
    • when keyboard has been idle for 15 minutes / when keyboard and CPU are idle – these options are more suited to machines where you want the jobs to take up otherwise idle time because they are usually used for other things too (such as lab PCs)
  • accounting domain – leave blank
  • hostname of smtp server and email address of administrator – you can leave these blank; they’re used to allow condor to send debug emails to you
  • path to Java virtual machine – only needed if you’re likely to be running java jobs
  • host with read access: Can be URLs with *s as wildcards or a set of IP addresses (or a subnet using slash notation). I left this as * – meaning that anyone can find out what jobs are running on this machine.
  • host with read access: I set this to the range of IP addresses on my network – that is, only machines on my network can submit jobs and do other things related to admin (all the machines in the pool need both read and write access – so make sure anything in the read and write access boxes doesn’t exclude any of your machines)
  • leave any other settings as the default values

Now repeat this for all the other client machines in the pool. Once you’re done, one one of the machines in the pool, open a console window and type condor_status to get a list of all the machines in the pool and what they’re doing. You can see my pool of 64 cores below…


C:\sb\condor_demo>condor_status

Name OpSys Arch State Activity LoadAv Mem ActvtyTime

slot1@7-cvaeib-375 WINDOWS INTEL Owner Idle 0.370 732 0+03:35:04
slot2@7-cvaeib-375 WINDOWS INTEL Owner Idle 0.000 732 0+03:40:10
slot3@7-cvaeib-375 WINDOWS INTEL Owner Idle 0.000 732 0+03:35:06
slot4@7-cvaeib-375 WINDOWS INTEL Owner Idle 0.000 732 0+03:35:07
slot1@7-cvaeib-1.l WINDOWS X86_64 Unclaimed Idle 0.000 1022 0+00:03:11
slot2@7-cvaeib-1.l WINDOWS X86_64 Unclaimed Idle 0.000 1022 0+01:00:05
slot3@7-cvaeib-1.l WINDOWS X86_64 Unclaimed Idle 0.000 1022 0+01:00:06
slot4@7-cvaeib-1.l WINDOWS X86_64 Unclaimed Idle 0.060 1022 0+01:00:07
slot5@7-cvaeib-1.l WINDOWS X86_64 Unclaimed Idle 0.000 1022 0+00:59:46
slot6@7-cvaeib-1.l WINDOWS X86_64 Unclaimed Idle 0.000 1022 0+01:00:09
slot7@7-cvaeib-1.l WINDOWS X86_64 Unclaimed Idle 0.000 1022 0+01:00:10
slot8@7-cvaeib-1.l WINDOWS X86_64 Unclaimed Idle 0.000 1022 0+00:55:03
slot1@7-cvaeib-2.l WINDOWS X86_64 Unclaimed Idle 0.000 1022 0+01:05:04
slot2@7-cvaeib-2.l WINDOWS X86_64 Unclaimed Idle 0.200 1022 0+01:05:05
slot3@7-cvaeib-2.l WINDOWS X86_64 Unclaimed Idle 0.000 1022 0+01:05:06
slot4@7-cvaeib-2.l WINDOWS X86_64 Unclaimed Idle 0.000 1022 0+01:04:39
slot5@7-cvaeib-2.l WINDOWS X86_64 Unclaimed Idle 0.000 1022 0+01:00:08
slot6@7-cvaeib-2.l WINDOWS X86_64 Unclaimed Idle 0.000 1022 0+01:00:09
slot7@7-cvaeib-2.l WINDOWS X86_64 Unclaimed Idle 0.000 1022 0+01:00:10
slot8@7-cvaeib-2.l WINDOWS X86_64 Unclaimed Idle 0.000 1022 0+01:05:03
slot1@7-cvaeib-3.l WINDOWS X86_64 Claimed Busy 0.000 1022 0+00:00:04
slot2@7-cvaeib-3.l WINDOWS X86_64 Claimed Busy 0.300 1022 0+00:00:05
slot3@7-cvaeib-3.l WINDOWS X86_64 Claimed Busy 0.020 1022 0+00:00:06
slot4@7-cvaeib-3.l WINDOWS X86_64 Claimed Busy 0.010 1022 0+00:00:07
slot5@7-cvaeib-3.l WINDOWS X86_64 Claimed Busy 0.000 1022 0+00:00:08
slot6@7-cvaeib-3.l WINDOWS X86_64 Claimed Busy 0.000 1022 0+00:00:08
slot7@7-cvaeib-3.l WINDOWS X86_64 Claimed Busy 1.070 1022 0+00:00:09
slot8@7-cvaeib-3.l WINDOWS X86_64 Claimed Busy 0.000 1022 0+00:00:02
slot1@7-cvjaw-3593 WINDOWS X86_64 Claimed Idle 0.000 1022 0+00:00:04
slot2@7-cvjaw-3593 WINDOWS X86_64 Claimed Idle 0.000 1022 0+00:00:05
slot3@7-cvjaw-3593 WINDOWS X86_64 Claimed Idle 0.000 1022 0+00:00:06
slot4@7-cvjaw-3593 WINDOWS X86_64 Claimed Busy 0.000 1022 0+00:00:01
slot5@7-cvjaw-3593 WINDOWS X86_64 Unclaimed Idle 0.000 1022 0+01:05:09
slot6@7-cvjaw-3593 WINDOWS X86_64 Unclaimed Idle 0.000 1022 0+01:05:10
slot7@7-cvjaw-3593 WINDOWS X86_64 Unclaimed Idle 0.000 1022 0+01:05:11
slot8@7-cvjaw-3593 WINDOWS X86_64 Claimed Idle 0.000 1022 0+00:00:03
slot10@7-cvjaw-416 WINDOWS X86_64 Claimed Busy 0.010 1023 0+00:00:05
slot11@7-cvjaw-416 WINDOWS X86_64 Claimed Busy 0.000 1023 0+00:00:05
slot12@7-cvjaw-416 WINDOWS X86_64 Claimed Busy 0.020 1023 0+00:00:07
slot13@7-cvjaw-416 WINDOWS X86_64 Claimed Busy 0.020 1023 0+00:00:07
slot14@7-cvjaw-416 WINDOWS X86_64 Claimed Busy 0.020 1023 0+00:00:08
slot15@7-cvjaw-416 WINDOWS X86_64 Unclaimed Idle 0.000 1023 0+01:30:10
slot17@7-cvjaw-416 WINDOWS X86_64 Claimed Busy 0.000 1023 0+00:00:03
slot18@7-cvjaw-416 WINDOWS X86_64 Claimed Busy 0.000 1023 0+00:00:04
slot19@7-cvjaw-416 WINDOWS X86_64 Claimed Busy 0.000 1023 0+00:00:05
slot1@7-cvjaw-4160 WINDOWS X86_64 Claimed Busy 0.010 1023 0+00:00:03
slot20@7-cvjaw-416 WINDOWS X86_64 Claimed Busy 0.000 1023 0+00:00:06
slot21@7-cvjaw-416 WINDOWS X86_64 Claimed Busy 1.020 1023 0+00:00:08
slot22@7-cvjaw-416 WINDOWS X86_64 Unclaimed Idle 0.000 1023 0+01:25:09
slot23@7-cvjaw-416 WINDOWS X86_64 Unclaimed Idle 0.000 1023 0+01:25:10
slot24@7-cvjaw-416 WINDOWS X86_64 Unclaimed Idle 0.000 1023 0+01:25:03
slot2@7-cvjaw-4160 WINDOWS X86_64 Unclaimed Idle 0.160 1023 0+01:24:43
slot3@7-cvjaw-4160 WINDOWS X86_64 Unclaimed Idle 0.000 1023 0+01:25:06
slot4@7-cvjaw-4160 WINDOWS X86_64 Unclaimed Idle 0.000 1023 0+01:25:07
slot5@7-cvjaw-4160 WINDOWS X86_64 Unclaimed Idle 0.000 1023 0+01:25:08
slot6@7-cvjaw-4160 WINDOWS X86_64 Unclaimed Idle 0.000 1023 0+01:25:09
slot7@7-cvjaw-4160 WINDOWS X86_64 Unclaimed Idle 0.000 1023 0+01:25:10
slot8@7-cvjaw-4160 WINDOWS X86_64 Unclaimed Idle 0.000 1023 0+01:25:03
slot9@7-cvjaw-4160 WINDOWS X86_64 Unclaimed Idle 0.000 1023 0+01:25:04
slot1@W7-CVTY-DELL WINDOWS X86_64 Unclaimed Idle 0.360 1023 0+01:02:22
slot2@W7-CVTY-DELL WINDOWS X86_64 Unclaimed Idle 0.000 1023 0+01:02:18
slot3@W7-CVTY-DELL WINDOWS X86_64 Unclaimed Idle 0.000 1023 0+00:11:32
slot4@W7-CVTY-DELL WINDOWS X86_64 Unclaimed Idle 0.000 1023 0+01:02:25
                Total Owner Claimed Unclaimed Matched Preempting Backfill

  INTEL/WINDOWS     4     4       0         0       0          0        0
 X86_64/WINDOWS    60     0      24        34       0          0        0

          Total    64     4      24        34       0          0        0

“claimed” means a slot has been taken – “busy” being that the job is running – so I’ve obviously got a few jobs running at the moment. The 4 Intel CPU slots are my laptop, which is marked as “Owner” because I’m using it, and it’s configured to only allow jobs when I’ve left the keyboard for a while.

You’ll also need to set up EnergyPlus with the appropriate weather files and other data on each machine in the pool. This I accept is a pain; I’ve been experimenting with running E+ from a shared network folder. Having got tantalisingly close I’ve come up against a brick wall to do with permissions which hopefully I’ll solve and follow up on later.

Finally, you need to store your password on the machine you’ll be submitting jobs from, so that condor can run as if it were you. This is quite safe, it’s stored in an encrypted form on your local machine only unless you have a much fancier setup. To store your password, open a command prompt and type:


condor_store_cred add

Then enter your password when asked for it.

Running on the pool

Condor jobs are pretty simple. You make a textfile with some basic settings for the job, then run condor_submit to submit the job. It’ll take care of transferring files for the job, monitoring how it’s doing, and copying back any output. It gets a bit more interesting with E+ because we need to run RunEPlus.bat – and I found this to be quite problematic when trying to run it via condor. It makes quite a lot of assumptions about where things are that seem to be quite hard to override. Also, sometimes we want to run things like the command preprocessor before it. So my solution was to make a batch file to run EnergyPlus, then make that the job for condor to execute. The machine we’re working on is referred to as the “local machine” – as opposed to the remote machines in the Condor pool.

Batch file, run.bat:


c:\EnergyPlusV6-0-0\PreProcess\parametricpreprocessor.exe FiveZoneMixedMode.idf GBR_Birmingham-Elmdon.035340_CIBSE-DSY
c:\EnergyPlusV6-0-0\RunEPlus.bat FiveZoneMixedMode-A GBR_Birmingham-Elmdon.035340_CIBSE-DSY

Condor job file, condorjob.cmd:


universe = vanilla
executable = run.bat
output = run.out
error = run.err
log = run.log
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = FiveZoneMixedMode.idf
Requirements = (Arch == "INTEL" || Arch == "X86_64") && (Opsys == "WINDOWS") && (Disk >= 200)
queue

Some explanation:

  • universe is always vanilla unless you want to get in to rebuilding the source code of EnergyPlus
  • executable is the program on the local machine to be run. In our case, it’s a windows batch file.
  • output is a text file on the local machine which any output will be written to. (specifically STDOUT); essentially what would have appeared in the console window had we run cmd.exe and called run.bat from there
  • error is a text file on the local machine which any errors will be written to. (specifically STDERR)
  • should_transfer_files is needed to get Condor to move files along with the executable, and to copy output files back when the job’s done.
  • when_to_transfer_output tells Condor when to copy output files from the job back to the local machine. ON_EXIT waits until EnergyPlus has finished running on the remote machine before copying anything back.
  • transfer_input_files is a comma separated list of files to transfer along with the job. In our case, the only one needed is the EnergyPlus IDF
  • Requirements – by default, EnergyPlus tries to allocate jobs to slots that match the machine the job was submitted from. So if, as I was, your local machine is running 32 bit Windows, but the machines in the pool at mostly 64 bit, Condor won’t be able to find available slots to run your job on. The Requirements field overrides this, and lets you set other conditions for the slots your job will be allocated to. Here I’ve specified that the job can run on either 32 bit or 64 bit CPUs (INTEL or X86_64), the machine needs to be running Windows rather than another OS, and it needs to have 200MB of hard disk space left (so there’s room for the E+ output files).

To submit a job to the pool, open a command window in the directory holding your IDF,run.bat and condorjob.cmd and run condor_submit like this:


C:\sb\condor_demo>condor_submit condorjob.cmd
Submitting job(s).
1 job(s) submitted to cluster 29154.

You’ll see that condor has responded confirming that the job has been submitted, and giving the number allocated to it. That number’s useful as it’s used to administer the job. You can use the program condor_wait to watch the log file and tell you when it’s done (what happens is that condor_wait blocks until the job is finished, then it simply exits saying “All jobs done” – though you can’t see the time delay below).


C:\sb\condor_demo>condor_wait run.log
All jobs done.

Now all you need to do is make a bit of code to generate the job files automatically for the batch of IDFs you have. This then needs to call condor_submit on each one. I’ve done mine in java, and I’ll maybe share how that works at some point too. The way that worked was that I created about 50 threads: each thread would take an IDF, generate a jobfile and submit it to the pool, then call condor_wait to work out when the job was done – the thread would then move on to a new IDF.

One tip, if you want to monitor the progress of the jobs with condor_q, make the name of run.bat something more helpful like run_house465.bat – as that’s what’ll show up in the listing that condor provides.

Admin

A few commands to help with keeping on top of your condor jobs. All run from the command line.

  • condor_q : displays the current queue of jobs submitted to the pool from your machine
  • condor_q -analyze : displays more detail, such as why a job has been held or is failing to run
  • condor_q -global : displays the global queue for the pool
  • condor_status : shows all the slots in the condor pool, with whether or not something is running on them
  • condor_rm 2344 : deletes job number 2344
  • condor_rm -all : deletes all jobs submitted from your machine to the pool
  • condor_release -all : if some of your jobs are showing up as “Held” in condor_status or condor_q -analyze, this will give them a kick and hopefully they’ll resume. Often this has happened for me because I’ve had a file open on my machine which condor was trying to write to (say the results of a previously failed run).

Some things to note:

EnergyPlus version 7 onwards has some multithreading built in. This offers some speedup when you’re running a single job on a multi-CPU machine, but when running several jobs in parallel it can actually result in slower performance, as the CPU has to juggle even more threads. The solution is to disable E+ multithreading: just add this to the IDF:

ProgramControl, 1 ; !- Number of Threads Allowed

I also had an Issue with condor binding to wrong network adaptor. Essentially, it was always trying to connect via wifi on my laptop, even when I was docket and using an ethernet cable, with wifi disabled. This resulted in failure to submit jobs to the cluster, “couldn’t connect to local master” errors when I tried running condor_restart, and “failed to fetch ads from 169.254.27.125” with condor_q. (hint: 169.* is often used by windows when a network connection is down) Apparently condor just choosed the first adaptor it can find, so the solution is to change your windows config so that the network adapter ordering is different, in my case with the wired network connection first. See here for an explanation of condor’s behaviour, and here for how to change the windows network configuration.

I hope that was helpful. Have fun watching those jobs being eaten alive by the pool instead of overheating your own CPU!

p

Leave a Comment

Your email address will not be published.