VDT Office Hours 2 February 2006
Happy Groundhog day!
Attendees
- Alain Roy (VDT)
- Tim Cartwright (VDT)
- Nate Mueller (VDT)
- Burt Holzman (Fermilab)
- Leigh Grundhoefer (Indiana)
- John Weigand (Fermi)
- Stu Martin (Argonne)
- Vikram Andem (Fermilab)
- Anand Padmanabhan (Iowa)
- John Hover (BNL)
- Rob Quick (Indiana)
- Mark Green (Buffalo)
Gridmonitor problem
People have become aware of a problem with the Condor-G gridmonitor:
If someone submits a job from Condor 6.7.8 or ealier (this includes
OSG 0.2.x) to a gatekeeper running GT 4.0 pre-web service, the
Condor-G gridmonitor won't work. Note that it does not matter which
batch system is used on the gatekeeper or which version of Condor
might be used on the gatekeeper: it only matters which version of
Condor is used for submissions.
The result is that the load on the gatekeeper is higher than expected
because the grid monitor is not used.
This has two possible fixes:
- People submitting jobs can upgrade to a newer version of
Condor. If they prefer, they can upgrade just the gridmonitor.
- Site administrators can fix the problem by setting permissions
on the Globus tmp directory to make it world-writable. We
are not yet sure of the implications of this change: Stu Martin
from Globus is looking into it and will report back if it's a
safe change or not.
Pre-WS scalability enhancement
As part of a larger effort to get the VDT and the TeraGrid software
stacks to use identical versions of Globus and other software, we
would like to consider using an enhancement to pre-web service GRAM
that TeraGrid is using.
It is a back-port of a small piece of web-services GRAM. It changes
the pre-web services jobmanagers so that they do not constantly query
the batch system to find the state of the jobs, but instead consult
the batch system logs. It is claimed that this reduces the load of the
jobmanagers by ninety percent.
There are two possible downsides to this.
- The batch system logs must be accessible by the gatekeeper
node. This is already true in WS-GRAM, but it was not true in
pre-WS GRAM. Some people (like Mark Green) prefer not to have the
logs accessible via NFS or accessible on the gatekeeper. He is
willing to make it work on his system, and no one else complained
about it.
- It requires an extra process to run that translates the job
logs into something the jobmanager can understand. This process
must be able to read the logs, so for some people it will need to
be setuid.
Overall, there were no real objections to this modification and people
agreed that it would be good. Alain will bring it up at the OSG
integration meeting today to run it by a larger crowd.