Why doesn't my job run?
Help for Users

This is a draft document that while amazingly helpful is not yet complete. It will eventually be moved to the opensciencegrid.org domain.

Introduction

This document is mean for people attempting to submit jobs to Open Science Grid (OSG) sites, and the jobs are failing. It only addresses issues regarding basic running of jobs, and not data transfer or storage. There is a companion document for system administrators who are helping users debug why jobs won't run at their site.

First step: Let us know!

A number of users have had problems getting jobs running at sites. We don't want you to suffer, and we consider fixing this sort of problem to be very high priority. Please contact our trained operators who are standing by waiting to assist you via a handy web form. If you can do any of the debugging steps that follow, please describe the results: they will be invaluable.

The GOC will work with you and the system administrators of the problem sites to debug the problem as quickly as possible. The GOC may involve members of the OSG troubleshooting team.

Summary of Debugging

The suggestions that follow are:

vdt-version
condor_version
globus-version
grid-proxy-init 
voms-proxy-init -all
globusrun -a -r sitename
globus-job-run  vdt-rhas3-ia32/jobmanager-fork /bin/hostname
These commands are discussed in much greater detail below.

1. What software are you using?

What software are you using? If you are using the VDT, please run the following command to get specific information:

vdt-version
If you are using Condor-G, please run this command:
condor_version
If you are using Globus, please run this command:
globus-version

2. What is your identity?

If you are using grid-proxy-init to create your grid proxy, please run the following command (note that the output is example output only):

> grid-proxy-info
subject  : /DC=org/DC=doegrids/OU=People/CN=Alain Roy
424511/CN=801637001
issuer   : /DC=org/DC=doegrids/OU=People/CN=Alain Roy 424511
identity : /DC=org/DC=doegrids/OU=People/CN=Alain Roy 424511
type     : Proxy draft (pre-RFC) compliant impersonation proxy
strength : 512 bits
path     : /tmp/x509up_u8471
timeleft : 11:59:54

If you are using voms-proxy-init to create your grid proxy, please run the following command (note that the output is example output only):

> voms-proxy-info -all
WARNING: Unable to verify signature! Server certificate possibly not
installed.
Error: Cannot find certificate of AC issuer for vo nanohub
subject   : /DC=org/DC=doegrids/OU=People/CN=Alain Roy 424511/CN=proxy
issuer    : /DC=org/DC=doegrids/OU=People/CN=Alain Roy 424511
identity  : /DC=org/DC=doegrids/OU=People/CN=Alain Roy 424511
type      : proxy
strength  : 512 bits
path      : /tmp/x509up_u8471
timeleft  : 11:59:51
=== VO nanohub extension information ===
VO        : nanohub
subject   : /DC=org/DC=doegrids/OU=People/CN=Alain Roy 424511
issuer    : /DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov
attribute : /nanohub/Role=NULL/Capability=NULL
timeleft  : 11:59:51

Don't worry: You can ignore that "WARNING" and "Error" that voms-proxy-init printed: it's not a worry here.

Hint: Some OSG sites require that you use a VOMS proxy instead of a basic grid proxy. For example, Fermilab either requires a VOMS proxy or special permission to use a grid proxy. In addition, as of this writing (28-April-28), Fermilab cannot recognize VOMS proxies derived from certificates granted by the Purdue CA

3. Can you connect?

Can you connect to the gatekeeper using telnet? If the gatekeeper isn't listening, you see a message like this:

> telnet vdt-fc4-ia32 2119
Trying 192.168.0.205...
telnet: connect to address 192.168.0.205: Connection refused
telnet: Unable to connect to remote host: Connection refused

If it connects and appears to "hang", you've leanred that the gatekeeper is alive and listening. In some rare circumstances, you might see an error message printed out. If so, that information is invaluable, but this is rare. This happens if xinetd is unable to start up the gatekeeper process at all: perhaps the executable is missing, or it cannot find some shared libraries.

4. Can you be authenticated and authorized?

Try to run the following command, which contacts Globus GRAM 2 (the software that accepts job submissions) and asks for authorization but doesn't actually run a job.

Command:
globusrun -a -r sitename

Example:
> globusrun -a -r vdt-rhas3-ia32.cs.wisc.edu

GRAM Authentication test successful

If it fails, you might see output like this:

> globusrun -a -r vdt-rhas3-ia32

GRAM Authentication test failure: connecting to the job manager
failed.  Possible reasons: job terminated, invalid job contact,
network problems, ...

Common reasons for failure include:

If the remote site has a GridFTP server on the same computer as the gatekeeper, you can often get better error messages for authentication and authorization failures from globus-url-copy. For example, if the remote site doesn't recognize your CA:

> globus-url-copy gsiftp://vdt-rhas3-ia32/etc/motd file:///home/roy/motd

error: globus_ftp_control_client.c:globus_l_ftp_control_send_cmd_cb:2748:
gss_init_sec_context failed
GSS Major Status: Authentication Failed
init_sec_context.c:gss_init_sec_context:190:
SSLv3 handshake problems
globus_i_gsi_gss_utils.c:globus_i_gsi_gss_handshake:889:
Unable to verify remote side's credentials
globus_i_gsi_gss_utils.c:globus_i_gsi_gss_handshake:862:
SSLv3 handshake problems: Couldn't do ssl handshake
OpenSSL Error: s3_clnt.c:842: in library: SSL routines, function SSL3_GET_SERVER_CERTIFICATE: certificate verify failed
globus_gsi_callback.c:globus_gsi_callback_handshake_callback:531:
Could not verify credential
globus_gsi_callback.c:globus_i_gsi_callback_cred_verify:681:
Can't get the local trusted CA certificate: Cannot find issuer certificate for local credential with subject: /DC=org/DC=doegrids/OU=Services/CN=vdt-rhas3-ia32.cs.wisc.edu
If you aren't in the grid-mapfile:
> globus-url-copy gsiftp://vdt-rhas3-ia32/etc/motd file:///home/roy/motd

error:
globus_ftp_client_state.c:globus_l_ftp_client_connection_error:4105:
the server responded with an error
530 530-Login incorrect. :
gridmap.c:globus_l_gss_assist_gridmap_lookup:2035:
530-Gridmap lookup failure: Could not map /DC=org/DC=doegrids/OU=People/CN=Alain Roy 424511
530-
530 End.
If you are mapped to a user that doesn't exist:
globus-url-copy gsiftp://vdt-rhas3-ia32/etc/motd file:///home/roy/motd

error: globus_ftp_client_state.c:globus_l_ftp_client_connection_error:4105:
the server responded with an error
530 530-Login incorrect. : globus_i_gfs_data.c:globus_l_gfs_data_authorize:1050:
530-Mapped user 'royy' is invalid.
530 End.

5. Can you run a simple fork job?

Try to run a simple GRAM 2 (a.k.a pre-web services) job that runs on the gatekeeper. A successful run will look something like this:

Command:
globus-job-run sitename/jobmanager-fork /bin/hostname

Example:
> globus-job-run vdt-rhas3-ia32/jobmanager-fork /bin/hostname
vdt-rhas3-ia32

Common reasons for failure include:

6. Can you run a simple batch job?

Try to run a simple GRAM 2 (a.k.a pre-web services) job that runs on on a worker node. A successful run will look something like this:

Command:
globus-job-run sitename/<jobmanager> /bin/hostname

Example:
> globus-job-run vdt-rhas3-ia32/jobmanager-condor /bin/hostname
node-0018
An example failure caused by Condor being down at the remote site:
> globus-job-run vdt-rhas3-ia32/jobmanager-condor /bin/hostname

ERROR: Can't find address of local schedd
GRAM Job failed because the job failed when the job manager attempted to run it (error code 17)

Common reasons for failure include:

7. Does Condor-G fail when Globus commmands succeed?

Condor-G exercises the GRAM protocol more than most globus-job-run or globusrun invocations. In particular, it requires that more network ports be open: if either your site or the remote site has a firewall blocking these ports, Condor-G will fail when the Globus command succeed. If you find that you can use globus-job-run but not Condor-G, this is likely to be your problem.

The cause of the problem is that Globus on the remote site needs to contact Condor-G, and a firewall that blocks incoming connections prevents that connection from being made. (It's also possible that a firewall at the remote site blocks outgoing connections, but that's much less likely.) This firewall might be on the local computer (perhaps with iptables), or it might be "in the network".

To solve the problem, you need to punch a hole in the firewall. You need a port range with at least three ports per user that might use Condor-G. For example, if you have ten people that might submit jobs via Condor-G, you'll need at least thirty ports in the port range. (Note that a user is a unique DN, not a unique user id.)

Once you have the port range, you need to change the Condor configuration for the condor_gridmanager component. For example, if the port range is 8000-8029, you would set:

GRIDMANAGER.IN_LOWPORT = 8000
GRIDMANAGER.IN_HIGHPORT = 8029
After changing this configuration, restart Condor.