This is a draft document that while amazingly helpful is not yet complete. It will eventually be moved to the opensciencegrid.org domain.
This document is mean for people attempting to submit jobs to Open Science Grid (OSG) sites, and the jobs are failing. It only addresses issues regarding basic running of jobs, and not data transfer or storage. There is a companion document for system administrators who are helping users debug why jobs won't run at their site.
A number of users have had problems getting jobs running at sites. We don't want you to suffer, and we consider fixing this sort of problem to be very high priority. Please contact our trained operators who are standing by waiting to assist you via a handy web form. If you can do any of the debugging steps that follow, please describe the results: they will be invaluable.
The GOC will work with you and the system administrators of the problem sites to debug the problem as quickly as possible. The GOC may involve members of the OSG troubleshooting team.
The suggestions that follow are:
vdt-version condor_version globus-version grid-proxy-init voms-proxy-init -all globusrun -a -r sitename globus-job-run vdt-rhas3-ia32/jobmanager-fork /bin/hostnameThese commands are discussed in much greater detail below.
What software are you using? If you are using the VDT, please run the following command to get specific information:
vdt-versionIf you are using Condor-G, please run this command:
condor_versionIf you are using Globus, please run this command:
globus-version
If you are using grid-proxy-init to create your grid
proxy, please run the following command (note that the output is
example output only):
> grid-proxy-info subject : /DC=org/DC=doegrids/OU=People/CN=Alain Roy 424511/CN=801637001 issuer : /DC=org/DC=doegrids/OU=People/CN=Alain Roy 424511 identity : /DC=org/DC=doegrids/OU=People/CN=Alain Roy 424511 type : Proxy draft (pre-RFC) compliant impersonation proxy strength : 512 bits path : /tmp/x509up_u8471 timeleft : 11:59:54
If you are using voms-proxy-init to create your grid
proxy, please run the following command (note that the output is
example output only):
> voms-proxy-info -all WARNING: Unable to verify signature! Server certificate possibly not installed. Error: Cannot find certificate of AC issuer for vo nanohub subject : /DC=org/DC=doegrids/OU=People/CN=Alain Roy 424511/CN=proxy issuer : /DC=org/DC=doegrids/OU=People/CN=Alain Roy 424511 identity : /DC=org/DC=doegrids/OU=People/CN=Alain Roy 424511 type : proxy strength : 512 bits path : /tmp/x509up_u8471 timeleft : 11:59:51 === VO nanohub extension information === VO : nanohub subject : /DC=org/DC=doegrids/OU=People/CN=Alain Roy 424511 issuer : /DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov attribute : /nanohub/Role=NULL/Capability=NULL timeleft : 11:59:51
voms-proxy-init printed: it's not a worry here.
Can you connect to the gatekeeper using telnet? If the gatekeeper isn't listening, you see a message like this:
> telnet vdt-fc4-ia32 2119 Trying 192.168.0.205... telnet: connect to address 192.168.0.205: Connection refused telnet: Unable to connect to remote host: Connection refused
If it connects and appears to "hang", you've leanred that the gatekeeper is alive and listening. In some rare circumstances, you might see an error message printed out. If so, that information is invaluable, but this is rare. This happens if xinetd is unable to start up the gatekeeper process at all: perhaps the executable is missing, or it cannot find some shared libraries.
Try to run the following command, which contacts Globus GRAM 2 (the software that accepts job submissions) and asks for authorization but doesn't actually run a job.
Command: globusrun -a -r sitename Example: > globusrun -a -r vdt-rhas3-ia32.cs.wisc.edu GRAM Authentication test successful
If it fails, you might see output like this:
> globusrun -a -r vdt-rhas3-ia32 GRAM Authentication test failure: connecting to the job manager failed. Possible reasons: job terminated, invalid job contact, network problems, ...
Common reasons for failure include:
If the remote site has a GridFTP server on the same computer as the gatekeeper, you can often get better error messages for authentication and authorization failures from globus-url-copy. For example, if the remote site doesn't recognize your CA:
> globus-url-copy gsiftp://vdt-rhas3-ia32/etc/motd file:///home/roy/motd error: globus_ftp_control_client.c:globus_l_ftp_control_send_cmd_cb:2748: gss_init_sec_context failed GSS Major Status: Authentication Failed init_sec_context.c:gss_init_sec_context:190: SSLv3 handshake problems globus_i_gsi_gss_utils.c:globus_i_gsi_gss_handshake:889: Unable to verify remote side's credentials globus_i_gsi_gss_utils.c:globus_i_gsi_gss_handshake:862: SSLv3 handshake problems: Couldn't do ssl handshake OpenSSL Error: s3_clnt.c:842: in library: SSL routines, function SSL3_GET_SERVER_CERTIFICATE: certificate verify failed globus_gsi_callback.c:globus_gsi_callback_handshake_callback:531: Could not verify credential globus_gsi_callback.c:globus_i_gsi_callback_cred_verify:681: Can't get the local trusted CA certificate: Cannot find issuer certificate for local credential with subject: /DC=org/DC=doegrids/OU=Services/CN=vdt-rhas3-ia32.cs.wisc.eduIf you aren't in the grid-mapfile:
> globus-url-copy gsiftp://vdt-rhas3-ia32/etc/motd file:///home/roy/motd error: globus_ftp_client_state.c:globus_l_ftp_client_connection_error:4105: the server responded with an error 530 530-Login incorrect. : gridmap.c:globus_l_gss_assist_gridmap_lookup:2035: 530-Gridmap lookup failure: Could not map /DC=org/DC=doegrids/OU=People/CN=Alain Roy 424511 530- 530 End.If you are mapped to a user that doesn't exist:
globus-url-copy gsiftp://vdt-rhas3-ia32/etc/motd file:///home/roy/motd error: globus_ftp_client_state.c:globus_l_ftp_client_connection_error:4105: the server responded with an error 530 530-Login incorrect. : globus_i_gfs_data.c:globus_l_gfs_data_authorize:1050: 530-Mapped user 'royy' is invalid. 530 End.
Try to run a simple GRAM 2 (a.k.a pre-web services) job that runs on the gatekeeper. A successful run will look something like this:
Command: globus-job-run sitename/jobmanager-fork /bin/hostname Example: > globus-job-run vdt-rhas3-ia32/jobmanager-fork /bin/hostname vdt-rhas3-ia32
Common reasons for failure include:
> globus-job-run vdt-rhas3-ia32/jobmanager-fork /bin/hostname GRAM Job submission failed because the gatekeeper failed to run the job manager (error code 47)
> globus-job-run vdt-rhas3-ia32/jobmanager-fork /bin/hostname GRAM Job submission failed because the job manager failed to create an internal script argument file (error code 22)
> globus-job-run vdt-fc3-ia32/jobmanager-fork /bin/nonexistent_program GRAM Job failed because the executable does not exist (error code 5)
Try to run a simple GRAM 2 (a.k.a pre-web services) job that runs on on a worker node. A successful run will look something like this:
Command: globus-job-run sitename/<jobmanager> /bin/hostname Example: > globus-job-run vdt-rhas3-ia32/jobmanager-condor /bin/hostname node-0018An example failure caused by Condor being down at the remote site:
> globus-job-run vdt-rhas3-ia32/jobmanager-condor /bin/hostname ERROR: Can't find address of local schedd GRAM Job failed because the job failed when the job manager attempted to run it (error code 17)
Common reasons for failure include:
globus_rsl=(queue=somequeuename)There currently isn't a nice way to get a list of available queues at a site, but you probably need to contact the site directly. Don't worry about Condor sites: they don't use separately named queues.
Condor-G exercises the GRAM protocol more than most
globus-job-run or globusrun invocations. In
particular, it requires that more network ports be open: if either your site
or the remote site has a firewall blocking these ports, Condor-G will
fail when the Globus command succeed. If you find that you can use
globus-job-run but not Condor-G, this is likely to be your problem.
The cause of the problem is that Globus on the remote site needs to contact Condor-G, and a firewall that blocks incoming connections prevents that connection from being made. (It's also possible that a firewall at the remote site blocks outgoing connections, but that's much less likely.) This firewall might be on the local computer (perhaps with iptables), or it might be "in the network".
To solve the problem, you need to punch a hole in the firewall. You need a port range with at least three ports per user that might use Condor-G. For example, if you have ten people that might submit jobs via Condor-G, you'll need at least thirty ports in the port range. (Note that a user is a unique DN, not a unique user id.)
Once you have the port range, you need to change the Condor
configuration for the condor_gridmanager component. For
example, if the port range is 8000-8029, you would set:
GRIDMANAGER.IN_LOWPORT = 8000 GRIDMANAGER.IN_HIGHPORT = 8029After changing this configuration, restart Condor.