Documents

How to use GXP in your batch queueing environments

Introduction

With GXP, you can use machines accessible via SSH very easily. You can use traditional rsh in place of SSH, or sh to run multiple processes on multiprocessor/multicore hosts. It is a matter of giving a suitable name to the 'gxpc use' command (e.g., gxpc use rsh ...).

While SSH and alike are enough for using Linux clusters of your own, many clusters are shared among many users and accesses to them allowed only via a designated batch queueing command (most often, 'qsub'). GXP's mechanism to choose and customize the underlying such 'remote-exec' commands is flexible enough to incorporate them. Once you acquire hosts, you can run commands on these hosts in exactly the same interface (e.g., 'e' command) regardless whether they are acquired via SSH, qsub, or whatever.

A simple case (just 'gxpc use torque ...')

First consider the simplest case in which you have the torque batch scheduler (open source implementation of PBS)

and assume that

  qsub script

just works without a need to specify anything else (e.g., queue names).

Then acquiring hosts via qsub is just a matter of:

  gxpc use torque REQUEST_HOST WHATEVER

where REQUEST_HOST should be the host name on which you run the qsub command. WHATEVER can be any name. For example,

  gxpc use torque node000 hoge

Then you acquire resources by

  gxpc explore hoge 20

This runs qsub commands 20 times (which normally gives you 20 CPUs or 20 hosts, depending on the queue configuration of your site).

Knowing how GXP invokes the underlying remote exec commands

It is often the case that you need to supply additional arguments to qsub, such as your group name or queue name to submit your jobs to. Thus, you need to customize the command line GXP uses to acquire resources with a particular command.

Before talking about customization, let's look at how GXP invokes underlying remote-exec commands, and what kind of remote-exec commands are available.

 $ gxpc rsh

lists all remote-exec names GXP is currently familiar with. By default, it will give you something like:

 $ gxpc rsh
 fujitsu, hitachi, n1ge, n1ge_host, nqs_fujitsu, nqs_hitachi, \
 qrsh, qrsh_host, rsh, rsh_as, sge, sge_host, sh, ssh, ssh_as, \
 torque, torque_host

We have already talked about ssh, rsh, sh, and torque. Let's call each of them an rsh-method.

 $ gxpc rsh RSH-METHOD

gives you how GXP invokes the specified rsh-method. For example,

 $ gxpc rsh ssh
 ssh : ssh -o 'StrictHostKeyChecking no' \
          -o 'PreferredAuthentications hostbased,publickey' \
          -A %target% %cmd%

This line shows the command line GXP uses to invoke a command on a target host. In the actual command line, %cmd% and %target% are replaced with the command line to executed, and the target host, respectively.

What about the torque used above?

 $ gxpc rsh torque
 torque : qsub_wrap --sys torque %cmd%

Well, unlike ssh, it does not invoke 'qsub' directly. It instead invokes a command (you won't be familar with) called 'qsub_wrap.' This is a simple wrapper command bundled with GXP. There are reasons that qsub cannot be used as is and qsub_wrap must be used instead.

  1. qsub, unlike ssh, is totally a non-interactive command, which does not connect the remote process's standard IOs to those of the local process.
  2. qsub, unlike ssh, will wait forever when the submitted job does not get started in a while.

qsub_wrap is a simple wrapper that makes qsub behave more like ssh (If you are familiar with qrsh command of Sun Grid Engine, qsub_wrap is a general wrapper that makes qsub behave like qrsh). To feel how it works, try:

 <GXP_DIR>/gxpbin/qsub_wrap --sys torque hostname

where GXP_DIR is the GXP's top directory (in which you find gxpc command). When it succeeds, it should give you the hostname on which the command happens to be exected.

Customizing how GXP invokes the underlying remote exec commands

Now you are ready to customize it. Let's say you need to give additional arguments '-q q0123 -g g4567' to qsub commands. Then do:

 $ gxpc rsh torque qsub_wrap --sys torque %cmd% -- -q q0123 -g g4567

Here, '--' tells qsub_wrap command that it should pass whatever follows to the qsub command. For other options to qsub_wrap, simply run it without argumetns.

 GXP_DIR/gxpbin/qsub_wrap
 ... < show help > ...

In general, GXP can use any underlying command which:

  • takes a command line to execute in its arguments,
  • and runs it with its' stdin, stdout, and stderr connected to its counterpart of the local process (e.g., ssh client, or qsub_wrap).

You can create an entirely new rsh-method of an arbitrary name. For example, let's say you have a host foo.myserver.net that runs SSH server with a non-default port number (say 2222). Then you may add a new rsh-method name, say ssh2222 as follows.

 $ gxpc rsh ssh2222 ssh -p 2222 -o 'StrictHostKeyChecking no' \
          -o 'PreferredAuthentications hostbased,publickey' \
          -A %target% %cmd%

Now you can use ssh2222 as the name given to 'gxpc use' command.

 $ gxpc use ssh2222 . foo.myserver.net
 $ gxpc explore foo.myserver.net

Which batch queueing system is supported (qsub_wrap's --sys option)?

qsub_wrap is meant to be a general wrapper that can 'wrap' the qsub command (or alike) of various batch queueing systems. The --sys option of qsub_wrap specifies which one should be assumed. Here is the list of acceptable system name and its supporting system.

  • torque : torque open source implementation of PBS
  • sge : Sun Grid Engine
  • n1ge : N1GE on Titech's Tsubame system (a variant of SGE)
  • nqs_hitachi : NQS on T2K Tokyo's HA8000 (a Hitachi variant of NQS)
  • nqs_fujitsu : NQS on T2K Kyoto's (a Fujitsu variant of NQS)

They are all similar, but still different enough to make it difficult to work with all these systems transparently. Among others, they differ in:

  • qsub's flag to specify where standard output/error of the job should go.
  • qsub's output format that shows the id of the submitted job
  • qsub's flag to specify the executing shell
  • qstat's flag to specify the job whose status is interested in
  • whether qstat's exit status indicates the existence of the specified job, and if not the output format of the qstat that shows it
  • qsub, qstat, and qdel's command names (some installations change it)

qsub_wrap currently absorbs these differences by making a class for each system, and fields describing their behaviors.

Adding a support of a new batch queueing environment to GXP is adding it to qsub_wrap, and it currently takes adding a small class to its source code.

The developer of GXP welcomes requests for support of a new batch queueing system, or (even better) a patch to support a new batch queueing system.

qsub_wrap's other options

--timeout T (default 150)
specifies qsub_wrap should wait for T seconds until the remote process brings up. When timeout has expired, it deletes the request in the queue using the qdel command.
--display_progress_interval T (default 20)
qsub_wrap shows, once in this seconds until the remote job brings up or timeout has reached, that the request is still in the queue.
--host HOSTNAME (default, none)
qsub_wrap uses a socket to connect the standard IOs of the remote process to those of the local's. If --host option is given, the remote process tries to connect to this HOSTNAME (with address resolution if necsssary). This option is normally unnecessary, but in circumstances where the host you run qsub_wrap on has multiple hostnames (e.g., the one exposed to the Internet and the one private to your cluster) and the compute nodes cannot connect to one of them, you must specify the one the compute nodes can connect to.
--dbg 0/1/2 (default 0)
by giving 1 or 2, qsub_wrap becomes more verbose.
--qsub PATH (default qsub)
--qstat PATH (default qstat)
--qdel PATH (default qdel)
specify the path to qsub, qstat, and qdel respectively.

Passing options from explore to ssh, qsub_wrap, etc.

Suppose you want to change the timeout parameter passed to qsub_wrap. One way to do this is simply to customize the qsub_wrap command line by 'gxpc rsh'.

 gxpc rsh torque qsub_wrap --timeout 200 --sys torque %cmd% 

But if you want to experiment with many timeout values and do not want to issue complex gxpc rsh command everytime, you can parameterize it by include %parameter% in your 'gxpc rsh' line, and then supplying the actual value by '-a parameter=value' in your explore command.

For example, the above example can be written in

 gxpc rsh torque qsub_wrap --timeout %timeout% --sys torque %cmd% 

and

 gxpc explore -a timeout=250 ...

Diagnosing explore problems

From time to time you may fail to explore. Most often it is simply that the target nodes are down or inaccessible for other reasons, and you may diagnose the problems simply by looking at output of explore.

But if the output is not very informative, and you feel you need to look into why it fails, here is the steps you might follow to diagnoze the problem.

Let's assume you are trying to explore host called "node001" from host "node000".

  • First perform 'gxpc rsh ssh' to see the command line GXP will use. Replace the %target% by the target, %cmd% by something as simple as 'hostname', and run it. That is,
     $ gxpc rsh ssh
     ssh : ssh -o 'StrictHostKeyChecking no' \
               -o 'PreferredAuthentications hostbased,publickey' \
               -A %target% %cmd%
    and then invoke
     ssh -o 'StrictHostKeyChecking no' \
         -o 'PreferredAuthentications hostbased,publickey' \
         -A node001 hostname
    It hopefully gives you more information about what went wrong. Play with different parameters if you feel any of the above parameters is problematic.
  • If the above suceeds but you are still not able to explore node000, then make sure GXP is actually using the "right" rsh-method. To see this, give '--verbosity 1' option to explore.
 $ explore --verbosity 1 node001
 gxpc : finding explorable pairs
 gxpc : ssh node000-tau-2008-07-14-12-59-48-2882 -> node001
 gxpc : found 1 explorable pairs (with 0 NG pairs 0 NG nodes) in 0.010 sec
  ...

This will show 'ssh' is used by node000 to explore node001. Make sure what's happening here is what you intended. Depending on how you used 'gxpc use' commmands, it may happen that

  • a wrong node (a node that actually cannot login node001) is trying to login node001
  • a wrong rsh-method (something other than ssh) is used
  • If this might be the case, check your use clauses by
 $ gxpc use
 0 : use ssh node000 node001

This will show the repertories GXP currently recognizes.


トップ   編集 バックアップ 添付 複製 名前変更 リロード   新規 一覧 最終更新   最終更新のRSS
© 2007 Taura lab. All Rights Reserved.
Powered by Pukiwiki, styled by Kei_ / Last-modified: 2016-07-12 (火) 15:12:44 (856d)