Adding details for checking error message.

This commit is contained in:
Matt Graham 2017-02-13 17:25:49 +00:00
parent dc9a472658
commit b2e1e231ff

View File

@ -34,7 +34,7 @@ The data files for the course (i.e. all of the files available at `/afs/inf.ed.a
To submit a job to the cluster from the head node you need to use the `qsub` command. This has many optional arguments - we will only cover the most basic usage here. To see a full description you can view the manual page for the command by running `man qsub` or search for tutorials on line for `grid engine qsub`. To submit a job to the cluster from the head node you need to use the `qsub` command. This has many optional arguments - we will only cover the most basic usage here. To see a full description you can view the manual page for the command by running `man qsub` or search for tutorials on line for `grid engine qsub`.
The main argument to `qsub` needs to be a script that can executed directly in a shell. One option is to create a wrapper `.sh` shell script which set ups the requisite environment variables and then executes a second Python script using the Python binary in `/disk/scratch/mlp/miniconda2/bin`. For example this could be done by creating a file `mlp_job.sh` in your home directory on the cluster file system with the following contents The main argument to `qsub` needs to be a script that can executed directly in a shell. One option is to create a wrapper `.sh` shell script which set ups the requisite environment variables and then executes a second Python script using the Python binary in `/disk/scratch/mlp/miniconda2/bin`. For example this could be done by creating a file `mlp-job.sh` in your home directory on the cluster file system with the following contents
``` ```
#!/bin/sh #!/bin/sh
@ -46,10 +46,10 @@ MLP_DATA_DIR='/disk/scratch/mlp/data'
where `[path-to-python-script]` is the path to the Python script you wish to submit as a job e.g. `$HOME/train-model.py`. The script can then be submitted to the cluster using where `[path-to-python-script]` is the path to the Python script you wish to submit as a job e.g. `$HOME/train-model.py`. The script can then be submitted to the cluster using
``` ```
qsub -q cpu $HOME/mlp_job.sh qsub -q cpu $HOME/mlp-job.sh
``` ```
assuming the `mlp_job.sh` script is in your home directory on the cluster file system. The `-q` option specifies which queue list to submit the job to; for MLP you should run jobs on the `cpu` queue. assuming the `mlp-job.sh` script is in your home directory on the cluster file system. The `-q` option specifies which queue list to submit the job to; for MLP you should run jobs on the `cpu` queue.
The scheduler will allocate the job to one of the CPU nodes. You can check on the status of submitted jobs using `qsub` - again `man qsub` can be used to give details of the output of this command and various optional arguments. The scheduler will allocate the job to one of the CPU nodes. You can check on the status of submitted jobs using `qsub` - again `man qsub` can be used to give details of the output of this command and various optional arguments.
@ -110,7 +110,9 @@ If the job is successfully submitted you should see a message
Your job [job-id] ("example-tf-mnist-train-job.py") has been submitted Your job [job-id] ("example-tf-mnist-train-job.py") has been submitted
``` ```
printed to the terminal, where `[job-id]` is an integer ID which identifies the job to the scheduler (this can be used for example to delete one of your running jobs using `qdel [job-id]`). You can use `qstat` to view the status of all of your currently submitted jobs. Typically straight after a job is submitted it will show its state as `qw` which means the job is waiting in the queue to be run. Once the job is in progress on one of the nodes this will change to `r` which indicates job is runnning. An `E` in the job state indicates there has been an error. printed to the terminal, where `[job-id]` is an integer ID which identifies the job to the scheduler (this can be used for example to delete one of your running jobs using `qdel [job-id]`). You can use `qstat` to view the status of all of your currently submitted jobs. Typically straight after a job is submitted it will show its state as `qw` which means the job is waiting in the queue to be run. Once the job is in progress on one of the nodes this will change to `r` which indicates job is runnning.
An `E` in the job state indicates there has been an error. Running `qstat -j [job-id] | grep error` may help diagnose the issue. You should also `qdel [job-id]` after you have read the error message if any.
By default the stdout and sterr output from your script will be written to files `example-tf-mnist-train-job.py.o[job-id]` and `example-tf-mnist-train-job.py.e[job-id]` respectively in your cluster home directory while the job is running (this default behaviour can be changed using the optional `-o` and `-e` options in `qsub`). If you display the contents of the `example-tf-mnist-train-job.py.o[job-id]` file by running e.g. By default the stdout and sterr output from your script will be written to files `example-tf-mnist-train-job.py.o[job-id]` and `example-tf-mnist-train-job.py.e[job-id]` respectively in your cluster home directory while the job is running (this default behaviour can be changed using the optional `-o` and `-e` options in `qsub`). If you display the contents of the `example-tf-mnist-train-job.py.o[job-id]` file by running e.g.