HPC Job runs slower on random submission
Clash Royale CLAN TAG#URR8PPP
up vote
0
down vote
favorite
I am running a parallel computation using MPI for parallelism on a cluster that uses IBM LSF for job scheduling. Frustratingly, whenever I submit a job, I find that sometimes it runs slower by a factor of about 2 or more, and the other times it runs as expected. At first I thought there would be a set of nodes that were faulty causing the simulation to slow, but I havent been able to locate which ones if any.
I am a lost as to where I should even start to look in order to determine what is the problem without resorting to painstaking trail and error. I am very confident the problem is indeed with the cluster and not my simulations themselves.
Could someone give me some suggestions of how to go about debugging this issue?
parallelism high-performance platform-lsf
 |Â
show 1 more comment
up vote
0
down vote
favorite
I am running a parallel computation using MPI for parallelism on a cluster that uses IBM LSF for job scheduling. Frustratingly, whenever I submit a job, I find that sometimes it runs slower by a factor of about 2 or more, and the other times it runs as expected. At first I thought there would be a set of nodes that were faulty causing the simulation to slow, but I havent been able to locate which ones if any.
I am a lost as to where I should even start to look in order to determine what is the problem without resorting to painstaking trail and error. I am very confident the problem is indeed with the cluster and not my simulations themselves.
Could someone give me some suggestions of how to go about debugging this issue?
parallelism high-performance platform-lsf
1
It seems less likely that someone here will know the problem better than your cluster administrator. Could be anything from competing jobs to system maintenance to unknown-to-me job scheduling re-prioritizations.
â Jeff Schaller
Sep 1 '17 at 0:39
Thanks for your comment. Yes I realise this, although I have flagged it already. The nodes I have requested are exclusive to my job so that way can I rule out any competition? I am not sure what job reprioritisation could mean.
â Dipole
Sep 1 '17 at 18:47
I'm not familiar with the job scheduler, but I imaging it could re-prioritize (renice, pause, de-schedule, etc) your job
â Jeff Schaller
Sep 1 '17 at 18:48
Ok I see - However I don't think this is the likely explanation. I should have mentioned that I am running a job which runs thousands of iterations in a loop. The time per loop iteration is twice as long on average for the slow simulation.
â Dipole
Sep 1 '17 at 19:32
If LSF does anything to the job, you'll see it in the output of 'bhist-l <jobid>'. Is the hardware different? Are other jobs running, there could be some cache effects, or other jobs are using to much memory.
â Michael Closson
Sep 2 '17 at 0:47
 |Â
show 1 more comment
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I am running a parallel computation using MPI for parallelism on a cluster that uses IBM LSF for job scheduling. Frustratingly, whenever I submit a job, I find that sometimes it runs slower by a factor of about 2 or more, and the other times it runs as expected. At first I thought there would be a set of nodes that were faulty causing the simulation to slow, but I havent been able to locate which ones if any.
I am a lost as to where I should even start to look in order to determine what is the problem without resorting to painstaking trail and error. I am very confident the problem is indeed with the cluster and not my simulations themselves.
Could someone give me some suggestions of how to go about debugging this issue?
parallelism high-performance platform-lsf
I am running a parallel computation using MPI for parallelism on a cluster that uses IBM LSF for job scheduling. Frustratingly, whenever I submit a job, I find that sometimes it runs slower by a factor of about 2 or more, and the other times it runs as expected. At first I thought there would be a set of nodes that were faulty causing the simulation to slow, but I havent been able to locate which ones if any.
I am a lost as to where I should even start to look in order to determine what is the problem without resorting to painstaking trail and error. I am very confident the problem is indeed with the cluster and not my simulations themselves.
Could someone give me some suggestions of how to go about debugging this issue?
parallelism high-performance platform-lsf
parallelism high-performance platform-lsf
edited Sep 13 at 11:48
Rui F Ribeiro
36.8k1273117
36.8k1273117
asked Aug 31 '17 at 23:54
Dipole
1063
1063
1
It seems less likely that someone here will know the problem better than your cluster administrator. Could be anything from competing jobs to system maintenance to unknown-to-me job scheduling re-prioritizations.
â Jeff Schaller
Sep 1 '17 at 0:39
Thanks for your comment. Yes I realise this, although I have flagged it already. The nodes I have requested are exclusive to my job so that way can I rule out any competition? I am not sure what job reprioritisation could mean.
â Dipole
Sep 1 '17 at 18:47
I'm not familiar with the job scheduler, but I imaging it could re-prioritize (renice, pause, de-schedule, etc) your job
â Jeff Schaller
Sep 1 '17 at 18:48
Ok I see - However I don't think this is the likely explanation. I should have mentioned that I am running a job which runs thousands of iterations in a loop. The time per loop iteration is twice as long on average for the slow simulation.
â Dipole
Sep 1 '17 at 19:32
If LSF does anything to the job, you'll see it in the output of 'bhist-l <jobid>'. Is the hardware different? Are other jobs running, there could be some cache effects, or other jobs are using to much memory.
â Michael Closson
Sep 2 '17 at 0:47
 |Â
show 1 more comment
1
It seems less likely that someone here will know the problem better than your cluster administrator. Could be anything from competing jobs to system maintenance to unknown-to-me job scheduling re-prioritizations.
â Jeff Schaller
Sep 1 '17 at 0:39
Thanks for your comment. Yes I realise this, although I have flagged it already. The nodes I have requested are exclusive to my job so that way can I rule out any competition? I am not sure what job reprioritisation could mean.
â Dipole
Sep 1 '17 at 18:47
I'm not familiar with the job scheduler, but I imaging it could re-prioritize (renice, pause, de-schedule, etc) your job
â Jeff Schaller
Sep 1 '17 at 18:48
Ok I see - However I don't think this is the likely explanation. I should have mentioned that I am running a job which runs thousands of iterations in a loop. The time per loop iteration is twice as long on average for the slow simulation.
â Dipole
Sep 1 '17 at 19:32
If LSF does anything to the job, you'll see it in the output of 'bhist-l <jobid>'. Is the hardware different? Are other jobs running, there could be some cache effects, or other jobs are using to much memory.
â Michael Closson
Sep 2 '17 at 0:47
1
1
It seems less likely that someone here will know the problem better than your cluster administrator. Could be anything from competing jobs to system maintenance to unknown-to-me job scheduling re-prioritizations.
â Jeff Schaller
Sep 1 '17 at 0:39
It seems less likely that someone here will know the problem better than your cluster administrator. Could be anything from competing jobs to system maintenance to unknown-to-me job scheduling re-prioritizations.
â Jeff Schaller
Sep 1 '17 at 0:39
Thanks for your comment. Yes I realise this, although I have flagged it already. The nodes I have requested are exclusive to my job so that way can I rule out any competition? I am not sure what job reprioritisation could mean.
â Dipole
Sep 1 '17 at 18:47
Thanks for your comment. Yes I realise this, although I have flagged it already. The nodes I have requested are exclusive to my job so that way can I rule out any competition? I am not sure what job reprioritisation could mean.
â Dipole
Sep 1 '17 at 18:47
I'm not familiar with the job scheduler, but I imaging it could re-prioritize (renice, pause, de-schedule, etc) your job
â Jeff Schaller
Sep 1 '17 at 18:48
I'm not familiar with the job scheduler, but I imaging it could re-prioritize (renice, pause, de-schedule, etc) your job
â Jeff Schaller
Sep 1 '17 at 18:48
Ok I see - However I don't think this is the likely explanation. I should have mentioned that I am running a job which runs thousands of iterations in a loop. The time per loop iteration is twice as long on average for the slow simulation.
â Dipole
Sep 1 '17 at 19:32
Ok I see - However I don't think this is the likely explanation. I should have mentioned that I am running a job which runs thousands of iterations in a loop. The time per loop iteration is twice as long on average for the slow simulation.
â Dipole
Sep 1 '17 at 19:32
If LSF does anything to the job, you'll see it in the output of 'bhist-l <jobid>'. Is the hardware different? Are other jobs running, there could be some cache effects, or other jobs are using to much memory.
â Michael Closson
Sep 2 '17 at 0:47
If LSF does anything to the job, you'll see it in the output of 'bhist-l <jobid>'. Is the hardware different? Are other jobs running, there could be some cache effects, or other jobs are using to much memory.
â Michael Closson
Sep 2 '17 at 0:47
 |Â
show 1 more comment
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f389656%2fhpc-job-runs-slower-on-random-submission%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
It seems less likely that someone here will know the problem better than your cluster administrator. Could be anything from competing jobs to system maintenance to unknown-to-me job scheduling re-prioritizations.
â Jeff Schaller
Sep 1 '17 at 0:39
Thanks for your comment. Yes I realise this, although I have flagged it already. The nodes I have requested are exclusive to my job so that way can I rule out any competition? I am not sure what job reprioritisation could mean.
â Dipole
Sep 1 '17 at 18:47
I'm not familiar with the job scheduler, but I imaging it could re-prioritize (renice, pause, de-schedule, etc) your job
â Jeff Schaller
Sep 1 '17 at 18:48
Ok I see - However I don't think this is the likely explanation. I should have mentioned that I am running a job which runs thousands of iterations in a loop. The time per loop iteration is twice as long on average for the slow simulation.
â Dipole
Sep 1 '17 at 19:32
If LSF does anything to the job, you'll see it in the output of 'bhist-l <jobid>'. Is the hardware different? Are other jobs running, there could be some cache effects, or other jobs are using to much memory.
â Michael Closson
Sep 2 '17 at 0:47