HPC Job runs slower on random submission

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
0
down vote

favorite












I am running a parallel computation using MPI for parallelism on a cluster that uses IBM LSF for job scheduling. Frustratingly, whenever I submit a job, I find that sometimes it runs slower by a factor of about 2 or more, and the other times it runs as expected. At first I thought there would be a set of nodes that were faulty causing the simulation to slow, but I havent been able to locate which ones if any.



I am a lost as to where I should even start to look in order to determine what is the problem without resorting to painstaking trail and error. I am very confident the problem is indeed with the cluster and not my simulations themselves.



Could someone give me some suggestions of how to go about debugging this issue?










share|improve this question



















  • 1




    It seems less likely that someone here will know the problem better than your cluster administrator. Could be anything from competing jobs to system maintenance to unknown-to-me job scheduling re-prioritizations.
    – Jeff Schaller
    Sep 1 '17 at 0:39










  • Thanks for your comment. Yes I realise this, although I have flagged it already. The nodes I have requested are exclusive to my job so that way can I rule out any competition? I am not sure what job reprioritisation could mean.
    – Dipole
    Sep 1 '17 at 18:47










  • I'm not familiar with the job scheduler, but I imaging it could re-prioritize (renice, pause, de-schedule, etc) your job
    – Jeff Schaller
    Sep 1 '17 at 18:48










  • Ok I see - However I don't think this is the likely explanation. I should have mentioned that I am running a job which runs thousands of iterations in a loop. The time per loop iteration is twice as long on average for the slow simulation.
    – Dipole
    Sep 1 '17 at 19:32










  • If LSF does anything to the job, you'll see it in the output of 'bhist-l <jobid>'. Is the hardware different? Are other jobs running, there could be some cache effects, or other jobs are using to much memory.
    – Michael Closson
    Sep 2 '17 at 0:47














up vote
0
down vote

favorite












I am running a parallel computation using MPI for parallelism on a cluster that uses IBM LSF for job scheduling. Frustratingly, whenever I submit a job, I find that sometimes it runs slower by a factor of about 2 or more, and the other times it runs as expected. At first I thought there would be a set of nodes that were faulty causing the simulation to slow, but I havent been able to locate which ones if any.



I am a lost as to where I should even start to look in order to determine what is the problem without resorting to painstaking trail and error. I am very confident the problem is indeed with the cluster and not my simulations themselves.



Could someone give me some suggestions of how to go about debugging this issue?










share|improve this question



















  • 1




    It seems less likely that someone here will know the problem better than your cluster administrator. Could be anything from competing jobs to system maintenance to unknown-to-me job scheduling re-prioritizations.
    – Jeff Schaller
    Sep 1 '17 at 0:39










  • Thanks for your comment. Yes I realise this, although I have flagged it already. The nodes I have requested are exclusive to my job so that way can I rule out any competition? I am not sure what job reprioritisation could mean.
    – Dipole
    Sep 1 '17 at 18:47










  • I'm not familiar with the job scheduler, but I imaging it could re-prioritize (renice, pause, de-schedule, etc) your job
    – Jeff Schaller
    Sep 1 '17 at 18:48










  • Ok I see - However I don't think this is the likely explanation. I should have mentioned that I am running a job which runs thousands of iterations in a loop. The time per loop iteration is twice as long on average for the slow simulation.
    – Dipole
    Sep 1 '17 at 19:32










  • If LSF does anything to the job, you'll see it in the output of 'bhist-l <jobid>'. Is the hardware different? Are other jobs running, there could be some cache effects, or other jobs are using to much memory.
    – Michael Closson
    Sep 2 '17 at 0:47












up vote
0
down vote

favorite









up vote
0
down vote

favorite











I am running a parallel computation using MPI for parallelism on a cluster that uses IBM LSF for job scheduling. Frustratingly, whenever I submit a job, I find that sometimes it runs slower by a factor of about 2 or more, and the other times it runs as expected. At first I thought there would be a set of nodes that were faulty causing the simulation to slow, but I havent been able to locate which ones if any.



I am a lost as to where I should even start to look in order to determine what is the problem without resorting to painstaking trail and error. I am very confident the problem is indeed with the cluster and not my simulations themselves.



Could someone give me some suggestions of how to go about debugging this issue?










share|improve this question















I am running a parallel computation using MPI for parallelism on a cluster that uses IBM LSF for job scheduling. Frustratingly, whenever I submit a job, I find that sometimes it runs slower by a factor of about 2 or more, and the other times it runs as expected. At first I thought there would be a set of nodes that were faulty causing the simulation to slow, but I havent been able to locate which ones if any.



I am a lost as to where I should even start to look in order to determine what is the problem without resorting to painstaking trail and error. I am very confident the problem is indeed with the cluster and not my simulations themselves.



Could someone give me some suggestions of how to go about debugging this issue?







parallelism high-performance platform-lsf






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Sep 13 at 11:48









Rui F Ribeiro

36.8k1273117




36.8k1273117










asked Aug 31 '17 at 23:54









Dipole

1063




1063







  • 1




    It seems less likely that someone here will know the problem better than your cluster administrator. Could be anything from competing jobs to system maintenance to unknown-to-me job scheduling re-prioritizations.
    – Jeff Schaller
    Sep 1 '17 at 0:39










  • Thanks for your comment. Yes I realise this, although I have flagged it already. The nodes I have requested are exclusive to my job so that way can I rule out any competition? I am not sure what job reprioritisation could mean.
    – Dipole
    Sep 1 '17 at 18:47










  • I'm not familiar with the job scheduler, but I imaging it could re-prioritize (renice, pause, de-schedule, etc) your job
    – Jeff Schaller
    Sep 1 '17 at 18:48










  • Ok I see - However I don't think this is the likely explanation. I should have mentioned that I am running a job which runs thousands of iterations in a loop. The time per loop iteration is twice as long on average for the slow simulation.
    – Dipole
    Sep 1 '17 at 19:32










  • If LSF does anything to the job, you'll see it in the output of 'bhist-l <jobid>'. Is the hardware different? Are other jobs running, there could be some cache effects, or other jobs are using to much memory.
    – Michael Closson
    Sep 2 '17 at 0:47












  • 1




    It seems less likely that someone here will know the problem better than your cluster administrator. Could be anything from competing jobs to system maintenance to unknown-to-me job scheduling re-prioritizations.
    – Jeff Schaller
    Sep 1 '17 at 0:39










  • Thanks for your comment. Yes I realise this, although I have flagged it already. The nodes I have requested are exclusive to my job so that way can I rule out any competition? I am not sure what job reprioritisation could mean.
    – Dipole
    Sep 1 '17 at 18:47










  • I'm not familiar with the job scheduler, but I imaging it could re-prioritize (renice, pause, de-schedule, etc) your job
    – Jeff Schaller
    Sep 1 '17 at 18:48










  • Ok I see - However I don't think this is the likely explanation. I should have mentioned that I am running a job which runs thousands of iterations in a loop. The time per loop iteration is twice as long on average for the slow simulation.
    – Dipole
    Sep 1 '17 at 19:32










  • If LSF does anything to the job, you'll see it in the output of 'bhist-l <jobid>'. Is the hardware different? Are other jobs running, there could be some cache effects, or other jobs are using to much memory.
    – Michael Closson
    Sep 2 '17 at 0:47







1




1




It seems less likely that someone here will know the problem better than your cluster administrator. Could be anything from competing jobs to system maintenance to unknown-to-me job scheduling re-prioritizations.
– Jeff Schaller
Sep 1 '17 at 0:39




It seems less likely that someone here will know the problem better than your cluster administrator. Could be anything from competing jobs to system maintenance to unknown-to-me job scheduling re-prioritizations.
– Jeff Schaller
Sep 1 '17 at 0:39












Thanks for your comment. Yes I realise this, although I have flagged it already. The nodes I have requested are exclusive to my job so that way can I rule out any competition? I am not sure what job reprioritisation could mean.
– Dipole
Sep 1 '17 at 18:47




Thanks for your comment. Yes I realise this, although I have flagged it already. The nodes I have requested are exclusive to my job so that way can I rule out any competition? I am not sure what job reprioritisation could mean.
– Dipole
Sep 1 '17 at 18:47












I'm not familiar with the job scheduler, but I imaging it could re-prioritize (renice, pause, de-schedule, etc) your job
– Jeff Schaller
Sep 1 '17 at 18:48




I'm not familiar with the job scheduler, but I imaging it could re-prioritize (renice, pause, de-schedule, etc) your job
– Jeff Schaller
Sep 1 '17 at 18:48












Ok I see - However I don't think this is the likely explanation. I should have mentioned that I am running a job which runs thousands of iterations in a loop. The time per loop iteration is twice as long on average for the slow simulation.
– Dipole
Sep 1 '17 at 19:32




Ok I see - However I don't think this is the likely explanation. I should have mentioned that I am running a job which runs thousands of iterations in a loop. The time per loop iteration is twice as long on average for the slow simulation.
– Dipole
Sep 1 '17 at 19:32












If LSF does anything to the job, you'll see it in the output of 'bhist-l <jobid>'. Is the hardware different? Are other jobs running, there could be some cache effects, or other jobs are using to much memory.
– Michael Closson
Sep 2 '17 at 0:47




If LSF does anything to the job, you'll see it in the output of 'bhist-l <jobid>'. Is the hardware different? Are other jobs running, there could be some cache effects, or other jobs are using to much memory.
– Michael Closson
Sep 2 '17 at 0:47















active

oldest

votes











Your Answer







StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f389656%2fhpc-job-runs-slower-on-random-submission%23new-answer', 'question_page');

);

Post as a guest



































active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f389656%2fhpc-job-runs-slower-on-random-submission%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

How to check contact read email or not when send email to Individual?

Christian Cage

How to properly install USB display driver for Fresco Logic FL2000DX on Ubuntu?