HPC Job runs slower on random submission

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
0
down vote

favorite












I am running a parallel computation using MPI for parallelism on a cluster that uses IBM LSF for job scheduling. Frustratingly, whenever I submit a job, I find that sometimes it runs slower by a factor of about 2 or more, and the other times it runs as expected. At first I thought there would be a set of nodes that were faulty causing the simulation to slow, but I havent been able to locate which ones if any.



I am a lost as to where I should even start to look in order to determine what is the problem without resorting to painstaking trail and error. I am very confident the problem is indeed with the cluster and not my simulations themselves.



Could someone give me some suggestions of how to go about debugging this issue?










share|improve this question



















  • 1




    It seems less likely that someone here will know the problem better than your cluster administrator. Could be anything from competing jobs to system maintenance to unknown-to-me job scheduling re-prioritizations.
    – Jeff Schaller
    Sep 1 '17 at 0:39










  • Thanks for your comment. Yes I realise this, although I have flagged it already. The nodes I have requested are exclusive to my job so that way can I rule out any competition? I am not sure what job reprioritisation could mean.
    – Dipole
    Sep 1 '17 at 18:47










  • I'm not familiar with the job scheduler, but I imaging it could re-prioritize (renice, pause, de-schedule, etc) your job
    – Jeff Schaller
    Sep 1 '17 at 18:48










  • Ok I see - However I don't think this is the likely explanation. I should have mentioned that I am running a job which runs thousands of iterations in a loop. The time per loop iteration is twice as long on average for the slow simulation.
    – Dipole
    Sep 1 '17 at 19:32










  • If LSF does anything to the job, you'll see it in the output of 'bhist-l <jobid>'. Is the hardware different? Are other jobs running, there could be some cache effects, or other jobs are using to much memory.
    – Michael Closson
    Sep 2 '17 at 0:47














up vote
0
down vote

favorite












I am running a parallel computation using MPI for parallelism on a cluster that uses IBM LSF for job scheduling. Frustratingly, whenever I submit a job, I find that sometimes it runs slower by a factor of about 2 or more, and the other times it runs as expected. At first I thought there would be a set of nodes that were faulty causing the simulation to slow, but I havent been able to locate which ones if any.



I am a lost as to where I should even start to look in order to determine what is the problem without resorting to painstaking trail and error. I am very confident the problem is indeed with the cluster and not my simulations themselves.



Could someone give me some suggestions of how to go about debugging this issue?










share|improve this question



















  • 1




    It seems less likely that someone here will know the problem better than your cluster administrator. Could be anything from competing jobs to system maintenance to unknown-to-me job scheduling re-prioritizations.
    – Jeff Schaller
    Sep 1 '17 at 0:39










  • Thanks for your comment. Yes I realise this, although I have flagged it already. The nodes I have requested are exclusive to my job so that way can I rule out any competition? I am not sure what job reprioritisation could mean.
    – Dipole
    Sep 1 '17 at 18:47










  • I'm not familiar with the job scheduler, but I imaging it could re-prioritize (renice, pause, de-schedule, etc) your job
    – Jeff Schaller
    Sep 1 '17 at 18:48










  • Ok I see - However I don't think this is the likely explanation. I should have mentioned that I am running a job which runs thousands of iterations in a loop. The time per loop iteration is twice as long on average for the slow simulation.
    – Dipole
    Sep 1 '17 at 19:32










  • If LSF does anything to the job, you'll see it in the output of 'bhist-l <jobid>'. Is the hardware different? Are other jobs running, there could be some cache effects, or other jobs are using to much memory.
    – Michael Closson
    Sep 2 '17 at 0:47












up vote
0
down vote

favorite









up vote
0
down vote

favorite











I am running a parallel computation using MPI for parallelism on a cluster that uses IBM LSF for job scheduling. Frustratingly, whenever I submit a job, I find that sometimes it runs slower by a factor of about 2 or more, and the other times it runs as expected. At first I thought there would be a set of nodes that were faulty causing the simulation to slow, but I havent been able to locate which ones if any.



I am a lost as to where I should even start to look in order to determine what is the problem without resorting to painstaking trail and error. I am very confident the problem is indeed with the cluster and not my simulations themselves.



Could someone give me some suggestions of how to go about debugging this issue?










share|improve this question















I am running a parallel computation using MPI for parallelism on a cluster that uses IBM LSF for job scheduling. Frustratingly, whenever I submit a job, I find that sometimes it runs slower by a factor of about 2 or more, and the other times it runs as expected. At first I thought there would be a set of nodes that were faulty causing the simulation to slow, but I havent been able to locate which ones if any.



I am a lost as to where I should even start to look in order to determine what is the problem without resorting to painstaking trail and error. I am very confident the problem is indeed with the cluster and not my simulations themselves.



Could someone give me some suggestions of how to go about debugging this issue?







parallelism high-performance platform-lsf






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Sep 13 at 11:48









Rui F Ribeiro

36.8k1273117




36.8k1273117










asked Aug 31 '17 at 23:54









Dipole

1063




1063







  • 1




    It seems less likely that someone here will know the problem better than your cluster administrator. Could be anything from competing jobs to system maintenance to unknown-to-me job scheduling re-prioritizations.
    – Jeff Schaller
    Sep 1 '17 at 0:39










  • Thanks for your comment. Yes I realise this, although I have flagged it already. The nodes I have requested are exclusive to my job so that way can I rule out any competition? I am not sure what job reprioritisation could mean.
    – Dipole
    Sep 1 '17 at 18:47










  • I'm not familiar with the job scheduler, but I imaging it could re-prioritize (renice, pause, de-schedule, etc) your job
    – Jeff Schaller
    Sep 1 '17 at 18:48










  • Ok I see - However I don't think this is the likely explanation. I should have mentioned that I am running a job which runs thousands of iterations in a loop. The time per loop iteration is twice as long on average for the slow simulation.
    – Dipole
    Sep 1 '17 at 19:32










  • If LSF does anything to the job, you'll see it in the output of 'bhist-l <jobid>'. Is the hardware different? Are other jobs running, there could be some cache effects, or other jobs are using to much memory.
    – Michael Closson
    Sep 2 '17 at 0:47












  • 1




    It seems less likely that someone here will know the problem better than your cluster administrator. Could be anything from competing jobs to system maintenance to unknown-to-me job scheduling re-prioritizations.
    – Jeff Schaller
    Sep 1 '17 at 0:39










  • Thanks for your comment. Yes I realise this, although I have flagged it already. The nodes I have requested are exclusive to my job so that way can I rule out any competition? I am not sure what job reprioritisation could mean.
    – Dipole
    Sep 1 '17 at 18:47










  • I'm not familiar with the job scheduler, but I imaging it could re-prioritize (renice, pause, de-schedule, etc) your job
    – Jeff Schaller
    Sep 1 '17 at 18:48










  • Ok I see - However I don't think this is the likely explanation. I should have mentioned that I am running a job which runs thousands of iterations in a loop. The time per loop iteration is twice as long on average for the slow simulation.
    – Dipole
    Sep 1 '17 at 19:32










  • If LSF does anything to the job, you'll see it in the output of 'bhist-l <jobid>'. Is the hardware different? Are other jobs running, there could be some cache effects, or other jobs are using to much memory.
    – Michael Closson
    Sep 2 '17 at 0:47







1




1




It seems less likely that someone here will know the problem better than your cluster administrator. Could be anything from competing jobs to system maintenance to unknown-to-me job scheduling re-prioritizations.
– Jeff Schaller
Sep 1 '17 at 0:39




It seems less likely that someone here will know the problem better than your cluster administrator. Could be anything from competing jobs to system maintenance to unknown-to-me job scheduling re-prioritizations.
– Jeff Schaller
Sep 1 '17 at 0:39












Thanks for your comment. Yes I realise this, although I have flagged it already. The nodes I have requested are exclusive to my job so that way can I rule out any competition? I am not sure what job reprioritisation could mean.
– Dipole
Sep 1 '17 at 18:47




Thanks for your comment. Yes I realise this, although I have flagged it already. The nodes I have requested are exclusive to my job so that way can I rule out any competition? I am not sure what job reprioritisation could mean.
– Dipole
Sep 1 '17 at 18:47












I'm not familiar with the job scheduler, but I imaging it could re-prioritize (renice, pause, de-schedule, etc) your job
– Jeff Schaller
Sep 1 '17 at 18:48




I'm not familiar with the job scheduler, but I imaging it could re-prioritize (renice, pause, de-schedule, etc) your job
– Jeff Schaller
Sep 1 '17 at 18:48












Ok I see - However I don't think this is the likely explanation. I should have mentioned that I am running a job which runs thousands of iterations in a loop. The time per loop iteration is twice as long on average for the slow simulation.
– Dipole
Sep 1 '17 at 19:32




Ok I see - However I don't think this is the likely explanation. I should have mentioned that I am running a job which runs thousands of iterations in a loop. The time per loop iteration is twice as long on average for the slow simulation.
– Dipole
Sep 1 '17 at 19:32












If LSF does anything to the job, you'll see it in the output of 'bhist-l <jobid>'. Is the hardware different? Are other jobs running, there could be some cache effects, or other jobs are using to much memory.
– Michael Closson
Sep 2 '17 at 0:47




If LSF does anything to the job, you'll see it in the output of 'bhist-l <jobid>'. Is the hardware different? Are other jobs running, there could be some cache effects, or other jobs are using to much memory.
– Michael Closson
Sep 2 '17 at 0:47















active

oldest

votes











Your Answer







StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f389656%2fhpc-job-runs-slower-on-random-submission%23new-answer', 'question_page');

);

Post as a guest



































active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f389656%2fhpc-job-runs-slower-on-random-submission%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

How to check contact read email or not when send email to Individual?

Displaying single band from multi-band raster using QGIS

How many registers does an x86_64 CPU actually have?