HPC Job runs slower on random submission

up vote
0
down vote

favorite

I am running a parallel computation using MPI for parallelism on a cluster that uses IBM LSF for job scheduling. Frustratingly, whenever I submit a job, I find that sometimes it runs slower by a factor of about 2 or more, and the other times it runs as expected. At first I thought there would be a set of nodes that were faulty causing the simulation to slow, but I havent been able to locate which ones if any.

I am a lost as to where I should even start to look in order to determine what is the problem without resorting to painstaking trail and error. I am very confident the problem is indeed with the cluster and not my simulations themselves.

Could someone give me some suggestions of how to go about debugging this issue?

edited Sep 13 at 11:48

Rui F Ribeiro

36.8k1273117

asked Aug 31 '17 at 23:54

Dipole

1063

1

It seems less likely that someone here will know the problem better than your cluster administrator. Could be anything from competing jobs to system maintenance to unknown-to-me job scheduling re-prioritizations.
â€“Â Jeff Schaller
Sep 1 '17 at 0:39

Thanks for your comment. Yes I realise this, although I have flagged it already. The nodes I have requested are exclusive to my job so that way can I rule out any competition? I am not sure what job reprioritisation could mean.
â€“Â Dipole
Sep 1 '17 at 18:47

I'm not familiar with the job scheduler, but I imaging it could re-prioritize (renice, pause, de-schedule, etc) your job
â€“Â Jeff Schaller
Sep 1 '17 at 18:48

Ok I see - However I don't think this is the likely explanation. I should have mentioned that I am running a job which runs thousands of iterations in a loop. The time per loop iteration is twice as long on average for the slow simulation.
â€“Â Dipole
Sep 1 '17 at 19:32

If LSF does anything to the job, you'll see it in the output of 'bhist-l <jobid>'. Is the hardware different? Are other jobs running, there could be some cache effects, or other jobs are using to much memory.
â€“Â Michael Closson
Sep 2 '17 at 0:47

Â |Â
show 1 more comment

up vote
0
down vote

favorite

Could someone give me some suggestions of how to go about debugging this issue?

edited Sep 13 at 11:48

Rui F Ribeiro

36.8k1273117

asked Aug 31 '17 at 23:54

Dipole

1063

1

It seems less likely that someone here will know the problem better than your cluster administrator. Could be anything from competing jobs to system maintenance to unknown-to-me job scheduling re-prioritizations.
â€“Â Jeff Schaller
Sep 1 '17 at 0:39

Thanks for your comment. Yes I realise this, although I have flagged it already. The nodes I have requested are exclusive to my job so that way can I rule out any competition? I am not sure what job reprioritisation could mean.
â€“Â Dipole
Sep 1 '17 at 18:47

I'm not familiar with the job scheduler, but I imaging it could re-prioritize (renice, pause, de-schedule, etc) your job
â€“Â Jeff Schaller
Sep 1 '17 at 18:48

Ok I see - However I don't think this is the likely explanation. I should have mentioned that I am running a job which runs thousands of iterations in a loop. The time per loop iteration is twice as long on average for the slow simulation.
â€“Â Dipole
Sep 1 '17 at 19:32

If LSF does anything to the job, you'll see it in the output of 'bhist-l <jobid>'. Is the hardware different? Are other jobs running, there could be some cache effects, or other jobs are using to much memory.
â€“Â Michael Closson
Sep 2 '17 at 0:47

Â |Â
show 1 more comment

up vote
0
down vote

favorite

Could someone give me some suggestions of how to go about debugging this issue?

edited Sep 13 at 11:48

Rui F Ribeiro

36.8k1273117

asked Aug 31 '17 at 23:54

Dipole

1063

Could someone give me some suggestions of how to go about debugging this issue?

parallelism high-performance platform-lsf

edited Sep 13 at 11:48

Rui F Ribeiro

36.8k1273117

asked Aug 31 '17 at 23:54

Dipole

1063

edited Sep 13 at 11:48

Rui F Ribeiro

36.8k1273117

asked Aug 31 '17 at 23:54

Dipole

1063

edited Sep 13 at 11:48

Rui F Ribeiro

36.8k1273117

edited Sep 13 at 11:48

Rui F Ribeiro

36.8k1273117

edited Sep 13 at 11:48

Rui F Ribeiro

36.8k1273117

asked Aug 31 '17 at 23:54

Dipole

1063

asked Aug 31 '17 at 23:54

Dipole

1063

asked Aug 31 '17 at 23:54

Dipole

1063

1

It seems less likely that someone here will know the problem better than your cluster administrator. Could be anything from competing jobs to system maintenance to unknown-to-me job scheduling re-prioritizations.
â€“Â Jeff Schaller
Sep 1 '17 at 0:39

Thanks for your comment. Yes I realise this, although I have flagged it already. The nodes I have requested are exclusive to my job so that way can I rule out any competition? I am not sure what job reprioritisation could mean.
â€“Â Dipole
Sep 1 '17 at 18:47

I'm not familiar with the job scheduler, but I imaging it could re-prioritize (renice, pause, de-schedule, etc) your job
â€“Â Jeff Schaller
Sep 1 '17 at 18:48

Ok I see - However I don't think this is the likely explanation. I should have mentioned that I am running a job which runs thousands of iterations in a loop. The time per loop iteration is twice as long on average for the slow simulation.
â€“Â Dipole
Sep 1 '17 at 19:32

If LSF does anything to the job, you'll see it in the output of 'bhist-l <jobid>'. Is the hardware different? Are other jobs running, there could be some cache effects, or other jobs are using to much memory.
â€“Â Michael Closson
Sep 2 '17 at 0:47

Â |Â
show 1 more comment

1

It seems less likely that someone here will know the problem better than your cluster administrator. Could be anything from competing jobs to system maintenance to unknown-to-me job scheduling re-prioritizations.
â€“Â Jeff Schaller
Sep 1 '17 at 0:39

Thanks for your comment. Yes I realise this, although I have flagged it already. The nodes I have requested are exclusive to my job so that way can I rule out any competition? I am not sure what job reprioritisation could mean.
â€“Â Dipole
Sep 1 '17 at 18:47

I'm not familiar with the job scheduler, but I imaging it could re-prioritize (renice, pause, de-schedule, etc) your job
â€“Â Jeff Schaller
Sep 1 '17 at 18:48

Ok I see - However I don't think this is the likely explanation. I should have mentioned that I am running a job which runs thousands of iterations in a loop. The time per loop iteration is twice as long on average for the slow simulation.
â€“Â Dipole
Sep 1 '17 at 19:32

If LSF does anything to the job, you'll see it in the output of 'bhist-l <jobid>'. Is the hardware different? Are other jobs running, there could be some cache effects, or other jobs are using to much memory.
â€“Â Michael Closson
Sep 2 '17 at 0:47

It seems less likely that someone here will know the problem better than your cluster administrator. Could be anything from competing jobs to system maintenance to unknown-to-me job scheduling re-prioritizations.
â€“Â Jeff Schaller
Sep 1 '17 at 0:39

Thanks for your comment. Yes I realise this, although I have flagged it already. The nodes I have requested are exclusive to my job so that way can I rule out any competition? I am not sure what job reprioritisation could mean.
â€“Â Dipole
Sep 1 '17 at 18:47

I'm not familiar with the job scheduler, but I imaging it could re-prioritize (renice, pause, de-schedule, etc) your job
â€“Â Jeff Schaller
Sep 1 '17 at 18:48

Ok I see - However I don't think this is the likely explanation. I should have mentioned that I am running a job which runs thousands of iterations in a loop. The time per loop iteration is twice as long on average for the slow simulation.
â€“Â Dipole
Sep 1 '17 at 19:32

If LSF does anything to the job, you'll see it in the output of 'bhist-l <jobid>'. Is the hardware different? Are other jobs running, there could be some cache effects, or other jobs are using to much memory.
â€“Â Michael Closson
Sep 2 '17 at 0:47

Â |Â
show 1 more comment

active

oldest

votes

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f389656%2fhpc-job-runs-slower-on-random-submission%23new-answer', 'question_page');

);

Post as a guest

Name

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu