Rocks cluster slave spontaneous reset?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I am 'administering' a small cluster (4 nodes) based on Rocks Cluster. After a recent restart it appears that the slave nodes have all decided to spontaneously reinstall their operating systems, wiping their whole configuration, infiniband support, installed software etc.

I cannot fathom why the system might have done this, and it is quite unhelpful. Has anyone had this happen before? What has caused it?

And for the kicker, since I'm probably resigned to rebuilding the nodes to the spec they should have had, how does one backup the slaves once they're in a working state?

Additional info:

Also the head node seems to be largely incapable of reaching the internet, based on attempted pings. It also cannot seem to ping the local DNS address (192.168.0.1)

edited Mar 18 at 4:05

Rui F Ribeiro

42.1k1484142

asked Apr 28 '15 at 17:12

J Collins

4301417

add a comment |

I cannot fathom why the system might have done this, and it is quite unhelpful. Has anyone had this happen before? What has caused it?

And for the kicker, since I'm probably resigned to rebuilding the nodes to the spec they should have had, how does one backup the slaves once they're in a working state?

Additional info:

Also the head node seems to be largely incapable of reaching the internet, based on attempted pings. It also cannot seem to ping the local DNS address (192.168.0.1)

edited Mar 18 at 4:05

Rui F Ribeiro

42.1k1484142

asked Apr 28 '15 at 17:12

J Collins

4301417

add a comment |

I cannot fathom why the system might have done this, and it is quite unhelpful. Has anyone had this happen before? What has caused it?

And for the kicker, since I'm probably resigned to rebuilding the nodes to the spec they should have had, how does one backup the slaves once they're in a working state?

Additional info:

Also the head node seems to be largely incapable of reaching the internet, based on attempted pings. It also cannot seem to ping the local DNS address (192.168.0.1)

edited Mar 18 at 4:05

Rui F Ribeiro

42.1k1484142

asked Apr 28 '15 at 17:12

J Collins

4301417

I cannot fathom why the system might have done this, and it is quite unhelpful. Has anyone had this happen before? What has caused it?

And for the kicker, since I'm probably resigned to rebuilding the nodes to the spec they should have had, how does one backup the slaves once they're in a working state?

Additional info:

Also the head node seems to be largely incapable of reaching the internet, based on attempted pings. It also cannot seem to ping the local DNS address (192.168.0.1)

cluster

edited Mar 18 at 4:05

Rui F Ribeiro

42.1k1484142

asked Apr 28 '15 at 17:12

J Collins

4301417

edited Mar 18 at 4:05

Rui F Ribeiro

42.1k1484142

asked Apr 28 '15 at 17:12

J Collins

4301417

edited Mar 18 at 4:05

Rui F Ribeiro

42.1k1484142

edited Mar 18 at 4:05

Rui F Ribeiro

42.1k1484142

edited Mar 18 at 4:05

Rui F Ribeiro

42.1k1484142

asked Apr 28 '15 at 17:12

J Collins

4301417

asked Apr 28 '15 at 17:12

J Collins

4301417

asked Apr 28 '15 at 17:12

J Collins

4301417

add a comment |

1 Answer
1

active

oldest

votes

It turns out that, at least in some cases, the default is for Rocks to reinstall itself on the slave nodes at every boot up (1). Presumably the intent is that clusters are always on and a restart probably means some changes were made that would benefit from a reinstall. For a casually-used system, this is not appropriate as it is unlikely all of the post-install scripts are configured to complete a full reinstallation. The way to avoid this reinstall is to execute:

rocks run host compute "chkconfig rocks-grub off"

This executes the function on all slave nodes in the 'compute' group, disabling the reinstall functionality.

In my case the slave nodes were set to boot from the local drive first, avoiding the auto-reinstall. I believe what triggered it was a forced power-off corrupting the local disk such that on next boot, the local corrupted disk would not boot and handed over to the PXE boot that got a reinstall instruction from the head node. The forced power off was caused by something unknown interrupting shutdown now being run on the slaves. A physical power off was all that would shut them down. I now use shutdown -h now which seems to overcome whatever interrupted the vanilla shutdown.

answered May 4 '15 at 18:03

J Collins

4301417

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f199192%2frocks-cluster-slave-spontaneous-reset%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

rocks run host compute "chkconfig rocks-grub off"

This executes the function on all slave nodes in the 'compute' group, disabling the reinstall functionality.

answered May 4 '15 at 18:03

J Collins

4301417

add a comment |

rocks run host compute "chkconfig rocks-grub off"

This executes the function on all slave nodes in the 'compute' group, disabling the reinstall functionality.

answered May 4 '15 at 18:03

J Collins

4301417

add a comment |

rocks run host compute "chkconfig rocks-grub off"

This executes the function on all slave nodes in the 'compute' group, disabling the reinstall functionality.

answered May 4 '15 at 18:03

J Collins

4301417

rocks run host compute "chkconfig rocks-grub off"

This executes the function on all slave nodes in the 'compute' group, disabling the reinstall functionality.

answered May 4 '15 at 18:03

J Collins

4301417

answered May 4 '15 at 18:03

J Collins

4301417

answered May 4 '15 at 18:03

J Collins

4301417

answered May 4 '15 at 18:03

J Collins

4301417

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

搜尋此網誌

mjhjmtu