Rocks cluster slave spontaneous reset?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








0















I am 'administering' a small cluster (4 nodes) based on Rocks Cluster. After a recent restart it appears that the slave nodes have all decided to spontaneously reinstall their operating systems, wiping their whole configuration, infiniband support, installed software etc.



I cannot fathom why the system might have done this, and it is quite unhelpful. Has anyone had this happen before? What has caused it?



And for the kicker, since I'm probably resigned to rebuilding the nodes to the spec they should have had, how does one backup the slaves once they're in a working state?



Additional info:



Also the head node seems to be largely incapable of reaching the internet, based on attempted pings. It also cannot seem to ping the local DNS address (192.168.0.1)










share|improve this question






























    0















    I am 'administering' a small cluster (4 nodes) based on Rocks Cluster. After a recent restart it appears that the slave nodes have all decided to spontaneously reinstall their operating systems, wiping their whole configuration, infiniband support, installed software etc.



    I cannot fathom why the system might have done this, and it is quite unhelpful. Has anyone had this happen before? What has caused it?



    And for the kicker, since I'm probably resigned to rebuilding the nodes to the spec they should have had, how does one backup the slaves once they're in a working state?



    Additional info:



    Also the head node seems to be largely incapable of reaching the internet, based on attempted pings. It also cannot seem to ping the local DNS address (192.168.0.1)










    share|improve this question


























      0












      0








      0








      I am 'administering' a small cluster (4 nodes) based on Rocks Cluster. After a recent restart it appears that the slave nodes have all decided to spontaneously reinstall their operating systems, wiping their whole configuration, infiniband support, installed software etc.



      I cannot fathom why the system might have done this, and it is quite unhelpful. Has anyone had this happen before? What has caused it?



      And for the kicker, since I'm probably resigned to rebuilding the nodes to the spec they should have had, how does one backup the slaves once they're in a working state?



      Additional info:



      Also the head node seems to be largely incapable of reaching the internet, based on attempted pings. It also cannot seem to ping the local DNS address (192.168.0.1)










      share|improve this question
















      I am 'administering' a small cluster (4 nodes) based on Rocks Cluster. After a recent restart it appears that the slave nodes have all decided to spontaneously reinstall their operating systems, wiping their whole configuration, infiniband support, installed software etc.



      I cannot fathom why the system might have done this, and it is quite unhelpful. Has anyone had this happen before? What has caused it?



      And for the kicker, since I'm probably resigned to rebuilding the nodes to the spec they should have had, how does one backup the slaves once they're in a working state?



      Additional info:



      Also the head node seems to be largely incapable of reaching the internet, based on attempted pings. It also cannot seem to ping the local DNS address (192.168.0.1)







      cluster






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Mar 18 at 4:05









      Rui F Ribeiro

      42.1k1484142




      42.1k1484142










      asked Apr 28 '15 at 17:12









      J CollinsJ Collins

      4301417




      4301417




















          1 Answer
          1






          active

          oldest

          votes


















          0














          It turns out that, at least in some cases, the default is for Rocks to reinstall itself on the slave nodes at every boot up (1). Presumably the intent is that clusters are always on and a restart probably means some changes were made that would benefit from a reinstall. For a casually-used system, this is not appropriate as it is unlikely all of the post-install scripts are configured to complete a full reinstallation. The way to avoid this reinstall is to execute:



          rocks run host compute "chkconfig rocks-grub off"


          This executes the function on all slave nodes in the 'compute' group, disabling the reinstall functionality.



          In my case the slave nodes were set to boot from the local drive first, avoiding the auto-reinstall. I believe what triggered it was a forced power-off corrupting the local disk such that on next boot, the local corrupted disk would not boot and handed over to the PXE boot that got a reinstall instruction from the head node. The forced power off was caused by something unknown interrupting shutdown now being run on the slaves. A physical power off was all that would shut them down. I now use shutdown -h now which seems to overcome whatever interrupted the vanilla shutdown.






          share|improve this answer























            Your Answer








            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "106"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f199192%2frocks-cluster-slave-spontaneous-reset%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            It turns out that, at least in some cases, the default is for Rocks to reinstall itself on the slave nodes at every boot up (1). Presumably the intent is that clusters are always on and a restart probably means some changes were made that would benefit from a reinstall. For a casually-used system, this is not appropriate as it is unlikely all of the post-install scripts are configured to complete a full reinstallation. The way to avoid this reinstall is to execute:



            rocks run host compute "chkconfig rocks-grub off"


            This executes the function on all slave nodes in the 'compute' group, disabling the reinstall functionality.



            In my case the slave nodes were set to boot from the local drive first, avoiding the auto-reinstall. I believe what triggered it was a forced power-off corrupting the local disk such that on next boot, the local corrupted disk would not boot and handed over to the PXE boot that got a reinstall instruction from the head node. The forced power off was caused by something unknown interrupting shutdown now being run on the slaves. A physical power off was all that would shut them down. I now use shutdown -h now which seems to overcome whatever interrupted the vanilla shutdown.






            share|improve this answer



























              0














              It turns out that, at least in some cases, the default is for Rocks to reinstall itself on the slave nodes at every boot up (1). Presumably the intent is that clusters are always on and a restart probably means some changes were made that would benefit from a reinstall. For a casually-used system, this is not appropriate as it is unlikely all of the post-install scripts are configured to complete a full reinstallation. The way to avoid this reinstall is to execute:



              rocks run host compute "chkconfig rocks-grub off"


              This executes the function on all slave nodes in the 'compute' group, disabling the reinstall functionality.



              In my case the slave nodes were set to boot from the local drive first, avoiding the auto-reinstall. I believe what triggered it was a forced power-off corrupting the local disk such that on next boot, the local corrupted disk would not boot and handed over to the PXE boot that got a reinstall instruction from the head node. The forced power off was caused by something unknown interrupting shutdown now being run on the slaves. A physical power off was all that would shut them down. I now use shutdown -h now which seems to overcome whatever interrupted the vanilla shutdown.






              share|improve this answer

























                0












                0








                0







                It turns out that, at least in some cases, the default is for Rocks to reinstall itself on the slave nodes at every boot up (1). Presumably the intent is that clusters are always on and a restart probably means some changes were made that would benefit from a reinstall. For a casually-used system, this is not appropriate as it is unlikely all of the post-install scripts are configured to complete a full reinstallation. The way to avoid this reinstall is to execute:



                rocks run host compute "chkconfig rocks-grub off"


                This executes the function on all slave nodes in the 'compute' group, disabling the reinstall functionality.



                In my case the slave nodes were set to boot from the local drive first, avoiding the auto-reinstall. I believe what triggered it was a forced power-off corrupting the local disk such that on next boot, the local corrupted disk would not boot and handed over to the PXE boot that got a reinstall instruction from the head node. The forced power off was caused by something unknown interrupting shutdown now being run on the slaves. A physical power off was all that would shut them down. I now use shutdown -h now which seems to overcome whatever interrupted the vanilla shutdown.






                share|improve this answer













                It turns out that, at least in some cases, the default is for Rocks to reinstall itself on the slave nodes at every boot up (1). Presumably the intent is that clusters are always on and a restart probably means some changes were made that would benefit from a reinstall. For a casually-used system, this is not appropriate as it is unlikely all of the post-install scripts are configured to complete a full reinstallation. The way to avoid this reinstall is to execute:



                rocks run host compute "chkconfig rocks-grub off"


                This executes the function on all slave nodes in the 'compute' group, disabling the reinstall functionality.



                In my case the slave nodes were set to boot from the local drive first, avoiding the auto-reinstall. I believe what triggered it was a forced power-off corrupting the local disk such that on next boot, the local corrupted disk would not boot and handed over to the PXE boot that got a reinstall instruction from the head node. The forced power off was caused by something unknown interrupting shutdown now being run on the slaves. A physical power off was all that would shut them down. I now use shutdown -h now which seems to overcome whatever interrupted the vanilla shutdown.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered May 4 '15 at 18:03









                J CollinsJ Collins

                4301417




                4301417



























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Unix & Linux Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f199192%2frocks-cluster-slave-spontaneous-reset%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown






                    Popular posts from this blog

                    How to check contact read email or not when send email to Individual?

                    Displaying single band from multi-band raster using QGIS

                    How many registers does an x86_64 CPU actually have?