mpd daemon prematurely ending jobs

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP












2















I am trying to configure mpirun and mpiexec to run software called Materials Studio on a 1 node, 2 processor, 12 core cluster. The submission scheme is PBS. I had everything set up properly (with some help) and where I could submit jobs and they would work well but after a few days I ran into issues where I would get this sort of error:



mpiexec_server.org: cannot connect to local mpd (/tmp/mpd2.console_user); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option)



It seemed like the daemon for mpd was somehow set up but eventually terminated. I had luck adding this (bold part) to my submission script:



export PATH=/data1/opt/MD/Linux-x86_64/IntelMPI/bin:$PATH

export LD_LIBRARY_PATH=/data1/opt/MD/Linux-x86_64/IntelMPI/lib:/data1/opt/MD/Linux-x86_64/IntelMPI/bin:/data1/opt/MD/Linux-x86_64/IntelMKL/lib

**mpdboot -n 1 -f ~/mpd.hosts**

nohup mpd &

/data1/opt/MD/Linux-x86_64/IntelMPI/bin/mpiexec -n 6 /data1/opt/MD/2.0/TaskServer/Tools/vasp5.3.3/Linux-x86_64/vasp_parallel


The job now submits and runs properly but times out after 30 minutes or so. I tried adding '-r ssh' without quotes to the end of the mpdboot line but I am not sure if that is the right strategy to take. Also, I am a little confused about why I need to run this daemon in this script and why I need to call a hosts file when I run- I thought that PBS creates that when the job picks up. Could anyone please give me some advice on where to go next? Basically how can I prevent a job that is running from quitting because of something to do with the mpi daemon.



EDIT: Could anyone shed any light on what is involved with running that mpiexec that I have on the last line? If I properly link to the folder where it is, do I need to run a boot command? I must admit that I am confused why I need to run mpdboot/mpd when then whole point of mpiexec is to eliminate the need for mpd (at least according to the mpiexec website).










share|improve this question
























  • I guess I am a little confused why I need to run mpdboot and mpd in the first place. It seems like only the latest and greatest intel compiler suggests doing this. Is there a way to revert to previous functionality that would be present in say mpi 3.2 which I am told this code was compiled against? Thanks again!

    – sjensen
    Jun 10 '13 at 0:44















2















I am trying to configure mpirun and mpiexec to run software called Materials Studio on a 1 node, 2 processor, 12 core cluster. The submission scheme is PBS. I had everything set up properly (with some help) and where I could submit jobs and they would work well but after a few days I ran into issues where I would get this sort of error:



mpiexec_server.org: cannot connect to local mpd (/tmp/mpd2.console_user); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option)



It seemed like the daemon for mpd was somehow set up but eventually terminated. I had luck adding this (bold part) to my submission script:



export PATH=/data1/opt/MD/Linux-x86_64/IntelMPI/bin:$PATH

export LD_LIBRARY_PATH=/data1/opt/MD/Linux-x86_64/IntelMPI/lib:/data1/opt/MD/Linux-x86_64/IntelMPI/bin:/data1/opt/MD/Linux-x86_64/IntelMKL/lib

**mpdboot -n 1 -f ~/mpd.hosts**

nohup mpd &

/data1/opt/MD/Linux-x86_64/IntelMPI/bin/mpiexec -n 6 /data1/opt/MD/2.0/TaskServer/Tools/vasp5.3.3/Linux-x86_64/vasp_parallel


The job now submits and runs properly but times out after 30 minutes or so. I tried adding '-r ssh' without quotes to the end of the mpdboot line but I am not sure if that is the right strategy to take. Also, I am a little confused about why I need to run this daemon in this script and why I need to call a hosts file when I run- I thought that PBS creates that when the job picks up. Could anyone please give me some advice on where to go next? Basically how can I prevent a job that is running from quitting because of something to do with the mpi daemon.



EDIT: Could anyone shed any light on what is involved with running that mpiexec that I have on the last line? If I properly link to the folder where it is, do I need to run a boot command? I must admit that I am confused why I need to run mpdboot/mpd when then whole point of mpiexec is to eliminate the need for mpd (at least according to the mpiexec website).










share|improve this question
























  • I guess I am a little confused why I need to run mpdboot and mpd in the first place. It seems like only the latest and greatest intel compiler suggests doing this. Is there a way to revert to previous functionality that would be present in say mpi 3.2 which I am told this code was compiled against? Thanks again!

    – sjensen
    Jun 10 '13 at 0:44













2












2








2


1






I am trying to configure mpirun and mpiexec to run software called Materials Studio on a 1 node, 2 processor, 12 core cluster. The submission scheme is PBS. I had everything set up properly (with some help) and where I could submit jobs and they would work well but after a few days I ran into issues where I would get this sort of error:



mpiexec_server.org: cannot connect to local mpd (/tmp/mpd2.console_user); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option)



It seemed like the daemon for mpd was somehow set up but eventually terminated. I had luck adding this (bold part) to my submission script:



export PATH=/data1/opt/MD/Linux-x86_64/IntelMPI/bin:$PATH

export LD_LIBRARY_PATH=/data1/opt/MD/Linux-x86_64/IntelMPI/lib:/data1/opt/MD/Linux-x86_64/IntelMPI/bin:/data1/opt/MD/Linux-x86_64/IntelMKL/lib

**mpdboot -n 1 -f ~/mpd.hosts**

nohup mpd &

/data1/opt/MD/Linux-x86_64/IntelMPI/bin/mpiexec -n 6 /data1/opt/MD/2.0/TaskServer/Tools/vasp5.3.3/Linux-x86_64/vasp_parallel


The job now submits and runs properly but times out after 30 minutes or so. I tried adding '-r ssh' without quotes to the end of the mpdboot line but I am not sure if that is the right strategy to take. Also, I am a little confused about why I need to run this daemon in this script and why I need to call a hosts file when I run- I thought that PBS creates that when the job picks up. Could anyone please give me some advice on where to go next? Basically how can I prevent a job that is running from quitting because of something to do with the mpi daemon.



EDIT: Could anyone shed any light on what is involved with running that mpiexec that I have on the last line? If I properly link to the folder where it is, do I need to run a boot command? I must admit that I am confused why I need to run mpdboot/mpd when then whole point of mpiexec is to eliminate the need for mpd (at least according to the mpiexec website).










share|improve this question
















I am trying to configure mpirun and mpiexec to run software called Materials Studio on a 1 node, 2 processor, 12 core cluster. The submission scheme is PBS. I had everything set up properly (with some help) and where I could submit jobs and they would work well but after a few days I ran into issues where I would get this sort of error:



mpiexec_server.org: cannot connect to local mpd (/tmp/mpd2.console_user); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option)



It seemed like the daemon for mpd was somehow set up but eventually terminated. I had luck adding this (bold part) to my submission script:



export PATH=/data1/opt/MD/Linux-x86_64/IntelMPI/bin:$PATH

export LD_LIBRARY_PATH=/data1/opt/MD/Linux-x86_64/IntelMPI/lib:/data1/opt/MD/Linux-x86_64/IntelMPI/bin:/data1/opt/MD/Linux-x86_64/IntelMKL/lib

**mpdboot -n 1 -f ~/mpd.hosts**

nohup mpd &

/data1/opt/MD/Linux-x86_64/IntelMPI/bin/mpiexec -n 6 /data1/opt/MD/2.0/TaskServer/Tools/vasp5.3.3/Linux-x86_64/vasp_parallel


The job now submits and runs properly but times out after 30 minutes or so. I tried adding '-r ssh' without quotes to the end of the mpdboot line but I am not sure if that is the right strategy to take. Also, I am a little confused about why I need to run this daemon in this script and why I need to call a hosts file when I run- I thought that PBS creates that when the job picks up. Could anyone please give me some advice on where to go next? Basically how can I prevent a job that is running from quitting because of something to do with the mpi daemon.



EDIT: Could anyone shed any light on what is involved with running that mpiexec that I have on the last line? If I properly link to the folder where it is, do I need to run a boot command? I must admit that I am confused why I need to run mpdboot/mpd when then whole point of mpiexec is to eliminate the need for mpd (at least according to the mpiexec website).







job-control cluster timeout mpi






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 13 at 22:13









Rui F Ribeiro

39.7k1479132




39.7k1479132










asked Jun 8 '13 at 13:37









sjensensjensen

1112




1112












  • I guess I am a little confused why I need to run mpdboot and mpd in the first place. It seems like only the latest and greatest intel compiler suggests doing this. Is there a way to revert to previous functionality that would be present in say mpi 3.2 which I am told this code was compiled against? Thanks again!

    – sjensen
    Jun 10 '13 at 0:44

















  • I guess I am a little confused why I need to run mpdboot and mpd in the first place. It seems like only the latest and greatest intel compiler suggests doing this. Is there a way to revert to previous functionality that would be present in say mpi 3.2 which I am told this code was compiled against? Thanks again!

    – sjensen
    Jun 10 '13 at 0:44
















I guess I am a little confused why I need to run mpdboot and mpd in the first place. It seems like only the latest and greatest intel compiler suggests doing this. Is there a way to revert to previous functionality that would be present in say mpi 3.2 which I am told this code was compiled against? Thanks again!

– sjensen
Jun 10 '13 at 0:44





I guess I am a little confused why I need to run mpdboot and mpd in the first place. It seems like only the latest and greatest intel compiler suggests doing this. Is there a way to revert to previous functionality that would be present in say mpi 3.2 which I am told this code was compiled against? Thanks again!

– sjensen
Jun 10 '13 at 0:44










1 Answer
1






active

oldest

votes


















0














I'm running a MD simulation. But, once I want to run the simulation in DL-POLY the simulation is not started. I used these commands:



$ ps aux | grep mpd 

$ nohup mpd > mpd.out 2> mpd.err < /dev/null/ &

$ mpiexec -n 4 DLPOLY.X >> job.out 2> job.err < /dev/null &

$ top


So that when I use the last command to see the process, I would see that the DL_POLY didn't appear. In the meanwhile, using the ll command I see that mpd.out has a zero value. I don't know why?






share|improve this answer
























    Your Answer








    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "106"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f78703%2fmpd-daemon-prematurely-ending-jobs%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    I'm running a MD simulation. But, once I want to run the simulation in DL-POLY the simulation is not started. I used these commands:



    $ ps aux | grep mpd 

    $ nohup mpd > mpd.out 2> mpd.err < /dev/null/ &

    $ mpiexec -n 4 DLPOLY.X >> job.out 2> job.err < /dev/null &

    $ top


    So that when I use the last command to see the process, I would see that the DL_POLY didn't appear. In the meanwhile, using the ll command I see that mpd.out has a zero value. I don't know why?






    share|improve this answer





























      0














      I'm running a MD simulation. But, once I want to run the simulation in DL-POLY the simulation is not started. I used these commands:



      $ ps aux | grep mpd 

      $ nohup mpd > mpd.out 2> mpd.err < /dev/null/ &

      $ mpiexec -n 4 DLPOLY.X >> job.out 2> job.err < /dev/null &

      $ top


      So that when I use the last command to see the process, I would see that the DL_POLY didn't appear. In the meanwhile, using the ll command I see that mpd.out has a zero value. I don't know why?






      share|improve this answer



























        0












        0








        0







        I'm running a MD simulation. But, once I want to run the simulation in DL-POLY the simulation is not started. I used these commands:



        $ ps aux | grep mpd 

        $ nohup mpd > mpd.out 2> mpd.err < /dev/null/ &

        $ mpiexec -n 4 DLPOLY.X >> job.out 2> job.err < /dev/null &

        $ top


        So that when I use the last command to see the process, I would see that the DL_POLY didn't appear. In the meanwhile, using the ll command I see that mpd.out has a zero value. I don't know why?






        share|improve this answer















        I'm running a MD simulation. But, once I want to run the simulation in DL-POLY the simulation is not started. I used these commands:



        $ ps aux | grep mpd 

        $ nohup mpd > mpd.out 2> mpd.err < /dev/null/ &

        $ mpiexec -n 4 DLPOLY.X >> job.out 2> job.err < /dev/null &

        $ top


        So that when I use the last command to see the process, I would see that the DL_POLY didn't appear. In the meanwhile, using the ll command I see that mpd.out has a zero value. I don't know why?







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited May 29 '14 at 11:43









        slm

        249k66523681




        249k66523681










        answered May 29 '14 at 11:20









        MajidMajid

        1




        1



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Unix & Linux Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f78703%2fmpd-daemon-prematurely-ending-jobs%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown






            Popular posts from this blog

            How to check contact read email or not when send email to Individual?

            Displaying single band from multi-band raster using QGIS

            How many registers does an x86_64 CPU actually have?