Intermittent System Lockup with Tensorflow / GPU

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
0
down vote

favorite












I've been experiencing an intermittent system lockup when using tensorflow-gpu. Unfortunately I'm not seeing anything in the syslog that tells me what's failing. I'm looking for some ideas on where I might look, or what debugging I might enable to track down the issue.



I was having the issue on Ubuntu 17.10 and I checked the hardware as best I could (swapped video cards, memory, run memory test, checked for SSD errors, etc..) and didn't find any issues or a configuration that eliminated the lockups.



I've now moved to 18.04 and it appeared to initially go away, however recently I've seen it happen a few times again so obviously the issue isn't fully resolved.



I'm running on an Intel 14-core, TitanX (maxwell) and 64GB RAM. No overclocking. The lockup leaves the screen intact but frozen with no mouse or keyboard and, if audio was playing, the sound goes into a continuous short loop. After a few seconds the system reboots.



I have plenty of available RAM when the lockup occurs. I've seen it happen when just running the standard tensorflow "hello world" test and in the middle of training large nets. I haven't seen it when running CPU stress tests.



From the behaviors I've seen, I'm assuming this is an NVidia driver / cuda / tensorflow issue but it's possible that's just the heavy stressor that brings out the problem



I'd appreciate any ideas for places I could check or debugging information that I could enable that would help me identify the specific point of failure.









share

























    up vote
    0
    down vote

    favorite












    I've been experiencing an intermittent system lockup when using tensorflow-gpu. Unfortunately I'm not seeing anything in the syslog that tells me what's failing. I'm looking for some ideas on where I might look, or what debugging I might enable to track down the issue.



    I was having the issue on Ubuntu 17.10 and I checked the hardware as best I could (swapped video cards, memory, run memory test, checked for SSD errors, etc..) and didn't find any issues or a configuration that eliminated the lockups.



    I've now moved to 18.04 and it appeared to initially go away, however recently I've seen it happen a few times again so obviously the issue isn't fully resolved.



    I'm running on an Intel 14-core, TitanX (maxwell) and 64GB RAM. No overclocking. The lockup leaves the screen intact but frozen with no mouse or keyboard and, if audio was playing, the sound goes into a continuous short loop. After a few seconds the system reboots.



    I have plenty of available RAM when the lockup occurs. I've seen it happen when just running the standard tensorflow "hello world" test and in the middle of training large nets. I haven't seen it when running CPU stress tests.



    From the behaviors I've seen, I'm assuming this is an NVidia driver / cuda / tensorflow issue but it's possible that's just the heavy stressor that brings out the problem



    I'd appreciate any ideas for places I could check or debugging information that I could enable that would help me identify the specific point of failure.









    share























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      I've been experiencing an intermittent system lockup when using tensorflow-gpu. Unfortunately I'm not seeing anything in the syslog that tells me what's failing. I'm looking for some ideas on where I might look, or what debugging I might enable to track down the issue.



      I was having the issue on Ubuntu 17.10 and I checked the hardware as best I could (swapped video cards, memory, run memory test, checked for SSD errors, etc..) and didn't find any issues or a configuration that eliminated the lockups.



      I've now moved to 18.04 and it appeared to initially go away, however recently I've seen it happen a few times again so obviously the issue isn't fully resolved.



      I'm running on an Intel 14-core, TitanX (maxwell) and 64GB RAM. No overclocking. The lockup leaves the screen intact but frozen with no mouse or keyboard and, if audio was playing, the sound goes into a continuous short loop. After a few seconds the system reboots.



      I have plenty of available RAM when the lockup occurs. I've seen it happen when just running the standard tensorflow "hello world" test and in the middle of training large nets. I haven't seen it when running CPU stress tests.



      From the behaviors I've seen, I'm assuming this is an NVidia driver / cuda / tensorflow issue but it's possible that's just the heavy stressor that brings out the problem



      I'd appreciate any ideas for places I could check or debugging information that I could enable that would help me identify the specific point of failure.









      share













      I've been experiencing an intermittent system lockup when using tensorflow-gpu. Unfortunately I'm not seeing anything in the syslog that tells me what's failing. I'm looking for some ideas on where I might look, or what debugging I might enable to track down the issue.



      I was having the issue on Ubuntu 17.10 and I checked the hardware as best I could (swapped video cards, memory, run memory test, checked for SSD errors, etc..) and didn't find any issues or a configuration that eliminated the lockups.



      I've now moved to 18.04 and it appeared to initially go away, however recently I've seen it happen a few times again so obviously the issue isn't fully resolved.



      I'm running on an Intel 14-core, TitanX (maxwell) and 64GB RAM. No overclocking. The lockup leaves the screen intact but frozen with no mouse or keyboard and, if audio was playing, the sound goes into a continuous short loop. After a few seconds the system reboots.



      I have plenty of available RAM when the lockup occurs. I've seen it happen when just running the standard tensorflow "hello world" test and in the middle of training large nets. I haven't seen it when running CPU stress tests.



      From the behaviors I've seen, I'm assuming this is an NVidia driver / cuda / tensorflow issue but it's possible that's just the heavy stressor that brings out the problem



      I'd appreciate any ideas for places I could check or debugging information that I could enable that would help me identify the specific point of failure.







      ubuntu nvidia debugging machine-learning





      share












      share










      share



      share










      asked 4 mins ago









      bivouac0

      1514




      1514

























          active

          oldest

          votes











          Your Answer







          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "106"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f476670%2fintermittent-system-lockup-with-tensorflow-gpu%23new-answer', 'question_page');

          );

          Post as a guest



































          active

          oldest

          votes













          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes















           

          draft saved


          draft discarded















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f476670%2fintermittent-system-lockup-with-tensorflow-gpu%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          How to check contact read email or not when send email to Individual?

          Displaying single band from multi-band raster using QGIS

          How many registers does an x86_64 CPU actually have?