Intermittent System Lockup with Tensorflow / GPU

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
0
down vote

favorite












I've been experiencing an intermittent system lockup when using tensorflow-gpu. Unfortunately I'm not seeing anything in the syslog that tells me what's failing. I'm looking for some ideas on where I might look, or what debugging I might enable to track down the issue.



I was having the issue on Ubuntu 17.10 and I checked the hardware as best I could (swapped video cards, memory, run memory test, checked for SSD errors, etc..) and didn't find any issues or a configuration that eliminated the lockups.



I've now moved to 18.04 and it appeared to initially go away, however recently I've seen it happen a few times again so obviously the issue isn't fully resolved.



I'm running on an Intel 14-core, TitanX (maxwell) and 64GB RAM. No overclocking. The lockup leaves the screen intact but frozen with no mouse or keyboard and, if audio was playing, the sound goes into a continuous short loop. After a few seconds the system reboots.



I have plenty of available RAM when the lockup occurs. I've seen it happen when just running the standard tensorflow "hello world" test and in the middle of training large nets. I haven't seen it when running CPU stress tests.



From the behaviors I've seen, I'm assuming this is an NVidia driver / cuda / tensorflow issue but it's possible that's just the heavy stressor that brings out the problem



I'd appreciate any ideas for places I could check or debugging information that I could enable that would help me identify the specific point of failure.









share

























    up vote
    0
    down vote

    favorite












    I've been experiencing an intermittent system lockup when using tensorflow-gpu. Unfortunately I'm not seeing anything in the syslog that tells me what's failing. I'm looking for some ideas on where I might look, or what debugging I might enable to track down the issue.



    I was having the issue on Ubuntu 17.10 and I checked the hardware as best I could (swapped video cards, memory, run memory test, checked for SSD errors, etc..) and didn't find any issues or a configuration that eliminated the lockups.



    I've now moved to 18.04 and it appeared to initially go away, however recently I've seen it happen a few times again so obviously the issue isn't fully resolved.



    I'm running on an Intel 14-core, TitanX (maxwell) and 64GB RAM. No overclocking. The lockup leaves the screen intact but frozen with no mouse or keyboard and, if audio was playing, the sound goes into a continuous short loop. After a few seconds the system reboots.



    I have plenty of available RAM when the lockup occurs. I've seen it happen when just running the standard tensorflow "hello world" test and in the middle of training large nets. I haven't seen it when running CPU stress tests.



    From the behaviors I've seen, I'm assuming this is an NVidia driver / cuda / tensorflow issue but it's possible that's just the heavy stressor that brings out the problem



    I'd appreciate any ideas for places I could check or debugging information that I could enable that would help me identify the specific point of failure.









    share























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      I've been experiencing an intermittent system lockup when using tensorflow-gpu. Unfortunately I'm not seeing anything in the syslog that tells me what's failing. I'm looking for some ideas on where I might look, or what debugging I might enable to track down the issue.



      I was having the issue on Ubuntu 17.10 and I checked the hardware as best I could (swapped video cards, memory, run memory test, checked for SSD errors, etc..) and didn't find any issues or a configuration that eliminated the lockups.



      I've now moved to 18.04 and it appeared to initially go away, however recently I've seen it happen a few times again so obviously the issue isn't fully resolved.



      I'm running on an Intel 14-core, TitanX (maxwell) and 64GB RAM. No overclocking. The lockup leaves the screen intact but frozen with no mouse or keyboard and, if audio was playing, the sound goes into a continuous short loop. After a few seconds the system reboots.



      I have plenty of available RAM when the lockup occurs. I've seen it happen when just running the standard tensorflow "hello world" test and in the middle of training large nets. I haven't seen it when running CPU stress tests.



      From the behaviors I've seen, I'm assuming this is an NVidia driver / cuda / tensorflow issue but it's possible that's just the heavy stressor that brings out the problem



      I'd appreciate any ideas for places I could check or debugging information that I could enable that would help me identify the specific point of failure.









      share













      I've been experiencing an intermittent system lockup when using tensorflow-gpu. Unfortunately I'm not seeing anything in the syslog that tells me what's failing. I'm looking for some ideas on where I might look, or what debugging I might enable to track down the issue.



      I was having the issue on Ubuntu 17.10 and I checked the hardware as best I could (swapped video cards, memory, run memory test, checked for SSD errors, etc..) and didn't find any issues or a configuration that eliminated the lockups.



      I've now moved to 18.04 and it appeared to initially go away, however recently I've seen it happen a few times again so obviously the issue isn't fully resolved.



      I'm running on an Intel 14-core, TitanX (maxwell) and 64GB RAM. No overclocking. The lockup leaves the screen intact but frozen with no mouse or keyboard and, if audio was playing, the sound goes into a continuous short loop. After a few seconds the system reboots.



      I have plenty of available RAM when the lockup occurs. I've seen it happen when just running the standard tensorflow "hello world" test and in the middle of training large nets. I haven't seen it when running CPU stress tests.



      From the behaviors I've seen, I'm assuming this is an NVidia driver / cuda / tensorflow issue but it's possible that's just the heavy stressor that brings out the problem



      I'd appreciate any ideas for places I could check or debugging information that I could enable that would help me identify the specific point of failure.







      ubuntu nvidia debugging machine-learning





      share












      share










      share



      share










      asked 4 mins ago









      bivouac0

      1514




      1514

























          active

          oldest

          votes











          Your Answer







          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "106"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f476670%2fintermittent-system-lockup-with-tensorflow-gpu%23new-answer', 'question_page');

          );

          Post as a guest



































          active

          oldest

          votes













          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes















           

          draft saved


          draft discarded















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f476670%2fintermittent-system-lockup-with-tensorflow-gpu%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          How to check contact read email or not when send email to Individual?

          Christian Cage

          How to properly install USB display driver for Fresco Logic FL2000DX on Ubuntu?