Intermittent System Lockup with Tensorflow / GPU

up vote
0
down vote

favorite

I've been experiencing an intermittent system lockup when using tensorflow-gpu. Unfortunately I'm not seeing anything in the syslog that tells me what's failing. I'm looking for some ideas on where I might look, or what debugging I might enable to track down the issue.

I was having the issue on Ubuntu 17.10 and I checked the hardware as best I could (swapped video cards, memory, run memory test, checked for SSD errors, etc..) and didn't find any issues or a configuration that eliminated the lockups.

I've now moved to 18.04 and it appeared to initially go away, however recently I've seen it happen a few times again so obviously the issue isn't fully resolved.

I'm running on an Intel 14-core, TitanX (maxwell) and 64GB RAM. No overclocking. The lockup leaves the screen intact but frozen with no mouse or keyboard and, if audio was playing, the sound goes into a continuous short loop. After a few seconds the system reboots.

I have plenty of available RAM when the lockup occurs. I've seen it happen when just running the standard tensorflow "hello world" test and in the middle of training large nets. I haven't seen it when running CPU stress tests.

From the behaviors I've seen, I'm assuming this is an NVidia driver / cuda / tensorflow issue but it's possible that's just the heavy stressor that brings out the problem

I'd appreciate any ideas for places I could check or debugging information that I could enable that would help me identify the specific point of failure.

asked 4 mins ago

bivouac0

1514

add a commentÂ |Â

up vote
0
down vote

favorite

I've now moved to 18.04 and it appeared to initially go away, however recently I've seen it happen a few times again so obviously the issue isn't fully resolved.

From the behaviors I've seen, I'm assuming this is an NVidia driver / cuda / tensorflow issue but it's possible that's just the heavy stressor that brings out the problem

I'd appreciate any ideas for places I could check or debugging information that I could enable that would help me identify the specific point of failure.

asked 4 mins ago

bivouac0

1514

add a commentÂ |Â

up vote
0
down vote

favorite

I've now moved to 18.04 and it appeared to initially go away, however recently I've seen it happen a few times again so obviously the issue isn't fully resolved.

From the behaviors I've seen, I'm assuming this is an NVidia driver / cuda / tensorflow issue but it's possible that's just the heavy stressor that brings out the problem

I'd appreciate any ideas for places I could check or debugging information that I could enable that would help me identify the specific point of failure.

asked 4 mins ago

bivouac0

1514

I've now moved to 18.04 and it appeared to initially go away, however recently I've seen it happen a few times again so obviously the issue isn't fully resolved.

From the behaviors I've seen, I'm assuming this is an NVidia driver / cuda / tensorflow issue but it's possible that's just the heavy stressor that brings out the problem

I'd appreciate any ideas for places I could check or debugging information that I could enable that would help me identify the specific point of failure.

ubuntu nvidia debugging machine-learning

asked 4 mins ago

bivouac0

1514

asked 4 mins ago

bivouac0

1514

asked 4 mins ago

bivouac0

1514

asked 4 mins ago

bivouac0

1514

asked 4 mins ago

bivouac0

1514

add a commentÂ |Â

active

oldest

votes

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f476670%2fintermittent-system-lockup-with-tensorflow-gpu%23new-answer', 'question_page');

);

Post as a guest

Name

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu