Intermittent System Lockup with Tensorflow / GPU
Clash Royale CLAN TAG#URR8PPP
up vote
0
down vote
favorite
I've been experiencing an intermittent system lockup when using tensorflow-gpu. Unfortunately I'm not seeing anything in the syslog that tells me what's failing. I'm looking for some ideas on where I might look, or what debugging I might enable to track down the issue.
I was having the issue on Ubuntu 17.10 and I checked the hardware as best I could (swapped video cards, memory, run memory test, checked for SSD errors, etc..) and didn't find any issues or a configuration that eliminated the lockups.
I've now moved to 18.04 and it appeared to initially go away, however recently I've seen it happen a few times again so obviously the issue isn't fully resolved.
I'm running on an Intel 14-core, TitanX (maxwell) and 64GB RAM. No overclocking. The lockup leaves the screen intact but frozen with no mouse or keyboard and, if audio was playing, the sound goes into a continuous short loop. After a few seconds the system reboots.
I have plenty of available RAM when the lockup occurs. I've seen it happen when just running the standard tensorflow "hello world" test and in the middle of training large nets. I haven't seen it when running CPU stress tests.
From the behaviors I've seen, I'm assuming this is an NVidia driver / cuda / tensorflow issue but it's possible that's just the heavy stressor that brings out the problem
I'd appreciate any ideas for places I could check or debugging information that I could enable that would help me identify the specific point of failure.
ubuntu nvidia debugging machine-learning
add a comment |Â
up vote
0
down vote
favorite
I've been experiencing an intermittent system lockup when using tensorflow-gpu. Unfortunately I'm not seeing anything in the syslog that tells me what's failing. I'm looking for some ideas on where I might look, or what debugging I might enable to track down the issue.
I was having the issue on Ubuntu 17.10 and I checked the hardware as best I could (swapped video cards, memory, run memory test, checked for SSD errors, etc..) and didn't find any issues or a configuration that eliminated the lockups.
I've now moved to 18.04 and it appeared to initially go away, however recently I've seen it happen a few times again so obviously the issue isn't fully resolved.
I'm running on an Intel 14-core, TitanX (maxwell) and 64GB RAM. No overclocking. The lockup leaves the screen intact but frozen with no mouse or keyboard and, if audio was playing, the sound goes into a continuous short loop. After a few seconds the system reboots.
I have plenty of available RAM when the lockup occurs. I've seen it happen when just running the standard tensorflow "hello world" test and in the middle of training large nets. I haven't seen it when running CPU stress tests.
From the behaviors I've seen, I'm assuming this is an NVidia driver / cuda / tensorflow issue but it's possible that's just the heavy stressor that brings out the problem
I'd appreciate any ideas for places I could check or debugging information that I could enable that would help me identify the specific point of failure.
ubuntu nvidia debugging machine-learning
add a comment |Â
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I've been experiencing an intermittent system lockup when using tensorflow-gpu. Unfortunately I'm not seeing anything in the syslog that tells me what's failing. I'm looking for some ideas on where I might look, or what debugging I might enable to track down the issue.
I was having the issue on Ubuntu 17.10 and I checked the hardware as best I could (swapped video cards, memory, run memory test, checked for SSD errors, etc..) and didn't find any issues or a configuration that eliminated the lockups.
I've now moved to 18.04 and it appeared to initially go away, however recently I've seen it happen a few times again so obviously the issue isn't fully resolved.
I'm running on an Intel 14-core, TitanX (maxwell) and 64GB RAM. No overclocking. The lockup leaves the screen intact but frozen with no mouse or keyboard and, if audio was playing, the sound goes into a continuous short loop. After a few seconds the system reboots.
I have plenty of available RAM when the lockup occurs. I've seen it happen when just running the standard tensorflow "hello world" test and in the middle of training large nets. I haven't seen it when running CPU stress tests.
From the behaviors I've seen, I'm assuming this is an NVidia driver / cuda / tensorflow issue but it's possible that's just the heavy stressor that brings out the problem
I'd appreciate any ideas for places I could check or debugging information that I could enable that would help me identify the specific point of failure.
ubuntu nvidia debugging machine-learning
I've been experiencing an intermittent system lockup when using tensorflow-gpu. Unfortunately I'm not seeing anything in the syslog that tells me what's failing. I'm looking for some ideas on where I might look, or what debugging I might enable to track down the issue.
I was having the issue on Ubuntu 17.10 and I checked the hardware as best I could (swapped video cards, memory, run memory test, checked for SSD errors, etc..) and didn't find any issues or a configuration that eliminated the lockups.
I've now moved to 18.04 and it appeared to initially go away, however recently I've seen it happen a few times again so obviously the issue isn't fully resolved.
I'm running on an Intel 14-core, TitanX (maxwell) and 64GB RAM. No overclocking. The lockup leaves the screen intact but frozen with no mouse or keyboard and, if audio was playing, the sound goes into a continuous short loop. After a few seconds the system reboots.
I have plenty of available RAM when the lockup occurs. I've seen it happen when just running the standard tensorflow "hello world" test and in the middle of training large nets. I haven't seen it when running CPU stress tests.
From the behaviors I've seen, I'm assuming this is an NVidia driver / cuda / tensorflow issue but it's possible that's just the heavy stressor that brings out the problem
I'd appreciate any ideas for places I could check or debugging information that I could enable that would help me identify the specific point of failure.
ubuntu nvidia debugging machine-learning
ubuntu nvidia debugging machine-learning
asked 4 mins ago
bivouac0
1514
1514
add a comment |Â
add a comment |Â
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f476670%2fintermittent-system-lockup-with-tensorflow-gpu%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password