What does the C++ compiler do to ensure that different but adjacent memory locations are safe to be used on different threads?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
34
down vote

favorite
2












Lets say I have a struct:



struct Foo 
char a; // read and written to by thread 1 only
char b; // read and written to by thread 2 only
;


Now from what I understand, the C++ standard guarantees the safety of the above when two threads operate on the two different memory locations.



I would think though that, since char a and char b, fall within the same cache line, that the compiler has to do extra syncing.



What exactly happens here?










share|improve this question

















  • 10




    On a lot of platforms (for example, x86), it doesn't have to do anything. It just works (it means that the HW does the necessary extra stuff).
    – geza
    Nov 19 at 14:35






  • 6




    Yes. But the exact hit could vary on different generations/vendors of CPU. Do a search on "false sharing".
    – geza
    Nov 19 at 14:40






  • 4




    This is handled by the hardware, not the compiler, as far as I am aware. This is called false sharing
    – NathanOliver
    Nov 19 at 14:40






  • 1




    I think the only CPUs that C++ has actually been implemented on that the compiler would have to do anything special to support the C++ memory model are early Alpha CPUs which lacked instructions that could atomically set a single byte (or 16-bit) memory location. See Peter Cordes answer to a related question for details: stackoverflow.com/a/46818162/3826372 As far as know there's no compiler implementations that have been updated to support the C++11 memory model on these long obsolete Alpha CPUs.
    – Ross Ridge
    Nov 19 at 19:12







  • 1




    @RossRidge - well, there's the even more obsolete TMS9900 that has the same issue (only with 8-bit values, as it assumes 16-bit alignment) -- there's a port of gcc 3.4 to the architecture, but I don't know whether any particular C++ variant is supported. I'm also not aware of any extant dual-processor TMS9900 machines that would ever actually see this issue. The TMS9900 has instructions that operate on single bytes, but the bus implementation always fetches both bytes in a 16-bit word and rewrites the unchanged one.
    – Jules
    Nov 20 at 9:40















up vote
34
down vote

favorite
2












Lets say I have a struct:



struct Foo 
char a; // read and written to by thread 1 only
char b; // read and written to by thread 2 only
;


Now from what I understand, the C++ standard guarantees the safety of the above when two threads operate on the two different memory locations.



I would think though that, since char a and char b, fall within the same cache line, that the compiler has to do extra syncing.



What exactly happens here?










share|improve this question

















  • 10




    On a lot of platforms (for example, x86), it doesn't have to do anything. It just works (it means that the HW does the necessary extra stuff).
    – geza
    Nov 19 at 14:35






  • 6




    Yes. But the exact hit could vary on different generations/vendors of CPU. Do a search on "false sharing".
    – geza
    Nov 19 at 14:40






  • 4




    This is handled by the hardware, not the compiler, as far as I am aware. This is called false sharing
    – NathanOliver
    Nov 19 at 14:40






  • 1




    I think the only CPUs that C++ has actually been implemented on that the compiler would have to do anything special to support the C++ memory model are early Alpha CPUs which lacked instructions that could atomically set a single byte (or 16-bit) memory location. See Peter Cordes answer to a related question for details: stackoverflow.com/a/46818162/3826372 As far as know there's no compiler implementations that have been updated to support the C++11 memory model on these long obsolete Alpha CPUs.
    – Ross Ridge
    Nov 19 at 19:12







  • 1




    @RossRidge - well, there's the even more obsolete TMS9900 that has the same issue (only with 8-bit values, as it assumes 16-bit alignment) -- there's a port of gcc 3.4 to the architecture, but I don't know whether any particular C++ variant is supported. I'm also not aware of any extant dual-processor TMS9900 machines that would ever actually see this issue. The TMS9900 has instructions that operate on single bytes, but the bus implementation always fetches both bytes in a 16-bit word and rewrites the unchanged one.
    – Jules
    Nov 20 at 9:40













up vote
34
down vote

favorite
2









up vote
34
down vote

favorite
2






2





Lets say I have a struct:



struct Foo 
char a; // read and written to by thread 1 only
char b; // read and written to by thread 2 only
;


Now from what I understand, the C++ standard guarantees the safety of the above when two threads operate on the two different memory locations.



I would think though that, since char a and char b, fall within the same cache line, that the compiler has to do extra syncing.



What exactly happens here?










share|improve this question













Lets say I have a struct:



struct Foo 
char a; // read and written to by thread 1 only
char b; // read and written to by thread 2 only
;


Now from what I understand, the C++ standard guarantees the safety of the above when two threads operate on the two different memory locations.



I would think though that, since char a and char b, fall within the same cache line, that the compiler has to do extra syncing.



What exactly happens here?







c++ multithreading thread-safety






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 19 at 14:32









Nathan Doromal

1,49111520




1,49111520







  • 10




    On a lot of platforms (for example, x86), it doesn't have to do anything. It just works (it means that the HW does the necessary extra stuff).
    – geza
    Nov 19 at 14:35






  • 6




    Yes. But the exact hit could vary on different generations/vendors of CPU. Do a search on "false sharing".
    – geza
    Nov 19 at 14:40






  • 4




    This is handled by the hardware, not the compiler, as far as I am aware. This is called false sharing
    – NathanOliver
    Nov 19 at 14:40






  • 1




    I think the only CPUs that C++ has actually been implemented on that the compiler would have to do anything special to support the C++ memory model are early Alpha CPUs which lacked instructions that could atomically set a single byte (or 16-bit) memory location. See Peter Cordes answer to a related question for details: stackoverflow.com/a/46818162/3826372 As far as know there's no compiler implementations that have been updated to support the C++11 memory model on these long obsolete Alpha CPUs.
    – Ross Ridge
    Nov 19 at 19:12







  • 1




    @RossRidge - well, there's the even more obsolete TMS9900 that has the same issue (only with 8-bit values, as it assumes 16-bit alignment) -- there's a port of gcc 3.4 to the architecture, but I don't know whether any particular C++ variant is supported. I'm also not aware of any extant dual-processor TMS9900 machines that would ever actually see this issue. The TMS9900 has instructions that operate on single bytes, but the bus implementation always fetches both bytes in a 16-bit word and rewrites the unchanged one.
    – Jules
    Nov 20 at 9:40













  • 10




    On a lot of platforms (for example, x86), it doesn't have to do anything. It just works (it means that the HW does the necessary extra stuff).
    – geza
    Nov 19 at 14:35






  • 6




    Yes. But the exact hit could vary on different generations/vendors of CPU. Do a search on "false sharing".
    – geza
    Nov 19 at 14:40






  • 4




    This is handled by the hardware, not the compiler, as far as I am aware. This is called false sharing
    – NathanOliver
    Nov 19 at 14:40






  • 1




    I think the only CPUs that C++ has actually been implemented on that the compiler would have to do anything special to support the C++ memory model are early Alpha CPUs which lacked instructions that could atomically set a single byte (or 16-bit) memory location. See Peter Cordes answer to a related question for details: stackoverflow.com/a/46818162/3826372 As far as know there's no compiler implementations that have been updated to support the C++11 memory model on these long obsolete Alpha CPUs.
    – Ross Ridge
    Nov 19 at 19:12







  • 1




    @RossRidge - well, there's the even more obsolete TMS9900 that has the same issue (only with 8-bit values, as it assumes 16-bit alignment) -- there's a port of gcc 3.4 to the architecture, but I don't know whether any particular C++ variant is supported. I'm also not aware of any extant dual-processor TMS9900 machines that would ever actually see this issue. The TMS9900 has instructions that operate on single bytes, but the bus implementation always fetches both bytes in a 16-bit word and rewrites the unchanged one.
    – Jules
    Nov 20 at 9:40








10




10




On a lot of platforms (for example, x86), it doesn't have to do anything. It just works (it means that the HW does the necessary extra stuff).
– geza
Nov 19 at 14:35




On a lot of platforms (for example, x86), it doesn't have to do anything. It just works (it means that the HW does the necessary extra stuff).
– geza
Nov 19 at 14:35




6




6




Yes. But the exact hit could vary on different generations/vendors of CPU. Do a search on "false sharing".
– geza
Nov 19 at 14:40




Yes. But the exact hit could vary on different generations/vendors of CPU. Do a search on "false sharing".
– geza
Nov 19 at 14:40




4




4




This is handled by the hardware, not the compiler, as far as I am aware. This is called false sharing
– NathanOliver
Nov 19 at 14:40




This is handled by the hardware, not the compiler, as far as I am aware. This is called false sharing
– NathanOliver
Nov 19 at 14:40




1




1




I think the only CPUs that C++ has actually been implemented on that the compiler would have to do anything special to support the C++ memory model are early Alpha CPUs which lacked instructions that could atomically set a single byte (or 16-bit) memory location. See Peter Cordes answer to a related question for details: stackoverflow.com/a/46818162/3826372 As far as know there's no compiler implementations that have been updated to support the C++11 memory model on these long obsolete Alpha CPUs.
– Ross Ridge
Nov 19 at 19:12





I think the only CPUs that C++ has actually been implemented on that the compiler would have to do anything special to support the C++ memory model are early Alpha CPUs which lacked instructions that could atomically set a single byte (or 16-bit) memory location. See Peter Cordes answer to a related question for details: stackoverflow.com/a/46818162/3826372 As far as know there's no compiler implementations that have been updated to support the C++11 memory model on these long obsolete Alpha CPUs.
– Ross Ridge
Nov 19 at 19:12





1




1




@RossRidge - well, there's the even more obsolete TMS9900 that has the same issue (only with 8-bit values, as it assumes 16-bit alignment) -- there's a port of gcc 3.4 to the architecture, but I don't know whether any particular C++ variant is supported. I'm also not aware of any extant dual-processor TMS9900 machines that would ever actually see this issue. The TMS9900 has instructions that operate on single bytes, but the bus implementation always fetches both bytes in a 16-bit word and rewrites the unchanged one.
– Jules
Nov 20 at 9:40





@RossRidge - well, there's the even more obsolete TMS9900 that has the same issue (only with 8-bit values, as it assumes 16-bit alignment) -- there's a port of gcc 3.4 to the architecture, but I don't know whether any particular C++ variant is supported. I'm also not aware of any extant dual-processor TMS9900 machines that would ever actually see this issue. The TMS9900 has instructions that operate on single bytes, but the bus implementation always fetches both bytes in a 16-bit word and rewrites the unchanged one.
– Jules
Nov 20 at 9:40













2 Answers
2






active

oldest

votes

















up vote
30
down vote



accepted










This is hardware-dependent. On hardware I am familiar with, C++ doesn't have to do anything special, because from hardware perspective accessing different bytes even on a cached line is handled 'transparently'. From the hardware, this situation is not really different from



char a[2];
// or
char a, b;


In the cases above, we are talking about two adjacent objects, which are guaranteed to be independently accessible.



However, I've put 'transparently' in quotes for a reason. When you really have a case like that, you could be suffering (performance-wise) from a 'false sharing' - which happens when two (or more) threads access adjacent memory simultaneously and it ends up being cached in several CPU's caches. This leads to constant cache invalidation. In the real life, care should be taken to prevent this from happening when possible.






share|improve this answer


















  • 2




    care should be taken to prevent this from happening when possible. How would you suggest one going about doing that?
    – ArtB
    Nov 19 at 18:52






  • 3




    @ArtB there is no hard and fast rule. Designing program correctly from the scratch is always the best approach. You can also try profiling tools, such as valgrind and analyze the number of cache misses.
    – SergeyA
    Nov 19 at 18:58






  • 1




    @ArtB - can be done by adding padding members to structs to separate fields into different cache lines. There are plenty of papers/blog posts/etc out there discussing how to measure to see if you've got a problem and then how to ameliorate it.
    – davidbak
    Nov 19 at 21:23










  • @ArtB: C++17 provides interference sizes to help guide such design.
    – Davis Herring
    Nov 20 at 1:42

















up vote
18
down vote













As others have explained, nothing in particular on common hardware. However, there is a catch: The compiler must refrain from performing certain optimizations, unless it can prove that other threads don't access the memory locations in question, e.g.:



std::array<std::uint8_t, 8u> c;

void f()

c[0] ^= 0xfa;
c[3] ^= 0x10;
c[6] ^= 0x8b;
c[7] ^= 0x92;



Here, in a single-threaded memory model, the compiler could emit code like the following (pseudo-assembly; assumes little-endian hardware):



load r0, *(std::uint64_t *) &c[0]
xor r0, 0x928b0000100000fa
store r0, *(std::uint64_t *) &c[0]


This is likely to be faster on common hardware than xor'ing the individual bytes. However, it reads and writes the unaffected (and unmentioned) elements of c at indices 1, 2, 4 and 5. If other threads are writing to these memory locations concurrently, these changes could be overwritten.



For this reason, optimizations like these are often unusable in a multi-threaded memory model. As long as the compiler performs only loads and stores of matching length, or merges accesses only when there is no gap (e.g. the accesses to c[6] and c[7] can still be merged), the hardware commonly already provides the necessary guarantees for correct execution.



(That said, there are/have been some architectures with weak and counterintuitive memory order guarantees, e.g. DEC Alpha does not track pointers as a data dependency in the way that other architectures do, so it is necessary to introduce an explicit memory barrier in some cases, in low level code. There is a somewhat well-known little rant by Linus Torvalds on this issue. However, a conforming C++ implementation is expected to shield you from such issues.)






share|improve this answer




















  • It's not the optimization that are unsafe in a multithread model, but the code itself. Unless the whole thing is protected by a mutex, code that reads and writes those locations from different threads is already invalid and whatever happens happens. This means that the compiler is indeed free to optimize this code with a single read-write operation because no valid code would be able to tell the difference.
    – 6502
    Nov 21 at 7:24










  • @6502 As I said, I'm assuming another thread could access other elements of c, i.e. c[i] where i is one of 1, 2, 4, 5. You seem to claim that any access to c from another thread is a data race, but that is not so. To quote (intro.memory/3): "A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having nonzero width." – i.e. c is not a memory location because it's neither a scalar nor an element of a bit field.
    – Arne Vogel
    yesterday










  • An element of an array of scalar types is an object of scalar type. A thread reading items 0 and 1 has no problem of concurrency with a thread writing element 2, even if it uses a single machine instruction. A thread reading elements 0 and 1 has a problem with a thread writing element 1 even using separate instructions if there is no mutex. Of course code reading items 0 and 1 cannot use a single instruction that reads e.g. 4 items and then drop the extra data because this could indeed introduce bad reads because of another thread writing in the part that is going to be discarded.
    – 6502
    yesterday










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53376806%2fwhat-does-the-c-compiler-do-to-ensure-that-different-but-adjacent-memory-locat%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
30
down vote



accepted










This is hardware-dependent. On hardware I am familiar with, C++ doesn't have to do anything special, because from hardware perspective accessing different bytes even on a cached line is handled 'transparently'. From the hardware, this situation is not really different from



char a[2];
// or
char a, b;


In the cases above, we are talking about two adjacent objects, which are guaranteed to be independently accessible.



However, I've put 'transparently' in quotes for a reason. When you really have a case like that, you could be suffering (performance-wise) from a 'false sharing' - which happens when two (or more) threads access adjacent memory simultaneously and it ends up being cached in several CPU's caches. This leads to constant cache invalidation. In the real life, care should be taken to prevent this from happening when possible.






share|improve this answer


















  • 2




    care should be taken to prevent this from happening when possible. How would you suggest one going about doing that?
    – ArtB
    Nov 19 at 18:52






  • 3




    @ArtB there is no hard and fast rule. Designing program correctly from the scratch is always the best approach. You can also try profiling tools, such as valgrind and analyze the number of cache misses.
    – SergeyA
    Nov 19 at 18:58






  • 1




    @ArtB - can be done by adding padding members to structs to separate fields into different cache lines. There are plenty of papers/blog posts/etc out there discussing how to measure to see if you've got a problem and then how to ameliorate it.
    – davidbak
    Nov 19 at 21:23










  • @ArtB: C++17 provides interference sizes to help guide such design.
    – Davis Herring
    Nov 20 at 1:42














up vote
30
down vote



accepted










This is hardware-dependent. On hardware I am familiar with, C++ doesn't have to do anything special, because from hardware perspective accessing different bytes even on a cached line is handled 'transparently'. From the hardware, this situation is not really different from



char a[2];
// or
char a, b;


In the cases above, we are talking about two adjacent objects, which are guaranteed to be independently accessible.



However, I've put 'transparently' in quotes for a reason. When you really have a case like that, you could be suffering (performance-wise) from a 'false sharing' - which happens when two (or more) threads access adjacent memory simultaneously and it ends up being cached in several CPU's caches. This leads to constant cache invalidation. In the real life, care should be taken to prevent this from happening when possible.






share|improve this answer


















  • 2




    care should be taken to prevent this from happening when possible. How would you suggest one going about doing that?
    – ArtB
    Nov 19 at 18:52






  • 3




    @ArtB there is no hard and fast rule. Designing program correctly from the scratch is always the best approach. You can also try profiling tools, such as valgrind and analyze the number of cache misses.
    – SergeyA
    Nov 19 at 18:58






  • 1




    @ArtB - can be done by adding padding members to structs to separate fields into different cache lines. There are plenty of papers/blog posts/etc out there discussing how to measure to see if you've got a problem and then how to ameliorate it.
    – davidbak
    Nov 19 at 21:23










  • @ArtB: C++17 provides interference sizes to help guide such design.
    – Davis Herring
    Nov 20 at 1:42












up vote
30
down vote



accepted







up vote
30
down vote



accepted






This is hardware-dependent. On hardware I am familiar with, C++ doesn't have to do anything special, because from hardware perspective accessing different bytes even on a cached line is handled 'transparently'. From the hardware, this situation is not really different from



char a[2];
// or
char a, b;


In the cases above, we are talking about two adjacent objects, which are guaranteed to be independently accessible.



However, I've put 'transparently' in quotes for a reason. When you really have a case like that, you could be suffering (performance-wise) from a 'false sharing' - which happens when two (or more) threads access adjacent memory simultaneously and it ends up being cached in several CPU's caches. This leads to constant cache invalidation. In the real life, care should be taken to prevent this from happening when possible.






share|improve this answer














This is hardware-dependent. On hardware I am familiar with, C++ doesn't have to do anything special, because from hardware perspective accessing different bytes even on a cached line is handled 'transparently'. From the hardware, this situation is not really different from



char a[2];
// or
char a, b;


In the cases above, we are talking about two adjacent objects, which are guaranteed to be independently accessible.



However, I've put 'transparently' in quotes for a reason. When you really have a case like that, you could be suffering (performance-wise) from a 'false sharing' - which happens when two (or more) threads access adjacent memory simultaneously and it ends up being cached in several CPU's caches. This leads to constant cache invalidation. In the real life, care should be taken to prevent this from happening when possible.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 19 at 21:26

























answered Nov 19 at 15:11









SergeyA

40.2k53781




40.2k53781







  • 2




    care should be taken to prevent this from happening when possible. How would you suggest one going about doing that?
    – ArtB
    Nov 19 at 18:52






  • 3




    @ArtB there is no hard and fast rule. Designing program correctly from the scratch is always the best approach. You can also try profiling tools, such as valgrind and analyze the number of cache misses.
    – SergeyA
    Nov 19 at 18:58






  • 1




    @ArtB - can be done by adding padding members to structs to separate fields into different cache lines. There are plenty of papers/blog posts/etc out there discussing how to measure to see if you've got a problem and then how to ameliorate it.
    – davidbak
    Nov 19 at 21:23










  • @ArtB: C++17 provides interference sizes to help guide such design.
    – Davis Herring
    Nov 20 at 1:42












  • 2




    care should be taken to prevent this from happening when possible. How would you suggest one going about doing that?
    – ArtB
    Nov 19 at 18:52






  • 3




    @ArtB there is no hard and fast rule. Designing program correctly from the scratch is always the best approach. You can also try profiling tools, such as valgrind and analyze the number of cache misses.
    – SergeyA
    Nov 19 at 18:58






  • 1




    @ArtB - can be done by adding padding members to structs to separate fields into different cache lines. There are plenty of papers/blog posts/etc out there discussing how to measure to see if you've got a problem and then how to ameliorate it.
    – davidbak
    Nov 19 at 21:23










  • @ArtB: C++17 provides interference sizes to help guide such design.
    – Davis Herring
    Nov 20 at 1:42







2




2




care should be taken to prevent this from happening when possible. How would you suggest one going about doing that?
– ArtB
Nov 19 at 18:52




care should be taken to prevent this from happening when possible. How would you suggest one going about doing that?
– ArtB
Nov 19 at 18:52




3




3




@ArtB there is no hard and fast rule. Designing program correctly from the scratch is always the best approach. You can also try profiling tools, such as valgrind and analyze the number of cache misses.
– SergeyA
Nov 19 at 18:58




@ArtB there is no hard and fast rule. Designing program correctly from the scratch is always the best approach. You can also try profiling tools, such as valgrind and analyze the number of cache misses.
– SergeyA
Nov 19 at 18:58




1




1




@ArtB - can be done by adding padding members to structs to separate fields into different cache lines. There are plenty of papers/blog posts/etc out there discussing how to measure to see if you've got a problem and then how to ameliorate it.
– davidbak
Nov 19 at 21:23




@ArtB - can be done by adding padding members to structs to separate fields into different cache lines. There are plenty of papers/blog posts/etc out there discussing how to measure to see if you've got a problem and then how to ameliorate it.
– davidbak
Nov 19 at 21:23












@ArtB: C++17 provides interference sizes to help guide such design.
– Davis Herring
Nov 20 at 1:42




@ArtB: C++17 provides interference sizes to help guide such design.
– Davis Herring
Nov 20 at 1:42












up vote
18
down vote













As others have explained, nothing in particular on common hardware. However, there is a catch: The compiler must refrain from performing certain optimizations, unless it can prove that other threads don't access the memory locations in question, e.g.:



std::array<std::uint8_t, 8u> c;

void f()

c[0] ^= 0xfa;
c[3] ^= 0x10;
c[6] ^= 0x8b;
c[7] ^= 0x92;



Here, in a single-threaded memory model, the compiler could emit code like the following (pseudo-assembly; assumes little-endian hardware):



load r0, *(std::uint64_t *) &c[0]
xor r0, 0x928b0000100000fa
store r0, *(std::uint64_t *) &c[0]


This is likely to be faster on common hardware than xor'ing the individual bytes. However, it reads and writes the unaffected (and unmentioned) elements of c at indices 1, 2, 4 and 5. If other threads are writing to these memory locations concurrently, these changes could be overwritten.



For this reason, optimizations like these are often unusable in a multi-threaded memory model. As long as the compiler performs only loads and stores of matching length, or merges accesses only when there is no gap (e.g. the accesses to c[6] and c[7] can still be merged), the hardware commonly already provides the necessary guarantees for correct execution.



(That said, there are/have been some architectures with weak and counterintuitive memory order guarantees, e.g. DEC Alpha does not track pointers as a data dependency in the way that other architectures do, so it is necessary to introduce an explicit memory barrier in some cases, in low level code. There is a somewhat well-known little rant by Linus Torvalds on this issue. However, a conforming C++ implementation is expected to shield you from such issues.)






share|improve this answer




















  • It's not the optimization that are unsafe in a multithread model, but the code itself. Unless the whole thing is protected by a mutex, code that reads and writes those locations from different threads is already invalid and whatever happens happens. This means that the compiler is indeed free to optimize this code with a single read-write operation because no valid code would be able to tell the difference.
    – 6502
    Nov 21 at 7:24










  • @6502 As I said, I'm assuming another thread could access other elements of c, i.e. c[i] where i is one of 1, 2, 4, 5. You seem to claim that any access to c from another thread is a data race, but that is not so. To quote (intro.memory/3): "A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having nonzero width." – i.e. c is not a memory location because it's neither a scalar nor an element of a bit field.
    – Arne Vogel
    yesterday










  • An element of an array of scalar types is an object of scalar type. A thread reading items 0 and 1 has no problem of concurrency with a thread writing element 2, even if it uses a single machine instruction. A thread reading elements 0 and 1 has a problem with a thread writing element 1 even using separate instructions if there is no mutex. Of course code reading items 0 and 1 cannot use a single instruction that reads e.g. 4 items and then drop the extra data because this could indeed introduce bad reads because of another thread writing in the part that is going to be discarded.
    – 6502
    yesterday














up vote
18
down vote













As others have explained, nothing in particular on common hardware. However, there is a catch: The compiler must refrain from performing certain optimizations, unless it can prove that other threads don't access the memory locations in question, e.g.:



std::array<std::uint8_t, 8u> c;

void f()

c[0] ^= 0xfa;
c[3] ^= 0x10;
c[6] ^= 0x8b;
c[7] ^= 0x92;



Here, in a single-threaded memory model, the compiler could emit code like the following (pseudo-assembly; assumes little-endian hardware):



load r0, *(std::uint64_t *) &c[0]
xor r0, 0x928b0000100000fa
store r0, *(std::uint64_t *) &c[0]


This is likely to be faster on common hardware than xor'ing the individual bytes. However, it reads and writes the unaffected (and unmentioned) elements of c at indices 1, 2, 4 and 5. If other threads are writing to these memory locations concurrently, these changes could be overwritten.



For this reason, optimizations like these are often unusable in a multi-threaded memory model. As long as the compiler performs only loads and stores of matching length, or merges accesses only when there is no gap (e.g. the accesses to c[6] and c[7] can still be merged), the hardware commonly already provides the necessary guarantees for correct execution.



(That said, there are/have been some architectures with weak and counterintuitive memory order guarantees, e.g. DEC Alpha does not track pointers as a data dependency in the way that other architectures do, so it is necessary to introduce an explicit memory barrier in some cases, in low level code. There is a somewhat well-known little rant by Linus Torvalds on this issue. However, a conforming C++ implementation is expected to shield you from such issues.)






share|improve this answer




















  • It's not the optimization that are unsafe in a multithread model, but the code itself. Unless the whole thing is protected by a mutex, code that reads and writes those locations from different threads is already invalid and whatever happens happens. This means that the compiler is indeed free to optimize this code with a single read-write operation because no valid code would be able to tell the difference.
    – 6502
    Nov 21 at 7:24










  • @6502 As I said, I'm assuming another thread could access other elements of c, i.e. c[i] where i is one of 1, 2, 4, 5. You seem to claim that any access to c from another thread is a data race, but that is not so. To quote (intro.memory/3): "A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having nonzero width." – i.e. c is not a memory location because it's neither a scalar nor an element of a bit field.
    – Arne Vogel
    yesterday










  • An element of an array of scalar types is an object of scalar type. A thread reading items 0 and 1 has no problem of concurrency with a thread writing element 2, even if it uses a single machine instruction. A thread reading elements 0 and 1 has a problem with a thread writing element 1 even using separate instructions if there is no mutex. Of course code reading items 0 and 1 cannot use a single instruction that reads e.g. 4 items and then drop the extra data because this could indeed introduce bad reads because of another thread writing in the part that is going to be discarded.
    – 6502
    yesterday












up vote
18
down vote










up vote
18
down vote









As others have explained, nothing in particular on common hardware. However, there is a catch: The compiler must refrain from performing certain optimizations, unless it can prove that other threads don't access the memory locations in question, e.g.:



std::array<std::uint8_t, 8u> c;

void f()

c[0] ^= 0xfa;
c[3] ^= 0x10;
c[6] ^= 0x8b;
c[7] ^= 0x92;



Here, in a single-threaded memory model, the compiler could emit code like the following (pseudo-assembly; assumes little-endian hardware):



load r0, *(std::uint64_t *) &c[0]
xor r0, 0x928b0000100000fa
store r0, *(std::uint64_t *) &c[0]


This is likely to be faster on common hardware than xor'ing the individual bytes. However, it reads and writes the unaffected (and unmentioned) elements of c at indices 1, 2, 4 and 5. If other threads are writing to these memory locations concurrently, these changes could be overwritten.



For this reason, optimizations like these are often unusable in a multi-threaded memory model. As long as the compiler performs only loads and stores of matching length, or merges accesses only when there is no gap (e.g. the accesses to c[6] and c[7] can still be merged), the hardware commonly already provides the necessary guarantees for correct execution.



(That said, there are/have been some architectures with weak and counterintuitive memory order guarantees, e.g. DEC Alpha does not track pointers as a data dependency in the way that other architectures do, so it is necessary to introduce an explicit memory barrier in some cases, in low level code. There is a somewhat well-known little rant by Linus Torvalds on this issue. However, a conforming C++ implementation is expected to shield you from such issues.)






share|improve this answer












As others have explained, nothing in particular on common hardware. However, there is a catch: The compiler must refrain from performing certain optimizations, unless it can prove that other threads don't access the memory locations in question, e.g.:



std::array<std::uint8_t, 8u> c;

void f()

c[0] ^= 0xfa;
c[3] ^= 0x10;
c[6] ^= 0x8b;
c[7] ^= 0x92;



Here, in a single-threaded memory model, the compiler could emit code like the following (pseudo-assembly; assumes little-endian hardware):



load r0, *(std::uint64_t *) &c[0]
xor r0, 0x928b0000100000fa
store r0, *(std::uint64_t *) &c[0]


This is likely to be faster on common hardware than xor'ing the individual bytes. However, it reads and writes the unaffected (and unmentioned) elements of c at indices 1, 2, 4 and 5. If other threads are writing to these memory locations concurrently, these changes could be overwritten.



For this reason, optimizations like these are often unusable in a multi-threaded memory model. As long as the compiler performs only loads and stores of matching length, or merges accesses only when there is no gap (e.g. the accesses to c[6] and c[7] can still be merged), the hardware commonly already provides the necessary guarantees for correct execution.



(That said, there are/have been some architectures with weak and counterintuitive memory order guarantees, e.g. DEC Alpha does not track pointers as a data dependency in the way that other architectures do, so it is necessary to introduce an explicit memory barrier in some cases, in low level code. There is a somewhat well-known little rant by Linus Torvalds on this issue. However, a conforming C++ implementation is expected to shield you from such issues.)







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 19 at 15:55









Arne Vogel

3,74011125




3,74011125











  • It's not the optimization that are unsafe in a multithread model, but the code itself. Unless the whole thing is protected by a mutex, code that reads and writes those locations from different threads is already invalid and whatever happens happens. This means that the compiler is indeed free to optimize this code with a single read-write operation because no valid code would be able to tell the difference.
    – 6502
    Nov 21 at 7:24










  • @6502 As I said, I'm assuming another thread could access other elements of c, i.e. c[i] where i is one of 1, 2, 4, 5. You seem to claim that any access to c from another thread is a data race, but that is not so. To quote (intro.memory/3): "A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having nonzero width." – i.e. c is not a memory location because it's neither a scalar nor an element of a bit field.
    – Arne Vogel
    yesterday










  • An element of an array of scalar types is an object of scalar type. A thread reading items 0 and 1 has no problem of concurrency with a thread writing element 2, even if it uses a single machine instruction. A thread reading elements 0 and 1 has a problem with a thread writing element 1 even using separate instructions if there is no mutex. Of course code reading items 0 and 1 cannot use a single instruction that reads e.g. 4 items and then drop the extra data because this could indeed introduce bad reads because of another thread writing in the part that is going to be discarded.
    – 6502
    yesterday
















  • It's not the optimization that are unsafe in a multithread model, but the code itself. Unless the whole thing is protected by a mutex, code that reads and writes those locations from different threads is already invalid and whatever happens happens. This means that the compiler is indeed free to optimize this code with a single read-write operation because no valid code would be able to tell the difference.
    – 6502
    Nov 21 at 7:24










  • @6502 As I said, I'm assuming another thread could access other elements of c, i.e. c[i] where i is one of 1, 2, 4, 5. You seem to claim that any access to c from another thread is a data race, but that is not so. To quote (intro.memory/3): "A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having nonzero width." – i.e. c is not a memory location because it's neither a scalar nor an element of a bit field.
    – Arne Vogel
    yesterday










  • An element of an array of scalar types is an object of scalar type. A thread reading items 0 and 1 has no problem of concurrency with a thread writing element 2, even if it uses a single machine instruction. A thread reading elements 0 and 1 has a problem with a thread writing element 1 even using separate instructions if there is no mutex. Of course code reading items 0 and 1 cannot use a single instruction that reads e.g. 4 items and then drop the extra data because this could indeed introduce bad reads because of another thread writing in the part that is going to be discarded.
    – 6502
    yesterday















It's not the optimization that are unsafe in a multithread model, but the code itself. Unless the whole thing is protected by a mutex, code that reads and writes those locations from different threads is already invalid and whatever happens happens. This means that the compiler is indeed free to optimize this code with a single read-write operation because no valid code would be able to tell the difference.
– 6502
Nov 21 at 7:24




It's not the optimization that are unsafe in a multithread model, but the code itself. Unless the whole thing is protected by a mutex, code that reads and writes those locations from different threads is already invalid and whatever happens happens. This means that the compiler is indeed free to optimize this code with a single read-write operation because no valid code would be able to tell the difference.
– 6502
Nov 21 at 7:24












@6502 As I said, I'm assuming another thread could access other elements of c, i.e. c[i] where i is one of 1, 2, 4, 5. You seem to claim that any access to c from another thread is a data race, but that is not so. To quote (intro.memory/3): "A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having nonzero width." – i.e. c is not a memory location because it's neither a scalar nor an element of a bit field.
– Arne Vogel
yesterday




@6502 As I said, I'm assuming another thread could access other elements of c, i.e. c[i] where i is one of 1, 2, 4, 5. You seem to claim that any access to c from another thread is a data race, but that is not so. To quote (intro.memory/3): "A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having nonzero width." – i.e. c is not a memory location because it's neither a scalar nor an element of a bit field.
– Arne Vogel
yesterday












An element of an array of scalar types is an object of scalar type. A thread reading items 0 and 1 has no problem of concurrency with a thread writing element 2, even if it uses a single machine instruction. A thread reading elements 0 and 1 has a problem with a thread writing element 1 even using separate instructions if there is no mutex. Of course code reading items 0 and 1 cannot use a single instruction that reads e.g. 4 items and then drop the extra data because this could indeed introduce bad reads because of another thread writing in the part that is going to be discarded.
– 6502
yesterday




An element of an array of scalar types is an object of scalar type. A thread reading items 0 and 1 has no problem of concurrency with a thread writing element 2, even if it uses a single machine instruction. A thread reading elements 0 and 1 has a problem with a thread writing element 1 even using separate instructions if there is no mutex. Of course code reading items 0 and 1 cannot use a single instruction that reads e.g. 4 items and then drop the extra data because this could indeed introduce bad reads because of another thread writing in the part that is going to be discarded.
– 6502
yesterday

















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53376806%2fwhat-does-the-c-compiler-do-to-ensure-that-different-but-adjacent-memory-locat%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown






Popular posts from this blog

How to check contact read email or not when send email to Individual?

Displaying single band from multi-band raster using QGIS

How many registers does an x86_64 CPU actually have?