Concurrent quicksort in C++

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












I have a task to write a concurrent version of quicksort algorithm, to speed it up. This is what I wrote:



template <class T>
void quicksort::sort(T *data, int len)
thread_count = 0;
int p = 0;
int r = len - 1;
_sort(data, p, r);


template <class T>
void quicksort::_swap(T *data, int first, int second)
auto temp = data[first];
data[first] = data[second];
data[second] = temp;


template <class T>
void quicksort::_sort(T *data, int p, int r)
thread_count++;
if (p < r)
auto q = _partition(data, p, r);
if (thread_count >= thread_limit)
_sort(data, p, q - 1);
_sort(data, q + 1, r);
else
std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
std::thread upper(&quicksort::_sort<T>, this, data, q + 1, r);
lower.join();
upper.join();


thread_count--;


template <class T>
int quicksort::_partition(T *data, int p, int r)
auto first = data[p];
auto last = data[r];
T pivot = 0;
// Median of three
auto middle = data[(p + 2) / 2];
if (first > middle)
if (middle > last)
pivot = middle;
else
pivot = last;

else
if (middle > last)
pivot = std::max(first, last);
else
pivot = middle;


auto i = p - 1;
for (auto j = p; j < r; ++j)
if (data[j] <= pivot)
_swap(data, ++i, j);


_swap(data, i + 1, r);
return i + 1;



Those are methods from class quicksort, which also has std::atomic<int> thread_limit and std::atomic<int> thread_count.



So f.e I cap the thread limit to f.e maximum 1000 threads.



The part I am particularly not sure of is the part in _sort() function. Basically, I don't know if my way of starting those threads and then joining them is the right way to do it.



I have a benchmark method, and I benchmarked my quicksort implementation like this: I took the average of 100 benchmarks for every thread_limit from 200 to 2000, stepping by 100 (so 200, 300, 400) and for single-threaded quicksort of mine. And this was for 1.000.000-sized arrays (every time I was creating a new one - this overhead was not measured). The result was shocking for me. It occurred, that of course, even with 2000 threads it was faster than with only 1 thread, but still, the fastest one was with thread_limit = 200. I thought that the optimum would be somewhere nearer the middle.



Sometimes f.e 700 threads were faster than 600 threads, but still, the general rule was: the more threads, the slower.



Do you guys see any flaws in my code or in my understanding of the benchmark I made?










share|improve this question







New contributor




Frynio is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.



















  • Having more threads than are present in your hardware (likely 4 or 8 for typical laptops, possibly 12 for desktop) is not beneficial.
    – 1201ProgramAlarm
    5 hours ago










  • Okay. I have two cores and 4 threads in my laptop. Does it mean, that I actually should cap on 4 threads? Now I actually measured for thread_limit ranging from 20 to 800, and I still see that the more threads, the slower. 20 threads are faster than 40 etc.
    – Frynio
    4 hours ago










  • @Frynio, more threads than hardware supports means more context switching. Context switches mean slower performance, and perhaps something else I'm unaware of (haven't designed or studied in depth any OS yet).
    – Incomputable
    4 hours ago










  • Every switch adds overhead. So if the overhead takes longer than the parallellisation improves, it's not helping.
    – Mast
    2 hours ago










  • You benchmarked your code, good. Did you verify it's output as well? As in, you know how long it takes, but does it work correctly?
    – Mast
    2 hours ago














up vote
1
down vote

favorite












I have a task to write a concurrent version of quicksort algorithm, to speed it up. This is what I wrote:



template <class T>
void quicksort::sort(T *data, int len)
thread_count = 0;
int p = 0;
int r = len - 1;
_sort(data, p, r);


template <class T>
void quicksort::_swap(T *data, int first, int second)
auto temp = data[first];
data[first] = data[second];
data[second] = temp;


template <class T>
void quicksort::_sort(T *data, int p, int r)
thread_count++;
if (p < r)
auto q = _partition(data, p, r);
if (thread_count >= thread_limit)
_sort(data, p, q - 1);
_sort(data, q + 1, r);
else
std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
std::thread upper(&quicksort::_sort<T>, this, data, q + 1, r);
lower.join();
upper.join();


thread_count--;


template <class T>
int quicksort::_partition(T *data, int p, int r)
auto first = data[p];
auto last = data[r];
T pivot = 0;
// Median of three
auto middle = data[(p + 2) / 2];
if (first > middle)
if (middle > last)
pivot = middle;
else
pivot = last;

else
if (middle > last)
pivot = std::max(first, last);
else
pivot = middle;


auto i = p - 1;
for (auto j = p; j < r; ++j)
if (data[j] <= pivot)
_swap(data, ++i, j);


_swap(data, i + 1, r);
return i + 1;



Those are methods from class quicksort, which also has std::atomic<int> thread_limit and std::atomic<int> thread_count.



So f.e I cap the thread limit to f.e maximum 1000 threads.



The part I am particularly not sure of is the part in _sort() function. Basically, I don't know if my way of starting those threads and then joining them is the right way to do it.



I have a benchmark method, and I benchmarked my quicksort implementation like this: I took the average of 100 benchmarks for every thread_limit from 200 to 2000, stepping by 100 (so 200, 300, 400) and for single-threaded quicksort of mine. And this was for 1.000.000-sized arrays (every time I was creating a new one - this overhead was not measured). The result was shocking for me. It occurred, that of course, even with 2000 threads it was faster than with only 1 thread, but still, the fastest one was with thread_limit = 200. I thought that the optimum would be somewhere nearer the middle.



Sometimes f.e 700 threads were faster than 600 threads, but still, the general rule was: the more threads, the slower.



Do you guys see any flaws in my code or in my understanding of the benchmark I made?










share|improve this question







New contributor




Frynio is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.



















  • Having more threads than are present in your hardware (likely 4 or 8 for typical laptops, possibly 12 for desktop) is not beneficial.
    – 1201ProgramAlarm
    5 hours ago










  • Okay. I have two cores and 4 threads in my laptop. Does it mean, that I actually should cap on 4 threads? Now I actually measured for thread_limit ranging from 20 to 800, and I still see that the more threads, the slower. 20 threads are faster than 40 etc.
    – Frynio
    4 hours ago










  • @Frynio, more threads than hardware supports means more context switching. Context switches mean slower performance, and perhaps something else I'm unaware of (haven't designed or studied in depth any OS yet).
    – Incomputable
    4 hours ago










  • Every switch adds overhead. So if the overhead takes longer than the parallellisation improves, it's not helping.
    – Mast
    2 hours ago










  • You benchmarked your code, good. Did you verify it's output as well? As in, you know how long it takes, but does it work correctly?
    – Mast
    2 hours ago












up vote
1
down vote

favorite









up vote
1
down vote

favorite











I have a task to write a concurrent version of quicksort algorithm, to speed it up. This is what I wrote:



template <class T>
void quicksort::sort(T *data, int len)
thread_count = 0;
int p = 0;
int r = len - 1;
_sort(data, p, r);


template <class T>
void quicksort::_swap(T *data, int first, int second)
auto temp = data[first];
data[first] = data[second];
data[second] = temp;


template <class T>
void quicksort::_sort(T *data, int p, int r)
thread_count++;
if (p < r)
auto q = _partition(data, p, r);
if (thread_count >= thread_limit)
_sort(data, p, q - 1);
_sort(data, q + 1, r);
else
std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
std::thread upper(&quicksort::_sort<T>, this, data, q + 1, r);
lower.join();
upper.join();


thread_count--;


template <class T>
int quicksort::_partition(T *data, int p, int r)
auto first = data[p];
auto last = data[r];
T pivot = 0;
// Median of three
auto middle = data[(p + 2) / 2];
if (first > middle)
if (middle > last)
pivot = middle;
else
pivot = last;

else
if (middle > last)
pivot = std::max(first, last);
else
pivot = middle;


auto i = p - 1;
for (auto j = p; j < r; ++j)
if (data[j] <= pivot)
_swap(data, ++i, j);


_swap(data, i + 1, r);
return i + 1;



Those are methods from class quicksort, which also has std::atomic<int> thread_limit and std::atomic<int> thread_count.



So f.e I cap the thread limit to f.e maximum 1000 threads.



The part I am particularly not sure of is the part in _sort() function. Basically, I don't know if my way of starting those threads and then joining them is the right way to do it.



I have a benchmark method, and I benchmarked my quicksort implementation like this: I took the average of 100 benchmarks for every thread_limit from 200 to 2000, stepping by 100 (so 200, 300, 400) and for single-threaded quicksort of mine. And this was for 1.000.000-sized arrays (every time I was creating a new one - this overhead was not measured). The result was shocking for me. It occurred, that of course, even with 2000 threads it was faster than with only 1 thread, but still, the fastest one was with thread_limit = 200. I thought that the optimum would be somewhere nearer the middle.



Sometimes f.e 700 threads were faster than 600 threads, but still, the general rule was: the more threads, the slower.



Do you guys see any flaws in my code or in my understanding of the benchmark I made?










share|improve this question







New contributor




Frynio is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











I have a task to write a concurrent version of quicksort algorithm, to speed it up. This is what I wrote:



template <class T>
void quicksort::sort(T *data, int len)
thread_count = 0;
int p = 0;
int r = len - 1;
_sort(data, p, r);


template <class T>
void quicksort::_swap(T *data, int first, int second)
auto temp = data[first];
data[first] = data[second];
data[second] = temp;


template <class T>
void quicksort::_sort(T *data, int p, int r)
thread_count++;
if (p < r)
auto q = _partition(data, p, r);
if (thread_count >= thread_limit)
_sort(data, p, q - 1);
_sort(data, q + 1, r);
else
std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
std::thread upper(&quicksort::_sort<T>, this, data, q + 1, r);
lower.join();
upper.join();


thread_count--;


template <class T>
int quicksort::_partition(T *data, int p, int r)
auto first = data[p];
auto last = data[r];
T pivot = 0;
// Median of three
auto middle = data[(p + 2) / 2];
if (first > middle)
if (middle > last)
pivot = middle;
else
pivot = last;

else
if (middle > last)
pivot = std::max(first, last);
else
pivot = middle;


auto i = p - 1;
for (auto j = p; j < r; ++j)
if (data[j] <= pivot)
_swap(data, ++i, j);


_swap(data, i + 1, r);
return i + 1;



Those are methods from class quicksort, which also has std::atomic<int> thread_limit and std::atomic<int> thread_count.



So f.e I cap the thread limit to f.e maximum 1000 threads.



The part I am particularly not sure of is the part in _sort() function. Basically, I don't know if my way of starting those threads and then joining them is the right way to do it.



I have a benchmark method, and I benchmarked my quicksort implementation like this: I took the average of 100 benchmarks for every thread_limit from 200 to 2000, stepping by 100 (so 200, 300, 400) and for single-threaded quicksort of mine. And this was for 1.000.000-sized arrays (every time I was creating a new one - this overhead was not measured). The result was shocking for me. It occurred, that of course, even with 2000 threads it was faster than with only 1 thread, but still, the fastest one was with thread_limit = 200. I thought that the optimum would be somewhere nearer the middle.



Sometimes f.e 700 threads were faster than 600 threads, but still, the general rule was: the more threads, the slower.



Do you guys see any flaws in my code or in my understanding of the benchmark I made?







c++ multithreading quick-sort






share|improve this question







New contributor




Frynio is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question







New contributor




Frynio is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question






New contributor




Frynio is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 5 hours ago









Frynio

1083




1083




New contributor




Frynio is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Frynio is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Frynio is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











  • Having more threads than are present in your hardware (likely 4 or 8 for typical laptops, possibly 12 for desktop) is not beneficial.
    – 1201ProgramAlarm
    5 hours ago










  • Okay. I have two cores and 4 threads in my laptop. Does it mean, that I actually should cap on 4 threads? Now I actually measured for thread_limit ranging from 20 to 800, and I still see that the more threads, the slower. 20 threads are faster than 40 etc.
    – Frynio
    4 hours ago










  • @Frynio, more threads than hardware supports means more context switching. Context switches mean slower performance, and perhaps something else I'm unaware of (haven't designed or studied in depth any OS yet).
    – Incomputable
    4 hours ago










  • Every switch adds overhead. So if the overhead takes longer than the parallellisation improves, it's not helping.
    – Mast
    2 hours ago










  • You benchmarked your code, good. Did you verify it's output as well? As in, you know how long it takes, but does it work correctly?
    – Mast
    2 hours ago
















  • Having more threads than are present in your hardware (likely 4 or 8 for typical laptops, possibly 12 for desktop) is not beneficial.
    – 1201ProgramAlarm
    5 hours ago










  • Okay. I have two cores and 4 threads in my laptop. Does it mean, that I actually should cap on 4 threads? Now I actually measured for thread_limit ranging from 20 to 800, and I still see that the more threads, the slower. 20 threads are faster than 40 etc.
    – Frynio
    4 hours ago










  • @Frynio, more threads than hardware supports means more context switching. Context switches mean slower performance, and perhaps something else I'm unaware of (haven't designed or studied in depth any OS yet).
    – Incomputable
    4 hours ago










  • Every switch adds overhead. So if the overhead takes longer than the parallellisation improves, it's not helping.
    – Mast
    2 hours ago










  • You benchmarked your code, good. Did you verify it's output as well? As in, you know how long it takes, but does it work correctly?
    – Mast
    2 hours ago















Having more threads than are present in your hardware (likely 4 or 8 for typical laptops, possibly 12 for desktop) is not beneficial.
– 1201ProgramAlarm
5 hours ago




Having more threads than are present in your hardware (likely 4 or 8 for typical laptops, possibly 12 for desktop) is not beneficial.
– 1201ProgramAlarm
5 hours ago












Okay. I have two cores and 4 threads in my laptop. Does it mean, that I actually should cap on 4 threads? Now I actually measured for thread_limit ranging from 20 to 800, and I still see that the more threads, the slower. 20 threads are faster than 40 etc.
– Frynio
4 hours ago




Okay. I have two cores and 4 threads in my laptop. Does it mean, that I actually should cap on 4 threads? Now I actually measured for thread_limit ranging from 20 to 800, and I still see that the more threads, the slower. 20 threads are faster than 40 etc.
– Frynio
4 hours ago












@Frynio, more threads than hardware supports means more context switching. Context switches mean slower performance, and perhaps something else I'm unaware of (haven't designed or studied in depth any OS yet).
– Incomputable
4 hours ago




@Frynio, more threads than hardware supports means more context switching. Context switches mean slower performance, and perhaps something else I'm unaware of (haven't designed or studied in depth any OS yet).
– Incomputable
4 hours ago












Every switch adds overhead. So if the overhead takes longer than the parallellisation improves, it's not helping.
– Mast
2 hours ago




Every switch adds overhead. So if the overhead takes longer than the parallellisation improves, it's not helping.
– Mast
2 hours ago












You benchmarked your code, good. Did you verify it's output as well? As in, you know how long it takes, but does it work correctly?
– Mast
2 hours ago




You benchmarked your code, good. Did you verify it's output as well? As in, you know how long it takes, but does it work correctly?
– Mast
2 hours ago










2 Answers
2






active

oldest

votes

















up vote
1
down vote



accepted










Partition - Pivot selection



There are 2 bugs in the pivot selection code, in combination allowing for $mathcalO(n^2)$ worst case runtime.



Bug #1: Wrong average calculation



I guess this is just a typo:



auto middle = data[(p + 2) / 2];


Most likely, (p + r) / 2 was intended instead, and that would fix this problem (disregarding overflow, but that could only ever happen if $p = r = textINT_MAX$. Maybe add a check for p == r and return early?).



Bug #2: Missing a case



if (first > middle) 
if (middle > last)
pivot = middle;
else
pivot = last;




This snippet doesn't handle the case correctly if $middle < first < last$. Even though first should be assigned as pivot, last will be.



The fix is simple: Change pivot = last; to pivot = std::min(first, last);.



Consequences



These two bugs allow for runtime complexity to degrade to $mathcalO(n^2)$, because they introduce a huge bias towards last as pivot value:



For any call with p >= 4, middle will have the value of an element to the left of data[p], thus being guaranteed to be smaller than first or last due to previous runs of _partition. So `last will be chosen as pivot, due to bug #2.



For any call with p == 2 or p == 3, middle will be assigned the value of data[1], which is equal to first. In that case, pivot will always be assigned the value of middle/first.



For any call with p == 0 or p == 1, the median calculation will work with a slight bias towards last (due to bug #2).



Partition - Moving values



There seems to be an assumption that data[r] == pivot. This might have been a good assumption before due to bug #2, but that isn't necessarily the case once that bug is fixed. There a multiple ways to fix that.



Also, technically it isn't necessary to move values around that are equal to the pivot value: It doesn't actually matter whether they are left or right of the final pivot position.



Multithreading



The current multithreading solution isn't actually that efficient at using threads: Nearly half the threads at a time are waiting for the others to finish. (The first thread is waiting on threads #2 and #3 it spawned, they then are waiting on threads #4-7, ...)



As mentioned in the comments, this could be fixed by letting the original thread keep working on one of the subsections before waiting, i.e. changing this:



 std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
std::thread upper(&quicksort::_sort<T>, this, data, q + 1, r);
lower.join();
upper.join();


to:



 std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
_sort(data, q + 1, r);
lower.join();


Another point to make is that it rarely benefits performance to have a number of threads much larger than the number of available (logical) cores. The final number is likely between #cores and 2 * #cores, but to find the best amount you'll likely need to measure.



Bug #3: No protection for concurrent calls



Since we're already in the multithreaded environment, what happens if a second thread calls quicksort::sort after the first thread, but before that call finished?



It resets thread_count, thus allowing for the creation of more threads than intended. Once they are finished, thread_count will be negative! This might cause problems if that case isn't expected.



Solution: Either don't reset thread_count, or wait until it reaches zero before starting to work on the next dataset.






share|improve this answer




















  • If you're after the best quicksort performance, you might want to look into sentinel-based partition (e.g. this talk, around minute 28) and/or parallel partition algorithms.
    – hoffmale
    48 mins ago











  • Jeez, thanks for the input. And about Bug #3 - I call sort() in a loop, yes, but if I join every thread, then I don't think there will be situation, where I call another sort() if the previous did not finish
    – Frynio
    10 mins ago






  • 1




    @Frynio: A loop on the same thread doesn't matter (one call will have finished before the next starts). However, if there ever would be two threads calling sort concurrently, then you would get issues.
    – hoffmale
    7 mins ago










  • Yes, that's true. Fortunately that's not the case
    – Frynio
    5 mins ago

















up vote
1
down vote













  • Thread creation is an expensive operation, and you do it way more than needed. Even though the number of threads running at any given time is capped, the threads are constantly joined and recreated over and over again (I run your code with cap of 10, over a 1000000-strong array, and counted 50 thread creations). Use a thread pool instead.


  • There is no need to start two threads. The calling thread may very well continue sorting one of the halves.


  • As said in comments, using more threads than hardware supports will cause slowdown. The software threads are beneficial only when they have natural blocking points (due to IO or synchronization). Sorting is not the case here.






share|improve this answer




















  • Great points! So after fixing the first two problems you mentioned, should I actually limit the threads only to match my hardware threads?
    – Frynio
    1 hour ago










Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);






Frynio is a new contributor. Be nice, and check out our Code of Conduct.









 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f206438%2fconcurrent-quicksort-in-c%23new-answer', 'question_page');

);

Post as a guest






























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote



accepted










Partition - Pivot selection



There are 2 bugs in the pivot selection code, in combination allowing for $mathcalO(n^2)$ worst case runtime.



Bug #1: Wrong average calculation



I guess this is just a typo:



auto middle = data[(p + 2) / 2];


Most likely, (p + r) / 2 was intended instead, and that would fix this problem (disregarding overflow, but that could only ever happen if $p = r = textINT_MAX$. Maybe add a check for p == r and return early?).



Bug #2: Missing a case



if (first > middle) 
if (middle > last)
pivot = middle;
else
pivot = last;




This snippet doesn't handle the case correctly if $middle < first < last$. Even though first should be assigned as pivot, last will be.



The fix is simple: Change pivot = last; to pivot = std::min(first, last);.



Consequences



These two bugs allow for runtime complexity to degrade to $mathcalO(n^2)$, because they introduce a huge bias towards last as pivot value:



For any call with p >= 4, middle will have the value of an element to the left of data[p], thus being guaranteed to be smaller than first or last due to previous runs of _partition. So `last will be chosen as pivot, due to bug #2.



For any call with p == 2 or p == 3, middle will be assigned the value of data[1], which is equal to first. In that case, pivot will always be assigned the value of middle/first.



For any call with p == 0 or p == 1, the median calculation will work with a slight bias towards last (due to bug #2).



Partition - Moving values



There seems to be an assumption that data[r] == pivot. This might have been a good assumption before due to bug #2, but that isn't necessarily the case once that bug is fixed. There a multiple ways to fix that.



Also, technically it isn't necessary to move values around that are equal to the pivot value: It doesn't actually matter whether they are left or right of the final pivot position.



Multithreading



The current multithreading solution isn't actually that efficient at using threads: Nearly half the threads at a time are waiting for the others to finish. (The first thread is waiting on threads #2 and #3 it spawned, they then are waiting on threads #4-7, ...)



As mentioned in the comments, this could be fixed by letting the original thread keep working on one of the subsections before waiting, i.e. changing this:



 std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
std::thread upper(&quicksort::_sort<T>, this, data, q + 1, r);
lower.join();
upper.join();


to:



 std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
_sort(data, q + 1, r);
lower.join();


Another point to make is that it rarely benefits performance to have a number of threads much larger than the number of available (logical) cores. The final number is likely between #cores and 2 * #cores, but to find the best amount you'll likely need to measure.



Bug #3: No protection for concurrent calls



Since we're already in the multithreaded environment, what happens if a second thread calls quicksort::sort after the first thread, but before that call finished?



It resets thread_count, thus allowing for the creation of more threads than intended. Once they are finished, thread_count will be negative! This might cause problems if that case isn't expected.



Solution: Either don't reset thread_count, or wait until it reaches zero before starting to work on the next dataset.






share|improve this answer




















  • If you're after the best quicksort performance, you might want to look into sentinel-based partition (e.g. this talk, around minute 28) and/or parallel partition algorithms.
    – hoffmale
    48 mins ago











  • Jeez, thanks for the input. And about Bug #3 - I call sort() in a loop, yes, but if I join every thread, then I don't think there will be situation, where I call another sort() if the previous did not finish
    – Frynio
    10 mins ago






  • 1




    @Frynio: A loop on the same thread doesn't matter (one call will have finished before the next starts). However, if there ever would be two threads calling sort concurrently, then you would get issues.
    – hoffmale
    7 mins ago










  • Yes, that's true. Fortunately that's not the case
    – Frynio
    5 mins ago














up vote
1
down vote



accepted










Partition - Pivot selection



There are 2 bugs in the pivot selection code, in combination allowing for $mathcalO(n^2)$ worst case runtime.



Bug #1: Wrong average calculation



I guess this is just a typo:



auto middle = data[(p + 2) / 2];


Most likely, (p + r) / 2 was intended instead, and that would fix this problem (disregarding overflow, but that could only ever happen if $p = r = textINT_MAX$. Maybe add a check for p == r and return early?).



Bug #2: Missing a case



if (first > middle) 
if (middle > last)
pivot = middle;
else
pivot = last;




This snippet doesn't handle the case correctly if $middle < first < last$. Even though first should be assigned as pivot, last will be.



The fix is simple: Change pivot = last; to pivot = std::min(first, last);.



Consequences



These two bugs allow for runtime complexity to degrade to $mathcalO(n^2)$, because they introduce a huge bias towards last as pivot value:



For any call with p >= 4, middle will have the value of an element to the left of data[p], thus being guaranteed to be smaller than first or last due to previous runs of _partition. So `last will be chosen as pivot, due to bug #2.



For any call with p == 2 or p == 3, middle will be assigned the value of data[1], which is equal to first. In that case, pivot will always be assigned the value of middle/first.



For any call with p == 0 or p == 1, the median calculation will work with a slight bias towards last (due to bug #2).



Partition - Moving values



There seems to be an assumption that data[r] == pivot. This might have been a good assumption before due to bug #2, but that isn't necessarily the case once that bug is fixed. There a multiple ways to fix that.



Also, technically it isn't necessary to move values around that are equal to the pivot value: It doesn't actually matter whether they are left or right of the final pivot position.



Multithreading



The current multithreading solution isn't actually that efficient at using threads: Nearly half the threads at a time are waiting for the others to finish. (The first thread is waiting on threads #2 and #3 it spawned, they then are waiting on threads #4-7, ...)



As mentioned in the comments, this could be fixed by letting the original thread keep working on one of the subsections before waiting, i.e. changing this:



 std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
std::thread upper(&quicksort::_sort<T>, this, data, q + 1, r);
lower.join();
upper.join();


to:



 std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
_sort(data, q + 1, r);
lower.join();


Another point to make is that it rarely benefits performance to have a number of threads much larger than the number of available (logical) cores. The final number is likely between #cores and 2 * #cores, but to find the best amount you'll likely need to measure.



Bug #3: No protection for concurrent calls



Since we're already in the multithreaded environment, what happens if a second thread calls quicksort::sort after the first thread, but before that call finished?



It resets thread_count, thus allowing for the creation of more threads than intended. Once they are finished, thread_count will be negative! This might cause problems if that case isn't expected.



Solution: Either don't reset thread_count, or wait until it reaches zero before starting to work on the next dataset.






share|improve this answer




















  • If you're after the best quicksort performance, you might want to look into sentinel-based partition (e.g. this talk, around minute 28) and/or parallel partition algorithms.
    – hoffmale
    48 mins ago











  • Jeez, thanks for the input. And about Bug #3 - I call sort() in a loop, yes, but if I join every thread, then I don't think there will be situation, where I call another sort() if the previous did not finish
    – Frynio
    10 mins ago






  • 1




    @Frynio: A loop on the same thread doesn't matter (one call will have finished before the next starts). However, if there ever would be two threads calling sort concurrently, then you would get issues.
    – hoffmale
    7 mins ago










  • Yes, that's true. Fortunately that's not the case
    – Frynio
    5 mins ago












up vote
1
down vote



accepted







up vote
1
down vote



accepted






Partition - Pivot selection



There are 2 bugs in the pivot selection code, in combination allowing for $mathcalO(n^2)$ worst case runtime.



Bug #1: Wrong average calculation



I guess this is just a typo:



auto middle = data[(p + 2) / 2];


Most likely, (p + r) / 2 was intended instead, and that would fix this problem (disregarding overflow, but that could only ever happen if $p = r = textINT_MAX$. Maybe add a check for p == r and return early?).



Bug #2: Missing a case



if (first > middle) 
if (middle > last)
pivot = middle;
else
pivot = last;




This snippet doesn't handle the case correctly if $middle < first < last$. Even though first should be assigned as pivot, last will be.



The fix is simple: Change pivot = last; to pivot = std::min(first, last);.



Consequences



These two bugs allow for runtime complexity to degrade to $mathcalO(n^2)$, because they introduce a huge bias towards last as pivot value:



For any call with p >= 4, middle will have the value of an element to the left of data[p], thus being guaranteed to be smaller than first or last due to previous runs of _partition. So `last will be chosen as pivot, due to bug #2.



For any call with p == 2 or p == 3, middle will be assigned the value of data[1], which is equal to first. In that case, pivot will always be assigned the value of middle/first.



For any call with p == 0 or p == 1, the median calculation will work with a slight bias towards last (due to bug #2).



Partition - Moving values



There seems to be an assumption that data[r] == pivot. This might have been a good assumption before due to bug #2, but that isn't necessarily the case once that bug is fixed. There a multiple ways to fix that.



Also, technically it isn't necessary to move values around that are equal to the pivot value: It doesn't actually matter whether they are left or right of the final pivot position.



Multithreading



The current multithreading solution isn't actually that efficient at using threads: Nearly half the threads at a time are waiting for the others to finish. (The first thread is waiting on threads #2 and #3 it spawned, they then are waiting on threads #4-7, ...)



As mentioned in the comments, this could be fixed by letting the original thread keep working on one of the subsections before waiting, i.e. changing this:



 std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
std::thread upper(&quicksort::_sort<T>, this, data, q + 1, r);
lower.join();
upper.join();


to:



 std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
_sort(data, q + 1, r);
lower.join();


Another point to make is that it rarely benefits performance to have a number of threads much larger than the number of available (logical) cores. The final number is likely between #cores and 2 * #cores, but to find the best amount you'll likely need to measure.



Bug #3: No protection for concurrent calls



Since we're already in the multithreaded environment, what happens if a second thread calls quicksort::sort after the first thread, but before that call finished?



It resets thread_count, thus allowing for the creation of more threads than intended. Once they are finished, thread_count will be negative! This might cause problems if that case isn't expected.



Solution: Either don't reset thread_count, or wait until it reaches zero before starting to work on the next dataset.






share|improve this answer












Partition - Pivot selection



There are 2 bugs in the pivot selection code, in combination allowing for $mathcalO(n^2)$ worst case runtime.



Bug #1: Wrong average calculation



I guess this is just a typo:



auto middle = data[(p + 2) / 2];


Most likely, (p + r) / 2 was intended instead, and that would fix this problem (disregarding overflow, but that could only ever happen if $p = r = textINT_MAX$. Maybe add a check for p == r and return early?).



Bug #2: Missing a case



if (first > middle) 
if (middle > last)
pivot = middle;
else
pivot = last;




This snippet doesn't handle the case correctly if $middle < first < last$. Even though first should be assigned as pivot, last will be.



The fix is simple: Change pivot = last; to pivot = std::min(first, last);.



Consequences



These two bugs allow for runtime complexity to degrade to $mathcalO(n^2)$, because they introduce a huge bias towards last as pivot value:



For any call with p >= 4, middle will have the value of an element to the left of data[p], thus being guaranteed to be smaller than first or last due to previous runs of _partition. So `last will be chosen as pivot, due to bug #2.



For any call with p == 2 or p == 3, middle will be assigned the value of data[1], which is equal to first. In that case, pivot will always be assigned the value of middle/first.



For any call with p == 0 or p == 1, the median calculation will work with a slight bias towards last (due to bug #2).



Partition - Moving values



There seems to be an assumption that data[r] == pivot. This might have been a good assumption before due to bug #2, but that isn't necessarily the case once that bug is fixed. There a multiple ways to fix that.



Also, technically it isn't necessary to move values around that are equal to the pivot value: It doesn't actually matter whether they are left or right of the final pivot position.



Multithreading



The current multithreading solution isn't actually that efficient at using threads: Nearly half the threads at a time are waiting for the others to finish. (The first thread is waiting on threads #2 and #3 it spawned, they then are waiting on threads #4-7, ...)



As mentioned in the comments, this could be fixed by letting the original thread keep working on one of the subsections before waiting, i.e. changing this:



 std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
std::thread upper(&quicksort::_sort<T>, this, data, q + 1, r);
lower.join();
upper.join();


to:



 std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
_sort(data, q + 1, r);
lower.join();


Another point to make is that it rarely benefits performance to have a number of threads much larger than the number of available (logical) cores. The final number is likely between #cores and 2 * #cores, but to find the best amount you'll likely need to measure.



Bug #3: No protection for concurrent calls



Since we're already in the multithreaded environment, what happens if a second thread calls quicksort::sort after the first thread, but before that call finished?



It resets thread_count, thus allowing for the creation of more threads than intended. Once they are finished, thread_count will be negative! This might cause problems if that case isn't expected.



Solution: Either don't reset thread_count, or wait until it reaches zero before starting to work on the next dataset.







share|improve this answer












share|improve this answer



share|improve this answer










answered 55 mins ago









hoffmale

5,342835




5,342835











  • If you're after the best quicksort performance, you might want to look into sentinel-based partition (e.g. this talk, around minute 28) and/or parallel partition algorithms.
    – hoffmale
    48 mins ago











  • Jeez, thanks for the input. And about Bug #3 - I call sort() in a loop, yes, but if I join every thread, then I don't think there will be situation, where I call another sort() if the previous did not finish
    – Frynio
    10 mins ago






  • 1




    @Frynio: A loop on the same thread doesn't matter (one call will have finished before the next starts). However, if there ever would be two threads calling sort concurrently, then you would get issues.
    – hoffmale
    7 mins ago










  • Yes, that's true. Fortunately that's not the case
    – Frynio
    5 mins ago
















  • If you're after the best quicksort performance, you might want to look into sentinel-based partition (e.g. this talk, around minute 28) and/or parallel partition algorithms.
    – hoffmale
    48 mins ago











  • Jeez, thanks for the input. And about Bug #3 - I call sort() in a loop, yes, but if I join every thread, then I don't think there will be situation, where I call another sort() if the previous did not finish
    – Frynio
    10 mins ago






  • 1




    @Frynio: A loop on the same thread doesn't matter (one call will have finished before the next starts). However, if there ever would be two threads calling sort concurrently, then you would get issues.
    – hoffmale
    7 mins ago










  • Yes, that's true. Fortunately that's not the case
    – Frynio
    5 mins ago















If you're after the best quicksort performance, you might want to look into sentinel-based partition (e.g. this talk, around minute 28) and/or parallel partition algorithms.
– hoffmale
48 mins ago





If you're after the best quicksort performance, you might want to look into sentinel-based partition (e.g. this talk, around minute 28) and/or parallel partition algorithms.
– hoffmale
48 mins ago













Jeez, thanks for the input. And about Bug #3 - I call sort() in a loop, yes, but if I join every thread, then I don't think there will be situation, where I call another sort() if the previous did not finish
– Frynio
10 mins ago




Jeez, thanks for the input. And about Bug #3 - I call sort() in a loop, yes, but if I join every thread, then I don't think there will be situation, where I call another sort() if the previous did not finish
– Frynio
10 mins ago




1




1




@Frynio: A loop on the same thread doesn't matter (one call will have finished before the next starts). However, if there ever would be two threads calling sort concurrently, then you would get issues.
– hoffmale
7 mins ago




@Frynio: A loop on the same thread doesn't matter (one call will have finished before the next starts). However, if there ever would be two threads calling sort concurrently, then you would get issues.
– hoffmale
7 mins ago












Yes, that's true. Fortunately that's not the case
– Frynio
5 mins ago




Yes, that's true. Fortunately that's not the case
– Frynio
5 mins ago












up vote
1
down vote













  • Thread creation is an expensive operation, and you do it way more than needed. Even though the number of threads running at any given time is capped, the threads are constantly joined and recreated over and over again (I run your code with cap of 10, over a 1000000-strong array, and counted 50 thread creations). Use a thread pool instead.


  • There is no need to start two threads. The calling thread may very well continue sorting one of the halves.


  • As said in comments, using more threads than hardware supports will cause slowdown. The software threads are beneficial only when they have natural blocking points (due to IO or synchronization). Sorting is not the case here.






share|improve this answer




















  • Great points! So after fixing the first two problems you mentioned, should I actually limit the threads only to match my hardware threads?
    – Frynio
    1 hour ago














up vote
1
down vote













  • Thread creation is an expensive operation, and you do it way more than needed. Even though the number of threads running at any given time is capped, the threads are constantly joined and recreated over and over again (I run your code with cap of 10, over a 1000000-strong array, and counted 50 thread creations). Use a thread pool instead.


  • There is no need to start two threads. The calling thread may very well continue sorting one of the halves.


  • As said in comments, using more threads than hardware supports will cause slowdown. The software threads are beneficial only when they have natural blocking points (due to IO or synchronization). Sorting is not the case here.






share|improve this answer




















  • Great points! So after fixing the first two problems you mentioned, should I actually limit the threads only to match my hardware threads?
    – Frynio
    1 hour ago












up vote
1
down vote










up vote
1
down vote









  • Thread creation is an expensive operation, and you do it way more than needed. Even though the number of threads running at any given time is capped, the threads are constantly joined and recreated over and over again (I run your code with cap of 10, over a 1000000-strong array, and counted 50 thread creations). Use a thread pool instead.


  • There is no need to start two threads. The calling thread may very well continue sorting one of the halves.


  • As said in comments, using more threads than hardware supports will cause slowdown. The software threads are beneficial only when they have natural blocking points (due to IO or synchronization). Sorting is not the case here.






share|improve this answer












  • Thread creation is an expensive operation, and you do it way more than needed. Even though the number of threads running at any given time is capped, the threads are constantly joined and recreated over and over again (I run your code with cap of 10, over a 1000000-strong array, and counted 50 thread creations). Use a thread pool instead.


  • There is no need to start two threads. The calling thread may very well continue sorting one of the halves.


  • As said in comments, using more threads than hardware supports will cause slowdown. The software threads are beneficial only when they have natural blocking points (due to IO or synchronization). Sorting is not the case here.







share|improve this answer












share|improve this answer



share|improve this answer










answered 1 hour ago









vnp

37.7k13096




37.7k13096











  • Great points! So after fixing the first two problems you mentioned, should I actually limit the threads only to match my hardware threads?
    – Frynio
    1 hour ago
















  • Great points! So after fixing the first two problems you mentioned, should I actually limit the threads only to match my hardware threads?
    – Frynio
    1 hour ago















Great points! So after fixing the first two problems you mentioned, should I actually limit the threads only to match my hardware threads?
– Frynio
1 hour ago




Great points! So after fixing the first two problems you mentioned, should I actually limit the threads only to match my hardware threads?
– Frynio
1 hour ago










Frynio is a new contributor. Be nice, and check out our Code of Conduct.









 

draft saved


draft discarded


















Frynio is a new contributor. Be nice, and check out our Code of Conduct.












Frynio is a new contributor. Be nice, and check out our Code of Conduct.











Frynio is a new contributor. Be nice, and check out our Code of Conduct.













 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f206438%2fconcurrent-quicksort-in-c%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

How to check contact read email or not when send email to Individual?

Displaying single band from multi-band raster using QGIS

How many registers does an x86_64 CPU actually have?