Concurrent quicksort in C++
Clash Royale CLAN TAG#URR8PPP
up vote
1
down vote
favorite
I have a task to write a concurrent version of quicksort
algorithm, to speed it up. This is what I wrote:
template <class T>
void quicksort::sort(T *data, int len)
thread_count = 0;
int p = 0;
int r = len - 1;
_sort(data, p, r);
template <class T>
void quicksort::_swap(T *data, int first, int second)
auto temp = data[first];
data[first] = data[second];
data[second] = temp;
template <class T>
void quicksort::_sort(T *data, int p, int r)
thread_count++;
if (p < r)
auto q = _partition(data, p, r);
if (thread_count >= thread_limit)
_sort(data, p, q - 1);
_sort(data, q + 1, r);
else
std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
std::thread upper(&quicksort::_sort<T>, this, data, q + 1, r);
lower.join();
upper.join();
thread_count--;
template <class T>
int quicksort::_partition(T *data, int p, int r)
auto first = data[p];
auto last = data[r];
T pivot = 0;
// Median of three
auto middle = data[(p + 2) / 2];
if (first > middle)
if (middle > last)
pivot = middle;
else
pivot = last;
else
if (middle > last)
pivot = std::max(first, last);
else
pivot = middle;
auto i = p - 1;
for (auto j = p; j < r; ++j)
if (data[j] <= pivot)
_swap(data, ++i, j);
_swap(data, i + 1, r);
return i + 1;
Those are methods from class quicksort
, which also has std::atomic<int> thread_limit
and std::atomic<int> thread_count
.
So f.e I cap the thread limit to f.e maximum 1000 threads.
The part I am particularly not sure of is the part in _sort()
function. Basically, I don't know if my way of starting those threads and then joining them is the right way to do it.
I have a benchmark method, and I benchmarked my quicksort implementation like this: I took the average of 100 benchmarks for every thread_limit from 200 to 2000, stepping by 100 (so 200, 300, 400) and for single-threaded quicksort of mine. And this was for 1.000.000
-sized arrays (every time I was creating a new one - this overhead was not measured). The result was shocking for me. It occurred, that of course, even with 2000 threads it was faster than with only 1 thread, but still, the fastest one was with thread_limit = 200
. I thought that the optimum would be somewhere nearer the middle.
Sometimes f.e 700 threads were faster than 600 threads, but still, the general rule was: the more threads, the slower.
Do you guys see any flaws in my code or in my understanding of the benchmark I made?
c++ multithreading quick-sort
New contributor
 |Â
show 1 more comment
up vote
1
down vote
favorite
I have a task to write a concurrent version of quicksort
algorithm, to speed it up. This is what I wrote:
template <class T>
void quicksort::sort(T *data, int len)
thread_count = 0;
int p = 0;
int r = len - 1;
_sort(data, p, r);
template <class T>
void quicksort::_swap(T *data, int first, int second)
auto temp = data[first];
data[first] = data[second];
data[second] = temp;
template <class T>
void quicksort::_sort(T *data, int p, int r)
thread_count++;
if (p < r)
auto q = _partition(data, p, r);
if (thread_count >= thread_limit)
_sort(data, p, q - 1);
_sort(data, q + 1, r);
else
std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
std::thread upper(&quicksort::_sort<T>, this, data, q + 1, r);
lower.join();
upper.join();
thread_count--;
template <class T>
int quicksort::_partition(T *data, int p, int r)
auto first = data[p];
auto last = data[r];
T pivot = 0;
// Median of three
auto middle = data[(p + 2) / 2];
if (first > middle)
if (middle > last)
pivot = middle;
else
pivot = last;
else
if (middle > last)
pivot = std::max(first, last);
else
pivot = middle;
auto i = p - 1;
for (auto j = p; j < r; ++j)
if (data[j] <= pivot)
_swap(data, ++i, j);
_swap(data, i + 1, r);
return i + 1;
Those are methods from class quicksort
, which also has std::atomic<int> thread_limit
and std::atomic<int> thread_count
.
So f.e I cap the thread limit to f.e maximum 1000 threads.
The part I am particularly not sure of is the part in _sort()
function. Basically, I don't know if my way of starting those threads and then joining them is the right way to do it.
I have a benchmark method, and I benchmarked my quicksort implementation like this: I took the average of 100 benchmarks for every thread_limit from 200 to 2000, stepping by 100 (so 200, 300, 400) and for single-threaded quicksort of mine. And this was for 1.000.000
-sized arrays (every time I was creating a new one - this overhead was not measured). The result was shocking for me. It occurred, that of course, even with 2000 threads it was faster than with only 1 thread, but still, the fastest one was with thread_limit = 200
. I thought that the optimum would be somewhere nearer the middle.
Sometimes f.e 700 threads were faster than 600 threads, but still, the general rule was: the more threads, the slower.
Do you guys see any flaws in my code or in my understanding of the benchmark I made?
c++ multithreading quick-sort
New contributor
Having more threads than are present in your hardware (likely 4 or 8 for typical laptops, possibly 12 for desktop) is not beneficial.
â 1201ProgramAlarm
5 hours ago
Okay. I have two cores and 4 threads in my laptop. Does it mean, that I actually should cap on 4 threads? Now I actually measured for thread_limit ranging from 20 to 800, and I still see that the more threads, the slower. 20 threads are faster than 40 etc.
â Frynio
4 hours ago
@Frynio, more threads than hardware supports means more context switching. Context switches mean slower performance, and perhaps something else I'm unaware of (haven't designed or studied in depth any OS yet).
â Incomputable
4 hours ago
Every switch adds overhead. So if the overhead takes longer than the parallellisation improves, it's not helping.
â Mast
2 hours ago
You benchmarked your code, good. Did you verify it's output as well? As in, you know how long it takes, but does it work correctly?
â Mast
2 hours ago
 |Â
show 1 more comment
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have a task to write a concurrent version of quicksort
algorithm, to speed it up. This is what I wrote:
template <class T>
void quicksort::sort(T *data, int len)
thread_count = 0;
int p = 0;
int r = len - 1;
_sort(data, p, r);
template <class T>
void quicksort::_swap(T *data, int first, int second)
auto temp = data[first];
data[first] = data[second];
data[second] = temp;
template <class T>
void quicksort::_sort(T *data, int p, int r)
thread_count++;
if (p < r)
auto q = _partition(data, p, r);
if (thread_count >= thread_limit)
_sort(data, p, q - 1);
_sort(data, q + 1, r);
else
std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
std::thread upper(&quicksort::_sort<T>, this, data, q + 1, r);
lower.join();
upper.join();
thread_count--;
template <class T>
int quicksort::_partition(T *data, int p, int r)
auto first = data[p];
auto last = data[r];
T pivot = 0;
// Median of three
auto middle = data[(p + 2) / 2];
if (first > middle)
if (middle > last)
pivot = middle;
else
pivot = last;
else
if (middle > last)
pivot = std::max(first, last);
else
pivot = middle;
auto i = p - 1;
for (auto j = p; j < r; ++j)
if (data[j] <= pivot)
_swap(data, ++i, j);
_swap(data, i + 1, r);
return i + 1;
Those are methods from class quicksort
, which also has std::atomic<int> thread_limit
and std::atomic<int> thread_count
.
So f.e I cap the thread limit to f.e maximum 1000 threads.
The part I am particularly not sure of is the part in _sort()
function. Basically, I don't know if my way of starting those threads and then joining them is the right way to do it.
I have a benchmark method, and I benchmarked my quicksort implementation like this: I took the average of 100 benchmarks for every thread_limit from 200 to 2000, stepping by 100 (so 200, 300, 400) and for single-threaded quicksort of mine. And this was for 1.000.000
-sized arrays (every time I was creating a new one - this overhead was not measured). The result was shocking for me. It occurred, that of course, even with 2000 threads it was faster than with only 1 thread, but still, the fastest one was with thread_limit = 200
. I thought that the optimum would be somewhere nearer the middle.
Sometimes f.e 700 threads were faster than 600 threads, but still, the general rule was: the more threads, the slower.
Do you guys see any flaws in my code or in my understanding of the benchmark I made?
c++ multithreading quick-sort
New contributor
I have a task to write a concurrent version of quicksort
algorithm, to speed it up. This is what I wrote:
template <class T>
void quicksort::sort(T *data, int len)
thread_count = 0;
int p = 0;
int r = len - 1;
_sort(data, p, r);
template <class T>
void quicksort::_swap(T *data, int first, int second)
auto temp = data[first];
data[first] = data[second];
data[second] = temp;
template <class T>
void quicksort::_sort(T *data, int p, int r)
thread_count++;
if (p < r)
auto q = _partition(data, p, r);
if (thread_count >= thread_limit)
_sort(data, p, q - 1);
_sort(data, q + 1, r);
else
std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
std::thread upper(&quicksort::_sort<T>, this, data, q + 1, r);
lower.join();
upper.join();
thread_count--;
template <class T>
int quicksort::_partition(T *data, int p, int r)
auto first = data[p];
auto last = data[r];
T pivot = 0;
// Median of three
auto middle = data[(p + 2) / 2];
if (first > middle)
if (middle > last)
pivot = middle;
else
pivot = last;
else
if (middle > last)
pivot = std::max(first, last);
else
pivot = middle;
auto i = p - 1;
for (auto j = p; j < r; ++j)
if (data[j] <= pivot)
_swap(data, ++i, j);
_swap(data, i + 1, r);
return i + 1;
Those are methods from class quicksort
, which also has std::atomic<int> thread_limit
and std::atomic<int> thread_count
.
So f.e I cap the thread limit to f.e maximum 1000 threads.
The part I am particularly not sure of is the part in _sort()
function. Basically, I don't know if my way of starting those threads and then joining them is the right way to do it.
I have a benchmark method, and I benchmarked my quicksort implementation like this: I took the average of 100 benchmarks for every thread_limit from 200 to 2000, stepping by 100 (so 200, 300, 400) and for single-threaded quicksort of mine. And this was for 1.000.000
-sized arrays (every time I was creating a new one - this overhead was not measured). The result was shocking for me. It occurred, that of course, even with 2000 threads it was faster than with only 1 thread, but still, the fastest one was with thread_limit = 200
. I thought that the optimum would be somewhere nearer the middle.
Sometimes f.e 700 threads were faster than 600 threads, but still, the general rule was: the more threads, the slower.
Do you guys see any flaws in my code or in my understanding of the benchmark I made?
c++ multithreading quick-sort
c++ multithreading quick-sort
New contributor
New contributor
New contributor
asked 5 hours ago
Frynio
1083
1083
New contributor
New contributor
Having more threads than are present in your hardware (likely 4 or 8 for typical laptops, possibly 12 for desktop) is not beneficial.
â 1201ProgramAlarm
5 hours ago
Okay. I have two cores and 4 threads in my laptop. Does it mean, that I actually should cap on 4 threads? Now I actually measured for thread_limit ranging from 20 to 800, and I still see that the more threads, the slower. 20 threads are faster than 40 etc.
â Frynio
4 hours ago
@Frynio, more threads than hardware supports means more context switching. Context switches mean slower performance, and perhaps something else I'm unaware of (haven't designed or studied in depth any OS yet).
â Incomputable
4 hours ago
Every switch adds overhead. So if the overhead takes longer than the parallellisation improves, it's not helping.
â Mast
2 hours ago
You benchmarked your code, good. Did you verify it's output as well? As in, you know how long it takes, but does it work correctly?
â Mast
2 hours ago
 |Â
show 1 more comment
Having more threads than are present in your hardware (likely 4 or 8 for typical laptops, possibly 12 for desktop) is not beneficial.
â 1201ProgramAlarm
5 hours ago
Okay. I have two cores and 4 threads in my laptop. Does it mean, that I actually should cap on 4 threads? Now I actually measured for thread_limit ranging from 20 to 800, and I still see that the more threads, the slower. 20 threads are faster than 40 etc.
â Frynio
4 hours ago
@Frynio, more threads than hardware supports means more context switching. Context switches mean slower performance, and perhaps something else I'm unaware of (haven't designed or studied in depth any OS yet).
â Incomputable
4 hours ago
Every switch adds overhead. So if the overhead takes longer than the parallellisation improves, it's not helping.
â Mast
2 hours ago
You benchmarked your code, good. Did you verify it's output as well? As in, you know how long it takes, but does it work correctly?
â Mast
2 hours ago
Having more threads than are present in your hardware (likely 4 or 8 for typical laptops, possibly 12 for desktop) is not beneficial.
â 1201ProgramAlarm
5 hours ago
Having more threads than are present in your hardware (likely 4 or 8 for typical laptops, possibly 12 for desktop) is not beneficial.
â 1201ProgramAlarm
5 hours ago
Okay. I have two cores and 4 threads in my laptop. Does it mean, that I actually should cap on 4 threads? Now I actually measured for thread_limit ranging from 20 to 800, and I still see that the more threads, the slower. 20 threads are faster than 40 etc.
â Frynio
4 hours ago
Okay. I have two cores and 4 threads in my laptop. Does it mean, that I actually should cap on 4 threads? Now I actually measured for thread_limit ranging from 20 to 800, and I still see that the more threads, the slower. 20 threads are faster than 40 etc.
â Frynio
4 hours ago
@Frynio, more threads than hardware supports means more context switching. Context switches mean slower performance, and perhaps something else I'm unaware of (haven't designed or studied in depth any OS yet).
â Incomputable
4 hours ago
@Frynio, more threads than hardware supports means more context switching. Context switches mean slower performance, and perhaps something else I'm unaware of (haven't designed or studied in depth any OS yet).
â Incomputable
4 hours ago
Every switch adds overhead. So if the overhead takes longer than the parallellisation improves, it's not helping.
â Mast
2 hours ago
Every switch adds overhead. So if the overhead takes longer than the parallellisation improves, it's not helping.
â Mast
2 hours ago
You benchmarked your code, good. Did you verify it's output as well? As in, you know how long it takes, but does it work correctly?
â Mast
2 hours ago
You benchmarked your code, good. Did you verify it's output as well? As in, you know how long it takes, but does it work correctly?
â Mast
2 hours ago
 |Â
show 1 more comment
2 Answers
2
active
oldest
votes
up vote
1
down vote
accepted
Partition - Pivot selection
There are 2 bugs in the pivot selection code, in combination allowing for $mathcalO(n^2)$ worst case runtime.
Bug #1: Wrong average calculation
I guess this is just a typo:
auto middle = data[(p + 2) / 2];
Most likely, (p + r) / 2
was intended instead, and that would fix this problem (disregarding overflow, but that could only ever happen if $p = r = textINT_MAX$. Maybe add a check for p == r
and return early?).
Bug #2: Missing a case
if (first > middle)
if (middle > last)
pivot = middle;
else
pivot = last;
This snippet doesn't handle the case correctly if $middle < first < last$. Even though first
should be assigned as pivot, last
will be.
The fix is simple: Change pivot = last;
to pivot = std::min(first, last);
.
Consequences
These two bugs allow for runtime complexity to degrade to $mathcalO(n^2)$, because they introduce a huge bias towards last
as pivot value:
For any call with p >= 4
, middle
will have the value of an element to the left of data[p]
, thus being guaranteed to be smaller than first
or last
due to previous runs of _partition
. So `last will be chosen as pivot, due to bug #2.
For any call with p == 2
or p == 3
, middle
will be assigned the value of data[1]
, which is equal to first
. In that case, pivot
will always be assigned the value of middle
/first
.
For any call with p == 0
or p == 1
, the median calculation will work with a slight bias towards last
(due to bug #2).
Partition - Moving values
There seems to be an assumption that data[r] == pivot
. This might have been a good assumption before due to bug #2, but that isn't necessarily the case once that bug is fixed. There a multiple ways to fix that.
Also, technically it isn't necessary to move values around that are equal to the pivot value: It doesn't actually matter whether they are left or right of the final pivot position.
Multithreading
The current multithreading solution isn't actually that efficient at using threads: Nearly half the threads at a time are waiting for the others to finish. (The first thread is waiting on threads #2 and #3 it spawned, they then are waiting on threads #4-7, ...)
As mentioned in the comments, this could be fixed by letting the original thread keep working on one of the subsections before waiting, i.e. changing this:
std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
std::thread upper(&quicksort::_sort<T>, this, data, q + 1, r);
lower.join();
upper.join();
to:
std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
_sort(data, q + 1, r);
lower.join();
Another point to make is that it rarely benefits performance to have a number of threads much larger than the number of available (logical) cores. The final number is likely between #cores and 2 * #cores, but to find the best amount you'll likely need to measure.
Bug #3: No protection for concurrent calls
Since we're already in the multithreaded environment, what happens if a second thread calls quicksort::sort
after the first thread, but before that call finished?
It resets thread_count
, thus allowing for the creation of more threads than intended. Once they are finished, thread_count
will be negative! This might cause problems if that case isn't expected.
Solution: Either don't reset thread_count
, or wait until it reaches zero before starting to work on the next dataset.
If you're after the best quicksort performance, you might want to look into sentinel-based partition (e.g. this talk, around minute 28) and/or parallel partition algorithms.
â hoffmale
48 mins ago
Jeez, thanks for the input. And about Bug #3 - I callsort()
in a loop, yes, but if I join every thread, then I don't think there will be situation, where I call anothersort()
if the previous did not finish
â Frynio
10 mins ago
1
@Frynio: A loop on the same thread doesn't matter (one call will have finished before the next starts). However, if there ever would be two threads callingsort
concurrently, then you would get issues.
â hoffmale
7 mins ago
Yes, that's true. Fortunately that's not the case
â Frynio
5 mins ago
add a comment |Â
up vote
1
down vote
Thread creation is an expensive operation, and you do it way more than needed. Even though the number of threads running at any given time is capped, the threads are constantly joined and recreated over and over again (I run your code with cap of 10, over a 1000000-strong array, and counted 50 thread creations). Use a thread pool instead.
There is no need to start two threads. The calling thread may very well continue sorting one of the halves.
As said in comments, using more threads than hardware supports will cause slowdown. The software threads are beneficial only when they have natural blocking points (due to IO or synchronization). Sorting is not the case here.
Great points! So after fixing the first two problems you mentioned, should I actually limit the threads only to match my hardware threads?
â Frynio
1 hour ago
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
Partition - Pivot selection
There are 2 bugs in the pivot selection code, in combination allowing for $mathcalO(n^2)$ worst case runtime.
Bug #1: Wrong average calculation
I guess this is just a typo:
auto middle = data[(p + 2) / 2];
Most likely, (p + r) / 2
was intended instead, and that would fix this problem (disregarding overflow, but that could only ever happen if $p = r = textINT_MAX$. Maybe add a check for p == r
and return early?).
Bug #2: Missing a case
if (first > middle)
if (middle > last)
pivot = middle;
else
pivot = last;
This snippet doesn't handle the case correctly if $middle < first < last$. Even though first
should be assigned as pivot, last
will be.
The fix is simple: Change pivot = last;
to pivot = std::min(first, last);
.
Consequences
These two bugs allow for runtime complexity to degrade to $mathcalO(n^2)$, because they introduce a huge bias towards last
as pivot value:
For any call with p >= 4
, middle
will have the value of an element to the left of data[p]
, thus being guaranteed to be smaller than first
or last
due to previous runs of _partition
. So `last will be chosen as pivot, due to bug #2.
For any call with p == 2
or p == 3
, middle
will be assigned the value of data[1]
, which is equal to first
. In that case, pivot
will always be assigned the value of middle
/first
.
For any call with p == 0
or p == 1
, the median calculation will work with a slight bias towards last
(due to bug #2).
Partition - Moving values
There seems to be an assumption that data[r] == pivot
. This might have been a good assumption before due to bug #2, but that isn't necessarily the case once that bug is fixed. There a multiple ways to fix that.
Also, technically it isn't necessary to move values around that are equal to the pivot value: It doesn't actually matter whether they are left or right of the final pivot position.
Multithreading
The current multithreading solution isn't actually that efficient at using threads: Nearly half the threads at a time are waiting for the others to finish. (The first thread is waiting on threads #2 and #3 it spawned, they then are waiting on threads #4-7, ...)
As mentioned in the comments, this could be fixed by letting the original thread keep working on one of the subsections before waiting, i.e. changing this:
std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
std::thread upper(&quicksort::_sort<T>, this, data, q + 1, r);
lower.join();
upper.join();
to:
std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
_sort(data, q + 1, r);
lower.join();
Another point to make is that it rarely benefits performance to have a number of threads much larger than the number of available (logical) cores. The final number is likely between #cores and 2 * #cores, but to find the best amount you'll likely need to measure.
Bug #3: No protection for concurrent calls
Since we're already in the multithreaded environment, what happens if a second thread calls quicksort::sort
after the first thread, but before that call finished?
It resets thread_count
, thus allowing for the creation of more threads than intended. Once they are finished, thread_count
will be negative! This might cause problems if that case isn't expected.
Solution: Either don't reset thread_count
, or wait until it reaches zero before starting to work on the next dataset.
If you're after the best quicksort performance, you might want to look into sentinel-based partition (e.g. this talk, around minute 28) and/or parallel partition algorithms.
â hoffmale
48 mins ago
Jeez, thanks for the input. And about Bug #3 - I callsort()
in a loop, yes, but if I join every thread, then I don't think there will be situation, where I call anothersort()
if the previous did not finish
â Frynio
10 mins ago
1
@Frynio: A loop on the same thread doesn't matter (one call will have finished before the next starts). However, if there ever would be two threads callingsort
concurrently, then you would get issues.
â hoffmale
7 mins ago
Yes, that's true. Fortunately that's not the case
â Frynio
5 mins ago
add a comment |Â
up vote
1
down vote
accepted
Partition - Pivot selection
There are 2 bugs in the pivot selection code, in combination allowing for $mathcalO(n^2)$ worst case runtime.
Bug #1: Wrong average calculation
I guess this is just a typo:
auto middle = data[(p + 2) / 2];
Most likely, (p + r) / 2
was intended instead, and that would fix this problem (disregarding overflow, but that could only ever happen if $p = r = textINT_MAX$. Maybe add a check for p == r
and return early?).
Bug #2: Missing a case
if (first > middle)
if (middle > last)
pivot = middle;
else
pivot = last;
This snippet doesn't handle the case correctly if $middle < first < last$. Even though first
should be assigned as pivot, last
will be.
The fix is simple: Change pivot = last;
to pivot = std::min(first, last);
.
Consequences
These two bugs allow for runtime complexity to degrade to $mathcalO(n^2)$, because they introduce a huge bias towards last
as pivot value:
For any call with p >= 4
, middle
will have the value of an element to the left of data[p]
, thus being guaranteed to be smaller than first
or last
due to previous runs of _partition
. So `last will be chosen as pivot, due to bug #2.
For any call with p == 2
or p == 3
, middle
will be assigned the value of data[1]
, which is equal to first
. In that case, pivot
will always be assigned the value of middle
/first
.
For any call with p == 0
or p == 1
, the median calculation will work with a slight bias towards last
(due to bug #2).
Partition - Moving values
There seems to be an assumption that data[r] == pivot
. This might have been a good assumption before due to bug #2, but that isn't necessarily the case once that bug is fixed. There a multiple ways to fix that.
Also, technically it isn't necessary to move values around that are equal to the pivot value: It doesn't actually matter whether they are left or right of the final pivot position.
Multithreading
The current multithreading solution isn't actually that efficient at using threads: Nearly half the threads at a time are waiting for the others to finish. (The first thread is waiting on threads #2 and #3 it spawned, they then are waiting on threads #4-7, ...)
As mentioned in the comments, this could be fixed by letting the original thread keep working on one of the subsections before waiting, i.e. changing this:
std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
std::thread upper(&quicksort::_sort<T>, this, data, q + 1, r);
lower.join();
upper.join();
to:
std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
_sort(data, q + 1, r);
lower.join();
Another point to make is that it rarely benefits performance to have a number of threads much larger than the number of available (logical) cores. The final number is likely between #cores and 2 * #cores, but to find the best amount you'll likely need to measure.
Bug #3: No protection for concurrent calls
Since we're already in the multithreaded environment, what happens if a second thread calls quicksort::sort
after the first thread, but before that call finished?
It resets thread_count
, thus allowing for the creation of more threads than intended. Once they are finished, thread_count
will be negative! This might cause problems if that case isn't expected.
Solution: Either don't reset thread_count
, or wait until it reaches zero before starting to work on the next dataset.
If you're after the best quicksort performance, you might want to look into sentinel-based partition (e.g. this talk, around minute 28) and/or parallel partition algorithms.
â hoffmale
48 mins ago
Jeez, thanks for the input. And about Bug #3 - I callsort()
in a loop, yes, but if I join every thread, then I don't think there will be situation, where I call anothersort()
if the previous did not finish
â Frynio
10 mins ago
1
@Frynio: A loop on the same thread doesn't matter (one call will have finished before the next starts). However, if there ever would be two threads callingsort
concurrently, then you would get issues.
â hoffmale
7 mins ago
Yes, that's true. Fortunately that's not the case
â Frynio
5 mins ago
add a comment |Â
up vote
1
down vote
accepted
up vote
1
down vote
accepted
Partition - Pivot selection
There are 2 bugs in the pivot selection code, in combination allowing for $mathcalO(n^2)$ worst case runtime.
Bug #1: Wrong average calculation
I guess this is just a typo:
auto middle = data[(p + 2) / 2];
Most likely, (p + r) / 2
was intended instead, and that would fix this problem (disregarding overflow, but that could only ever happen if $p = r = textINT_MAX$. Maybe add a check for p == r
and return early?).
Bug #2: Missing a case
if (first > middle)
if (middle > last)
pivot = middle;
else
pivot = last;
This snippet doesn't handle the case correctly if $middle < first < last$. Even though first
should be assigned as pivot, last
will be.
The fix is simple: Change pivot = last;
to pivot = std::min(first, last);
.
Consequences
These two bugs allow for runtime complexity to degrade to $mathcalO(n^2)$, because they introduce a huge bias towards last
as pivot value:
For any call with p >= 4
, middle
will have the value of an element to the left of data[p]
, thus being guaranteed to be smaller than first
or last
due to previous runs of _partition
. So `last will be chosen as pivot, due to bug #2.
For any call with p == 2
or p == 3
, middle
will be assigned the value of data[1]
, which is equal to first
. In that case, pivot
will always be assigned the value of middle
/first
.
For any call with p == 0
or p == 1
, the median calculation will work with a slight bias towards last
(due to bug #2).
Partition - Moving values
There seems to be an assumption that data[r] == pivot
. This might have been a good assumption before due to bug #2, but that isn't necessarily the case once that bug is fixed. There a multiple ways to fix that.
Also, technically it isn't necessary to move values around that are equal to the pivot value: It doesn't actually matter whether they are left or right of the final pivot position.
Multithreading
The current multithreading solution isn't actually that efficient at using threads: Nearly half the threads at a time are waiting for the others to finish. (The first thread is waiting on threads #2 and #3 it spawned, they then are waiting on threads #4-7, ...)
As mentioned in the comments, this could be fixed by letting the original thread keep working on one of the subsections before waiting, i.e. changing this:
std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
std::thread upper(&quicksort::_sort<T>, this, data, q + 1, r);
lower.join();
upper.join();
to:
std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
_sort(data, q + 1, r);
lower.join();
Another point to make is that it rarely benefits performance to have a number of threads much larger than the number of available (logical) cores. The final number is likely between #cores and 2 * #cores, but to find the best amount you'll likely need to measure.
Bug #3: No protection for concurrent calls
Since we're already in the multithreaded environment, what happens if a second thread calls quicksort::sort
after the first thread, but before that call finished?
It resets thread_count
, thus allowing for the creation of more threads than intended. Once they are finished, thread_count
will be negative! This might cause problems if that case isn't expected.
Solution: Either don't reset thread_count
, or wait until it reaches zero before starting to work on the next dataset.
Partition - Pivot selection
There are 2 bugs in the pivot selection code, in combination allowing for $mathcalO(n^2)$ worst case runtime.
Bug #1: Wrong average calculation
I guess this is just a typo:
auto middle = data[(p + 2) / 2];
Most likely, (p + r) / 2
was intended instead, and that would fix this problem (disregarding overflow, but that could only ever happen if $p = r = textINT_MAX$. Maybe add a check for p == r
and return early?).
Bug #2: Missing a case
if (first > middle)
if (middle > last)
pivot = middle;
else
pivot = last;
This snippet doesn't handle the case correctly if $middle < first < last$. Even though first
should be assigned as pivot, last
will be.
The fix is simple: Change pivot = last;
to pivot = std::min(first, last);
.
Consequences
These two bugs allow for runtime complexity to degrade to $mathcalO(n^2)$, because they introduce a huge bias towards last
as pivot value:
For any call with p >= 4
, middle
will have the value of an element to the left of data[p]
, thus being guaranteed to be smaller than first
or last
due to previous runs of _partition
. So `last will be chosen as pivot, due to bug #2.
For any call with p == 2
or p == 3
, middle
will be assigned the value of data[1]
, which is equal to first
. In that case, pivot
will always be assigned the value of middle
/first
.
For any call with p == 0
or p == 1
, the median calculation will work with a slight bias towards last
(due to bug #2).
Partition - Moving values
There seems to be an assumption that data[r] == pivot
. This might have been a good assumption before due to bug #2, but that isn't necessarily the case once that bug is fixed. There a multiple ways to fix that.
Also, technically it isn't necessary to move values around that are equal to the pivot value: It doesn't actually matter whether they are left or right of the final pivot position.
Multithreading
The current multithreading solution isn't actually that efficient at using threads: Nearly half the threads at a time are waiting for the others to finish. (The first thread is waiting on threads #2 and #3 it spawned, they then are waiting on threads #4-7, ...)
As mentioned in the comments, this could be fixed by letting the original thread keep working on one of the subsections before waiting, i.e. changing this:
std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
std::thread upper(&quicksort::_sort<T>, this, data, q + 1, r);
lower.join();
upper.join();
to:
std::thread lower(&quicksort::_sort<T>, this, data, p, q - 1);
_sort(data, q + 1, r);
lower.join();
Another point to make is that it rarely benefits performance to have a number of threads much larger than the number of available (logical) cores. The final number is likely between #cores and 2 * #cores, but to find the best amount you'll likely need to measure.
Bug #3: No protection for concurrent calls
Since we're already in the multithreaded environment, what happens if a second thread calls quicksort::sort
after the first thread, but before that call finished?
It resets thread_count
, thus allowing for the creation of more threads than intended. Once they are finished, thread_count
will be negative! This might cause problems if that case isn't expected.
Solution: Either don't reset thread_count
, or wait until it reaches zero before starting to work on the next dataset.
answered 55 mins ago
hoffmale
5,342835
5,342835
If you're after the best quicksort performance, you might want to look into sentinel-based partition (e.g. this talk, around minute 28) and/or parallel partition algorithms.
â hoffmale
48 mins ago
Jeez, thanks for the input. And about Bug #3 - I callsort()
in a loop, yes, but if I join every thread, then I don't think there will be situation, where I call anothersort()
if the previous did not finish
â Frynio
10 mins ago
1
@Frynio: A loop on the same thread doesn't matter (one call will have finished before the next starts). However, if there ever would be two threads callingsort
concurrently, then you would get issues.
â hoffmale
7 mins ago
Yes, that's true. Fortunately that's not the case
â Frynio
5 mins ago
add a comment |Â
If you're after the best quicksort performance, you might want to look into sentinel-based partition (e.g. this talk, around minute 28) and/or parallel partition algorithms.
â hoffmale
48 mins ago
Jeez, thanks for the input. And about Bug #3 - I callsort()
in a loop, yes, but if I join every thread, then I don't think there will be situation, where I call anothersort()
if the previous did not finish
â Frynio
10 mins ago
1
@Frynio: A loop on the same thread doesn't matter (one call will have finished before the next starts). However, if there ever would be two threads callingsort
concurrently, then you would get issues.
â hoffmale
7 mins ago
Yes, that's true. Fortunately that's not the case
â Frynio
5 mins ago
If you're after the best quicksort performance, you might want to look into sentinel-based partition (e.g. this talk, around minute 28) and/or parallel partition algorithms.
â hoffmale
48 mins ago
If you're after the best quicksort performance, you might want to look into sentinel-based partition (e.g. this talk, around minute 28) and/or parallel partition algorithms.
â hoffmale
48 mins ago
Jeez, thanks for the input. And about Bug #3 - I call
sort()
in a loop, yes, but if I join every thread, then I don't think there will be situation, where I call another sort()
if the previous did not finishâ Frynio
10 mins ago
Jeez, thanks for the input. And about Bug #3 - I call
sort()
in a loop, yes, but if I join every thread, then I don't think there will be situation, where I call another sort()
if the previous did not finishâ Frynio
10 mins ago
1
1
@Frynio: A loop on the same thread doesn't matter (one call will have finished before the next starts). However, if there ever would be two threads calling
sort
concurrently, then you would get issues.â hoffmale
7 mins ago
@Frynio: A loop on the same thread doesn't matter (one call will have finished before the next starts). However, if there ever would be two threads calling
sort
concurrently, then you would get issues.â hoffmale
7 mins ago
Yes, that's true. Fortunately that's not the case
â Frynio
5 mins ago
Yes, that's true. Fortunately that's not the case
â Frynio
5 mins ago
add a comment |Â
up vote
1
down vote
Thread creation is an expensive operation, and you do it way more than needed. Even though the number of threads running at any given time is capped, the threads are constantly joined and recreated over and over again (I run your code with cap of 10, over a 1000000-strong array, and counted 50 thread creations). Use a thread pool instead.
There is no need to start two threads. The calling thread may very well continue sorting one of the halves.
As said in comments, using more threads than hardware supports will cause slowdown. The software threads are beneficial only when they have natural blocking points (due to IO or synchronization). Sorting is not the case here.
Great points! So after fixing the first two problems you mentioned, should I actually limit the threads only to match my hardware threads?
â Frynio
1 hour ago
add a comment |Â
up vote
1
down vote
Thread creation is an expensive operation, and you do it way more than needed. Even though the number of threads running at any given time is capped, the threads are constantly joined and recreated over and over again (I run your code with cap of 10, over a 1000000-strong array, and counted 50 thread creations). Use a thread pool instead.
There is no need to start two threads. The calling thread may very well continue sorting one of the halves.
As said in comments, using more threads than hardware supports will cause slowdown. The software threads are beneficial only when they have natural blocking points (due to IO or synchronization). Sorting is not the case here.
Great points! So after fixing the first two problems you mentioned, should I actually limit the threads only to match my hardware threads?
â Frynio
1 hour ago
add a comment |Â
up vote
1
down vote
up vote
1
down vote
Thread creation is an expensive operation, and you do it way more than needed. Even though the number of threads running at any given time is capped, the threads are constantly joined and recreated over and over again (I run your code with cap of 10, over a 1000000-strong array, and counted 50 thread creations). Use a thread pool instead.
There is no need to start two threads. The calling thread may very well continue sorting one of the halves.
As said in comments, using more threads than hardware supports will cause slowdown. The software threads are beneficial only when they have natural blocking points (due to IO or synchronization). Sorting is not the case here.
Thread creation is an expensive operation, and you do it way more than needed. Even though the number of threads running at any given time is capped, the threads are constantly joined and recreated over and over again (I run your code with cap of 10, over a 1000000-strong array, and counted 50 thread creations). Use a thread pool instead.
There is no need to start two threads. The calling thread may very well continue sorting one of the halves.
As said in comments, using more threads than hardware supports will cause slowdown. The software threads are beneficial only when they have natural blocking points (due to IO or synchronization). Sorting is not the case here.
answered 1 hour ago
vnp
37.7k13096
37.7k13096
Great points! So after fixing the first two problems you mentioned, should I actually limit the threads only to match my hardware threads?
â Frynio
1 hour ago
add a comment |Â
Great points! So after fixing the first two problems you mentioned, should I actually limit the threads only to match my hardware threads?
â Frynio
1 hour ago
Great points! So after fixing the first two problems you mentioned, should I actually limit the threads only to match my hardware threads?
â Frynio
1 hour ago
Great points! So after fixing the first two problems you mentioned, should I actually limit the threads only to match my hardware threads?
â Frynio
1 hour ago
add a comment |Â
Frynio is a new contributor. Be nice, and check out our Code of Conduct.
Frynio is a new contributor. Be nice, and check out our Code of Conduct.
Frynio is a new contributor. Be nice, and check out our Code of Conduct.
Frynio is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f206438%2fconcurrent-quicksort-in-c%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Having more threads than are present in your hardware (likely 4 or 8 for typical laptops, possibly 12 for desktop) is not beneficial.
â 1201ProgramAlarm
5 hours ago
Okay. I have two cores and 4 threads in my laptop. Does it mean, that I actually should cap on 4 threads? Now I actually measured for thread_limit ranging from 20 to 800, and I still see that the more threads, the slower. 20 threads are faster than 40 etc.
â Frynio
4 hours ago
@Frynio, more threads than hardware supports means more context switching. Context switches mean slower performance, and perhaps something else I'm unaware of (haven't designed or studied in depth any OS yet).
â Incomputable
4 hours ago
Every switch adds overhead. So if the overhead takes longer than the parallellisation improves, it's not helping.
â Mast
2 hours ago
You benchmarked your code, good. Did you verify it's output as well? As in, you know how long it takes, but does it work correctly?
â Mast
2 hours ago