Median Calculation of List of Integers without using heap
Clash Royale CLAN TAG#URR8PPP
up vote
7
down vote
favorite
References
Using those references:
- rolling median implementations benchmark
- FAST RUNNING MEDIAN USING AN INDEXABLE SKIPLIST
- Calculate the median of a billion numbers
- Find running median from a stream of integers
Code
A code was written to calculate the Median with the SortedContainers library:
from itertools import islice
from sortedcontainers import SortedList
import random
import time
start_time = time.time()
class Median(object):
def __init__(self, iterable):
self._iterable = islice(iterable, None)
self._sortedlist = SortedList(self._iterable)
def __iter__(self):
self_sortedlist = self._sortedlist
# print(self_sortedlist)
length = len(self_sortedlist)
half = length // 2
if length % 2 == 0:
yield (self_sortedlist[half] + self_sortedlist[half - 1]) // 2
elif length % 2 == 1:
yield self_sortedlist[half]
def main():
m, n = 1000, 1500000
data = [random.randrange(m) for i in range(n)]
# print("Random Data: ", data)
result = list(Median(data))
print("Result: ", result)
if __name__ == "__main__":
main()
print("--- %s seconds ---" % (time.time() - start_time))
Explanation
Random Number Generator
The following code generates data within the Range m
and quantity n
.
m, n = 1000, 15000000
data = [random.randrange(m) for i in range(n)]
Median
The Median class sorts the list of numbers and if n
is odd, returns the middle item with yield self_sortedlist[half]
. Or if n
is even, returns the mean of two middle itens of the list with yield (self_sortedlist[half] + self_sortedlist[half - 1]) // 2
Question
How do I improve the code performance? Because for a large list (100 milion), it takes --- 186.7168517112732 seconds---
on my computer.
python python-3.x statistics
 |Â
show 2 more comments
up vote
7
down vote
favorite
References
Using those references:
- rolling median implementations benchmark
- FAST RUNNING MEDIAN USING AN INDEXABLE SKIPLIST
- Calculate the median of a billion numbers
- Find running median from a stream of integers
Code
A code was written to calculate the Median with the SortedContainers library:
from itertools import islice
from sortedcontainers import SortedList
import random
import time
start_time = time.time()
class Median(object):
def __init__(self, iterable):
self._iterable = islice(iterable, None)
self._sortedlist = SortedList(self._iterable)
def __iter__(self):
self_sortedlist = self._sortedlist
# print(self_sortedlist)
length = len(self_sortedlist)
half = length // 2
if length % 2 == 0:
yield (self_sortedlist[half] + self_sortedlist[half - 1]) // 2
elif length % 2 == 1:
yield self_sortedlist[half]
def main():
m, n = 1000, 1500000
data = [random.randrange(m) for i in range(n)]
# print("Random Data: ", data)
result = list(Median(data))
print("Result: ", result)
if __name__ == "__main__":
main()
print("--- %s seconds ---" % (time.time() - start_time))
Explanation
Random Number Generator
The following code generates data within the Range m
and quantity n
.
m, n = 1000, 15000000
data = [random.randrange(m) for i in range(n)]
Median
The Median class sorts the list of numbers and if n
is odd, returns the middle item with yield self_sortedlist[half]
. Or if n
is even, returns the mean of two middle itens of the list with yield (self_sortedlist[half] + self_sortedlist[half - 1]) // 2
Question
How do I improve the code performance? Because for a large list (100 milion), it takes --- 186.7168517112732 seconds---
on my computer.
python python-3.x statistics
sorting has time complexityO(n log n)
. You may want to look into Medians of Medians (scroll down for a python implementation) which has linear time complexity.
â Andre Holzner
Aug 15 at 19:20
Why do you use integer division for a list with even length? Shouldn't it be a regular division of the two center values in the sorted list?
â Daniel Lenz
Aug 15 at 21:34
1
I think there is a misunderstanding; the use of heaps and such clever data structures is needed to maintain a rolling median, i.e. you have to return the current median after every number in the input. If you want to call it only once on the final list, then a simple quickselect is enough - that gets you to O(n) time. (If you were to use this algorithm for the rolling mean case, you would need O(1+2+..+n) = O(n^2) time, which is why heaps and other structures are used.)
â Ant
Aug 15 at 22:39
1
Timsort is a sorting algorithm, and it takes O(n log n) time. Quickselect is not a sorting algorithm, and it takes O(n) time. Quickselect is much faster than timsort for this use case; the key observation is that to get the median you don't need to sort the whole vector, just as if you want to compute the minimum you don't need to sort the whole vector, you can just do one pass over the elements. With the median (and in general with order statistics) it's more complicated, so you use quickselect, but it still has a O(n) performance, much better than your sorting approach which is O(n log n)
â Ant
Aug 16 at 11:58
1
@DanielLenz Oh, It was to reuse the number as Integer. Just as the data type of the input. However, if the precision of the program is needed for decimal numbers, then it can be done.
â danieltakeshi
Aug 16 at 16:33
 |Â
show 2 more comments
up vote
7
down vote
favorite
up vote
7
down vote
favorite
References
Using those references:
- rolling median implementations benchmark
- FAST RUNNING MEDIAN USING AN INDEXABLE SKIPLIST
- Calculate the median of a billion numbers
- Find running median from a stream of integers
Code
A code was written to calculate the Median with the SortedContainers library:
from itertools import islice
from sortedcontainers import SortedList
import random
import time
start_time = time.time()
class Median(object):
def __init__(self, iterable):
self._iterable = islice(iterable, None)
self._sortedlist = SortedList(self._iterable)
def __iter__(self):
self_sortedlist = self._sortedlist
# print(self_sortedlist)
length = len(self_sortedlist)
half = length // 2
if length % 2 == 0:
yield (self_sortedlist[half] + self_sortedlist[half - 1]) // 2
elif length % 2 == 1:
yield self_sortedlist[half]
def main():
m, n = 1000, 1500000
data = [random.randrange(m) for i in range(n)]
# print("Random Data: ", data)
result = list(Median(data))
print("Result: ", result)
if __name__ == "__main__":
main()
print("--- %s seconds ---" % (time.time() - start_time))
Explanation
Random Number Generator
The following code generates data within the Range m
and quantity n
.
m, n = 1000, 15000000
data = [random.randrange(m) for i in range(n)]
Median
The Median class sorts the list of numbers and if n
is odd, returns the middle item with yield self_sortedlist[half]
. Or if n
is even, returns the mean of two middle itens of the list with yield (self_sortedlist[half] + self_sortedlist[half - 1]) // 2
Question
How do I improve the code performance? Because for a large list (100 milion), it takes --- 186.7168517112732 seconds---
on my computer.
python python-3.x statistics
References
Using those references:
- rolling median implementations benchmark
- FAST RUNNING MEDIAN USING AN INDEXABLE SKIPLIST
- Calculate the median of a billion numbers
- Find running median from a stream of integers
Code
A code was written to calculate the Median with the SortedContainers library:
from itertools import islice
from sortedcontainers import SortedList
import random
import time
start_time = time.time()
class Median(object):
def __init__(self, iterable):
self._iterable = islice(iterable, None)
self._sortedlist = SortedList(self._iterable)
def __iter__(self):
self_sortedlist = self._sortedlist
# print(self_sortedlist)
length = len(self_sortedlist)
half = length // 2
if length % 2 == 0:
yield (self_sortedlist[half] + self_sortedlist[half - 1]) // 2
elif length % 2 == 1:
yield self_sortedlist[half]
def main():
m, n = 1000, 1500000
data = [random.randrange(m) for i in range(n)]
# print("Random Data: ", data)
result = list(Median(data))
print("Result: ", result)
if __name__ == "__main__":
main()
print("--- %s seconds ---" % (time.time() - start_time))
Explanation
Random Number Generator
The following code generates data within the Range m
and quantity n
.
m, n = 1000, 15000000
data = [random.randrange(m) for i in range(n)]
Median
The Median class sorts the list of numbers and if n
is odd, returns the middle item with yield self_sortedlist[half]
. Or if n
is even, returns the mean of two middle itens of the list with yield (self_sortedlist[half] + self_sortedlist[half - 1]) // 2
Question
How do I improve the code performance? Because for a large list (100 milion), it takes --- 186.7168517112732 seconds---
on my computer.
python python-3.x statistics
python python-3.x statistics
asked Aug 15 at 13:08
danieltakeshi
1585
1585
sorting has time complexityO(n log n)
. You may want to look into Medians of Medians (scroll down for a python implementation) which has linear time complexity.
â Andre Holzner
Aug 15 at 19:20
Why do you use integer division for a list with even length? Shouldn't it be a regular division of the two center values in the sorted list?
â Daniel Lenz
Aug 15 at 21:34
1
I think there is a misunderstanding; the use of heaps and such clever data structures is needed to maintain a rolling median, i.e. you have to return the current median after every number in the input. If you want to call it only once on the final list, then a simple quickselect is enough - that gets you to O(n) time. (If you were to use this algorithm for the rolling mean case, you would need O(1+2+..+n) = O(n^2) time, which is why heaps and other structures are used.)
â Ant
Aug 15 at 22:39
1
Timsort is a sorting algorithm, and it takes O(n log n) time. Quickselect is not a sorting algorithm, and it takes O(n) time. Quickselect is much faster than timsort for this use case; the key observation is that to get the median you don't need to sort the whole vector, just as if you want to compute the minimum you don't need to sort the whole vector, you can just do one pass over the elements. With the median (and in general with order statistics) it's more complicated, so you use quickselect, but it still has a O(n) performance, much better than your sorting approach which is O(n log n)
â Ant
Aug 16 at 11:58
1
@DanielLenz Oh, It was to reuse the number as Integer. Just as the data type of the input. However, if the precision of the program is needed for decimal numbers, then it can be done.
â danieltakeshi
Aug 16 at 16:33
 |Â
show 2 more comments
sorting has time complexityO(n log n)
. You may want to look into Medians of Medians (scroll down for a python implementation) which has linear time complexity.
â Andre Holzner
Aug 15 at 19:20
Why do you use integer division for a list with even length? Shouldn't it be a regular division of the two center values in the sorted list?
â Daniel Lenz
Aug 15 at 21:34
1
I think there is a misunderstanding; the use of heaps and such clever data structures is needed to maintain a rolling median, i.e. you have to return the current median after every number in the input. If you want to call it only once on the final list, then a simple quickselect is enough - that gets you to O(n) time. (If you were to use this algorithm for the rolling mean case, you would need O(1+2+..+n) = O(n^2) time, which is why heaps and other structures are used.)
â Ant
Aug 15 at 22:39
1
Timsort is a sorting algorithm, and it takes O(n log n) time. Quickselect is not a sorting algorithm, and it takes O(n) time. Quickselect is much faster than timsort for this use case; the key observation is that to get the median you don't need to sort the whole vector, just as if you want to compute the minimum you don't need to sort the whole vector, you can just do one pass over the elements. With the median (and in general with order statistics) it's more complicated, so you use quickselect, but it still has a O(n) performance, much better than your sorting approach which is O(n log n)
â Ant
Aug 16 at 11:58
1
@DanielLenz Oh, It was to reuse the number as Integer. Just as the data type of the input. However, if the precision of the program is needed for decimal numbers, then it can be done.
â danieltakeshi
Aug 16 at 16:33
sorting has time complexity
O(n log n)
. You may want to look into Medians of Medians (scroll down for a python implementation) which has linear time complexity.â Andre Holzner
Aug 15 at 19:20
sorting has time complexity
O(n log n)
. You may want to look into Medians of Medians (scroll down for a python implementation) which has linear time complexity.â Andre Holzner
Aug 15 at 19:20
Why do you use integer division for a list with even length? Shouldn't it be a regular division of the two center values in the sorted list?
â Daniel Lenz
Aug 15 at 21:34
Why do you use integer division for a list with even length? Shouldn't it be a regular division of the two center values in the sorted list?
â Daniel Lenz
Aug 15 at 21:34
1
1
I think there is a misunderstanding; the use of heaps and such clever data structures is needed to maintain a rolling median, i.e. you have to return the current median after every number in the input. If you want to call it only once on the final list, then a simple quickselect is enough - that gets you to O(n) time. (If you were to use this algorithm for the rolling mean case, you would need O(1+2+..+n) = O(n^2) time, which is why heaps and other structures are used.)
â Ant
Aug 15 at 22:39
I think there is a misunderstanding; the use of heaps and such clever data structures is needed to maintain a rolling median, i.e. you have to return the current median after every number in the input. If you want to call it only once on the final list, then a simple quickselect is enough - that gets you to O(n) time. (If you were to use this algorithm for the rolling mean case, you would need O(1+2+..+n) = O(n^2) time, which is why heaps and other structures are used.)
â Ant
Aug 15 at 22:39
1
1
Timsort is a sorting algorithm, and it takes O(n log n) time. Quickselect is not a sorting algorithm, and it takes O(n) time. Quickselect is much faster than timsort for this use case; the key observation is that to get the median you don't need to sort the whole vector, just as if you want to compute the minimum you don't need to sort the whole vector, you can just do one pass over the elements. With the median (and in general with order statistics) it's more complicated, so you use quickselect, but it still has a O(n) performance, much better than your sorting approach which is O(n log n)
â Ant
Aug 16 at 11:58
Timsort is a sorting algorithm, and it takes O(n log n) time. Quickselect is not a sorting algorithm, and it takes O(n) time. Quickselect is much faster than timsort for this use case; the key observation is that to get the median you don't need to sort the whole vector, just as if you want to compute the minimum you don't need to sort the whole vector, you can just do one pass over the elements. With the median (and in general with order statistics) it's more complicated, so you use quickselect, but it still has a O(n) performance, much better than your sorting approach which is O(n log n)
â Ant
Aug 16 at 11:58
1
1
@DanielLenz Oh, It was to reuse the number as Integer. Just as the data type of the input. However, if the precision of the program is needed for decimal numbers, then it can be done.
â danieltakeshi
Aug 16 at 16:33
@DanielLenz Oh, It was to reuse the number as Integer. Just as the data type of the input. However, if the precision of the program is needed for decimal numbers, then it can be done.
â danieltakeshi
Aug 16 at 16:33
 |Â
show 2 more comments
4 Answers
4
active
oldest
votes
up vote
5
down vote
accepted
Performance
I instrumented your code differently:
def main():
m, n = 1000, 15000000
start_time = time.time()
data = [random.randrange(m) for i in range(n)]
print("--- %s seconds to generate" % (time.time() - start_time))
# print("Random Data: ", data)
start_time = time.time()
result = SortedList(data)
print("--- %s seconds to sort" % (time.time() - start_time))
start_time = time.time()
result = list(Median(data))
print("Result: ", result)
print("--- %s seconds ---" % (time.time() - start_time))
For me, the results were:
--- 7.407598257064819 seconds to generate
--- 4.535749673843384 seconds to sort
Result: [500]
--- 5.01109504699707 seconds ---
This shows that the generation of the random input takes longer than finding the median, and that 90% of Median()
is spent sorting (probably most of the rest is caused by converting between lists and iterators). You're unlikely to get large gains through modifications to your own part of the code.
I got better results by using Python's built-in sorted()
. We don't need any of the extra functionality of SortedList
(maintaining the invariant through adds and deletes), and a single sort (mostly in native code) gives these results:
def median(values):
sortedlist = sorted(values)
length = len(sortedlist)
half = length // 2
# return not yield; see "General review"
if length % 2 == 0:
return (sortedlist[half] + sortedlist[half - 1]) // 2
else:
return sortedlist[half]
def main():
m, n = 1000, 15000000
start_time = time.time()
data = [random.randrange(m) for i in range(n)]
print("--- %s seconds to generate" % (time.time() - start_time))
# print("Random Data: ", data)
start_time = time.time()
result = sorted(data)
print("--- %s seconds to sort" % (time.time() - start_time))
start_time = time.time()
result = median(data)
print("Result: ", result)
print("--- %s seconds ---" % (time.time() - start_time))
--- 7.638948202133179 seconds to generate
--- 3.118924617767334 seconds to sort
Result: 500
--- 3.3397886753082275 seconds ---
General review
I don't see why you yield
a single value, rather than simply returning. Is there a need for Median()
to be iterable?
Given that length
is an integer, if length % 2
is not 0, it must be 1 - so the elif
could easily be else
.
You are right on the General Review, yield is only necessary for the running median, so i changed toreturn
and the result toresult = Median(data)
. Else was used instead of elif. Is there any algorithm that generates the list on an improved way, faster?
â danieltakeshi
Aug 15 at 13:52
Yes, use Python's built-in sort for a one-time sorting of a list - see edit.
â Toby Speight
Aug 15 at 14:20
add a comment |Â
up vote
3
down vote
If you want to time something, you should time just that. When I ran your original code, I got 6.023478031158447 seconds. When I instead did
start_time = time.time()
result = list(Median(data))
end_time = time.time()
print("Result: ", result, end_time-start_time)
I got 1.8368241786956787. Note that calling time.time()
before the print
statement means that the execution of the print
statement isn't included in your calculation of elapsed time.
I also tried
import statistics
start_time = time.time()
result = statistics.median(data)
end_time = time.time()
print("Result: ", result, end_time-start_time)
and got 0.7475762367248535.
What do mean by "not using heap", and why do you want that? What makes you think SortedList doesn't use a heap? If you're going to use pre-made functions, why not just use the pre-made median
function?
Just taking some challenges to learn Python. It was without heap, because there are many answers and duplicates. Thanks for the tip for measuring the time.
â danieltakeshi
Aug 15 at 19:56
add a comment |Â
up vote
2
down vote
Thanks to Toby now the program takes:
--- 7.169409990310669 seconds to finish the program---
instead of --- 186.7168517112732 seconds---
for 100 milion numbers.
One other improvement I made was to add an extra optimization on the random generating algorithm, using numpy. So numpy.sort()
was used to sort the numpy array at the start of median()
, because it is faster for numpy arrays.
Code
import time
import numpy
start_time0 = time.time()
def median(values):
sortedlist = numpy.sort(values)
length = len(sortedlist)
half = length // 2
if length % 2 == 0:
return (sortedlist[half] + sortedlist[half - 1]) // 2
else:
return sortedlist[half]
def main():
m, n = 1000, 100000000
# Random Number Generator
start_time = time.time()
data = numpy.random.randint(1, m, n)
print("--- %s seconds to numpy random---" % (time.time() - start_time))
# Median
start_time = time.time()
result = median(data)
print("Result: ", result)
print("--- %s seconds to find the Median---" % (time.time() - start_time))
if __name__ == "__main__":
main()
print("--- %s seconds to finish the program---" % (time.time() - start_time0))
In this case, you might want to consider posting another question if you want another review? As it is, your newly refactored code isn't more helpful than the already accepted answer :)
â IEatBagels
Aug 15 at 19:44
3
Oh forgive us - it was difficult to see the implicit change. Hopefully my edit makes it easier for others to see
â Sá´Âá´ Oná´Âá´Âá´Â
Aug 15 at 19:47
add a comment |Â
up vote
0
down vote
just a note :
you can enhance just a tidbit by defining the name in the list comprehension
from
data = [random.randrange(m) for i in range(n)]
to
data = [random.randrange(m) for num in range(n)]
add a comment |Â
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
5
down vote
accepted
Performance
I instrumented your code differently:
def main():
m, n = 1000, 15000000
start_time = time.time()
data = [random.randrange(m) for i in range(n)]
print("--- %s seconds to generate" % (time.time() - start_time))
# print("Random Data: ", data)
start_time = time.time()
result = SortedList(data)
print("--- %s seconds to sort" % (time.time() - start_time))
start_time = time.time()
result = list(Median(data))
print("Result: ", result)
print("--- %s seconds ---" % (time.time() - start_time))
For me, the results were:
--- 7.407598257064819 seconds to generate
--- 4.535749673843384 seconds to sort
Result: [500]
--- 5.01109504699707 seconds ---
This shows that the generation of the random input takes longer than finding the median, and that 90% of Median()
is spent sorting (probably most of the rest is caused by converting between lists and iterators). You're unlikely to get large gains through modifications to your own part of the code.
I got better results by using Python's built-in sorted()
. We don't need any of the extra functionality of SortedList
(maintaining the invariant through adds and deletes), and a single sort (mostly in native code) gives these results:
def median(values):
sortedlist = sorted(values)
length = len(sortedlist)
half = length // 2
# return not yield; see "General review"
if length % 2 == 0:
return (sortedlist[half] + sortedlist[half - 1]) // 2
else:
return sortedlist[half]
def main():
m, n = 1000, 15000000
start_time = time.time()
data = [random.randrange(m) for i in range(n)]
print("--- %s seconds to generate" % (time.time() - start_time))
# print("Random Data: ", data)
start_time = time.time()
result = sorted(data)
print("--- %s seconds to sort" % (time.time() - start_time))
start_time = time.time()
result = median(data)
print("Result: ", result)
print("--- %s seconds ---" % (time.time() - start_time))
--- 7.638948202133179 seconds to generate
--- 3.118924617767334 seconds to sort
Result: 500
--- 3.3397886753082275 seconds ---
General review
I don't see why you yield
a single value, rather than simply returning. Is there a need for Median()
to be iterable?
Given that length
is an integer, if length % 2
is not 0, it must be 1 - so the elif
could easily be else
.
You are right on the General Review, yield is only necessary for the running median, so i changed toreturn
and the result toresult = Median(data)
. Else was used instead of elif. Is there any algorithm that generates the list on an improved way, faster?
â danieltakeshi
Aug 15 at 13:52
Yes, use Python's built-in sort for a one-time sorting of a list - see edit.
â Toby Speight
Aug 15 at 14:20
add a comment |Â
up vote
5
down vote
accepted
Performance
I instrumented your code differently:
def main():
m, n = 1000, 15000000
start_time = time.time()
data = [random.randrange(m) for i in range(n)]
print("--- %s seconds to generate" % (time.time() - start_time))
# print("Random Data: ", data)
start_time = time.time()
result = SortedList(data)
print("--- %s seconds to sort" % (time.time() - start_time))
start_time = time.time()
result = list(Median(data))
print("Result: ", result)
print("--- %s seconds ---" % (time.time() - start_time))
For me, the results were:
--- 7.407598257064819 seconds to generate
--- 4.535749673843384 seconds to sort
Result: [500]
--- 5.01109504699707 seconds ---
This shows that the generation of the random input takes longer than finding the median, and that 90% of Median()
is spent sorting (probably most of the rest is caused by converting between lists and iterators). You're unlikely to get large gains through modifications to your own part of the code.
I got better results by using Python's built-in sorted()
. We don't need any of the extra functionality of SortedList
(maintaining the invariant through adds and deletes), and a single sort (mostly in native code) gives these results:
def median(values):
sortedlist = sorted(values)
length = len(sortedlist)
half = length // 2
# return not yield; see "General review"
if length % 2 == 0:
return (sortedlist[half] + sortedlist[half - 1]) // 2
else:
return sortedlist[half]
def main():
m, n = 1000, 15000000
start_time = time.time()
data = [random.randrange(m) for i in range(n)]
print("--- %s seconds to generate" % (time.time() - start_time))
# print("Random Data: ", data)
start_time = time.time()
result = sorted(data)
print("--- %s seconds to sort" % (time.time() - start_time))
start_time = time.time()
result = median(data)
print("Result: ", result)
print("--- %s seconds ---" % (time.time() - start_time))
--- 7.638948202133179 seconds to generate
--- 3.118924617767334 seconds to sort
Result: 500
--- 3.3397886753082275 seconds ---
General review
I don't see why you yield
a single value, rather than simply returning. Is there a need for Median()
to be iterable?
Given that length
is an integer, if length % 2
is not 0, it must be 1 - so the elif
could easily be else
.
You are right on the General Review, yield is only necessary for the running median, so i changed toreturn
and the result toresult = Median(data)
. Else was used instead of elif. Is there any algorithm that generates the list on an improved way, faster?
â danieltakeshi
Aug 15 at 13:52
Yes, use Python's built-in sort for a one-time sorting of a list - see edit.
â Toby Speight
Aug 15 at 14:20
add a comment |Â
up vote
5
down vote
accepted
up vote
5
down vote
accepted
Performance
I instrumented your code differently:
def main():
m, n = 1000, 15000000
start_time = time.time()
data = [random.randrange(m) for i in range(n)]
print("--- %s seconds to generate" % (time.time() - start_time))
# print("Random Data: ", data)
start_time = time.time()
result = SortedList(data)
print("--- %s seconds to sort" % (time.time() - start_time))
start_time = time.time()
result = list(Median(data))
print("Result: ", result)
print("--- %s seconds ---" % (time.time() - start_time))
For me, the results were:
--- 7.407598257064819 seconds to generate
--- 4.535749673843384 seconds to sort
Result: [500]
--- 5.01109504699707 seconds ---
This shows that the generation of the random input takes longer than finding the median, and that 90% of Median()
is spent sorting (probably most of the rest is caused by converting between lists and iterators). You're unlikely to get large gains through modifications to your own part of the code.
I got better results by using Python's built-in sorted()
. We don't need any of the extra functionality of SortedList
(maintaining the invariant through adds and deletes), and a single sort (mostly in native code) gives these results:
def median(values):
sortedlist = sorted(values)
length = len(sortedlist)
half = length // 2
# return not yield; see "General review"
if length % 2 == 0:
return (sortedlist[half] + sortedlist[half - 1]) // 2
else:
return sortedlist[half]
def main():
m, n = 1000, 15000000
start_time = time.time()
data = [random.randrange(m) for i in range(n)]
print("--- %s seconds to generate" % (time.time() - start_time))
# print("Random Data: ", data)
start_time = time.time()
result = sorted(data)
print("--- %s seconds to sort" % (time.time() - start_time))
start_time = time.time()
result = median(data)
print("Result: ", result)
print("--- %s seconds ---" % (time.time() - start_time))
--- 7.638948202133179 seconds to generate
--- 3.118924617767334 seconds to sort
Result: 500
--- 3.3397886753082275 seconds ---
General review
I don't see why you yield
a single value, rather than simply returning. Is there a need for Median()
to be iterable?
Given that length
is an integer, if length % 2
is not 0, it must be 1 - so the elif
could easily be else
.
Performance
I instrumented your code differently:
def main():
m, n = 1000, 15000000
start_time = time.time()
data = [random.randrange(m) for i in range(n)]
print("--- %s seconds to generate" % (time.time() - start_time))
# print("Random Data: ", data)
start_time = time.time()
result = SortedList(data)
print("--- %s seconds to sort" % (time.time() - start_time))
start_time = time.time()
result = list(Median(data))
print("Result: ", result)
print("--- %s seconds ---" % (time.time() - start_time))
For me, the results were:
--- 7.407598257064819 seconds to generate
--- 4.535749673843384 seconds to sort
Result: [500]
--- 5.01109504699707 seconds ---
This shows that the generation of the random input takes longer than finding the median, and that 90% of Median()
is spent sorting (probably most of the rest is caused by converting between lists and iterators). You're unlikely to get large gains through modifications to your own part of the code.
I got better results by using Python's built-in sorted()
. We don't need any of the extra functionality of SortedList
(maintaining the invariant through adds and deletes), and a single sort (mostly in native code) gives these results:
def median(values):
sortedlist = sorted(values)
length = len(sortedlist)
half = length // 2
# return not yield; see "General review"
if length % 2 == 0:
return (sortedlist[half] + sortedlist[half - 1]) // 2
else:
return sortedlist[half]
def main():
m, n = 1000, 15000000
start_time = time.time()
data = [random.randrange(m) for i in range(n)]
print("--- %s seconds to generate" % (time.time() - start_time))
# print("Random Data: ", data)
start_time = time.time()
result = sorted(data)
print("--- %s seconds to sort" % (time.time() - start_time))
start_time = time.time()
result = median(data)
print("Result: ", result)
print("--- %s seconds ---" % (time.time() - start_time))
--- 7.638948202133179 seconds to generate
--- 3.118924617767334 seconds to sort
Result: 500
--- 3.3397886753082275 seconds ---
General review
I don't see why you yield
a single value, rather than simply returning. Is there a need for Median()
to be iterable?
Given that length
is an integer, if length % 2
is not 0, it must be 1 - so the elif
could easily be else
.
edited Aug 15 at 14:19
answered Aug 15 at 13:39
Toby Speight
19k43697
19k43697
You are right on the General Review, yield is only necessary for the running median, so i changed toreturn
and the result toresult = Median(data)
. Else was used instead of elif. Is there any algorithm that generates the list on an improved way, faster?
â danieltakeshi
Aug 15 at 13:52
Yes, use Python's built-in sort for a one-time sorting of a list - see edit.
â Toby Speight
Aug 15 at 14:20
add a comment |Â
You are right on the General Review, yield is only necessary for the running median, so i changed toreturn
and the result toresult = Median(data)
. Else was used instead of elif. Is there any algorithm that generates the list on an improved way, faster?
â danieltakeshi
Aug 15 at 13:52
Yes, use Python's built-in sort for a one-time sorting of a list - see edit.
â Toby Speight
Aug 15 at 14:20
You are right on the General Review, yield is only necessary for the running median, so i changed to
return
and the result to result = Median(data)
. Else was used instead of elif. Is there any algorithm that generates the list on an improved way, faster?â danieltakeshi
Aug 15 at 13:52
You are right on the General Review, yield is only necessary for the running median, so i changed to
return
and the result to result = Median(data)
. Else was used instead of elif. Is there any algorithm that generates the list on an improved way, faster?â danieltakeshi
Aug 15 at 13:52
Yes, use Python's built-in sort for a one-time sorting of a list - see edit.
â Toby Speight
Aug 15 at 14:20
Yes, use Python's built-in sort for a one-time sorting of a list - see edit.
â Toby Speight
Aug 15 at 14:20
add a comment |Â
up vote
3
down vote
If you want to time something, you should time just that. When I ran your original code, I got 6.023478031158447 seconds. When I instead did
start_time = time.time()
result = list(Median(data))
end_time = time.time()
print("Result: ", result, end_time-start_time)
I got 1.8368241786956787. Note that calling time.time()
before the print
statement means that the execution of the print
statement isn't included in your calculation of elapsed time.
I also tried
import statistics
start_time = time.time()
result = statistics.median(data)
end_time = time.time()
print("Result: ", result, end_time-start_time)
and got 0.7475762367248535.
What do mean by "not using heap", and why do you want that? What makes you think SortedList doesn't use a heap? If you're going to use pre-made functions, why not just use the pre-made median
function?
Just taking some challenges to learn Python. It was without heap, because there are many answers and duplicates. Thanks for the tip for measuring the time.
â danieltakeshi
Aug 15 at 19:56
add a comment |Â
up vote
3
down vote
If you want to time something, you should time just that. When I ran your original code, I got 6.023478031158447 seconds. When I instead did
start_time = time.time()
result = list(Median(data))
end_time = time.time()
print("Result: ", result, end_time-start_time)
I got 1.8368241786956787. Note that calling time.time()
before the print
statement means that the execution of the print
statement isn't included in your calculation of elapsed time.
I also tried
import statistics
start_time = time.time()
result = statistics.median(data)
end_time = time.time()
print("Result: ", result, end_time-start_time)
and got 0.7475762367248535.
What do mean by "not using heap", and why do you want that? What makes you think SortedList doesn't use a heap? If you're going to use pre-made functions, why not just use the pre-made median
function?
Just taking some challenges to learn Python. It was without heap, because there are many answers and duplicates. Thanks for the tip for measuring the time.
â danieltakeshi
Aug 15 at 19:56
add a comment |Â
up vote
3
down vote
up vote
3
down vote
If you want to time something, you should time just that. When I ran your original code, I got 6.023478031158447 seconds. When I instead did
start_time = time.time()
result = list(Median(data))
end_time = time.time()
print("Result: ", result, end_time-start_time)
I got 1.8368241786956787. Note that calling time.time()
before the print
statement means that the execution of the print
statement isn't included in your calculation of elapsed time.
I also tried
import statistics
start_time = time.time()
result = statistics.median(data)
end_time = time.time()
print("Result: ", result, end_time-start_time)
and got 0.7475762367248535.
What do mean by "not using heap", and why do you want that? What makes you think SortedList doesn't use a heap? If you're going to use pre-made functions, why not just use the pre-made median
function?
If you want to time something, you should time just that. When I ran your original code, I got 6.023478031158447 seconds. When I instead did
start_time = time.time()
result = list(Median(data))
end_time = time.time()
print("Result: ", result, end_time-start_time)
I got 1.8368241786956787. Note that calling time.time()
before the print
statement means that the execution of the print
statement isn't included in your calculation of elapsed time.
I also tried
import statistics
start_time = time.time()
result = statistics.median(data)
end_time = time.time()
print("Result: ", result, end_time-start_time)
and got 0.7475762367248535.
What do mean by "not using heap", and why do you want that? What makes you think SortedList doesn't use a heap? If you're going to use pre-made functions, why not just use the pre-made median
function?
answered Aug 15 at 19:47
Acccumulation
1,01915
1,01915
Just taking some challenges to learn Python. It was without heap, because there are many answers and duplicates. Thanks for the tip for measuring the time.
â danieltakeshi
Aug 15 at 19:56
add a comment |Â
Just taking some challenges to learn Python. It was without heap, because there are many answers and duplicates. Thanks for the tip for measuring the time.
â danieltakeshi
Aug 15 at 19:56
Just taking some challenges to learn Python. It was without heap, because there are many answers and duplicates. Thanks for the tip for measuring the time.
â danieltakeshi
Aug 15 at 19:56
Just taking some challenges to learn Python. It was without heap, because there are many answers and duplicates. Thanks for the tip for measuring the time.
â danieltakeshi
Aug 15 at 19:56
add a comment |Â
up vote
2
down vote
Thanks to Toby now the program takes:
--- 7.169409990310669 seconds to finish the program---
instead of --- 186.7168517112732 seconds---
for 100 milion numbers.
One other improvement I made was to add an extra optimization on the random generating algorithm, using numpy. So numpy.sort()
was used to sort the numpy array at the start of median()
, because it is faster for numpy arrays.
Code
import time
import numpy
start_time0 = time.time()
def median(values):
sortedlist = numpy.sort(values)
length = len(sortedlist)
half = length // 2
if length % 2 == 0:
return (sortedlist[half] + sortedlist[half - 1]) // 2
else:
return sortedlist[half]
def main():
m, n = 1000, 100000000
# Random Number Generator
start_time = time.time()
data = numpy.random.randint(1, m, n)
print("--- %s seconds to numpy random---" % (time.time() - start_time))
# Median
start_time = time.time()
result = median(data)
print("Result: ", result)
print("--- %s seconds to find the Median---" % (time.time() - start_time))
if __name__ == "__main__":
main()
print("--- %s seconds to finish the program---" % (time.time() - start_time0))
In this case, you might want to consider posting another question if you want another review? As it is, your newly refactored code isn't more helpful than the already accepted answer :)
â IEatBagels
Aug 15 at 19:44
3
Oh forgive us - it was difficult to see the implicit change. Hopefully my edit makes it easier for others to see
â Sá´Âá´ Oná´Âá´Âá´Â
Aug 15 at 19:47
add a comment |Â
up vote
2
down vote
Thanks to Toby now the program takes:
--- 7.169409990310669 seconds to finish the program---
instead of --- 186.7168517112732 seconds---
for 100 milion numbers.
One other improvement I made was to add an extra optimization on the random generating algorithm, using numpy. So numpy.sort()
was used to sort the numpy array at the start of median()
, because it is faster for numpy arrays.
Code
import time
import numpy
start_time0 = time.time()
def median(values):
sortedlist = numpy.sort(values)
length = len(sortedlist)
half = length // 2
if length % 2 == 0:
return (sortedlist[half] + sortedlist[half - 1]) // 2
else:
return sortedlist[half]
def main():
m, n = 1000, 100000000
# Random Number Generator
start_time = time.time()
data = numpy.random.randint(1, m, n)
print("--- %s seconds to numpy random---" % (time.time() - start_time))
# Median
start_time = time.time()
result = median(data)
print("Result: ", result)
print("--- %s seconds to find the Median---" % (time.time() - start_time))
if __name__ == "__main__":
main()
print("--- %s seconds to finish the program---" % (time.time() - start_time0))
In this case, you might want to consider posting another question if you want another review? As it is, your newly refactored code isn't more helpful than the already accepted answer :)
â IEatBagels
Aug 15 at 19:44
3
Oh forgive us - it was difficult to see the implicit change. Hopefully my edit makes it easier for others to see
â Sá´Âá´ Oná´Âá´Âá´Â
Aug 15 at 19:47
add a comment |Â
up vote
2
down vote
up vote
2
down vote
Thanks to Toby now the program takes:
--- 7.169409990310669 seconds to finish the program---
instead of --- 186.7168517112732 seconds---
for 100 milion numbers.
One other improvement I made was to add an extra optimization on the random generating algorithm, using numpy. So numpy.sort()
was used to sort the numpy array at the start of median()
, because it is faster for numpy arrays.
Code
import time
import numpy
start_time0 = time.time()
def median(values):
sortedlist = numpy.sort(values)
length = len(sortedlist)
half = length // 2
if length % 2 == 0:
return (sortedlist[half] + sortedlist[half - 1]) // 2
else:
return sortedlist[half]
def main():
m, n = 1000, 100000000
# Random Number Generator
start_time = time.time()
data = numpy.random.randint(1, m, n)
print("--- %s seconds to numpy random---" % (time.time() - start_time))
# Median
start_time = time.time()
result = median(data)
print("Result: ", result)
print("--- %s seconds to find the Median---" % (time.time() - start_time))
if __name__ == "__main__":
main()
print("--- %s seconds to finish the program---" % (time.time() - start_time0))
Thanks to Toby now the program takes:
--- 7.169409990310669 seconds to finish the program---
instead of --- 186.7168517112732 seconds---
for 100 milion numbers.
One other improvement I made was to add an extra optimization on the random generating algorithm, using numpy. So numpy.sort()
was used to sort the numpy array at the start of median()
, because it is faster for numpy arrays.
Code
import time
import numpy
start_time0 = time.time()
def median(values):
sortedlist = numpy.sort(values)
length = len(sortedlist)
half = length // 2
if length % 2 == 0:
return (sortedlist[half] + sortedlist[half - 1]) // 2
else:
return sortedlist[half]
def main():
m, n = 1000, 100000000
# Random Number Generator
start_time = time.time()
data = numpy.random.randint(1, m, n)
print("--- %s seconds to numpy random---" % (time.time() - start_time))
# Median
start_time = time.time()
result = median(data)
print("Result: ", result)
print("--- %s seconds to find the Median---" % (time.time() - start_time))
if __name__ == "__main__":
main()
print("--- %s seconds to finish the program---" % (time.time() - start_time0))
edited Aug 15 at 19:50
Sá´Âá´ Oná´Âá´Âá´Â
6,86361645
6,86361645
answered Aug 15 at 17:34
danieltakeshi
1585
1585
In this case, you might want to consider posting another question if you want another review? As it is, your newly refactored code isn't more helpful than the already accepted answer :)
â IEatBagels
Aug 15 at 19:44
3
Oh forgive us - it was difficult to see the implicit change. Hopefully my edit makes it easier for others to see
â Sá´Âá´ Oná´Âá´Âá´Â
Aug 15 at 19:47
add a comment |Â
In this case, you might want to consider posting another question if you want another review? As it is, your newly refactored code isn't more helpful than the already accepted answer :)
â IEatBagels
Aug 15 at 19:44
3
Oh forgive us - it was difficult to see the implicit change. Hopefully my edit makes it easier for others to see
â Sá´Âá´ Oná´Âá´Âá´Â
Aug 15 at 19:47
In this case, you might want to consider posting another question if you want another review? As it is, your newly refactored code isn't more helpful than the already accepted answer :)
â IEatBagels
Aug 15 at 19:44
In this case, you might want to consider posting another question if you want another review? As it is, your newly refactored code isn't more helpful than the already accepted answer :)
â IEatBagels
Aug 15 at 19:44
3
3
Oh forgive us - it was difficult to see the implicit change. Hopefully my edit makes it easier for others to see
â Sá´Âá´ Oná´Âá´Âá´Â
Aug 15 at 19:47
Oh forgive us - it was difficult to see the implicit change. Hopefully my edit makes it easier for others to see
â Sá´Âá´ Oná´Âá´Âá´Â
Aug 15 at 19:47
add a comment |Â
up vote
0
down vote
just a note :
you can enhance just a tidbit by defining the name in the list comprehension
from
data = [random.randrange(m) for i in range(n)]
to
data = [random.randrange(m) for num in range(n)]
add a comment |Â
up vote
0
down vote
just a note :
you can enhance just a tidbit by defining the name in the list comprehension
from
data = [random.randrange(m) for i in range(n)]
to
data = [random.randrange(m) for num in range(n)]
add a comment |Â
up vote
0
down vote
up vote
0
down vote
just a note :
you can enhance just a tidbit by defining the name in the list comprehension
from
data = [random.randrange(m) for i in range(n)]
to
data = [random.randrange(m) for num in range(n)]
just a note :
you can enhance just a tidbit by defining the name in the list comprehension
from
data = [random.randrange(m) for i in range(n)]
to
data = [random.randrange(m) for num in range(n)]
answered Aug 16 at 19:51
Abdur-Rahmaan Janhangeer
21811
21811
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f201727%2fmedian-calculation-of-list-of-integers-without-using-heap%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
sorting has time complexity
O(n log n)
. You may want to look into Medians of Medians (scroll down for a python implementation) which has linear time complexity.â Andre Holzner
Aug 15 at 19:20
Why do you use integer division for a list with even length? Shouldn't it be a regular division of the two center values in the sorted list?
â Daniel Lenz
Aug 15 at 21:34
1
I think there is a misunderstanding; the use of heaps and such clever data structures is needed to maintain a rolling median, i.e. you have to return the current median after every number in the input. If you want to call it only once on the final list, then a simple quickselect is enough - that gets you to O(n) time. (If you were to use this algorithm for the rolling mean case, you would need O(1+2+..+n) = O(n^2) time, which is why heaps and other structures are used.)
â Ant
Aug 15 at 22:39
1
Timsort is a sorting algorithm, and it takes O(n log n) time. Quickselect is not a sorting algorithm, and it takes O(n) time. Quickselect is much faster than timsort for this use case; the key observation is that to get the median you don't need to sort the whole vector, just as if you want to compute the minimum you don't need to sort the whole vector, you can just do one pass over the elements. With the median (and in general with order statistics) it's more complicated, so you use quickselect, but it still has a O(n) performance, much better than your sorting approach which is O(n log n)
â Ant
Aug 16 at 11:58
1
@DanielLenz Oh, It was to reuse the number as Integer. Just as the data type of the input. However, if the precision of the program is needed for decimal numbers, then it can be done.
â danieltakeshi
Aug 16 at 16:33