Median Calculation of List of Integers without using heap

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
7
down vote

favorite












References



Using those references:



  • rolling median implementations benchmark

  • FAST RUNNING MEDIAN USING AN INDEXABLE SKIPLIST

  • Calculate the median of a billion numbers

  • Find running median from a stream of integers

Code



A code was written to calculate the Median with the SortedContainers library:



from itertools import islice
from sortedcontainers import SortedList
import random
import time

start_time = time.time()


class Median(object):

def __init__(self, iterable):
self._iterable = islice(iterable, None)
self._sortedlist = SortedList(self._iterable)

def __iter__(self):
self_sortedlist = self._sortedlist
# print(self_sortedlist)
length = len(self_sortedlist)
half = length // 2
if length % 2 == 0:
yield (self_sortedlist[half] + self_sortedlist[half - 1]) // 2
elif length % 2 == 1:
yield self_sortedlist[half]


def main():
m, n = 1000, 1500000
data = [random.randrange(m) for i in range(n)]
# print("Random Data: ", data)
result = list(Median(data))
print("Result: ", result)


if __name__ == "__main__":

main()
print("--- %s seconds ---" % (time.time() - start_time))


Explanation



Random Number Generator



The following code generates data within the Range m and quantity n.



m, n = 1000, 15000000
data = [random.randrange(m) for i in range(n)]


Median



The Median class sorts the list of numbers and if n is odd, returns the middle item with yield self_sortedlist[half]. Or if n is even, returns the mean of two middle itens of the list with yield (self_sortedlist[half] + self_sortedlist[half - 1]) // 2



Question



How do I improve the code performance? Because for a large list (100 milion), it takes --- 186.7168517112732 seconds--- on my computer.










share|improve this question





















  • sorting has time complexity O(n log n). You may want to look into Medians of Medians (scroll down for a python implementation) which has linear time complexity.
    – Andre Holzner
    Aug 15 at 19:20











  • Why do you use integer division for a list with even length? Shouldn't it be a regular division of the two center values in the sorted list?
    – Daniel Lenz
    Aug 15 at 21:34






  • 1




    I think there is a misunderstanding; the use of heaps and such clever data structures is needed to maintain a rolling median, i.e. you have to return the current median after every number in the input. If you want to call it only once on the final list, then a simple quickselect is enough - that gets you to O(n) time. (If you were to use this algorithm for the rolling mean case, you would need O(1+2+..+n) = O(n^2) time, which is why heaps and other structures are used.)
    – Ant
    Aug 15 at 22:39







  • 1




    Timsort is a sorting algorithm, and it takes O(n log n) time. Quickselect is not a sorting algorithm, and it takes O(n) time. Quickselect is much faster than timsort for this use case; the key observation is that to get the median you don't need to sort the whole vector, just as if you want to compute the minimum you don't need to sort the whole vector, you can just do one pass over the elements. With the median (and in general with order statistics) it's more complicated, so you use quickselect, but it still has a O(n) performance, much better than your sorting approach which is O(n log n)
    – Ant
    Aug 16 at 11:58







  • 1




    @DanielLenz Oh, It was to reuse the number as Integer. Just as the data type of the input. However, if the precision of the program is needed for decimal numbers, then it can be done.
    – danieltakeshi
    Aug 16 at 16:33














up vote
7
down vote

favorite












References



Using those references:



  • rolling median implementations benchmark

  • FAST RUNNING MEDIAN USING AN INDEXABLE SKIPLIST

  • Calculate the median of a billion numbers

  • Find running median from a stream of integers

Code



A code was written to calculate the Median with the SortedContainers library:



from itertools import islice
from sortedcontainers import SortedList
import random
import time

start_time = time.time()


class Median(object):

def __init__(self, iterable):
self._iterable = islice(iterable, None)
self._sortedlist = SortedList(self._iterable)

def __iter__(self):
self_sortedlist = self._sortedlist
# print(self_sortedlist)
length = len(self_sortedlist)
half = length // 2
if length % 2 == 0:
yield (self_sortedlist[half] + self_sortedlist[half - 1]) // 2
elif length % 2 == 1:
yield self_sortedlist[half]


def main():
m, n = 1000, 1500000
data = [random.randrange(m) for i in range(n)]
# print("Random Data: ", data)
result = list(Median(data))
print("Result: ", result)


if __name__ == "__main__":

main()
print("--- %s seconds ---" % (time.time() - start_time))


Explanation



Random Number Generator



The following code generates data within the Range m and quantity n.



m, n = 1000, 15000000
data = [random.randrange(m) for i in range(n)]


Median



The Median class sorts the list of numbers and if n is odd, returns the middle item with yield self_sortedlist[half]. Or if n is even, returns the mean of two middle itens of the list with yield (self_sortedlist[half] + self_sortedlist[half - 1]) // 2



Question



How do I improve the code performance? Because for a large list (100 milion), it takes --- 186.7168517112732 seconds--- on my computer.










share|improve this question





















  • sorting has time complexity O(n log n). You may want to look into Medians of Medians (scroll down for a python implementation) which has linear time complexity.
    – Andre Holzner
    Aug 15 at 19:20











  • Why do you use integer division for a list with even length? Shouldn't it be a regular division of the two center values in the sorted list?
    – Daniel Lenz
    Aug 15 at 21:34






  • 1




    I think there is a misunderstanding; the use of heaps and such clever data structures is needed to maintain a rolling median, i.e. you have to return the current median after every number in the input. If you want to call it only once on the final list, then a simple quickselect is enough - that gets you to O(n) time. (If you were to use this algorithm for the rolling mean case, you would need O(1+2+..+n) = O(n^2) time, which is why heaps and other structures are used.)
    – Ant
    Aug 15 at 22:39







  • 1




    Timsort is a sorting algorithm, and it takes O(n log n) time. Quickselect is not a sorting algorithm, and it takes O(n) time. Quickselect is much faster than timsort for this use case; the key observation is that to get the median you don't need to sort the whole vector, just as if you want to compute the minimum you don't need to sort the whole vector, you can just do one pass over the elements. With the median (and in general with order statistics) it's more complicated, so you use quickselect, but it still has a O(n) performance, much better than your sorting approach which is O(n log n)
    – Ant
    Aug 16 at 11:58







  • 1




    @DanielLenz Oh, It was to reuse the number as Integer. Just as the data type of the input. However, if the precision of the program is needed for decimal numbers, then it can be done.
    – danieltakeshi
    Aug 16 at 16:33












up vote
7
down vote

favorite









up vote
7
down vote

favorite











References



Using those references:



  • rolling median implementations benchmark

  • FAST RUNNING MEDIAN USING AN INDEXABLE SKIPLIST

  • Calculate the median of a billion numbers

  • Find running median from a stream of integers

Code



A code was written to calculate the Median with the SortedContainers library:



from itertools import islice
from sortedcontainers import SortedList
import random
import time

start_time = time.time()


class Median(object):

def __init__(self, iterable):
self._iterable = islice(iterable, None)
self._sortedlist = SortedList(self._iterable)

def __iter__(self):
self_sortedlist = self._sortedlist
# print(self_sortedlist)
length = len(self_sortedlist)
half = length // 2
if length % 2 == 0:
yield (self_sortedlist[half] + self_sortedlist[half - 1]) // 2
elif length % 2 == 1:
yield self_sortedlist[half]


def main():
m, n = 1000, 1500000
data = [random.randrange(m) for i in range(n)]
# print("Random Data: ", data)
result = list(Median(data))
print("Result: ", result)


if __name__ == "__main__":

main()
print("--- %s seconds ---" % (time.time() - start_time))


Explanation



Random Number Generator



The following code generates data within the Range m and quantity n.



m, n = 1000, 15000000
data = [random.randrange(m) for i in range(n)]


Median



The Median class sorts the list of numbers and if n is odd, returns the middle item with yield self_sortedlist[half]. Or if n is even, returns the mean of two middle itens of the list with yield (self_sortedlist[half] + self_sortedlist[half - 1]) // 2



Question



How do I improve the code performance? Because for a large list (100 milion), it takes --- 186.7168517112732 seconds--- on my computer.










share|improve this question













References



Using those references:



  • rolling median implementations benchmark

  • FAST RUNNING MEDIAN USING AN INDEXABLE SKIPLIST

  • Calculate the median of a billion numbers

  • Find running median from a stream of integers

Code



A code was written to calculate the Median with the SortedContainers library:



from itertools import islice
from sortedcontainers import SortedList
import random
import time

start_time = time.time()


class Median(object):

def __init__(self, iterable):
self._iterable = islice(iterable, None)
self._sortedlist = SortedList(self._iterable)

def __iter__(self):
self_sortedlist = self._sortedlist
# print(self_sortedlist)
length = len(self_sortedlist)
half = length // 2
if length % 2 == 0:
yield (self_sortedlist[half] + self_sortedlist[half - 1]) // 2
elif length % 2 == 1:
yield self_sortedlist[half]


def main():
m, n = 1000, 1500000
data = [random.randrange(m) for i in range(n)]
# print("Random Data: ", data)
result = list(Median(data))
print("Result: ", result)


if __name__ == "__main__":

main()
print("--- %s seconds ---" % (time.time() - start_time))


Explanation



Random Number Generator



The following code generates data within the Range m and quantity n.



m, n = 1000, 15000000
data = [random.randrange(m) for i in range(n)]


Median



The Median class sorts the list of numbers and if n is odd, returns the middle item with yield self_sortedlist[half]. Or if n is even, returns the mean of two middle itens of the list with yield (self_sortedlist[half] + self_sortedlist[half - 1]) // 2



Question



How do I improve the code performance? Because for a large list (100 milion), it takes --- 186.7168517112732 seconds--- on my computer.







python python-3.x statistics






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Aug 15 at 13:08









danieltakeshi

1585




1585











  • sorting has time complexity O(n log n). You may want to look into Medians of Medians (scroll down for a python implementation) which has linear time complexity.
    – Andre Holzner
    Aug 15 at 19:20











  • Why do you use integer division for a list with even length? Shouldn't it be a regular division of the two center values in the sorted list?
    – Daniel Lenz
    Aug 15 at 21:34






  • 1




    I think there is a misunderstanding; the use of heaps and such clever data structures is needed to maintain a rolling median, i.e. you have to return the current median after every number in the input. If you want to call it only once on the final list, then a simple quickselect is enough - that gets you to O(n) time. (If you were to use this algorithm for the rolling mean case, you would need O(1+2+..+n) = O(n^2) time, which is why heaps and other structures are used.)
    – Ant
    Aug 15 at 22:39







  • 1




    Timsort is a sorting algorithm, and it takes O(n log n) time. Quickselect is not a sorting algorithm, and it takes O(n) time. Quickselect is much faster than timsort for this use case; the key observation is that to get the median you don't need to sort the whole vector, just as if you want to compute the minimum you don't need to sort the whole vector, you can just do one pass over the elements. With the median (and in general with order statistics) it's more complicated, so you use quickselect, but it still has a O(n) performance, much better than your sorting approach which is O(n log n)
    – Ant
    Aug 16 at 11:58







  • 1




    @DanielLenz Oh, It was to reuse the number as Integer. Just as the data type of the input. However, if the precision of the program is needed for decimal numbers, then it can be done.
    – danieltakeshi
    Aug 16 at 16:33
















  • sorting has time complexity O(n log n). You may want to look into Medians of Medians (scroll down for a python implementation) which has linear time complexity.
    – Andre Holzner
    Aug 15 at 19:20











  • Why do you use integer division for a list with even length? Shouldn't it be a regular division of the two center values in the sorted list?
    – Daniel Lenz
    Aug 15 at 21:34






  • 1




    I think there is a misunderstanding; the use of heaps and such clever data structures is needed to maintain a rolling median, i.e. you have to return the current median after every number in the input. If you want to call it only once on the final list, then a simple quickselect is enough - that gets you to O(n) time. (If you were to use this algorithm for the rolling mean case, you would need O(1+2+..+n) = O(n^2) time, which is why heaps and other structures are used.)
    – Ant
    Aug 15 at 22:39







  • 1




    Timsort is a sorting algorithm, and it takes O(n log n) time. Quickselect is not a sorting algorithm, and it takes O(n) time. Quickselect is much faster than timsort for this use case; the key observation is that to get the median you don't need to sort the whole vector, just as if you want to compute the minimum you don't need to sort the whole vector, you can just do one pass over the elements. With the median (and in general with order statistics) it's more complicated, so you use quickselect, but it still has a O(n) performance, much better than your sorting approach which is O(n log n)
    – Ant
    Aug 16 at 11:58







  • 1




    @DanielLenz Oh, It was to reuse the number as Integer. Just as the data type of the input. However, if the precision of the program is needed for decimal numbers, then it can be done.
    – danieltakeshi
    Aug 16 at 16:33















sorting has time complexity O(n log n). You may want to look into Medians of Medians (scroll down for a python implementation) which has linear time complexity.
– Andre Holzner
Aug 15 at 19:20





sorting has time complexity O(n log n). You may want to look into Medians of Medians (scroll down for a python implementation) which has linear time complexity.
– Andre Holzner
Aug 15 at 19:20













Why do you use integer division for a list with even length? Shouldn't it be a regular division of the two center values in the sorted list?
– Daniel Lenz
Aug 15 at 21:34




Why do you use integer division for a list with even length? Shouldn't it be a regular division of the two center values in the sorted list?
– Daniel Lenz
Aug 15 at 21:34




1




1




I think there is a misunderstanding; the use of heaps and such clever data structures is needed to maintain a rolling median, i.e. you have to return the current median after every number in the input. If you want to call it only once on the final list, then a simple quickselect is enough - that gets you to O(n) time. (If you were to use this algorithm for the rolling mean case, you would need O(1+2+..+n) = O(n^2) time, which is why heaps and other structures are used.)
– Ant
Aug 15 at 22:39





I think there is a misunderstanding; the use of heaps and such clever data structures is needed to maintain a rolling median, i.e. you have to return the current median after every number in the input. If you want to call it only once on the final list, then a simple quickselect is enough - that gets you to O(n) time. (If you were to use this algorithm for the rolling mean case, you would need O(1+2+..+n) = O(n^2) time, which is why heaps and other structures are used.)
– Ant
Aug 15 at 22:39





1




1




Timsort is a sorting algorithm, and it takes O(n log n) time. Quickselect is not a sorting algorithm, and it takes O(n) time. Quickselect is much faster than timsort for this use case; the key observation is that to get the median you don't need to sort the whole vector, just as if you want to compute the minimum you don't need to sort the whole vector, you can just do one pass over the elements. With the median (and in general with order statistics) it's more complicated, so you use quickselect, but it still has a O(n) performance, much better than your sorting approach which is O(n log n)
– Ant
Aug 16 at 11:58





Timsort is a sorting algorithm, and it takes O(n log n) time. Quickselect is not a sorting algorithm, and it takes O(n) time. Quickselect is much faster than timsort for this use case; the key observation is that to get the median you don't need to sort the whole vector, just as if you want to compute the minimum you don't need to sort the whole vector, you can just do one pass over the elements. With the median (and in general with order statistics) it's more complicated, so you use quickselect, but it still has a O(n) performance, much better than your sorting approach which is O(n log n)
– Ant
Aug 16 at 11:58





1




1




@DanielLenz Oh, It was to reuse the number as Integer. Just as the data type of the input. However, if the precision of the program is needed for decimal numbers, then it can be done.
– danieltakeshi
Aug 16 at 16:33




@DanielLenz Oh, It was to reuse the number as Integer. Just as the data type of the input. However, if the precision of the program is needed for decimal numbers, then it can be done.
– danieltakeshi
Aug 16 at 16:33










4 Answers
4






active

oldest

votes

















up vote
5
down vote



accepted










Performance



I instrumented your code differently:



def main():
m, n = 1000, 15000000
start_time = time.time()
data = [random.randrange(m) for i in range(n)]
print("--- %s seconds to generate" % (time.time() - start_time))
# print("Random Data: ", data)
start_time = time.time()
result = SortedList(data)
print("--- %s seconds to sort" % (time.time() - start_time))
start_time = time.time()
result = list(Median(data))
print("Result: ", result)
print("--- %s seconds ---" % (time.time() - start_time))


For me, the results were:



--- 7.407598257064819 seconds to generate
--- 4.535749673843384 seconds to sort
Result: [500]
--- 5.01109504699707 seconds ---


This shows that the generation of the random input takes longer than finding the median, and that 90% of Median() is spent sorting (probably most of the rest is caused by converting between lists and iterators). You're unlikely to get large gains through modifications to your own part of the code.



I got better results by using Python's built-in sorted(). We don't need any of the extra functionality of SortedList (maintaining the invariant through adds and deletes), and a single sort (mostly in native code) gives these results:



def median(values):
sortedlist = sorted(values)
length = len(sortedlist)
half = length // 2
# return not yield; see "General review"
if length % 2 == 0:
return (sortedlist[half] + sortedlist[half - 1]) // 2
else:
return sortedlist[half]

def main():
m, n = 1000, 15000000
start_time = time.time()
data = [random.randrange(m) for i in range(n)]
print("--- %s seconds to generate" % (time.time() - start_time))
# print("Random Data: ", data)
start_time = time.time()
result = sorted(data)
print("--- %s seconds to sort" % (time.time() - start_time))
start_time = time.time()
result = median(data)
print("Result: ", result)
print("--- %s seconds ---" % (time.time() - start_time))


--- 7.638948202133179 seconds to generate
--- 3.118924617767334 seconds to sort
Result: 500
--- 3.3397886753082275 seconds ---



General review



I don't see why you yield a single value, rather than simply returning. Is there a need for Median() to be iterable?



Given that length is an integer, if length % 2 is not 0, it must be 1 - so the elif could easily be else.






share|improve this answer






















  • You are right on the General Review, yield is only necessary for the running median, so i changed to return and the result to result = Median(data). Else was used instead of elif. Is there any algorithm that generates the list on an improved way, faster?
    – danieltakeshi
    Aug 15 at 13:52










  • Yes, use Python's built-in sort for a one-time sorting of a list - see edit.
    – Toby Speight
    Aug 15 at 14:20

















up vote
3
down vote













If you want to time something, you should time just that. When I ran your original code, I got 6.023478031158447 seconds. When I instead did



start_time = time.time()
result = list(Median(data))
end_time = time.time()
print("Result: ", result, end_time-start_time)


I got 1.8368241786956787. Note that calling time.time() before the print statement means that the execution of the print statement isn't included in your calculation of elapsed time.



I also tried



import statistics
start_time = time.time()
result = statistics.median(data)
end_time = time.time()
print("Result: ", result, end_time-start_time)


and got 0.7475762367248535.



What do mean by "not using heap", and why do you want that? What makes you think SortedList doesn't use a heap? If you're going to use pre-made functions, why not just use the pre-made median function?






share|improve this answer




















  • Just taking some challenges to learn Python. It was without heap, because there are many answers and duplicates. Thanks for the tip for measuring the time.
    – danieltakeshi
    Aug 15 at 19:56

















up vote
2
down vote













Thanks to Toby now the program takes:



--- 7.169409990310669 seconds to finish the program--- instead of --- 186.7168517112732 seconds--- for 100 milion numbers.



One other improvement I made was to add an extra optimization on the random generating algorithm, using numpy. So numpy.sort() was used to sort the numpy array at the start of median(), because it is faster for numpy arrays.



Code



import time
import numpy

start_time0 = time.time()


def median(values):
sortedlist = numpy.sort(values)
length = len(sortedlist)
half = length // 2

if length % 2 == 0:
return (sortedlist[half] + sortedlist[half - 1]) // 2
else:
return sortedlist[half]


def main():

m, n = 1000, 100000000
# Random Number Generator
start_time = time.time()
data = numpy.random.randint(1, m, n)
print("--- %s seconds to numpy random---" % (time.time() - start_time))
# Median
start_time = time.time()
result = median(data)
print("Result: ", result)
print("--- %s seconds to find the Median---" % (time.time() - start_time))


if __name__ == "__main__":

main()
print("--- %s seconds to finish the program---" % (time.time() - start_time0))





share|improve this answer






















  • In this case, you might want to consider posting another question if you want another review? As it is, your newly refactored code isn't more helpful than the already accepted answer :)
    – IEatBagels
    Aug 15 at 19:44






  • 3




    Oh forgive us - it was difficult to see the implicit change. Hopefully my edit makes it easier for others to see
    – Sᴀᴍ Onᴇᴌᴀ
    Aug 15 at 19:47


















up vote
0
down vote













just a note :
you can enhance just a tidbit by defining the name in the list comprehension



from



data = [random.randrange(m) for i in range(n)]


to



data = [random.randrange(m) for num in range(n)]





share|improve this answer




















    Your Answer




    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "196"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f201727%2fmedian-calculation-of-list-of-integers-without-using-heap%23new-answer', 'question_page');

    );

    Post as a guest






























    4 Answers
    4






    active

    oldest

    votes








    4 Answers
    4






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    5
    down vote



    accepted










    Performance



    I instrumented your code differently:



    def main():
    m, n = 1000, 15000000
    start_time = time.time()
    data = [random.randrange(m) for i in range(n)]
    print("--- %s seconds to generate" % (time.time() - start_time))
    # print("Random Data: ", data)
    start_time = time.time()
    result = SortedList(data)
    print("--- %s seconds to sort" % (time.time() - start_time))
    start_time = time.time()
    result = list(Median(data))
    print("Result: ", result)
    print("--- %s seconds ---" % (time.time() - start_time))


    For me, the results were:



    --- 7.407598257064819 seconds to generate
    --- 4.535749673843384 seconds to sort
    Result: [500]
    --- 5.01109504699707 seconds ---


    This shows that the generation of the random input takes longer than finding the median, and that 90% of Median() is spent sorting (probably most of the rest is caused by converting between lists and iterators). You're unlikely to get large gains through modifications to your own part of the code.



    I got better results by using Python's built-in sorted(). We don't need any of the extra functionality of SortedList (maintaining the invariant through adds and deletes), and a single sort (mostly in native code) gives these results:



    def median(values):
    sortedlist = sorted(values)
    length = len(sortedlist)
    half = length // 2
    # return not yield; see "General review"
    if length % 2 == 0:
    return (sortedlist[half] + sortedlist[half - 1]) // 2
    else:
    return sortedlist[half]

    def main():
    m, n = 1000, 15000000
    start_time = time.time()
    data = [random.randrange(m) for i in range(n)]
    print("--- %s seconds to generate" % (time.time() - start_time))
    # print("Random Data: ", data)
    start_time = time.time()
    result = sorted(data)
    print("--- %s seconds to sort" % (time.time() - start_time))
    start_time = time.time()
    result = median(data)
    print("Result: ", result)
    print("--- %s seconds ---" % (time.time() - start_time))


    --- 7.638948202133179 seconds to generate
    --- 3.118924617767334 seconds to sort
    Result: 500
    --- 3.3397886753082275 seconds ---



    General review



    I don't see why you yield a single value, rather than simply returning. Is there a need for Median() to be iterable?



    Given that length is an integer, if length % 2 is not 0, it must be 1 - so the elif could easily be else.






    share|improve this answer






















    • You are right on the General Review, yield is only necessary for the running median, so i changed to return and the result to result = Median(data). Else was used instead of elif. Is there any algorithm that generates the list on an improved way, faster?
      – danieltakeshi
      Aug 15 at 13:52










    • Yes, use Python's built-in sort for a one-time sorting of a list - see edit.
      – Toby Speight
      Aug 15 at 14:20














    up vote
    5
    down vote



    accepted










    Performance



    I instrumented your code differently:



    def main():
    m, n = 1000, 15000000
    start_time = time.time()
    data = [random.randrange(m) for i in range(n)]
    print("--- %s seconds to generate" % (time.time() - start_time))
    # print("Random Data: ", data)
    start_time = time.time()
    result = SortedList(data)
    print("--- %s seconds to sort" % (time.time() - start_time))
    start_time = time.time()
    result = list(Median(data))
    print("Result: ", result)
    print("--- %s seconds ---" % (time.time() - start_time))


    For me, the results were:



    --- 7.407598257064819 seconds to generate
    --- 4.535749673843384 seconds to sort
    Result: [500]
    --- 5.01109504699707 seconds ---


    This shows that the generation of the random input takes longer than finding the median, and that 90% of Median() is spent sorting (probably most of the rest is caused by converting between lists and iterators). You're unlikely to get large gains through modifications to your own part of the code.



    I got better results by using Python's built-in sorted(). We don't need any of the extra functionality of SortedList (maintaining the invariant through adds and deletes), and a single sort (mostly in native code) gives these results:



    def median(values):
    sortedlist = sorted(values)
    length = len(sortedlist)
    half = length // 2
    # return not yield; see "General review"
    if length % 2 == 0:
    return (sortedlist[half] + sortedlist[half - 1]) // 2
    else:
    return sortedlist[half]

    def main():
    m, n = 1000, 15000000
    start_time = time.time()
    data = [random.randrange(m) for i in range(n)]
    print("--- %s seconds to generate" % (time.time() - start_time))
    # print("Random Data: ", data)
    start_time = time.time()
    result = sorted(data)
    print("--- %s seconds to sort" % (time.time() - start_time))
    start_time = time.time()
    result = median(data)
    print("Result: ", result)
    print("--- %s seconds ---" % (time.time() - start_time))


    --- 7.638948202133179 seconds to generate
    --- 3.118924617767334 seconds to sort
    Result: 500
    --- 3.3397886753082275 seconds ---



    General review



    I don't see why you yield a single value, rather than simply returning. Is there a need for Median() to be iterable?



    Given that length is an integer, if length % 2 is not 0, it must be 1 - so the elif could easily be else.






    share|improve this answer






















    • You are right on the General Review, yield is only necessary for the running median, so i changed to return and the result to result = Median(data). Else was used instead of elif. Is there any algorithm that generates the list on an improved way, faster?
      – danieltakeshi
      Aug 15 at 13:52










    • Yes, use Python's built-in sort for a one-time sorting of a list - see edit.
      – Toby Speight
      Aug 15 at 14:20












    up vote
    5
    down vote



    accepted







    up vote
    5
    down vote



    accepted






    Performance



    I instrumented your code differently:



    def main():
    m, n = 1000, 15000000
    start_time = time.time()
    data = [random.randrange(m) for i in range(n)]
    print("--- %s seconds to generate" % (time.time() - start_time))
    # print("Random Data: ", data)
    start_time = time.time()
    result = SortedList(data)
    print("--- %s seconds to sort" % (time.time() - start_time))
    start_time = time.time()
    result = list(Median(data))
    print("Result: ", result)
    print("--- %s seconds ---" % (time.time() - start_time))


    For me, the results were:



    --- 7.407598257064819 seconds to generate
    --- 4.535749673843384 seconds to sort
    Result: [500]
    --- 5.01109504699707 seconds ---


    This shows that the generation of the random input takes longer than finding the median, and that 90% of Median() is spent sorting (probably most of the rest is caused by converting between lists and iterators). You're unlikely to get large gains through modifications to your own part of the code.



    I got better results by using Python's built-in sorted(). We don't need any of the extra functionality of SortedList (maintaining the invariant through adds and deletes), and a single sort (mostly in native code) gives these results:



    def median(values):
    sortedlist = sorted(values)
    length = len(sortedlist)
    half = length // 2
    # return not yield; see "General review"
    if length % 2 == 0:
    return (sortedlist[half] + sortedlist[half - 1]) // 2
    else:
    return sortedlist[half]

    def main():
    m, n = 1000, 15000000
    start_time = time.time()
    data = [random.randrange(m) for i in range(n)]
    print("--- %s seconds to generate" % (time.time() - start_time))
    # print("Random Data: ", data)
    start_time = time.time()
    result = sorted(data)
    print("--- %s seconds to sort" % (time.time() - start_time))
    start_time = time.time()
    result = median(data)
    print("Result: ", result)
    print("--- %s seconds ---" % (time.time() - start_time))


    --- 7.638948202133179 seconds to generate
    --- 3.118924617767334 seconds to sort
    Result: 500
    --- 3.3397886753082275 seconds ---



    General review



    I don't see why you yield a single value, rather than simply returning. Is there a need for Median() to be iterable?



    Given that length is an integer, if length % 2 is not 0, it must be 1 - so the elif could easily be else.






    share|improve this answer














    Performance



    I instrumented your code differently:



    def main():
    m, n = 1000, 15000000
    start_time = time.time()
    data = [random.randrange(m) for i in range(n)]
    print("--- %s seconds to generate" % (time.time() - start_time))
    # print("Random Data: ", data)
    start_time = time.time()
    result = SortedList(data)
    print("--- %s seconds to sort" % (time.time() - start_time))
    start_time = time.time()
    result = list(Median(data))
    print("Result: ", result)
    print("--- %s seconds ---" % (time.time() - start_time))


    For me, the results were:



    --- 7.407598257064819 seconds to generate
    --- 4.535749673843384 seconds to sort
    Result: [500]
    --- 5.01109504699707 seconds ---


    This shows that the generation of the random input takes longer than finding the median, and that 90% of Median() is spent sorting (probably most of the rest is caused by converting between lists and iterators). You're unlikely to get large gains through modifications to your own part of the code.



    I got better results by using Python's built-in sorted(). We don't need any of the extra functionality of SortedList (maintaining the invariant through adds and deletes), and a single sort (mostly in native code) gives these results:



    def median(values):
    sortedlist = sorted(values)
    length = len(sortedlist)
    half = length // 2
    # return not yield; see "General review"
    if length % 2 == 0:
    return (sortedlist[half] + sortedlist[half - 1]) // 2
    else:
    return sortedlist[half]

    def main():
    m, n = 1000, 15000000
    start_time = time.time()
    data = [random.randrange(m) for i in range(n)]
    print("--- %s seconds to generate" % (time.time() - start_time))
    # print("Random Data: ", data)
    start_time = time.time()
    result = sorted(data)
    print("--- %s seconds to sort" % (time.time() - start_time))
    start_time = time.time()
    result = median(data)
    print("Result: ", result)
    print("--- %s seconds ---" % (time.time() - start_time))


    --- 7.638948202133179 seconds to generate
    --- 3.118924617767334 seconds to sort
    Result: 500
    --- 3.3397886753082275 seconds ---



    General review



    I don't see why you yield a single value, rather than simply returning. Is there a need for Median() to be iterable?



    Given that length is an integer, if length % 2 is not 0, it must be 1 - so the elif could easily be else.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Aug 15 at 14:19

























    answered Aug 15 at 13:39









    Toby Speight

    19k43697




    19k43697











    • You are right on the General Review, yield is only necessary for the running median, so i changed to return and the result to result = Median(data). Else was used instead of elif. Is there any algorithm that generates the list on an improved way, faster?
      – danieltakeshi
      Aug 15 at 13:52










    • Yes, use Python's built-in sort for a one-time sorting of a list - see edit.
      – Toby Speight
      Aug 15 at 14:20
















    • You are right on the General Review, yield is only necessary for the running median, so i changed to return and the result to result = Median(data). Else was used instead of elif. Is there any algorithm that generates the list on an improved way, faster?
      – danieltakeshi
      Aug 15 at 13:52










    • Yes, use Python's built-in sort for a one-time sorting of a list - see edit.
      – Toby Speight
      Aug 15 at 14:20















    You are right on the General Review, yield is only necessary for the running median, so i changed to return and the result to result = Median(data). Else was used instead of elif. Is there any algorithm that generates the list on an improved way, faster?
    – danieltakeshi
    Aug 15 at 13:52




    You are right on the General Review, yield is only necessary for the running median, so i changed to return and the result to result = Median(data). Else was used instead of elif. Is there any algorithm that generates the list on an improved way, faster?
    – danieltakeshi
    Aug 15 at 13:52












    Yes, use Python's built-in sort for a one-time sorting of a list - see edit.
    – Toby Speight
    Aug 15 at 14:20




    Yes, use Python's built-in sort for a one-time sorting of a list - see edit.
    – Toby Speight
    Aug 15 at 14:20












    up vote
    3
    down vote













    If you want to time something, you should time just that. When I ran your original code, I got 6.023478031158447 seconds. When I instead did



    start_time = time.time()
    result = list(Median(data))
    end_time = time.time()
    print("Result: ", result, end_time-start_time)


    I got 1.8368241786956787. Note that calling time.time() before the print statement means that the execution of the print statement isn't included in your calculation of elapsed time.



    I also tried



    import statistics
    start_time = time.time()
    result = statistics.median(data)
    end_time = time.time()
    print("Result: ", result, end_time-start_time)


    and got 0.7475762367248535.



    What do mean by "not using heap", and why do you want that? What makes you think SortedList doesn't use a heap? If you're going to use pre-made functions, why not just use the pre-made median function?






    share|improve this answer




















    • Just taking some challenges to learn Python. It was without heap, because there are many answers and duplicates. Thanks for the tip for measuring the time.
      – danieltakeshi
      Aug 15 at 19:56














    up vote
    3
    down vote













    If you want to time something, you should time just that. When I ran your original code, I got 6.023478031158447 seconds. When I instead did



    start_time = time.time()
    result = list(Median(data))
    end_time = time.time()
    print("Result: ", result, end_time-start_time)


    I got 1.8368241786956787. Note that calling time.time() before the print statement means that the execution of the print statement isn't included in your calculation of elapsed time.



    I also tried



    import statistics
    start_time = time.time()
    result = statistics.median(data)
    end_time = time.time()
    print("Result: ", result, end_time-start_time)


    and got 0.7475762367248535.



    What do mean by "not using heap", and why do you want that? What makes you think SortedList doesn't use a heap? If you're going to use pre-made functions, why not just use the pre-made median function?






    share|improve this answer




















    • Just taking some challenges to learn Python. It was without heap, because there are many answers and duplicates. Thanks for the tip for measuring the time.
      – danieltakeshi
      Aug 15 at 19:56












    up vote
    3
    down vote










    up vote
    3
    down vote









    If you want to time something, you should time just that. When I ran your original code, I got 6.023478031158447 seconds. When I instead did



    start_time = time.time()
    result = list(Median(data))
    end_time = time.time()
    print("Result: ", result, end_time-start_time)


    I got 1.8368241786956787. Note that calling time.time() before the print statement means that the execution of the print statement isn't included in your calculation of elapsed time.



    I also tried



    import statistics
    start_time = time.time()
    result = statistics.median(data)
    end_time = time.time()
    print("Result: ", result, end_time-start_time)


    and got 0.7475762367248535.



    What do mean by "not using heap", and why do you want that? What makes you think SortedList doesn't use a heap? If you're going to use pre-made functions, why not just use the pre-made median function?






    share|improve this answer












    If you want to time something, you should time just that. When I ran your original code, I got 6.023478031158447 seconds. When I instead did



    start_time = time.time()
    result = list(Median(data))
    end_time = time.time()
    print("Result: ", result, end_time-start_time)


    I got 1.8368241786956787. Note that calling time.time() before the print statement means that the execution of the print statement isn't included in your calculation of elapsed time.



    I also tried



    import statistics
    start_time = time.time()
    result = statistics.median(data)
    end_time = time.time()
    print("Result: ", result, end_time-start_time)


    and got 0.7475762367248535.



    What do mean by "not using heap", and why do you want that? What makes you think SortedList doesn't use a heap? If you're going to use pre-made functions, why not just use the pre-made median function?







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Aug 15 at 19:47









    Acccumulation

    1,01915




    1,01915











    • Just taking some challenges to learn Python. It was without heap, because there are many answers and duplicates. Thanks for the tip for measuring the time.
      – danieltakeshi
      Aug 15 at 19:56
















    • Just taking some challenges to learn Python. It was without heap, because there are many answers and duplicates. Thanks for the tip for measuring the time.
      – danieltakeshi
      Aug 15 at 19:56















    Just taking some challenges to learn Python. It was without heap, because there are many answers and duplicates. Thanks for the tip for measuring the time.
    – danieltakeshi
    Aug 15 at 19:56




    Just taking some challenges to learn Python. It was without heap, because there are many answers and duplicates. Thanks for the tip for measuring the time.
    – danieltakeshi
    Aug 15 at 19:56










    up vote
    2
    down vote













    Thanks to Toby now the program takes:



    --- 7.169409990310669 seconds to finish the program--- instead of --- 186.7168517112732 seconds--- for 100 milion numbers.



    One other improvement I made was to add an extra optimization on the random generating algorithm, using numpy. So numpy.sort() was used to sort the numpy array at the start of median(), because it is faster for numpy arrays.



    Code



    import time
    import numpy

    start_time0 = time.time()


    def median(values):
    sortedlist = numpy.sort(values)
    length = len(sortedlist)
    half = length // 2

    if length % 2 == 0:
    return (sortedlist[half] + sortedlist[half - 1]) // 2
    else:
    return sortedlist[half]


    def main():

    m, n = 1000, 100000000
    # Random Number Generator
    start_time = time.time()
    data = numpy.random.randint(1, m, n)
    print("--- %s seconds to numpy random---" % (time.time() - start_time))
    # Median
    start_time = time.time()
    result = median(data)
    print("Result: ", result)
    print("--- %s seconds to find the Median---" % (time.time() - start_time))


    if __name__ == "__main__":

    main()
    print("--- %s seconds to finish the program---" % (time.time() - start_time0))





    share|improve this answer






















    • In this case, you might want to consider posting another question if you want another review? As it is, your newly refactored code isn't more helpful than the already accepted answer :)
      – IEatBagels
      Aug 15 at 19:44






    • 3




      Oh forgive us - it was difficult to see the implicit change. Hopefully my edit makes it easier for others to see
      – Sᴀᴍ Onᴇᴌᴀ
      Aug 15 at 19:47















    up vote
    2
    down vote













    Thanks to Toby now the program takes:



    --- 7.169409990310669 seconds to finish the program--- instead of --- 186.7168517112732 seconds--- for 100 milion numbers.



    One other improvement I made was to add an extra optimization on the random generating algorithm, using numpy. So numpy.sort() was used to sort the numpy array at the start of median(), because it is faster for numpy arrays.



    Code



    import time
    import numpy

    start_time0 = time.time()


    def median(values):
    sortedlist = numpy.sort(values)
    length = len(sortedlist)
    half = length // 2

    if length % 2 == 0:
    return (sortedlist[half] + sortedlist[half - 1]) // 2
    else:
    return sortedlist[half]


    def main():

    m, n = 1000, 100000000
    # Random Number Generator
    start_time = time.time()
    data = numpy.random.randint(1, m, n)
    print("--- %s seconds to numpy random---" % (time.time() - start_time))
    # Median
    start_time = time.time()
    result = median(data)
    print("Result: ", result)
    print("--- %s seconds to find the Median---" % (time.time() - start_time))


    if __name__ == "__main__":

    main()
    print("--- %s seconds to finish the program---" % (time.time() - start_time0))





    share|improve this answer






















    • In this case, you might want to consider posting another question if you want another review? As it is, your newly refactored code isn't more helpful than the already accepted answer :)
      – IEatBagels
      Aug 15 at 19:44






    • 3




      Oh forgive us - it was difficult to see the implicit change. Hopefully my edit makes it easier for others to see
      – Sᴀᴍ Onᴇᴌᴀ
      Aug 15 at 19:47













    up vote
    2
    down vote










    up vote
    2
    down vote









    Thanks to Toby now the program takes:



    --- 7.169409990310669 seconds to finish the program--- instead of --- 186.7168517112732 seconds--- for 100 milion numbers.



    One other improvement I made was to add an extra optimization on the random generating algorithm, using numpy. So numpy.sort() was used to sort the numpy array at the start of median(), because it is faster for numpy arrays.



    Code



    import time
    import numpy

    start_time0 = time.time()


    def median(values):
    sortedlist = numpy.sort(values)
    length = len(sortedlist)
    half = length // 2

    if length % 2 == 0:
    return (sortedlist[half] + sortedlist[half - 1]) // 2
    else:
    return sortedlist[half]


    def main():

    m, n = 1000, 100000000
    # Random Number Generator
    start_time = time.time()
    data = numpy.random.randint(1, m, n)
    print("--- %s seconds to numpy random---" % (time.time() - start_time))
    # Median
    start_time = time.time()
    result = median(data)
    print("Result: ", result)
    print("--- %s seconds to find the Median---" % (time.time() - start_time))


    if __name__ == "__main__":

    main()
    print("--- %s seconds to finish the program---" % (time.time() - start_time0))





    share|improve this answer














    Thanks to Toby now the program takes:



    --- 7.169409990310669 seconds to finish the program--- instead of --- 186.7168517112732 seconds--- for 100 milion numbers.



    One other improvement I made was to add an extra optimization on the random generating algorithm, using numpy. So numpy.sort() was used to sort the numpy array at the start of median(), because it is faster for numpy arrays.



    Code



    import time
    import numpy

    start_time0 = time.time()


    def median(values):
    sortedlist = numpy.sort(values)
    length = len(sortedlist)
    half = length // 2

    if length % 2 == 0:
    return (sortedlist[half] + sortedlist[half - 1]) // 2
    else:
    return sortedlist[half]


    def main():

    m, n = 1000, 100000000
    # Random Number Generator
    start_time = time.time()
    data = numpy.random.randint(1, m, n)
    print("--- %s seconds to numpy random---" % (time.time() - start_time))
    # Median
    start_time = time.time()
    result = median(data)
    print("Result: ", result)
    print("--- %s seconds to find the Median---" % (time.time() - start_time))


    if __name__ == "__main__":

    main()
    print("--- %s seconds to finish the program---" % (time.time() - start_time0))






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Aug 15 at 19:50









    Sᴀᴍ Onᴇᴌᴀ

    6,86361645




    6,86361645










    answered Aug 15 at 17:34









    danieltakeshi

    1585




    1585











    • In this case, you might want to consider posting another question if you want another review? As it is, your newly refactored code isn't more helpful than the already accepted answer :)
      – IEatBagels
      Aug 15 at 19:44






    • 3




      Oh forgive us - it was difficult to see the implicit change. Hopefully my edit makes it easier for others to see
      – Sᴀᴍ Onᴇᴌᴀ
      Aug 15 at 19:47

















    • In this case, you might want to consider posting another question if you want another review? As it is, your newly refactored code isn't more helpful than the already accepted answer :)
      – IEatBagels
      Aug 15 at 19:44






    • 3




      Oh forgive us - it was difficult to see the implicit change. Hopefully my edit makes it easier for others to see
      – Sᴀᴍ Onᴇᴌᴀ
      Aug 15 at 19:47
















    In this case, you might want to consider posting another question if you want another review? As it is, your newly refactored code isn't more helpful than the already accepted answer :)
    – IEatBagels
    Aug 15 at 19:44




    In this case, you might want to consider posting another question if you want another review? As it is, your newly refactored code isn't more helpful than the already accepted answer :)
    – IEatBagels
    Aug 15 at 19:44




    3




    3




    Oh forgive us - it was difficult to see the implicit change. Hopefully my edit makes it easier for others to see
    – Sᴀᴍ Onᴇᴌᴀ
    Aug 15 at 19:47





    Oh forgive us - it was difficult to see the implicit change. Hopefully my edit makes it easier for others to see
    – Sᴀᴍ Onᴇᴌᴀ
    Aug 15 at 19:47











    up vote
    0
    down vote













    just a note :
    you can enhance just a tidbit by defining the name in the list comprehension



    from



    data = [random.randrange(m) for i in range(n)]


    to



    data = [random.randrange(m) for num in range(n)]





    share|improve this answer
























      up vote
      0
      down vote













      just a note :
      you can enhance just a tidbit by defining the name in the list comprehension



      from



      data = [random.randrange(m) for i in range(n)]


      to



      data = [random.randrange(m) for num in range(n)]





      share|improve this answer






















        up vote
        0
        down vote










        up vote
        0
        down vote









        just a note :
        you can enhance just a tidbit by defining the name in the list comprehension



        from



        data = [random.randrange(m) for i in range(n)]


        to



        data = [random.randrange(m) for num in range(n)]





        share|improve this answer












        just a note :
        you can enhance just a tidbit by defining the name in the list comprehension



        from



        data = [random.randrange(m) for i in range(n)]


        to



        data = [random.randrange(m) for num in range(n)]






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Aug 16 at 19:51









        Abdur-Rahmaan Janhangeer

        21811




        21811



























             

            draft saved


            draft discarded















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f201727%2fmedian-calculation-of-list-of-integers-without-using-heap%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            How to check contact read email or not when send email to Individual?

            Displaying single band from multi-band raster using QGIS

            How many registers does an x86_64 CPU actually have?