Coding gotchas: The python needle in a haystack problem.

Hey everybody! I will start mixing in slightly different posts from now on that deal with my other passion: Programming and optimization.

For this post I am going to look at a very common python oversight that causes 8000 times slower execution: Using the in operator to check whether an element exists within a list or not. Let's consider the following simple test case of looking whether we have a particular needle in our haystack:

def list_in(iterations):
    mylist = []
    for i in range(iterations):
        mylist.append(i)

    for i in range(iterations):
        if i in mylist: # O(n)
            continue
        else:
            return False
    return True


def set_in(iterations):
    myset = set()
    for i in range(iterations):
        myset.add(i)

    for i in range(iterations):
        if i in myset: # O(1)
            continue
        else:
            return False
    return True

Both of the functions do exactly the same thing, but with different datastructures: The list version uses a linked list to check whether a number is contained within a list, while the set version uses, well, surprise a set. The two datastructures obviously have different use cases and functionality, but is very important to note the time complexity of each them. The in operator takes O(1) time in the average case, whereas for the linked list it is O(n), which is a huge difference! To illustrate it, I wrote a small piece of code that tests the two functions with 200, 20000 and 200000 elements:

import gc
from timeit import default_timer
if __name__ == '__main__':
    gc.disable() #Disable garbage collection so that results are not influenced

    start_list_200 = default_timer()
    list_in(200)
    end_list_200 = default_timer()
    print("List with 200 elements took " + "{0:.6f}".format(end_list_200 - start_list_200) + " seconds.")

    start_set_200 = default_timer()
    set_in(200)
    end_set_200 = default_timer()
    print(" Set with 200 elements took " + "{0:.6f}".format(end_set_200 - start_set_200) + " seconds.")

    start_list_20k = default_timer()
    list_in(20000)
    end_list_20k = default_timer()
    print("List with 20000 elements took " + "{0:.6f}".format(end_list_20k - start_list_20k) + " seconds.")

    start_set_20k = default_timer()
    set_in(20000)
    end_set_20k = default_timer()
    print(" Set with 20000 elements took " + "{0:.6f}".format(end_set_20k - start_set_20k) + " seconds.")

    start_list_200k = default_timer()
    list_in(200000)
    end_list_200k = default_timer()
    print("List with 200000 elements took " + "{0:.6f}".format(end_list_200k - start_list_200k) + " seconds.")

    start_set_200k = default_timer()
    set_in(200000)
    end_set_200k = default_timer()
    print(" Set with 200000 elements took " + "{0:.6f}".format(end_set_200k - start_set_200k) + " seconds.")

The results are not surprising: The set version is about 8000 times faster when we increase the number of elements in the list to 20000.

% python python_gotcha.py 
List with 200 elements took 0.000243 seconds.
 Set with 200 elements took 0.000018 seconds.
List with 20000 elements took 2.171486 seconds.
 Set with 20000 elements took 0.003255 seconds.
List with 200000 elements took 242.570072 seconds.
 Set with 200000 elements took 0.031962 seconds.

You may think that this is a synthetic example and does not occur in practice, but stackoverflow is full questions like this where lists are used incorrectly in place of sets and users struggle to find why their code performs poorly.

The reason for the huge performance difference is that set in python is implemented similarly to a dictionary, that is to say a hash table with constant time lookup. With lists on the other hand, every single element needs to be traversed until it can be concluded whether an element is found within the list or not, which gets really slow with very long lists.

Sets make the needle in our haystack stand out and easy to locate.

I would argue that python should not allow the in syntactic sugar for lists, because it can be very frequently misused. List already implements the index method, which suggests it is a bad idea to traverse a list in order to see if something is in it.

tl;dr
If you have a list in python and you want to see if certain element is contained in the list, it's usually better to use a set OR convert your list to a set. The cost of conversion is tiny compared to the cost of traversing a list thousands of times.

That's it from me, expect more programming gotchas in the future.

Nick

Image sources: pixabay pixabay
code

Coding gotchas: The python needle in a haystack problem.

Nikolay Bogoychev

Nikolay Bogoychev

Terminology-Aware Translation with Constrained Decoding and Large Language Model Prompting

OpusCleaner and OpusTrainer: Machine translation training made easy

Efficient machine translation

Coding gotchas: Of references and copies

Hanzi of the day! 風