Find the longest word in set D that is a subsequence of string S

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP












6












$begingroup$



Given a string S and a set of words D, find the longest word in D that is a subsequence of S.



Word W is a subsequence of S if some number of characters, possibly zero, can be deleted from S to form W, without reordering the remaining characters.



Note: D can appear in any format (list, hash table, prefix tree, etc.)



For example, given the input of S = "abppplee" and D = "able", "ale", "apple", "bale", "kangaroo" the correct output would be "apple".



  • The words "able" and "ale" are both subsequences of S, but they are shorter than "apple".

  • The word "bale" is not a subsequence of S because even though S has all the right letters, they are not in the right order.

  • The word "kangaroo" is the longest word in D, but it isn't a subsequence of S.



I am currently trying to learn how to become a better programmer. I am taking the Google Tech Dev Guide recommended path and this is the first problem. I came up with my own solution, and read Google's first recommended answer is using brute force, and a slight optimization is to at least ensure the dictionary is represented by a hash table or prefix tree so that lookups are efficient.



This is my first time using a dictionary, and I also don't feel confident about using objects and classes. If you could review and recommend better ways to improve my code, maybe by using more (or better) objects, I would appreciate it.



class Main:
def create_dictionary(string):
dictionary =
index = 0

for letter in string:
if letter in dictionary:
dictionary[letter].append(index)
else:
dictionary[letter] = [index]
index += 1
return(dictionary)

def get_word_is_substring(word, dictionary):
index_of_last_letter_found = None
for letter in word:
if letter in dictionary and (index_of_last_letter_found is None or dictionary[letter][-1] > index_of_last_letter_found):
index = 0
while index < len(dictionary[letter]):
if index_of_last_letter_found is None or index_of_last_letter_found < dictionary[letter][index]:
index_of_last_letter_found = dictionary[letter][index]
break
else:
index += 1

else:
return False
return True

def replace_word_if_necessary(word, longest_word):
if (longest_word is None) or (len(word) > len(longest_word)):
longest_word = word
return longest_word

def get_longest_word(s, d):
dictionary = Main.create_dictionary(s)
longest_word = None

for word in d:
word_is_substring = Main.get_word_is_substring(word, dictionary)
if word_is_substring:
longest_word = Main.replace_word_if_necessary(word, longest_word)
print(longest_word)

Main.get_longest_word("abppplee", "ale", "bale", "able", "apple", "kangaroo")









share|improve this question











$endgroup$











  • $begingroup$
    For the person who raised the close vote, why? If you flag a question as "unclear what you're asking", consider asking the OP for explanations on what you don't undestand. This post seems very well written, especially for a first post.
    $endgroup$
    – IEatBagels
    Feb 27 at 20:20






  • 2




    $begingroup$
    @IEatBagels - yes, OP was asked to clarify, and did so: look at the edit history to see what changed. I retracted my close vote after the edit, but it looks like the other person didn't revisit. No worry; it will expire, in time.
    $endgroup$
    – Toby Speight
    Feb 27 at 22:03






  • 1




    $begingroup$
    Food for thought if this wasn't a test question: it may be worth prioritizing the order you check strings: if ale isn't in the string, then neither is able, apple and bale, and it's pointless to check them. In a small sequence this is not really beneficial, but if you're checking large lists building a dependency tree can very quickly pay dividends... I think they hint at this when they include it may be supplied as a prefix tree
    $endgroup$
    – TemporalWolf
    Feb 27 at 23:49
















6












$begingroup$



Given a string S and a set of words D, find the longest word in D that is a subsequence of S.



Word W is a subsequence of S if some number of characters, possibly zero, can be deleted from S to form W, without reordering the remaining characters.



Note: D can appear in any format (list, hash table, prefix tree, etc.)



For example, given the input of S = "abppplee" and D = "able", "ale", "apple", "bale", "kangaroo" the correct output would be "apple".



  • The words "able" and "ale" are both subsequences of S, but they are shorter than "apple".

  • The word "bale" is not a subsequence of S because even though S has all the right letters, they are not in the right order.

  • The word "kangaroo" is the longest word in D, but it isn't a subsequence of S.



I am currently trying to learn how to become a better programmer. I am taking the Google Tech Dev Guide recommended path and this is the first problem. I came up with my own solution, and read Google's first recommended answer is using brute force, and a slight optimization is to at least ensure the dictionary is represented by a hash table or prefix tree so that lookups are efficient.



This is my first time using a dictionary, and I also don't feel confident about using objects and classes. If you could review and recommend better ways to improve my code, maybe by using more (or better) objects, I would appreciate it.



class Main:
def create_dictionary(string):
dictionary =
index = 0

for letter in string:
if letter in dictionary:
dictionary[letter].append(index)
else:
dictionary[letter] = [index]
index += 1
return(dictionary)

def get_word_is_substring(word, dictionary):
index_of_last_letter_found = None
for letter in word:
if letter in dictionary and (index_of_last_letter_found is None or dictionary[letter][-1] > index_of_last_letter_found):
index = 0
while index < len(dictionary[letter]):
if index_of_last_letter_found is None or index_of_last_letter_found < dictionary[letter][index]:
index_of_last_letter_found = dictionary[letter][index]
break
else:
index += 1

else:
return False
return True

def replace_word_if_necessary(word, longest_word):
if (longest_word is None) or (len(word) > len(longest_word)):
longest_word = word
return longest_word

def get_longest_word(s, d):
dictionary = Main.create_dictionary(s)
longest_word = None

for word in d:
word_is_substring = Main.get_word_is_substring(word, dictionary)
if word_is_substring:
longest_word = Main.replace_word_if_necessary(word, longest_word)
print(longest_word)

Main.get_longest_word("abppplee", "ale", "bale", "able", "apple", "kangaroo")









share|improve this question











$endgroup$











  • $begingroup$
    For the person who raised the close vote, why? If you flag a question as "unclear what you're asking", consider asking the OP for explanations on what you don't undestand. This post seems very well written, especially for a first post.
    $endgroup$
    – IEatBagels
    Feb 27 at 20:20






  • 2




    $begingroup$
    @IEatBagels - yes, OP was asked to clarify, and did so: look at the edit history to see what changed. I retracted my close vote after the edit, but it looks like the other person didn't revisit. No worry; it will expire, in time.
    $endgroup$
    – Toby Speight
    Feb 27 at 22:03






  • 1




    $begingroup$
    Food for thought if this wasn't a test question: it may be worth prioritizing the order you check strings: if ale isn't in the string, then neither is able, apple and bale, and it's pointless to check them. In a small sequence this is not really beneficial, but if you're checking large lists building a dependency tree can very quickly pay dividends... I think they hint at this when they include it may be supplied as a prefix tree
    $endgroup$
    – TemporalWolf
    Feb 27 at 23:49














6












6








6





$begingroup$



Given a string S and a set of words D, find the longest word in D that is a subsequence of S.



Word W is a subsequence of S if some number of characters, possibly zero, can be deleted from S to form W, without reordering the remaining characters.



Note: D can appear in any format (list, hash table, prefix tree, etc.)



For example, given the input of S = "abppplee" and D = "able", "ale", "apple", "bale", "kangaroo" the correct output would be "apple".



  • The words "able" and "ale" are both subsequences of S, but they are shorter than "apple".

  • The word "bale" is not a subsequence of S because even though S has all the right letters, they are not in the right order.

  • The word "kangaroo" is the longest word in D, but it isn't a subsequence of S.



I am currently trying to learn how to become a better programmer. I am taking the Google Tech Dev Guide recommended path and this is the first problem. I came up with my own solution, and read Google's first recommended answer is using brute force, and a slight optimization is to at least ensure the dictionary is represented by a hash table or prefix tree so that lookups are efficient.



This is my first time using a dictionary, and I also don't feel confident about using objects and classes. If you could review and recommend better ways to improve my code, maybe by using more (or better) objects, I would appreciate it.



class Main:
def create_dictionary(string):
dictionary =
index = 0

for letter in string:
if letter in dictionary:
dictionary[letter].append(index)
else:
dictionary[letter] = [index]
index += 1
return(dictionary)

def get_word_is_substring(word, dictionary):
index_of_last_letter_found = None
for letter in word:
if letter in dictionary and (index_of_last_letter_found is None or dictionary[letter][-1] > index_of_last_letter_found):
index = 0
while index < len(dictionary[letter]):
if index_of_last_letter_found is None or index_of_last_letter_found < dictionary[letter][index]:
index_of_last_letter_found = dictionary[letter][index]
break
else:
index += 1

else:
return False
return True

def replace_word_if_necessary(word, longest_word):
if (longest_word is None) or (len(word) > len(longest_word)):
longest_word = word
return longest_word

def get_longest_word(s, d):
dictionary = Main.create_dictionary(s)
longest_word = None

for word in d:
word_is_substring = Main.get_word_is_substring(word, dictionary)
if word_is_substring:
longest_word = Main.replace_word_if_necessary(word, longest_word)
print(longest_word)

Main.get_longest_word("abppplee", "ale", "bale", "able", "apple", "kangaroo")









share|improve this question











$endgroup$





Given a string S and a set of words D, find the longest word in D that is a subsequence of S.



Word W is a subsequence of S if some number of characters, possibly zero, can be deleted from S to form W, without reordering the remaining characters.



Note: D can appear in any format (list, hash table, prefix tree, etc.)



For example, given the input of S = "abppplee" and D = "able", "ale", "apple", "bale", "kangaroo" the correct output would be "apple".



  • The words "able" and "ale" are both subsequences of S, but they are shorter than "apple".

  • The word "bale" is not a subsequence of S because even though S has all the right letters, they are not in the right order.

  • The word "kangaroo" is the longest word in D, but it isn't a subsequence of S.



I am currently trying to learn how to become a better programmer. I am taking the Google Tech Dev Guide recommended path and this is the first problem. I came up with my own solution, and read Google's first recommended answer is using brute force, and a slight optimization is to at least ensure the dictionary is represented by a hash table or prefix tree so that lookups are efficient.



This is my first time using a dictionary, and I also don't feel confident about using objects and classes. If you could review and recommend better ways to improve my code, maybe by using more (or better) objects, I would appreciate it.



class Main:
def create_dictionary(string):
dictionary =
index = 0

for letter in string:
if letter in dictionary:
dictionary[letter].append(index)
else:
dictionary[letter] = [index]
index += 1
return(dictionary)

def get_word_is_substring(word, dictionary):
index_of_last_letter_found = None
for letter in word:
if letter in dictionary and (index_of_last_letter_found is None or dictionary[letter][-1] > index_of_last_letter_found):
index = 0
while index < len(dictionary[letter]):
if index_of_last_letter_found is None or index_of_last_letter_found < dictionary[letter][index]:
index_of_last_letter_found = dictionary[letter][index]
break
else:
index += 1

else:
return False
return True

def replace_word_if_necessary(word, longest_word):
if (longest_word is None) or (len(word) > len(longest_word)):
longest_word = word
return longest_word

def get_longest_word(s, d):
dictionary = Main.create_dictionary(s)
longest_word = None

for word in d:
word_is_substring = Main.get_word_is_substring(word, dictionary)
if word_is_substring:
longest_word = Main.replace_word_if_necessary(word, longest_word)
print(longest_word)

Main.get_longest_word("abppplee", "ale", "bale", "able", "apple", "kangaroo")






python python-3.x






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Feb 27 at 18:23







K. Kretz

















asked Feb 27 at 18:09









K. KretzK. Kretz

334




334











  • $begingroup$
    For the person who raised the close vote, why? If you flag a question as "unclear what you're asking", consider asking the OP for explanations on what you don't undestand. This post seems very well written, especially for a first post.
    $endgroup$
    – IEatBagels
    Feb 27 at 20:20






  • 2




    $begingroup$
    @IEatBagels - yes, OP was asked to clarify, and did so: look at the edit history to see what changed. I retracted my close vote after the edit, but it looks like the other person didn't revisit. No worry; it will expire, in time.
    $endgroup$
    – Toby Speight
    Feb 27 at 22:03






  • 1




    $begingroup$
    Food for thought if this wasn't a test question: it may be worth prioritizing the order you check strings: if ale isn't in the string, then neither is able, apple and bale, and it's pointless to check them. In a small sequence this is not really beneficial, but if you're checking large lists building a dependency tree can very quickly pay dividends... I think they hint at this when they include it may be supplied as a prefix tree
    $endgroup$
    – TemporalWolf
    Feb 27 at 23:49

















  • $begingroup$
    For the person who raised the close vote, why? If you flag a question as "unclear what you're asking", consider asking the OP for explanations on what you don't undestand. This post seems very well written, especially for a first post.
    $endgroup$
    – IEatBagels
    Feb 27 at 20:20






  • 2




    $begingroup$
    @IEatBagels - yes, OP was asked to clarify, and did so: look at the edit history to see what changed. I retracted my close vote after the edit, but it looks like the other person didn't revisit. No worry; it will expire, in time.
    $endgroup$
    – Toby Speight
    Feb 27 at 22:03






  • 1




    $begingroup$
    Food for thought if this wasn't a test question: it may be worth prioritizing the order you check strings: if ale isn't in the string, then neither is able, apple and bale, and it's pointless to check them. In a small sequence this is not really beneficial, but if you're checking large lists building a dependency tree can very quickly pay dividends... I think they hint at this when they include it may be supplied as a prefix tree
    $endgroup$
    – TemporalWolf
    Feb 27 at 23:49
















$begingroup$
For the person who raised the close vote, why? If you flag a question as "unclear what you're asking", consider asking the OP for explanations on what you don't undestand. This post seems very well written, especially for a first post.
$endgroup$
– IEatBagels
Feb 27 at 20:20




$begingroup$
For the person who raised the close vote, why? If you flag a question as "unclear what you're asking", consider asking the OP for explanations on what you don't undestand. This post seems very well written, especially for a first post.
$endgroup$
– IEatBagels
Feb 27 at 20:20




2




2




$begingroup$
@IEatBagels - yes, OP was asked to clarify, and did so: look at the edit history to see what changed. I retracted my close vote after the edit, but it looks like the other person didn't revisit. No worry; it will expire, in time.
$endgroup$
– Toby Speight
Feb 27 at 22:03




$begingroup$
@IEatBagels - yes, OP was asked to clarify, and did so: look at the edit history to see what changed. I retracted my close vote after the edit, but it looks like the other person didn't revisit. No worry; it will expire, in time.
$endgroup$
– Toby Speight
Feb 27 at 22:03




1




1




$begingroup$
Food for thought if this wasn't a test question: it may be worth prioritizing the order you check strings: if ale isn't in the string, then neither is able, apple and bale, and it's pointless to check them. In a small sequence this is not really beneficial, but if you're checking large lists building a dependency tree can very quickly pay dividends... I think they hint at this when they include it may be supplied as a prefix tree
$endgroup$
– TemporalWolf
Feb 27 at 23:49





$begingroup$
Food for thought if this wasn't a test question: it may be worth prioritizing the order you check strings: if ale isn't in the string, then neither is able, apple and bale, and it's pointless to check them. In a small sequence this is not really beneficial, but if you're checking large lists building a dependency tree can very quickly pay dividends... I think they hint at this when they include it may be supplied as a prefix tree
$endgroup$
– TemporalWolf
Feb 27 at 23:49











4 Answers
4






active

oldest

votes


















7












$begingroup$

First, the good news: your code looks pretty good. It's indented properly, spelled properly, spaced properly, capitalized properly, and uses the right Python style. Good work!



Here are a couple of nits:




  1. This is a "program." (As opposed to just a module, or package, or library.) So use the standard python idiom for programs at the bottom:



    if __name__ == '__main__':
    Main.get_longest_word("abppplee", "ale", "bale", "able", "apple", "kangaroo")


    This mechanism is to allow your code to be loaded (via import myprogram) without having the main entry point automatically get invoked. That lets you load it up in the interpreter and browse around before you call it.




  2. Use docstrings! Write down the problem you are trying to solve, and any notes about inputs or outputs for functions. Especially data formats.



    For coding puzzle type problems, the docstring is a great place to copy the puzzle specification. This lets you refer back to it, lets you remember what you were doing when you wrote this code, and lets you paste it up at CodeReview with ease!



    #!/usr/bin/env python3
    """ Given a string S and a set of words D, find the longest word in D that is a subsequence of S.

    Word W is a subsequence of S if some number of characters, possibly zero, can be deleted from S to form W, without reordering the remaining characters.

    Note: D can appear in any format (list, hash table, prefix tree, etc.)

    For example, given the input of S = "abppplee" and D = "able", "ale", "apple", "bale", "kangaroo" the correct output would be "apple".

    - The words "able" and "ale" are both subsequences of S, but they are shorter than "apple".
    - The word "bale" is not a subsequence of S because even though S has all the right letters, they are not in the right order.
    - The word "kangaroo" is the longest word in D, but it isn't a subsequence of S.
    """


Now here are some possible improvements:



In create_dictionary



def create_dictionary(string):
dictionary =
index = 0

for letter in string:
if letter in dictionary:
dictionary[letter].append(index)
else:
dictionary[letter] = [index]
index += 1
return(dictionary)


First, congratulations! You have written the hand-coded version of how to manage a dictionary whose values are lists. You got it right, your code works, and you should never do that again because it's boring. Instead, use collections.defaultdict(list). A defaultdict remembers a factory function that it will call whenever it is asked about a key but doesn't have a corresponding value.



The word list is not just the name of a Python data type, it's also the function to call when you want to construct one! (Just like you use the name of every class as it's constructor: my_obj = MyClass(1, 2, "hello")) So when you want a dictionary that looks up lists, it's much easier to assume that there will always be a list in the dictionary, but it might be empty. That's what defaultdict(list) will get you:



import collections # somewhere at top of file

def create_dictionary(s):
dictionary = collections.defaultdict(list)
index = 0

for letter in string:
dictionary[letter].append(index) # defaultdict FTW!
index += 1
return(dictionary)


Next, give up your love affair with integers! Ned Batchelder gave a nice talk on the subject, so here's a link to that: https://www.youtube.com/watch?v=EnSu9hHGq5o



The idea is that many (most?) python loops don't need to use integers to index over things. And when you do need an integer index (and in this case you do!) there are better ways than maintaining your own integer index variable. Here's one such way: the enumerate() built-in function.



With enumerate, you can write a loop that iterates over the values from some iterable along with an automatically-associated integer:



 # No index=0 here
for index, letter in enumerate(string):
dictionary[letter].append(index)
# No index+=1 here!


The index, string pair is called a tuple, and it is a built-in type just like list. The act of assigning or iterating using multiple target variables that take their values from a single tuple is called tuple unpacking. (Remember that phrase, you'll need it when you want to ask for help on the subject.)



In get_longest_word



The issue I have with this function is not one of Python, but rather of design. You are given some words, d. You want to find the longest word that is a subsequence of the string s. How do you do it?



In your case, the answer is "look at every single word in d, ignore the ones that are not subsequences of s, pick the longest one that remains."



There are a couple of better (read: faster, more efficient) ways of doing that job. Let me make one simple suggestion: sort the words!



In Python, there are Iterables and Sequences. An iterable is something you can iterate. A sequence is something you can access using s[i]. It's possible to have an infinite iterable, by writing an infinite generator function. It's not possible to have an infinite sequence, since you'll run out of memory trying to store it.



In this particular case it seems okay to assume that d is going to be a sequence: a finite list or tuple. So the fastest way to find the "longest word" is to start by looking at long words first. Because once you find a long word that is a subsequence, you can stop - there are no shorter ones!



The way to sort things in Python is the sorted built-in function. (Yes, it takes an iterable. No, it won't sort an infinite one. Yeesh!) By default, it sorts things by comparing the items using the "native" comparison. You can specify a key function, however, to sort using some different mechanism. Let's use the length of the words, which is the len function
(function len(x) -> int returns the length of the list/string/whatever). And let's reverse the order, so that the big ones come first:



d = sorted(d, key=len, reverse=True)


Now, instead of needing to check the length and update the longest-word-so-far variable, you can just return immediately once you find a subsequence.



In get_word_is_substring



Let's talk about default values. You say:



index_of_last_letter_found = None


But :! grep letter_found % in Vim tells me:



 index_of_last_letter_found = None
if letter in dictionary and (index_of_last_letter_found is None or dictionary[letter][-1] > index_of_last_letter_found):
if index_of_last_letter_found is None or index_of_last_letter_found < dictionary[letter][index]:
index_of_last_letter_found = dictionary[letter][index]


You're spending a lot of keystrokes checking for None. And all you do is compare using <, and assign new values. Why not just set the default value to some value that you know is going to be "too low"? Since string index values start at 0, maybe -1 would make sense:



index_of_last_letter_found = -1


While were at it, shorten that variable name. Names should be as long as they need to be, and no longer!



def get_word_is_substring(word, dictionary):
last_index = -1 # Index of last letter found

for letter in word:
if letter in dictionary and dictionary[letter][-1] > last_index:
index = 0
while index < len(dictionary[letter]):
if last_index < dictionary[letter][index]:
last_index = dictionary[letter][index]
break
else:
index += 1
else:
return False
return True


That's a lot more readable, since there are fewer tests and fewer characters.



Next, let's go back and work on your fetish for simple integer arithmetic. Now that you've watched Ned Batchelder's talk, you can see that index += 1 is not the way!



for index in dictionary[letter]:
if last_index < index:
last_index = index
break


(There are some other ways to find the first element in an iterable that matches a condition. See here for many of them. But this is nice and clear, and works.)



Some words on "style"



The above code may be somewhat more difficult to understand because I changed the meaning of index. In your original code, index was an index into the list of indices that pointed to where letters occur in the master sequence. In my updated version index means the actual indices from the list. I have removed one level of indirection.



This is an example of why "too short" variable names are actually a good thing. You'll find a lot of people use very small names, like ch or i, to represent loop variables. This is because most of the time, loop variables are not a "concept" or a "noun". Instead, they are a subscript. We write a[i] because the original Teletype devices hand-carved out of wood by our pioneer forbears would not allow writing 𝓐ᵢ.



But we don't really care about i, or about a[i]. We care about the thing stored there, and its properties. So, I encourage you to use short little 1-letter and 2-letter names when you are indexing a sequence, or when you're stuck in some other language that doesn't provide the nice looping options that Python does. Then you could say:



for i ...
index = s_dict[letter][i]


And there would be no confusion between index and that-loop-variable.



Next, the structuring of your escape clauses: I see a lot of people being taught to code in a way that puts exit conditions last. You're doing that in your loop:



 for letter in word:
if letter in dictionary and dictionary[letter][-1] > last_index:
# much code here
else:
return False


But moving the exit condition to the top (by reversing the sense of the if statement) makes it clear what the alternatives are (if this is true, leave, otherwise ...) and reduces the indentation level of your code (by removing the else indent). Some people would have you use two-space tabs to address the indentation problem. Those people are all young with good eyesight. They're also quite wrong. :-)



 for letter in word:
if letter not in dictionary or dictionary[letter][-1] <= last_index:
return False

# much code here


Lastly, there's the name. You called it get_word_is_substring, but we want subsequences not substring, and there's no reason to say get_ at the beginning. In fact, the word can be assumed, since you're passing it in, so:



def is_subsequence(word, s_dict):
last_index = -1 # Index (in s) of last letter found

for letter in word:
if letter not in s_dict or s_dict[letter][-1] <= last_index:
return False

for index in s_dict[letter]:
if last_index < index:
last_index = index
break

return True


(Note: Because I suggested s_dict be a defaultdict, you could eliminate the if test entirely and rely on the for ... else construct, which is rather obscure. I don't recommend it, simply because it's non-intuitive, hard to understand, and because spelling out the failure conditions is better than leaving them to the language. ("explicit is better than implicit"))



The class



Finally, let's talk about the class:



class Main:


But first! Here's another video for you to watch, Jack Diederich's talk at Pycon 2012 entitled: "Stop Writing Classes!": https://www.youtube.com/watch?v=o9pEzgHorH0



As Jack says, "The signature of 'this shouldn't be a class' is that it has two methods, one of which is __init__."



In this case, you've created a class, but you don't store any data and you only call one method from the outside. It should be a function.






share|improve this answer











$endgroup$












  • $begingroup$
    I take issue with the suggestion of, as a first step, sorting the list of words. You have turned an $O(n)$ search operation into an $O(n log n)$ operation, which can be quite expensive if you have a very long list of words to check. Pre-filtering may be a more apt method. After finding the first entry “able” is a subsequence of S, you can skip “ale” because, even if it is a valid subsequence, it is too short. Since “apple” is longer, you do need to check it, and find it is a subsequence. “bale” is too short, and can be skipped. Finally, you do need to check and reject “kangaroo”.
    $endgroup$
    – AJNeufeld
    Feb 28 at 19:15










  • $begingroup$
    First of all, thanks for your time! This is exactly what I was looking for, a way to take my code to the next level. I am going to have to re-read your suggestion a few times and do some studying. All of your suggestions make sense and I really appreciate it!
    $endgroup$
    – K. Kretz
    Feb 28 at 22:14


















3












$begingroup$

Use defaultdict and enumerate



Python offers the defaultdict class, which works like a regular dictonary but returns a default value (here: empty list) if the requested element is not present. Also use enumerate if you want an index in a loop.



from collections import defaultdict

def create_dictionary(string):
dictionary = defaultdict(list)
for index, letter in enumerate(string):
dictionary[letter].append(index)
return dictionary


In the next method, work with lists of indices to leverage the defaultdict. Also, use a list comprehension to filter for valid indices because that also defaults to an empty list, so you don't need any special case handling.



def get_word_is_substring(word, dictionary):
indices = [-1]
for letter in word:
indices = [index for index in dictionary[letter] if index > indices[0]]
if not indices:
return False
return True





share|improve this answer









$endgroup$












  • $begingroup$
    Thanks so much for your help! As mentioned in the answer above, I will definitely study what you suggested. I really appreciate it.
    $endgroup$
    – K. Kretz
    Feb 28 at 22:17


















2












$begingroup$

I know this is a code review and I should improve the code rather than write a new one, I'd like to share with you how I would have done it.



def longest_matching_word(s, d):
def is_valid(word, s=s):
if not word and not s:
return True
if not s and word:
return False
if word and word[0]==s[0]:
return is_valid(word[1:], s[1:])
return is_valid(word, s[1:])
return max(filter(is_valid, d), key=len)


S = "abppplee" 
D = "ale", "bale", "able", "apple", "kangaroo"

longest_matching_word(S, D) # -> 'apple'


In the way it's done, it's easier to debug and maintain (no state or variable, so less things that can go wrong).



Thanks to the lazy filter object, for large sets, it will use nearly no memory at all.



The complexity is O(n). One interesting things for large sets is that multiprocessing.map can be applied to more efficiently use the computer's resources.



And if you think about it, it's not that complicated to understand:



Visualisation of the algorithm



It is also very readable, there is no mysterious variables, questionable loops, everything has a purpose that can be deduced.



I hope it gives you ideas...






share|improve this answer











$endgroup$












  • $begingroup$
    @Ludisposed Thanks for your interest in my solution, my answer was mainly to suggest an alternative way, not to provide a full answer. But I provided some explanation as you asked and even a gif (or jif). Feel free to give me additional feedback on my answer.
    $endgroup$
    – Benoît Pilatte
    Feb 28 at 12:22










  • $begingroup$
    You are passing D to longest_matching_word(S, D), where it becomes known as d locally, and is never used. Instead you filter(is_valid, D) using D from the global context. Not the best example of “how I would have done it”.
    $endgroup$
    – AJNeufeld
    Feb 28 at 17:28










  • $begingroup$
    Additionally, recursion is a terrible looping construct, especially given that Python does not do tail-call-optimization. A simple loop would have sufficed.
    $endgroup$
    – AJNeufeld
    Feb 28 at 17:43










  • $begingroup$
    I indeed misspelled D instead of d. Concerning recursion, I will not agree with you, I tested my function on large sets and although it is usually 2x slower than other solutions regardless of the sets given, it requires less memory and can be easily scaled if really needed. I consider not using recursion-because of python's implementation premature optimisation. This of course will not be true with strings that are hundreds of char long. AS I HAVE ALREADY SAID: my objective was to suggest an alternative way.
    $endgroup$
    – Benoît Pilatte
    Feb 28 at 18:34











  • $begingroup$
    Thanks for your suggestion. I like the gif as it makes things much easier for me being a new programmer. I appreciate the time and effort!
    $endgroup$
    – K. Kretz
    Feb 28 at 23:07


















2












$begingroup$

Clarify requirements



What result is desired if there's a tie for longest word? Accepting the first of equal-length words is a valid choice, but we'll want to document that we've made a choice where the spec was vague. Similarly, returning None when there are no matches is another valid choice that should be documented.



I'd lean toward returning a collection of results; that allows the caller to decide whether to print one or more, and streamlines the handling of the empty result.



Reduce computational complexity



The code presented tests every candidate against the entire input, in turn. This scales as O(MN), where M is the size of the text and N is the number of candidate words. We can reduce that complexity by grouping the words according to the next unseen letter in the word; then walk through the text, examining just the set of words that can be advanced at each step. That scales as O(M+Nn), where Nn is the total length of all the candidate words.



It should be obvious that this really lends itself to a prefix-tree representation, but we can make it work for plain lists, too.



Here's a quickly knocked up implementation of that idea, with some small improvements (e.g. finish early if we match the longest candidate). Please forgive a C++ coder's approach to Python:



from collections import defaultdict

class Word:
def __init__(self, s):
self.chars = s
self.index = 0

def __repr__(self):
return self.chars + "(" + str(self.index) + ")"

class Matcher:
def __init__(self, words):
self.candidates = defaultdict(list)
self.longest = 0
self.results =
for w in words:
self.insert(Word(w))

def insert(self, word):
length = len(word.chars)
if length < self.longest:
return
elif word.index < length:
self.candidates[word.chars[word.index]].append(word)
elif length == self.longest:
self.results.append(word.chars)
elif length > self.longest:
self.longest = length
self.results = [word.chars]
for i in self.candidates.values():
i[:] = filter(lambda x: len(x.chars) >= length, i)

def step(self, char):
words = self.candidates[char]
del self.candidates[char]
for w in words:
w.index += 1
self.insert(w)
if not any(self.candidates.values()):
raise StopIteration

def read(self, it):
try:
for c in it:
self.step(c)
except StopIteration as e:
pass
return self.results


Testing this on a large problem set gives results around 20✕ faster (after removing the print() from your code, to give a fair comparison):



import itertools
import random
import string
import time
import timeit

def make_random_words(min_len, max_len):
"""An infinite generator of words containing min_len to max_len letters"""
while True:
length = random.randint(min_len, max_len)
yield ''.join([random.choice(string.ascii_letters) for n in range(length)])

if __name__ == "__main__":
# try ten-thousand random words and a long text
words = list(itertools.islice(make_random_words(5, 15), 10000))
mobydick = open("moby-dick.txt").read()

print(timeit.timeit("Main.get_longest_word(mobydick, words)",
number=10, timer=time.process_time, globals=globals()))

print(timeit.timeit("Matcher(words).read(mobydick)",
number=10, timer=time.process_time, globals=globals()))


9.290360821999998
0.510936933


These results might be more dramatic if we read directly from the file, as stopping early will have greater benefit in that case.






share|improve this answer











$endgroup$












  • $begingroup$
    Thanks for the explination! The time complexity difference is insane! Thanks for the advice, and as I have told the others, I will try to study up on your recommendations. I appreciate it!
    $endgroup$
    – K. Kretz
    Feb 28 at 23:12











Your Answer





StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f214408%2ffind-the-longest-word-in-set-d-that-is-a-subsequence-of-string-s%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























4 Answers
4






active

oldest

votes








4 Answers
4






active

oldest

votes









active

oldest

votes






active

oldest

votes









7












$begingroup$

First, the good news: your code looks pretty good. It's indented properly, spelled properly, spaced properly, capitalized properly, and uses the right Python style. Good work!



Here are a couple of nits:




  1. This is a "program." (As opposed to just a module, or package, or library.) So use the standard python idiom for programs at the bottom:



    if __name__ == '__main__':
    Main.get_longest_word("abppplee", "ale", "bale", "able", "apple", "kangaroo")


    This mechanism is to allow your code to be loaded (via import myprogram) without having the main entry point automatically get invoked. That lets you load it up in the interpreter and browse around before you call it.




  2. Use docstrings! Write down the problem you are trying to solve, and any notes about inputs or outputs for functions. Especially data formats.



    For coding puzzle type problems, the docstring is a great place to copy the puzzle specification. This lets you refer back to it, lets you remember what you were doing when you wrote this code, and lets you paste it up at CodeReview with ease!



    #!/usr/bin/env python3
    """ Given a string S and a set of words D, find the longest word in D that is a subsequence of S.

    Word W is a subsequence of S if some number of characters, possibly zero, can be deleted from S to form W, without reordering the remaining characters.

    Note: D can appear in any format (list, hash table, prefix tree, etc.)

    For example, given the input of S = "abppplee" and D = "able", "ale", "apple", "bale", "kangaroo" the correct output would be "apple".

    - The words "able" and "ale" are both subsequences of S, but they are shorter than "apple".
    - The word "bale" is not a subsequence of S because even though S has all the right letters, they are not in the right order.
    - The word "kangaroo" is the longest word in D, but it isn't a subsequence of S.
    """


Now here are some possible improvements:



In create_dictionary



def create_dictionary(string):
dictionary =
index = 0

for letter in string:
if letter in dictionary:
dictionary[letter].append(index)
else:
dictionary[letter] = [index]
index += 1
return(dictionary)


First, congratulations! You have written the hand-coded version of how to manage a dictionary whose values are lists. You got it right, your code works, and you should never do that again because it's boring. Instead, use collections.defaultdict(list). A defaultdict remembers a factory function that it will call whenever it is asked about a key but doesn't have a corresponding value.



The word list is not just the name of a Python data type, it's also the function to call when you want to construct one! (Just like you use the name of every class as it's constructor: my_obj = MyClass(1, 2, "hello")) So when you want a dictionary that looks up lists, it's much easier to assume that there will always be a list in the dictionary, but it might be empty. That's what defaultdict(list) will get you:



import collections # somewhere at top of file

def create_dictionary(s):
dictionary = collections.defaultdict(list)
index = 0

for letter in string:
dictionary[letter].append(index) # defaultdict FTW!
index += 1
return(dictionary)


Next, give up your love affair with integers! Ned Batchelder gave a nice talk on the subject, so here's a link to that: https://www.youtube.com/watch?v=EnSu9hHGq5o



The idea is that many (most?) python loops don't need to use integers to index over things. And when you do need an integer index (and in this case you do!) there are better ways than maintaining your own integer index variable. Here's one such way: the enumerate() built-in function.



With enumerate, you can write a loop that iterates over the values from some iterable along with an automatically-associated integer:



 # No index=0 here
for index, letter in enumerate(string):
dictionary[letter].append(index)
# No index+=1 here!


The index, string pair is called a tuple, and it is a built-in type just like list. The act of assigning or iterating using multiple target variables that take their values from a single tuple is called tuple unpacking. (Remember that phrase, you'll need it when you want to ask for help on the subject.)



In get_longest_word



The issue I have with this function is not one of Python, but rather of design. You are given some words, d. You want to find the longest word that is a subsequence of the string s. How do you do it?



In your case, the answer is "look at every single word in d, ignore the ones that are not subsequences of s, pick the longest one that remains."



There are a couple of better (read: faster, more efficient) ways of doing that job. Let me make one simple suggestion: sort the words!



In Python, there are Iterables and Sequences. An iterable is something you can iterate. A sequence is something you can access using s[i]. It's possible to have an infinite iterable, by writing an infinite generator function. It's not possible to have an infinite sequence, since you'll run out of memory trying to store it.



In this particular case it seems okay to assume that d is going to be a sequence: a finite list or tuple. So the fastest way to find the "longest word" is to start by looking at long words first. Because once you find a long word that is a subsequence, you can stop - there are no shorter ones!



The way to sort things in Python is the sorted built-in function. (Yes, it takes an iterable. No, it won't sort an infinite one. Yeesh!) By default, it sorts things by comparing the items using the "native" comparison. You can specify a key function, however, to sort using some different mechanism. Let's use the length of the words, which is the len function
(function len(x) -> int returns the length of the list/string/whatever). And let's reverse the order, so that the big ones come first:



d = sorted(d, key=len, reverse=True)


Now, instead of needing to check the length and update the longest-word-so-far variable, you can just return immediately once you find a subsequence.



In get_word_is_substring



Let's talk about default values. You say:



index_of_last_letter_found = None


But :! grep letter_found % in Vim tells me:



 index_of_last_letter_found = None
if letter in dictionary and (index_of_last_letter_found is None or dictionary[letter][-1] > index_of_last_letter_found):
if index_of_last_letter_found is None or index_of_last_letter_found < dictionary[letter][index]:
index_of_last_letter_found = dictionary[letter][index]


You're spending a lot of keystrokes checking for None. And all you do is compare using <, and assign new values. Why not just set the default value to some value that you know is going to be "too low"? Since string index values start at 0, maybe -1 would make sense:



index_of_last_letter_found = -1


While were at it, shorten that variable name. Names should be as long as they need to be, and no longer!



def get_word_is_substring(word, dictionary):
last_index = -1 # Index of last letter found

for letter in word:
if letter in dictionary and dictionary[letter][-1] > last_index:
index = 0
while index < len(dictionary[letter]):
if last_index < dictionary[letter][index]:
last_index = dictionary[letter][index]
break
else:
index += 1
else:
return False
return True


That's a lot more readable, since there are fewer tests and fewer characters.



Next, let's go back and work on your fetish for simple integer arithmetic. Now that you've watched Ned Batchelder's talk, you can see that index += 1 is not the way!



for index in dictionary[letter]:
if last_index < index:
last_index = index
break


(There are some other ways to find the first element in an iterable that matches a condition. See here for many of them. But this is nice and clear, and works.)



Some words on "style"



The above code may be somewhat more difficult to understand because I changed the meaning of index. In your original code, index was an index into the list of indices that pointed to where letters occur in the master sequence. In my updated version index means the actual indices from the list. I have removed one level of indirection.



This is an example of why "too short" variable names are actually a good thing. You'll find a lot of people use very small names, like ch or i, to represent loop variables. This is because most of the time, loop variables are not a "concept" or a "noun". Instead, they are a subscript. We write a[i] because the original Teletype devices hand-carved out of wood by our pioneer forbears would not allow writing 𝓐ᵢ.



But we don't really care about i, or about a[i]. We care about the thing stored there, and its properties. So, I encourage you to use short little 1-letter and 2-letter names when you are indexing a sequence, or when you're stuck in some other language that doesn't provide the nice looping options that Python does. Then you could say:



for i ...
index = s_dict[letter][i]


And there would be no confusion between index and that-loop-variable.



Next, the structuring of your escape clauses: I see a lot of people being taught to code in a way that puts exit conditions last. You're doing that in your loop:



 for letter in word:
if letter in dictionary and dictionary[letter][-1] > last_index:
# much code here
else:
return False


But moving the exit condition to the top (by reversing the sense of the if statement) makes it clear what the alternatives are (if this is true, leave, otherwise ...) and reduces the indentation level of your code (by removing the else indent). Some people would have you use two-space tabs to address the indentation problem. Those people are all young with good eyesight. They're also quite wrong. :-)



 for letter in word:
if letter not in dictionary or dictionary[letter][-1] <= last_index:
return False

# much code here


Lastly, there's the name. You called it get_word_is_substring, but we want subsequences not substring, and there's no reason to say get_ at the beginning. In fact, the word can be assumed, since you're passing it in, so:



def is_subsequence(word, s_dict):
last_index = -1 # Index (in s) of last letter found

for letter in word:
if letter not in s_dict or s_dict[letter][-1] <= last_index:
return False

for index in s_dict[letter]:
if last_index < index:
last_index = index
break

return True


(Note: Because I suggested s_dict be a defaultdict, you could eliminate the if test entirely and rely on the for ... else construct, which is rather obscure. I don't recommend it, simply because it's non-intuitive, hard to understand, and because spelling out the failure conditions is better than leaving them to the language. ("explicit is better than implicit"))



The class



Finally, let's talk about the class:



class Main:


But first! Here's another video for you to watch, Jack Diederich's talk at Pycon 2012 entitled: "Stop Writing Classes!": https://www.youtube.com/watch?v=o9pEzgHorH0



As Jack says, "The signature of 'this shouldn't be a class' is that it has two methods, one of which is __init__."



In this case, you've created a class, but you don't store any data and you only call one method from the outside. It should be a function.






share|improve this answer











$endgroup$












  • $begingroup$
    I take issue with the suggestion of, as a first step, sorting the list of words. You have turned an $O(n)$ search operation into an $O(n log n)$ operation, which can be quite expensive if you have a very long list of words to check. Pre-filtering may be a more apt method. After finding the first entry “able” is a subsequence of S, you can skip “ale” because, even if it is a valid subsequence, it is too short. Since “apple” is longer, you do need to check it, and find it is a subsequence. “bale” is too short, and can be skipped. Finally, you do need to check and reject “kangaroo”.
    $endgroup$
    – AJNeufeld
    Feb 28 at 19:15










  • $begingroup$
    First of all, thanks for your time! This is exactly what I was looking for, a way to take my code to the next level. I am going to have to re-read your suggestion a few times and do some studying. All of your suggestions make sense and I really appreciate it!
    $endgroup$
    – K. Kretz
    Feb 28 at 22:14















7












$begingroup$

First, the good news: your code looks pretty good. It's indented properly, spelled properly, spaced properly, capitalized properly, and uses the right Python style. Good work!



Here are a couple of nits:




  1. This is a "program." (As opposed to just a module, or package, or library.) So use the standard python idiom for programs at the bottom:



    if __name__ == '__main__':
    Main.get_longest_word("abppplee", "ale", "bale", "able", "apple", "kangaroo")


    This mechanism is to allow your code to be loaded (via import myprogram) without having the main entry point automatically get invoked. That lets you load it up in the interpreter and browse around before you call it.




  2. Use docstrings! Write down the problem you are trying to solve, and any notes about inputs or outputs for functions. Especially data formats.



    For coding puzzle type problems, the docstring is a great place to copy the puzzle specification. This lets you refer back to it, lets you remember what you were doing when you wrote this code, and lets you paste it up at CodeReview with ease!



    #!/usr/bin/env python3
    """ Given a string S and a set of words D, find the longest word in D that is a subsequence of S.

    Word W is a subsequence of S if some number of characters, possibly zero, can be deleted from S to form W, without reordering the remaining characters.

    Note: D can appear in any format (list, hash table, prefix tree, etc.)

    For example, given the input of S = "abppplee" and D = "able", "ale", "apple", "bale", "kangaroo" the correct output would be "apple".

    - The words "able" and "ale" are both subsequences of S, but they are shorter than "apple".
    - The word "bale" is not a subsequence of S because even though S has all the right letters, they are not in the right order.
    - The word "kangaroo" is the longest word in D, but it isn't a subsequence of S.
    """


Now here are some possible improvements:



In create_dictionary



def create_dictionary(string):
dictionary =
index = 0

for letter in string:
if letter in dictionary:
dictionary[letter].append(index)
else:
dictionary[letter] = [index]
index += 1
return(dictionary)


First, congratulations! You have written the hand-coded version of how to manage a dictionary whose values are lists. You got it right, your code works, and you should never do that again because it's boring. Instead, use collections.defaultdict(list). A defaultdict remembers a factory function that it will call whenever it is asked about a key but doesn't have a corresponding value.



The word list is not just the name of a Python data type, it's also the function to call when you want to construct one! (Just like you use the name of every class as it's constructor: my_obj = MyClass(1, 2, "hello")) So when you want a dictionary that looks up lists, it's much easier to assume that there will always be a list in the dictionary, but it might be empty. That's what defaultdict(list) will get you:



import collections # somewhere at top of file

def create_dictionary(s):
dictionary = collections.defaultdict(list)
index = 0

for letter in string:
dictionary[letter].append(index) # defaultdict FTW!
index += 1
return(dictionary)


Next, give up your love affair with integers! Ned Batchelder gave a nice talk on the subject, so here's a link to that: https://www.youtube.com/watch?v=EnSu9hHGq5o



The idea is that many (most?) python loops don't need to use integers to index over things. And when you do need an integer index (and in this case you do!) there are better ways than maintaining your own integer index variable. Here's one such way: the enumerate() built-in function.



With enumerate, you can write a loop that iterates over the values from some iterable along with an automatically-associated integer:



 # No index=0 here
for index, letter in enumerate(string):
dictionary[letter].append(index)
# No index+=1 here!


The index, string pair is called a tuple, and it is a built-in type just like list. The act of assigning or iterating using multiple target variables that take their values from a single tuple is called tuple unpacking. (Remember that phrase, you'll need it when you want to ask for help on the subject.)



In get_longest_word



The issue I have with this function is not one of Python, but rather of design. You are given some words, d. You want to find the longest word that is a subsequence of the string s. How do you do it?



In your case, the answer is "look at every single word in d, ignore the ones that are not subsequences of s, pick the longest one that remains."



There are a couple of better (read: faster, more efficient) ways of doing that job. Let me make one simple suggestion: sort the words!



In Python, there are Iterables and Sequences. An iterable is something you can iterate. A sequence is something you can access using s[i]. It's possible to have an infinite iterable, by writing an infinite generator function. It's not possible to have an infinite sequence, since you'll run out of memory trying to store it.



In this particular case it seems okay to assume that d is going to be a sequence: a finite list or tuple. So the fastest way to find the "longest word" is to start by looking at long words first. Because once you find a long word that is a subsequence, you can stop - there are no shorter ones!



The way to sort things in Python is the sorted built-in function. (Yes, it takes an iterable. No, it won't sort an infinite one. Yeesh!) By default, it sorts things by comparing the items using the "native" comparison. You can specify a key function, however, to sort using some different mechanism. Let's use the length of the words, which is the len function
(function len(x) -> int returns the length of the list/string/whatever). And let's reverse the order, so that the big ones come first:



d = sorted(d, key=len, reverse=True)


Now, instead of needing to check the length and update the longest-word-so-far variable, you can just return immediately once you find a subsequence.



In get_word_is_substring



Let's talk about default values. You say:



index_of_last_letter_found = None


But :! grep letter_found % in Vim tells me:



 index_of_last_letter_found = None
if letter in dictionary and (index_of_last_letter_found is None or dictionary[letter][-1] > index_of_last_letter_found):
if index_of_last_letter_found is None or index_of_last_letter_found < dictionary[letter][index]:
index_of_last_letter_found = dictionary[letter][index]


You're spending a lot of keystrokes checking for None. And all you do is compare using <, and assign new values. Why not just set the default value to some value that you know is going to be "too low"? Since string index values start at 0, maybe -1 would make sense:



index_of_last_letter_found = -1


While were at it, shorten that variable name. Names should be as long as they need to be, and no longer!



def get_word_is_substring(word, dictionary):
last_index = -1 # Index of last letter found

for letter in word:
if letter in dictionary and dictionary[letter][-1] > last_index:
index = 0
while index < len(dictionary[letter]):
if last_index < dictionary[letter][index]:
last_index = dictionary[letter][index]
break
else:
index += 1
else:
return False
return True


That's a lot more readable, since there are fewer tests and fewer characters.



Next, let's go back and work on your fetish for simple integer arithmetic. Now that you've watched Ned Batchelder's talk, you can see that index += 1 is not the way!



for index in dictionary[letter]:
if last_index < index:
last_index = index
break


(There are some other ways to find the first element in an iterable that matches a condition. See here for many of them. But this is nice and clear, and works.)



Some words on "style"



The above code may be somewhat more difficult to understand because I changed the meaning of index. In your original code, index was an index into the list of indices that pointed to where letters occur in the master sequence. In my updated version index means the actual indices from the list. I have removed one level of indirection.



This is an example of why "too short" variable names are actually a good thing. You'll find a lot of people use very small names, like ch or i, to represent loop variables. This is because most of the time, loop variables are not a "concept" or a "noun". Instead, they are a subscript. We write a[i] because the original Teletype devices hand-carved out of wood by our pioneer forbears would not allow writing 𝓐ᵢ.



But we don't really care about i, or about a[i]. We care about the thing stored there, and its properties. So, I encourage you to use short little 1-letter and 2-letter names when you are indexing a sequence, or when you're stuck in some other language that doesn't provide the nice looping options that Python does. Then you could say:



for i ...
index = s_dict[letter][i]


And there would be no confusion between index and that-loop-variable.



Next, the structuring of your escape clauses: I see a lot of people being taught to code in a way that puts exit conditions last. You're doing that in your loop:



 for letter in word:
if letter in dictionary and dictionary[letter][-1] > last_index:
# much code here
else:
return False


But moving the exit condition to the top (by reversing the sense of the if statement) makes it clear what the alternatives are (if this is true, leave, otherwise ...) and reduces the indentation level of your code (by removing the else indent). Some people would have you use two-space tabs to address the indentation problem. Those people are all young with good eyesight. They're also quite wrong. :-)



 for letter in word:
if letter not in dictionary or dictionary[letter][-1] <= last_index:
return False

# much code here


Lastly, there's the name. You called it get_word_is_substring, but we want subsequences not substring, and there's no reason to say get_ at the beginning. In fact, the word can be assumed, since you're passing it in, so:



def is_subsequence(word, s_dict):
last_index = -1 # Index (in s) of last letter found

for letter in word:
if letter not in s_dict or s_dict[letter][-1] <= last_index:
return False

for index in s_dict[letter]:
if last_index < index:
last_index = index
break

return True


(Note: Because I suggested s_dict be a defaultdict, you could eliminate the if test entirely and rely on the for ... else construct, which is rather obscure. I don't recommend it, simply because it's non-intuitive, hard to understand, and because spelling out the failure conditions is better than leaving them to the language. ("explicit is better than implicit"))



The class



Finally, let's talk about the class:



class Main:


But first! Here's another video for you to watch, Jack Diederich's talk at Pycon 2012 entitled: "Stop Writing Classes!": https://www.youtube.com/watch?v=o9pEzgHorH0



As Jack says, "The signature of 'this shouldn't be a class' is that it has two methods, one of which is __init__."



In this case, you've created a class, but you don't store any data and you only call one method from the outside. It should be a function.






share|improve this answer











$endgroup$












  • $begingroup$
    I take issue with the suggestion of, as a first step, sorting the list of words. You have turned an $O(n)$ search operation into an $O(n log n)$ operation, which can be quite expensive if you have a very long list of words to check. Pre-filtering may be a more apt method. After finding the first entry “able” is a subsequence of S, you can skip “ale” because, even if it is a valid subsequence, it is too short. Since “apple” is longer, you do need to check it, and find it is a subsequence. “bale” is too short, and can be skipped. Finally, you do need to check and reject “kangaroo”.
    $endgroup$
    – AJNeufeld
    Feb 28 at 19:15










  • $begingroup$
    First of all, thanks for your time! This is exactly what I was looking for, a way to take my code to the next level. I am going to have to re-read your suggestion a few times and do some studying. All of your suggestions make sense and I really appreciate it!
    $endgroup$
    – K. Kretz
    Feb 28 at 22:14













7












7








7





$begingroup$

First, the good news: your code looks pretty good. It's indented properly, spelled properly, spaced properly, capitalized properly, and uses the right Python style. Good work!



Here are a couple of nits:




  1. This is a "program." (As opposed to just a module, or package, or library.) So use the standard python idiom for programs at the bottom:



    if __name__ == '__main__':
    Main.get_longest_word("abppplee", "ale", "bale", "able", "apple", "kangaroo")


    This mechanism is to allow your code to be loaded (via import myprogram) without having the main entry point automatically get invoked. That lets you load it up in the interpreter and browse around before you call it.




  2. Use docstrings! Write down the problem you are trying to solve, and any notes about inputs or outputs for functions. Especially data formats.



    For coding puzzle type problems, the docstring is a great place to copy the puzzle specification. This lets you refer back to it, lets you remember what you were doing when you wrote this code, and lets you paste it up at CodeReview with ease!



    #!/usr/bin/env python3
    """ Given a string S and a set of words D, find the longest word in D that is a subsequence of S.

    Word W is a subsequence of S if some number of characters, possibly zero, can be deleted from S to form W, without reordering the remaining characters.

    Note: D can appear in any format (list, hash table, prefix tree, etc.)

    For example, given the input of S = "abppplee" and D = "able", "ale", "apple", "bale", "kangaroo" the correct output would be "apple".

    - The words "able" and "ale" are both subsequences of S, but they are shorter than "apple".
    - The word "bale" is not a subsequence of S because even though S has all the right letters, they are not in the right order.
    - The word "kangaroo" is the longest word in D, but it isn't a subsequence of S.
    """


Now here are some possible improvements:



In create_dictionary



def create_dictionary(string):
dictionary =
index = 0

for letter in string:
if letter in dictionary:
dictionary[letter].append(index)
else:
dictionary[letter] = [index]
index += 1
return(dictionary)


First, congratulations! You have written the hand-coded version of how to manage a dictionary whose values are lists. You got it right, your code works, and you should never do that again because it's boring. Instead, use collections.defaultdict(list). A defaultdict remembers a factory function that it will call whenever it is asked about a key but doesn't have a corresponding value.



The word list is not just the name of a Python data type, it's also the function to call when you want to construct one! (Just like you use the name of every class as it's constructor: my_obj = MyClass(1, 2, "hello")) So when you want a dictionary that looks up lists, it's much easier to assume that there will always be a list in the dictionary, but it might be empty. That's what defaultdict(list) will get you:



import collections # somewhere at top of file

def create_dictionary(s):
dictionary = collections.defaultdict(list)
index = 0

for letter in string:
dictionary[letter].append(index) # defaultdict FTW!
index += 1
return(dictionary)


Next, give up your love affair with integers! Ned Batchelder gave a nice talk on the subject, so here's a link to that: https://www.youtube.com/watch?v=EnSu9hHGq5o



The idea is that many (most?) python loops don't need to use integers to index over things. And when you do need an integer index (and in this case you do!) there are better ways than maintaining your own integer index variable. Here's one such way: the enumerate() built-in function.



With enumerate, you can write a loop that iterates over the values from some iterable along with an automatically-associated integer:



 # No index=0 here
for index, letter in enumerate(string):
dictionary[letter].append(index)
# No index+=1 here!


The index, string pair is called a tuple, and it is a built-in type just like list. The act of assigning or iterating using multiple target variables that take their values from a single tuple is called tuple unpacking. (Remember that phrase, you'll need it when you want to ask for help on the subject.)



In get_longest_word



The issue I have with this function is not one of Python, but rather of design. You are given some words, d. You want to find the longest word that is a subsequence of the string s. How do you do it?



In your case, the answer is "look at every single word in d, ignore the ones that are not subsequences of s, pick the longest one that remains."



There are a couple of better (read: faster, more efficient) ways of doing that job. Let me make one simple suggestion: sort the words!



In Python, there are Iterables and Sequences. An iterable is something you can iterate. A sequence is something you can access using s[i]. It's possible to have an infinite iterable, by writing an infinite generator function. It's not possible to have an infinite sequence, since you'll run out of memory trying to store it.



In this particular case it seems okay to assume that d is going to be a sequence: a finite list or tuple. So the fastest way to find the "longest word" is to start by looking at long words first. Because once you find a long word that is a subsequence, you can stop - there are no shorter ones!



The way to sort things in Python is the sorted built-in function. (Yes, it takes an iterable. No, it won't sort an infinite one. Yeesh!) By default, it sorts things by comparing the items using the "native" comparison. You can specify a key function, however, to sort using some different mechanism. Let's use the length of the words, which is the len function
(function len(x) -> int returns the length of the list/string/whatever). And let's reverse the order, so that the big ones come first:



d = sorted(d, key=len, reverse=True)


Now, instead of needing to check the length and update the longest-word-so-far variable, you can just return immediately once you find a subsequence.



In get_word_is_substring



Let's talk about default values. You say:



index_of_last_letter_found = None


But :! grep letter_found % in Vim tells me:



 index_of_last_letter_found = None
if letter in dictionary and (index_of_last_letter_found is None or dictionary[letter][-1] > index_of_last_letter_found):
if index_of_last_letter_found is None or index_of_last_letter_found < dictionary[letter][index]:
index_of_last_letter_found = dictionary[letter][index]


You're spending a lot of keystrokes checking for None. And all you do is compare using <, and assign new values. Why not just set the default value to some value that you know is going to be "too low"? Since string index values start at 0, maybe -1 would make sense:



index_of_last_letter_found = -1


While were at it, shorten that variable name. Names should be as long as they need to be, and no longer!



def get_word_is_substring(word, dictionary):
last_index = -1 # Index of last letter found

for letter in word:
if letter in dictionary and dictionary[letter][-1] > last_index:
index = 0
while index < len(dictionary[letter]):
if last_index < dictionary[letter][index]:
last_index = dictionary[letter][index]
break
else:
index += 1
else:
return False
return True


That's a lot more readable, since there are fewer tests and fewer characters.



Next, let's go back and work on your fetish for simple integer arithmetic. Now that you've watched Ned Batchelder's talk, you can see that index += 1 is not the way!



for index in dictionary[letter]:
if last_index < index:
last_index = index
break


(There are some other ways to find the first element in an iterable that matches a condition. See here for many of them. But this is nice and clear, and works.)



Some words on "style"



The above code may be somewhat more difficult to understand because I changed the meaning of index. In your original code, index was an index into the list of indices that pointed to where letters occur in the master sequence. In my updated version index means the actual indices from the list. I have removed one level of indirection.



This is an example of why "too short" variable names are actually a good thing. You'll find a lot of people use very small names, like ch or i, to represent loop variables. This is because most of the time, loop variables are not a "concept" or a "noun". Instead, they are a subscript. We write a[i] because the original Teletype devices hand-carved out of wood by our pioneer forbears would not allow writing 𝓐ᵢ.



But we don't really care about i, or about a[i]. We care about the thing stored there, and its properties. So, I encourage you to use short little 1-letter and 2-letter names when you are indexing a sequence, or when you're stuck in some other language that doesn't provide the nice looping options that Python does. Then you could say:



for i ...
index = s_dict[letter][i]


And there would be no confusion between index and that-loop-variable.



Next, the structuring of your escape clauses: I see a lot of people being taught to code in a way that puts exit conditions last. You're doing that in your loop:



 for letter in word:
if letter in dictionary and dictionary[letter][-1] > last_index:
# much code here
else:
return False


But moving the exit condition to the top (by reversing the sense of the if statement) makes it clear what the alternatives are (if this is true, leave, otherwise ...) and reduces the indentation level of your code (by removing the else indent). Some people would have you use two-space tabs to address the indentation problem. Those people are all young with good eyesight. They're also quite wrong. :-)



 for letter in word:
if letter not in dictionary or dictionary[letter][-1] <= last_index:
return False

# much code here


Lastly, there's the name. You called it get_word_is_substring, but we want subsequences not substring, and there's no reason to say get_ at the beginning. In fact, the word can be assumed, since you're passing it in, so:



def is_subsequence(word, s_dict):
last_index = -1 # Index (in s) of last letter found

for letter in word:
if letter not in s_dict or s_dict[letter][-1] <= last_index:
return False

for index in s_dict[letter]:
if last_index < index:
last_index = index
break

return True


(Note: Because I suggested s_dict be a defaultdict, you could eliminate the if test entirely and rely on the for ... else construct, which is rather obscure. I don't recommend it, simply because it's non-intuitive, hard to understand, and because spelling out the failure conditions is better than leaving them to the language. ("explicit is better than implicit"))



The class



Finally, let's talk about the class:



class Main:


But first! Here's another video for you to watch, Jack Diederich's talk at Pycon 2012 entitled: "Stop Writing Classes!": https://www.youtube.com/watch?v=o9pEzgHorH0



As Jack says, "The signature of 'this shouldn't be a class' is that it has two methods, one of which is __init__."



In this case, you've created a class, but you don't store any data and you only call one method from the outside. It should be a function.






share|improve this answer











$endgroup$



First, the good news: your code looks pretty good. It's indented properly, spelled properly, spaced properly, capitalized properly, and uses the right Python style. Good work!



Here are a couple of nits:




  1. This is a "program." (As opposed to just a module, or package, or library.) So use the standard python idiom for programs at the bottom:



    if __name__ == '__main__':
    Main.get_longest_word("abppplee", "ale", "bale", "able", "apple", "kangaroo")


    This mechanism is to allow your code to be loaded (via import myprogram) without having the main entry point automatically get invoked. That lets you load it up in the interpreter and browse around before you call it.




  2. Use docstrings! Write down the problem you are trying to solve, and any notes about inputs or outputs for functions. Especially data formats.



    For coding puzzle type problems, the docstring is a great place to copy the puzzle specification. This lets you refer back to it, lets you remember what you were doing when you wrote this code, and lets you paste it up at CodeReview with ease!



    #!/usr/bin/env python3
    """ Given a string S and a set of words D, find the longest word in D that is a subsequence of S.

    Word W is a subsequence of S if some number of characters, possibly zero, can be deleted from S to form W, without reordering the remaining characters.

    Note: D can appear in any format (list, hash table, prefix tree, etc.)

    For example, given the input of S = "abppplee" and D = "able", "ale", "apple", "bale", "kangaroo" the correct output would be "apple".

    - The words "able" and "ale" are both subsequences of S, but they are shorter than "apple".
    - The word "bale" is not a subsequence of S because even though S has all the right letters, they are not in the right order.
    - The word "kangaroo" is the longest word in D, but it isn't a subsequence of S.
    """


Now here are some possible improvements:



In create_dictionary



def create_dictionary(string):
dictionary =
index = 0

for letter in string:
if letter in dictionary:
dictionary[letter].append(index)
else:
dictionary[letter] = [index]
index += 1
return(dictionary)


First, congratulations! You have written the hand-coded version of how to manage a dictionary whose values are lists. You got it right, your code works, and you should never do that again because it's boring. Instead, use collections.defaultdict(list). A defaultdict remembers a factory function that it will call whenever it is asked about a key but doesn't have a corresponding value.



The word list is not just the name of a Python data type, it's also the function to call when you want to construct one! (Just like you use the name of every class as it's constructor: my_obj = MyClass(1, 2, "hello")) So when you want a dictionary that looks up lists, it's much easier to assume that there will always be a list in the dictionary, but it might be empty. That's what defaultdict(list) will get you:



import collections # somewhere at top of file

def create_dictionary(s):
dictionary = collections.defaultdict(list)
index = 0

for letter in string:
dictionary[letter].append(index) # defaultdict FTW!
index += 1
return(dictionary)


Next, give up your love affair with integers! Ned Batchelder gave a nice talk on the subject, so here's a link to that: https://www.youtube.com/watch?v=EnSu9hHGq5o



The idea is that many (most?) python loops don't need to use integers to index over things. And when you do need an integer index (and in this case you do!) there are better ways than maintaining your own integer index variable. Here's one such way: the enumerate() built-in function.



With enumerate, you can write a loop that iterates over the values from some iterable along with an automatically-associated integer:



 # No index=0 here
for index, letter in enumerate(string):
dictionary[letter].append(index)
# No index+=1 here!


The index, string pair is called a tuple, and it is a built-in type just like list. The act of assigning or iterating using multiple target variables that take their values from a single tuple is called tuple unpacking. (Remember that phrase, you'll need it when you want to ask for help on the subject.)



In get_longest_word



The issue I have with this function is not one of Python, but rather of design. You are given some words, d. You want to find the longest word that is a subsequence of the string s. How do you do it?



In your case, the answer is "look at every single word in d, ignore the ones that are not subsequences of s, pick the longest one that remains."



There are a couple of better (read: faster, more efficient) ways of doing that job. Let me make one simple suggestion: sort the words!



In Python, there are Iterables and Sequences. An iterable is something you can iterate. A sequence is something you can access using s[i]. It's possible to have an infinite iterable, by writing an infinite generator function. It's not possible to have an infinite sequence, since you'll run out of memory trying to store it.



In this particular case it seems okay to assume that d is going to be a sequence: a finite list or tuple. So the fastest way to find the "longest word" is to start by looking at long words first. Because once you find a long word that is a subsequence, you can stop - there are no shorter ones!



The way to sort things in Python is the sorted built-in function. (Yes, it takes an iterable. No, it won't sort an infinite one. Yeesh!) By default, it sorts things by comparing the items using the "native" comparison. You can specify a key function, however, to sort using some different mechanism. Let's use the length of the words, which is the len function
(function len(x) -> int returns the length of the list/string/whatever). And let's reverse the order, so that the big ones come first:



d = sorted(d, key=len, reverse=True)


Now, instead of needing to check the length and update the longest-word-so-far variable, you can just return immediately once you find a subsequence.



In get_word_is_substring



Let's talk about default values. You say:



index_of_last_letter_found = None


But :! grep letter_found % in Vim tells me:



 index_of_last_letter_found = None
if letter in dictionary and (index_of_last_letter_found is None or dictionary[letter][-1] > index_of_last_letter_found):
if index_of_last_letter_found is None or index_of_last_letter_found < dictionary[letter][index]:
index_of_last_letter_found = dictionary[letter][index]


You're spending a lot of keystrokes checking for None. And all you do is compare using <, and assign new values. Why not just set the default value to some value that you know is going to be "too low"? Since string index values start at 0, maybe -1 would make sense:



index_of_last_letter_found = -1


While were at it, shorten that variable name. Names should be as long as they need to be, and no longer!



def get_word_is_substring(word, dictionary):
last_index = -1 # Index of last letter found

for letter in word:
if letter in dictionary and dictionary[letter][-1] > last_index:
index = 0
while index < len(dictionary[letter]):
if last_index < dictionary[letter][index]:
last_index = dictionary[letter][index]
break
else:
index += 1
else:
return False
return True


That's a lot more readable, since there are fewer tests and fewer characters.



Next, let's go back and work on your fetish for simple integer arithmetic. Now that you've watched Ned Batchelder's talk, you can see that index += 1 is not the way!



for index in dictionary[letter]:
if last_index < index:
last_index = index
break


(There are some other ways to find the first element in an iterable that matches a condition. See here for many of them. But this is nice and clear, and works.)



Some words on "style"



The above code may be somewhat more difficult to understand because I changed the meaning of index. In your original code, index was an index into the list of indices that pointed to where letters occur in the master sequence. In my updated version index means the actual indices from the list. I have removed one level of indirection.



This is an example of why "too short" variable names are actually a good thing. You'll find a lot of people use very small names, like ch or i, to represent loop variables. This is because most of the time, loop variables are not a "concept" or a "noun". Instead, they are a subscript. We write a[i] because the original Teletype devices hand-carved out of wood by our pioneer forbears would not allow writing 𝓐ᵢ.



But we don't really care about i, or about a[i]. We care about the thing stored there, and its properties. So, I encourage you to use short little 1-letter and 2-letter names when you are indexing a sequence, or when you're stuck in some other language that doesn't provide the nice looping options that Python does. Then you could say:



for i ...
index = s_dict[letter][i]


And there would be no confusion between index and that-loop-variable.



Next, the structuring of your escape clauses: I see a lot of people being taught to code in a way that puts exit conditions last. You're doing that in your loop:



 for letter in word:
if letter in dictionary and dictionary[letter][-1] > last_index:
# much code here
else:
return False


But moving the exit condition to the top (by reversing the sense of the if statement) makes it clear what the alternatives are (if this is true, leave, otherwise ...) and reduces the indentation level of your code (by removing the else indent). Some people would have you use two-space tabs to address the indentation problem. Those people are all young with good eyesight. They're also quite wrong. :-)



 for letter in word:
if letter not in dictionary or dictionary[letter][-1] <= last_index:
return False

# much code here


Lastly, there's the name. You called it get_word_is_substring, but we want subsequences not substring, and there's no reason to say get_ at the beginning. In fact, the word can be assumed, since you're passing it in, so:



def is_subsequence(word, s_dict):
last_index = -1 # Index (in s) of last letter found

for letter in word:
if letter not in s_dict or s_dict[letter][-1] <= last_index:
return False

for index in s_dict[letter]:
if last_index < index:
last_index = index
break

return True


(Note: Because I suggested s_dict be a defaultdict, you could eliminate the if test entirely and rely on the for ... else construct, which is rather obscure. I don't recommend it, simply because it's non-intuitive, hard to understand, and because spelling out the failure conditions is better than leaving them to the language. ("explicit is better than implicit"))



The class



Finally, let's talk about the class:



class Main:


But first! Here's another video for you to watch, Jack Diederich's talk at Pycon 2012 entitled: "Stop Writing Classes!": https://www.youtube.com/watch?v=o9pEzgHorH0



As Jack says, "The signature of 'this shouldn't be a class' is that it has two methods, one of which is __init__."



In this case, you've created a class, but you don't store any data and you only call one method from the outside. It should be a function.







share|improve this answer














share|improve this answer



share|improve this answer








edited Feb 27 at 20:17

























answered Feb 27 at 19:55









Austin HastingsAustin Hastings

7,5071233




7,5071233











  • $begingroup$
    I take issue with the suggestion of, as a first step, sorting the list of words. You have turned an $O(n)$ search operation into an $O(n log n)$ operation, which can be quite expensive if you have a very long list of words to check. Pre-filtering may be a more apt method. After finding the first entry “able” is a subsequence of S, you can skip “ale” because, even if it is a valid subsequence, it is too short. Since “apple” is longer, you do need to check it, and find it is a subsequence. “bale” is too short, and can be skipped. Finally, you do need to check and reject “kangaroo”.
    $endgroup$
    – AJNeufeld
    Feb 28 at 19:15










  • $begingroup$
    First of all, thanks for your time! This is exactly what I was looking for, a way to take my code to the next level. I am going to have to re-read your suggestion a few times and do some studying. All of your suggestions make sense and I really appreciate it!
    $endgroup$
    – K. Kretz
    Feb 28 at 22:14
















  • $begingroup$
    I take issue with the suggestion of, as a first step, sorting the list of words. You have turned an $O(n)$ search operation into an $O(n log n)$ operation, which can be quite expensive if you have a very long list of words to check. Pre-filtering may be a more apt method. After finding the first entry “able” is a subsequence of S, you can skip “ale” because, even if it is a valid subsequence, it is too short. Since “apple” is longer, you do need to check it, and find it is a subsequence. “bale” is too short, and can be skipped. Finally, you do need to check and reject “kangaroo”.
    $endgroup$
    – AJNeufeld
    Feb 28 at 19:15










  • $begingroup$
    First of all, thanks for your time! This is exactly what I was looking for, a way to take my code to the next level. I am going to have to re-read your suggestion a few times and do some studying. All of your suggestions make sense and I really appreciate it!
    $endgroup$
    – K. Kretz
    Feb 28 at 22:14















$begingroup$
I take issue with the suggestion of, as a first step, sorting the list of words. You have turned an $O(n)$ search operation into an $O(n log n)$ operation, which can be quite expensive if you have a very long list of words to check. Pre-filtering may be a more apt method. After finding the first entry “able” is a subsequence of S, you can skip “ale” because, even if it is a valid subsequence, it is too short. Since “apple” is longer, you do need to check it, and find it is a subsequence. “bale” is too short, and can be skipped. Finally, you do need to check and reject “kangaroo”.
$endgroup$
– AJNeufeld
Feb 28 at 19:15




$begingroup$
I take issue with the suggestion of, as a first step, sorting the list of words. You have turned an $O(n)$ search operation into an $O(n log n)$ operation, which can be quite expensive if you have a very long list of words to check. Pre-filtering may be a more apt method. After finding the first entry “able” is a subsequence of S, you can skip “ale” because, even if it is a valid subsequence, it is too short. Since “apple” is longer, you do need to check it, and find it is a subsequence. “bale” is too short, and can be skipped. Finally, you do need to check and reject “kangaroo”.
$endgroup$
– AJNeufeld
Feb 28 at 19:15












$begingroup$
First of all, thanks for your time! This is exactly what I was looking for, a way to take my code to the next level. I am going to have to re-read your suggestion a few times and do some studying. All of your suggestions make sense and I really appreciate it!
$endgroup$
– K. Kretz
Feb 28 at 22:14




$begingroup$
First of all, thanks for your time! This is exactly what I was looking for, a way to take my code to the next level. I am going to have to re-read your suggestion a few times and do some studying. All of your suggestions make sense and I really appreciate it!
$endgroup$
– K. Kretz
Feb 28 at 22:14













3












$begingroup$

Use defaultdict and enumerate



Python offers the defaultdict class, which works like a regular dictonary but returns a default value (here: empty list) if the requested element is not present. Also use enumerate if you want an index in a loop.



from collections import defaultdict

def create_dictionary(string):
dictionary = defaultdict(list)
for index, letter in enumerate(string):
dictionary[letter].append(index)
return dictionary


In the next method, work with lists of indices to leverage the defaultdict. Also, use a list comprehension to filter for valid indices because that also defaults to an empty list, so you don't need any special case handling.



def get_word_is_substring(word, dictionary):
indices = [-1]
for letter in word:
indices = [index for index in dictionary[letter] if index > indices[0]]
if not indices:
return False
return True





share|improve this answer









$endgroup$












  • $begingroup$
    Thanks so much for your help! As mentioned in the answer above, I will definitely study what you suggested. I really appreciate it.
    $endgroup$
    – K. Kretz
    Feb 28 at 22:17















3












$begingroup$

Use defaultdict and enumerate



Python offers the defaultdict class, which works like a regular dictonary but returns a default value (here: empty list) if the requested element is not present. Also use enumerate if you want an index in a loop.



from collections import defaultdict

def create_dictionary(string):
dictionary = defaultdict(list)
for index, letter in enumerate(string):
dictionary[letter].append(index)
return dictionary


In the next method, work with lists of indices to leverage the defaultdict. Also, use a list comprehension to filter for valid indices because that also defaults to an empty list, so you don't need any special case handling.



def get_word_is_substring(word, dictionary):
indices = [-1]
for letter in word:
indices = [index for index in dictionary[letter] if index > indices[0]]
if not indices:
return False
return True





share|improve this answer









$endgroup$












  • $begingroup$
    Thanks so much for your help! As mentioned in the answer above, I will definitely study what you suggested. I really appreciate it.
    $endgroup$
    – K. Kretz
    Feb 28 at 22:17













3












3








3





$begingroup$

Use defaultdict and enumerate



Python offers the defaultdict class, which works like a regular dictonary but returns a default value (here: empty list) if the requested element is not present. Also use enumerate if you want an index in a loop.



from collections import defaultdict

def create_dictionary(string):
dictionary = defaultdict(list)
for index, letter in enumerate(string):
dictionary[letter].append(index)
return dictionary


In the next method, work with lists of indices to leverage the defaultdict. Also, use a list comprehension to filter for valid indices because that also defaults to an empty list, so you don't need any special case handling.



def get_word_is_substring(word, dictionary):
indices = [-1]
for letter in word:
indices = [index for index in dictionary[letter] if index > indices[0]]
if not indices:
return False
return True





share|improve this answer









$endgroup$



Use defaultdict and enumerate



Python offers the defaultdict class, which works like a regular dictonary but returns a default value (here: empty list) if the requested element is not present. Also use enumerate if you want an index in a loop.



from collections import defaultdict

def create_dictionary(string):
dictionary = defaultdict(list)
for index, letter in enumerate(string):
dictionary[letter].append(index)
return dictionary


In the next method, work with lists of indices to leverage the defaultdict. Also, use a list comprehension to filter for valid indices because that also defaults to an empty list, so you don't need any special case handling.



def get_word_is_substring(word, dictionary):
indices = [-1]
for letter in word:
indices = [index for index in dictionary[letter] if index > indices[0]]
if not indices:
return False
return True






share|improve this answer












share|improve this answer



share|improve this answer










answered Feb 27 at 19:57









Rainer P.Rainer P.

1,261414




1,261414











  • $begingroup$
    Thanks so much for your help! As mentioned in the answer above, I will definitely study what you suggested. I really appreciate it.
    $endgroup$
    – K. Kretz
    Feb 28 at 22:17
















  • $begingroup$
    Thanks so much for your help! As mentioned in the answer above, I will definitely study what you suggested. I really appreciate it.
    $endgroup$
    – K. Kretz
    Feb 28 at 22:17















$begingroup$
Thanks so much for your help! As mentioned in the answer above, I will definitely study what you suggested. I really appreciate it.
$endgroup$
– K. Kretz
Feb 28 at 22:17




$begingroup$
Thanks so much for your help! As mentioned in the answer above, I will definitely study what you suggested. I really appreciate it.
$endgroup$
– K. Kretz
Feb 28 at 22:17











2












$begingroup$

I know this is a code review and I should improve the code rather than write a new one, I'd like to share with you how I would have done it.



def longest_matching_word(s, d):
def is_valid(word, s=s):
if not word and not s:
return True
if not s and word:
return False
if word and word[0]==s[0]:
return is_valid(word[1:], s[1:])
return is_valid(word, s[1:])
return max(filter(is_valid, d), key=len)


S = "abppplee" 
D = "ale", "bale", "able", "apple", "kangaroo"

longest_matching_word(S, D) # -> 'apple'


In the way it's done, it's easier to debug and maintain (no state or variable, so less things that can go wrong).



Thanks to the lazy filter object, for large sets, it will use nearly no memory at all.



The complexity is O(n). One interesting things for large sets is that multiprocessing.map can be applied to more efficiently use the computer's resources.



And if you think about it, it's not that complicated to understand:



Visualisation of the algorithm



It is also very readable, there is no mysterious variables, questionable loops, everything has a purpose that can be deduced.



I hope it gives you ideas...






share|improve this answer











$endgroup$












  • $begingroup$
    @Ludisposed Thanks for your interest in my solution, my answer was mainly to suggest an alternative way, not to provide a full answer. But I provided some explanation as you asked and even a gif (or jif). Feel free to give me additional feedback on my answer.
    $endgroup$
    – Benoît Pilatte
    Feb 28 at 12:22










  • $begingroup$
    You are passing D to longest_matching_word(S, D), where it becomes known as d locally, and is never used. Instead you filter(is_valid, D) using D from the global context. Not the best example of “how I would have done it”.
    $endgroup$
    – AJNeufeld
    Feb 28 at 17:28










  • $begingroup$
    Additionally, recursion is a terrible looping construct, especially given that Python does not do tail-call-optimization. A simple loop would have sufficed.
    $endgroup$
    – AJNeufeld
    Feb 28 at 17:43










  • $begingroup$
    I indeed misspelled D instead of d. Concerning recursion, I will not agree with you, I tested my function on large sets and although it is usually 2x slower than other solutions regardless of the sets given, it requires less memory and can be easily scaled if really needed. I consider not using recursion-because of python's implementation premature optimisation. This of course will not be true with strings that are hundreds of char long. AS I HAVE ALREADY SAID: my objective was to suggest an alternative way.
    $endgroup$
    – Benoît Pilatte
    Feb 28 at 18:34











  • $begingroup$
    Thanks for your suggestion. I like the gif as it makes things much easier for me being a new programmer. I appreciate the time and effort!
    $endgroup$
    – K. Kretz
    Feb 28 at 23:07















2












$begingroup$

I know this is a code review and I should improve the code rather than write a new one, I'd like to share with you how I would have done it.



def longest_matching_word(s, d):
def is_valid(word, s=s):
if not word and not s:
return True
if not s and word:
return False
if word and word[0]==s[0]:
return is_valid(word[1:], s[1:])
return is_valid(word, s[1:])
return max(filter(is_valid, d), key=len)


S = "abppplee" 
D = "ale", "bale", "able", "apple", "kangaroo"

longest_matching_word(S, D) # -> 'apple'


In the way it's done, it's easier to debug and maintain (no state or variable, so less things that can go wrong).



Thanks to the lazy filter object, for large sets, it will use nearly no memory at all.



The complexity is O(n). One interesting things for large sets is that multiprocessing.map can be applied to more efficiently use the computer's resources.



And if you think about it, it's not that complicated to understand:



Visualisation of the algorithm



It is also very readable, there is no mysterious variables, questionable loops, everything has a purpose that can be deduced.



I hope it gives you ideas...






share|improve this answer











$endgroup$












  • $begingroup$
    @Ludisposed Thanks for your interest in my solution, my answer was mainly to suggest an alternative way, not to provide a full answer. But I provided some explanation as you asked and even a gif (or jif). Feel free to give me additional feedback on my answer.
    $endgroup$
    – Benoît Pilatte
    Feb 28 at 12:22










  • $begingroup$
    You are passing D to longest_matching_word(S, D), where it becomes known as d locally, and is never used. Instead you filter(is_valid, D) using D from the global context. Not the best example of “how I would have done it”.
    $endgroup$
    – AJNeufeld
    Feb 28 at 17:28










  • $begingroup$
    Additionally, recursion is a terrible looping construct, especially given that Python does not do tail-call-optimization. A simple loop would have sufficed.
    $endgroup$
    – AJNeufeld
    Feb 28 at 17:43










  • $begingroup$
    I indeed misspelled D instead of d. Concerning recursion, I will not agree with you, I tested my function on large sets and although it is usually 2x slower than other solutions regardless of the sets given, it requires less memory and can be easily scaled if really needed. I consider not using recursion-because of python's implementation premature optimisation. This of course will not be true with strings that are hundreds of char long. AS I HAVE ALREADY SAID: my objective was to suggest an alternative way.
    $endgroup$
    – Benoît Pilatte
    Feb 28 at 18:34











  • $begingroup$
    Thanks for your suggestion. I like the gif as it makes things much easier for me being a new programmer. I appreciate the time and effort!
    $endgroup$
    – K. Kretz
    Feb 28 at 23:07













2












2








2





$begingroup$

I know this is a code review and I should improve the code rather than write a new one, I'd like to share with you how I would have done it.



def longest_matching_word(s, d):
def is_valid(word, s=s):
if not word and not s:
return True
if not s and word:
return False
if word and word[0]==s[0]:
return is_valid(word[1:], s[1:])
return is_valid(word, s[1:])
return max(filter(is_valid, d), key=len)


S = "abppplee" 
D = "ale", "bale", "able", "apple", "kangaroo"

longest_matching_word(S, D) # -> 'apple'


In the way it's done, it's easier to debug and maintain (no state or variable, so less things that can go wrong).



Thanks to the lazy filter object, for large sets, it will use nearly no memory at all.



The complexity is O(n). One interesting things for large sets is that multiprocessing.map can be applied to more efficiently use the computer's resources.



And if you think about it, it's not that complicated to understand:



Visualisation of the algorithm



It is also very readable, there is no mysterious variables, questionable loops, everything has a purpose that can be deduced.



I hope it gives you ideas...






share|improve this answer











$endgroup$



I know this is a code review and I should improve the code rather than write a new one, I'd like to share with you how I would have done it.



def longest_matching_word(s, d):
def is_valid(word, s=s):
if not word and not s:
return True
if not s and word:
return False
if word and word[0]==s[0]:
return is_valid(word[1:], s[1:])
return is_valid(word, s[1:])
return max(filter(is_valid, d), key=len)


S = "abppplee" 
D = "ale", "bale", "able", "apple", "kangaroo"

longest_matching_word(S, D) # -> 'apple'


In the way it's done, it's easier to debug and maintain (no state or variable, so less things that can go wrong).



Thanks to the lazy filter object, for large sets, it will use nearly no memory at all.



The complexity is O(n). One interesting things for large sets is that multiprocessing.map can be applied to more efficiently use the computer's resources.



And if you think about it, it's not that complicated to understand:



Visualisation of the algorithm



It is also very readable, there is no mysterious variables, questionable loops, everything has a purpose that can be deduced.



I hope it gives you ideas...







share|improve this answer














share|improve this answer



share|improve this answer








edited Feb 28 at 17:45

























answered Feb 27 at 22:58









Benoît PilatteBenoît Pilatte

63911




63911











  • $begingroup$
    @Ludisposed Thanks for your interest in my solution, my answer was mainly to suggest an alternative way, not to provide a full answer. But I provided some explanation as you asked and even a gif (or jif). Feel free to give me additional feedback on my answer.
    $endgroup$
    – Benoît Pilatte
    Feb 28 at 12:22










  • $begingroup$
    You are passing D to longest_matching_word(S, D), where it becomes known as d locally, and is never used. Instead you filter(is_valid, D) using D from the global context. Not the best example of “how I would have done it”.
    $endgroup$
    – AJNeufeld
    Feb 28 at 17:28










  • $begingroup$
    Additionally, recursion is a terrible looping construct, especially given that Python does not do tail-call-optimization. A simple loop would have sufficed.
    $endgroup$
    – AJNeufeld
    Feb 28 at 17:43










  • $begingroup$
    I indeed misspelled D instead of d. Concerning recursion, I will not agree with you, I tested my function on large sets and although it is usually 2x slower than other solutions regardless of the sets given, it requires less memory and can be easily scaled if really needed. I consider not using recursion-because of python's implementation premature optimisation. This of course will not be true with strings that are hundreds of char long. AS I HAVE ALREADY SAID: my objective was to suggest an alternative way.
    $endgroup$
    – Benoît Pilatte
    Feb 28 at 18:34











  • $begingroup$
    Thanks for your suggestion. I like the gif as it makes things much easier for me being a new programmer. I appreciate the time and effort!
    $endgroup$
    – K. Kretz
    Feb 28 at 23:07
















  • $begingroup$
    @Ludisposed Thanks for your interest in my solution, my answer was mainly to suggest an alternative way, not to provide a full answer. But I provided some explanation as you asked and even a gif (or jif). Feel free to give me additional feedback on my answer.
    $endgroup$
    – Benoît Pilatte
    Feb 28 at 12:22










  • $begingroup$
    You are passing D to longest_matching_word(S, D), where it becomes known as d locally, and is never used. Instead you filter(is_valid, D) using D from the global context. Not the best example of “how I would have done it”.
    $endgroup$
    – AJNeufeld
    Feb 28 at 17:28










  • $begingroup$
    Additionally, recursion is a terrible looping construct, especially given that Python does not do tail-call-optimization. A simple loop would have sufficed.
    $endgroup$
    – AJNeufeld
    Feb 28 at 17:43










  • $begingroup$
    I indeed misspelled D instead of d. Concerning recursion, I will not agree with you, I tested my function on large sets and although it is usually 2x slower than other solutions regardless of the sets given, it requires less memory and can be easily scaled if really needed. I consider not using recursion-because of python's implementation premature optimisation. This of course will not be true with strings that are hundreds of char long. AS I HAVE ALREADY SAID: my objective was to suggest an alternative way.
    $endgroup$
    – Benoît Pilatte
    Feb 28 at 18:34











  • $begingroup$
    Thanks for your suggestion. I like the gif as it makes things much easier for me being a new programmer. I appreciate the time and effort!
    $endgroup$
    – K. Kretz
    Feb 28 at 23:07















$begingroup$
@Ludisposed Thanks for your interest in my solution, my answer was mainly to suggest an alternative way, not to provide a full answer. But I provided some explanation as you asked and even a gif (or jif). Feel free to give me additional feedback on my answer.
$endgroup$
– Benoît Pilatte
Feb 28 at 12:22




$begingroup$
@Ludisposed Thanks for your interest in my solution, my answer was mainly to suggest an alternative way, not to provide a full answer. But I provided some explanation as you asked and even a gif (or jif). Feel free to give me additional feedback on my answer.
$endgroup$
– Benoît Pilatte
Feb 28 at 12:22












$begingroup$
You are passing D to longest_matching_word(S, D), where it becomes known as d locally, and is never used. Instead you filter(is_valid, D) using D from the global context. Not the best example of “how I would have done it”.
$endgroup$
– AJNeufeld
Feb 28 at 17:28




$begingroup$
You are passing D to longest_matching_word(S, D), where it becomes known as d locally, and is never used. Instead you filter(is_valid, D) using D from the global context. Not the best example of “how I would have done it”.
$endgroup$
– AJNeufeld
Feb 28 at 17:28












$begingroup$
Additionally, recursion is a terrible looping construct, especially given that Python does not do tail-call-optimization. A simple loop would have sufficed.
$endgroup$
– AJNeufeld
Feb 28 at 17:43




$begingroup$
Additionally, recursion is a terrible looping construct, especially given that Python does not do tail-call-optimization. A simple loop would have sufficed.
$endgroup$
– AJNeufeld
Feb 28 at 17:43












$begingroup$
I indeed misspelled D instead of d. Concerning recursion, I will not agree with you, I tested my function on large sets and although it is usually 2x slower than other solutions regardless of the sets given, it requires less memory and can be easily scaled if really needed. I consider not using recursion-because of python's implementation premature optimisation. This of course will not be true with strings that are hundreds of char long. AS I HAVE ALREADY SAID: my objective was to suggest an alternative way.
$endgroup$
– Benoît Pilatte
Feb 28 at 18:34





$begingroup$
I indeed misspelled D instead of d. Concerning recursion, I will not agree with you, I tested my function on large sets and although it is usually 2x slower than other solutions regardless of the sets given, it requires less memory and can be easily scaled if really needed. I consider not using recursion-because of python's implementation premature optimisation. This of course will not be true with strings that are hundreds of char long. AS I HAVE ALREADY SAID: my objective was to suggest an alternative way.
$endgroup$
– Benoît Pilatte
Feb 28 at 18:34













$begingroup$
Thanks for your suggestion. I like the gif as it makes things much easier for me being a new programmer. I appreciate the time and effort!
$endgroup$
– K. Kretz
Feb 28 at 23:07




$begingroup$
Thanks for your suggestion. I like the gif as it makes things much easier for me being a new programmer. I appreciate the time and effort!
$endgroup$
– K. Kretz
Feb 28 at 23:07











2












$begingroup$

Clarify requirements



What result is desired if there's a tie for longest word? Accepting the first of equal-length words is a valid choice, but we'll want to document that we've made a choice where the spec was vague. Similarly, returning None when there are no matches is another valid choice that should be documented.



I'd lean toward returning a collection of results; that allows the caller to decide whether to print one or more, and streamlines the handling of the empty result.



Reduce computational complexity



The code presented tests every candidate against the entire input, in turn. This scales as O(MN), where M is the size of the text and N is the number of candidate words. We can reduce that complexity by grouping the words according to the next unseen letter in the word; then walk through the text, examining just the set of words that can be advanced at each step. That scales as O(M+Nn), where Nn is the total length of all the candidate words.



It should be obvious that this really lends itself to a prefix-tree representation, but we can make it work for plain lists, too.



Here's a quickly knocked up implementation of that idea, with some small improvements (e.g. finish early if we match the longest candidate). Please forgive a C++ coder's approach to Python:



from collections import defaultdict

class Word:
def __init__(self, s):
self.chars = s
self.index = 0

def __repr__(self):
return self.chars + "(" + str(self.index) + ")"

class Matcher:
def __init__(self, words):
self.candidates = defaultdict(list)
self.longest = 0
self.results =
for w in words:
self.insert(Word(w))

def insert(self, word):
length = len(word.chars)
if length < self.longest:
return
elif word.index < length:
self.candidates[word.chars[word.index]].append(word)
elif length == self.longest:
self.results.append(word.chars)
elif length > self.longest:
self.longest = length
self.results = [word.chars]
for i in self.candidates.values():
i[:] = filter(lambda x: len(x.chars) >= length, i)

def step(self, char):
words = self.candidates[char]
del self.candidates[char]
for w in words:
w.index += 1
self.insert(w)
if not any(self.candidates.values()):
raise StopIteration

def read(self, it):
try:
for c in it:
self.step(c)
except StopIteration as e:
pass
return self.results


Testing this on a large problem set gives results around 20✕ faster (after removing the print() from your code, to give a fair comparison):



import itertools
import random
import string
import time
import timeit

def make_random_words(min_len, max_len):
"""An infinite generator of words containing min_len to max_len letters"""
while True:
length = random.randint(min_len, max_len)
yield ''.join([random.choice(string.ascii_letters) for n in range(length)])

if __name__ == "__main__":
# try ten-thousand random words and a long text
words = list(itertools.islice(make_random_words(5, 15), 10000))
mobydick = open("moby-dick.txt").read()

print(timeit.timeit("Main.get_longest_word(mobydick, words)",
number=10, timer=time.process_time, globals=globals()))

print(timeit.timeit("Matcher(words).read(mobydick)",
number=10, timer=time.process_time, globals=globals()))


9.290360821999998
0.510936933


These results might be more dramatic if we read directly from the file, as stopping early will have greater benefit in that case.






share|improve this answer











$endgroup$












  • $begingroup$
    Thanks for the explination! The time complexity difference is insane! Thanks for the advice, and as I have told the others, I will try to study up on your recommendations. I appreciate it!
    $endgroup$
    – K. Kretz
    Feb 28 at 23:12
















2












$begingroup$

Clarify requirements



What result is desired if there's a tie for longest word? Accepting the first of equal-length words is a valid choice, but we'll want to document that we've made a choice where the spec was vague. Similarly, returning None when there are no matches is another valid choice that should be documented.



I'd lean toward returning a collection of results; that allows the caller to decide whether to print one or more, and streamlines the handling of the empty result.



Reduce computational complexity



The code presented tests every candidate against the entire input, in turn. This scales as O(MN), where M is the size of the text and N is the number of candidate words. We can reduce that complexity by grouping the words according to the next unseen letter in the word; then walk through the text, examining just the set of words that can be advanced at each step. That scales as O(M+Nn), where Nn is the total length of all the candidate words.



It should be obvious that this really lends itself to a prefix-tree representation, but we can make it work for plain lists, too.



Here's a quickly knocked up implementation of that idea, with some small improvements (e.g. finish early if we match the longest candidate). Please forgive a C++ coder's approach to Python:



from collections import defaultdict

class Word:
def __init__(self, s):
self.chars = s
self.index = 0

def __repr__(self):
return self.chars + "(" + str(self.index) + ")"

class Matcher:
def __init__(self, words):
self.candidates = defaultdict(list)
self.longest = 0
self.results =
for w in words:
self.insert(Word(w))

def insert(self, word):
length = len(word.chars)
if length < self.longest:
return
elif word.index < length:
self.candidates[word.chars[word.index]].append(word)
elif length == self.longest:
self.results.append(word.chars)
elif length > self.longest:
self.longest = length
self.results = [word.chars]
for i in self.candidates.values():
i[:] = filter(lambda x: len(x.chars) >= length, i)

def step(self, char):
words = self.candidates[char]
del self.candidates[char]
for w in words:
w.index += 1
self.insert(w)
if not any(self.candidates.values()):
raise StopIteration

def read(self, it):
try:
for c in it:
self.step(c)
except StopIteration as e:
pass
return self.results


Testing this on a large problem set gives results around 20✕ faster (after removing the print() from your code, to give a fair comparison):



import itertools
import random
import string
import time
import timeit

def make_random_words(min_len, max_len):
"""An infinite generator of words containing min_len to max_len letters"""
while True:
length = random.randint(min_len, max_len)
yield ''.join([random.choice(string.ascii_letters) for n in range(length)])

if __name__ == "__main__":
# try ten-thousand random words and a long text
words = list(itertools.islice(make_random_words(5, 15), 10000))
mobydick = open("moby-dick.txt").read()

print(timeit.timeit("Main.get_longest_word(mobydick, words)",
number=10, timer=time.process_time, globals=globals()))

print(timeit.timeit("Matcher(words).read(mobydick)",
number=10, timer=time.process_time, globals=globals()))


9.290360821999998
0.510936933


These results might be more dramatic if we read directly from the file, as stopping early will have greater benefit in that case.






share|improve this answer











$endgroup$












  • $begingroup$
    Thanks for the explination! The time complexity difference is insane! Thanks for the advice, and as I have told the others, I will try to study up on your recommendations. I appreciate it!
    $endgroup$
    – K. Kretz
    Feb 28 at 23:12














2












2








2





$begingroup$

Clarify requirements



What result is desired if there's a tie for longest word? Accepting the first of equal-length words is a valid choice, but we'll want to document that we've made a choice where the spec was vague. Similarly, returning None when there are no matches is another valid choice that should be documented.



I'd lean toward returning a collection of results; that allows the caller to decide whether to print one or more, and streamlines the handling of the empty result.



Reduce computational complexity



The code presented tests every candidate against the entire input, in turn. This scales as O(MN), where M is the size of the text and N is the number of candidate words. We can reduce that complexity by grouping the words according to the next unseen letter in the word; then walk through the text, examining just the set of words that can be advanced at each step. That scales as O(M+Nn), where Nn is the total length of all the candidate words.



It should be obvious that this really lends itself to a prefix-tree representation, but we can make it work for plain lists, too.



Here's a quickly knocked up implementation of that idea, with some small improvements (e.g. finish early if we match the longest candidate). Please forgive a C++ coder's approach to Python:



from collections import defaultdict

class Word:
def __init__(self, s):
self.chars = s
self.index = 0

def __repr__(self):
return self.chars + "(" + str(self.index) + ")"

class Matcher:
def __init__(self, words):
self.candidates = defaultdict(list)
self.longest = 0
self.results =
for w in words:
self.insert(Word(w))

def insert(self, word):
length = len(word.chars)
if length < self.longest:
return
elif word.index < length:
self.candidates[word.chars[word.index]].append(word)
elif length == self.longest:
self.results.append(word.chars)
elif length > self.longest:
self.longest = length
self.results = [word.chars]
for i in self.candidates.values():
i[:] = filter(lambda x: len(x.chars) >= length, i)

def step(self, char):
words = self.candidates[char]
del self.candidates[char]
for w in words:
w.index += 1
self.insert(w)
if not any(self.candidates.values()):
raise StopIteration

def read(self, it):
try:
for c in it:
self.step(c)
except StopIteration as e:
pass
return self.results


Testing this on a large problem set gives results around 20✕ faster (after removing the print() from your code, to give a fair comparison):



import itertools
import random
import string
import time
import timeit

def make_random_words(min_len, max_len):
"""An infinite generator of words containing min_len to max_len letters"""
while True:
length = random.randint(min_len, max_len)
yield ''.join([random.choice(string.ascii_letters) for n in range(length)])

if __name__ == "__main__":
# try ten-thousand random words and a long text
words = list(itertools.islice(make_random_words(5, 15), 10000))
mobydick = open("moby-dick.txt").read()

print(timeit.timeit("Main.get_longest_word(mobydick, words)",
number=10, timer=time.process_time, globals=globals()))

print(timeit.timeit("Matcher(words).read(mobydick)",
number=10, timer=time.process_time, globals=globals()))


9.290360821999998
0.510936933


These results might be more dramatic if we read directly from the file, as stopping early will have greater benefit in that case.






share|improve this answer











$endgroup$



Clarify requirements



What result is desired if there's a tie for longest word? Accepting the first of equal-length words is a valid choice, but we'll want to document that we've made a choice where the spec was vague. Similarly, returning None when there are no matches is another valid choice that should be documented.



I'd lean toward returning a collection of results; that allows the caller to decide whether to print one or more, and streamlines the handling of the empty result.



Reduce computational complexity



The code presented tests every candidate against the entire input, in turn. This scales as O(MN), where M is the size of the text and N is the number of candidate words. We can reduce that complexity by grouping the words according to the next unseen letter in the word; then walk through the text, examining just the set of words that can be advanced at each step. That scales as O(M+Nn), where Nn is the total length of all the candidate words.



It should be obvious that this really lends itself to a prefix-tree representation, but we can make it work for plain lists, too.



Here's a quickly knocked up implementation of that idea, with some small improvements (e.g. finish early if we match the longest candidate). Please forgive a C++ coder's approach to Python:



from collections import defaultdict

class Word:
def __init__(self, s):
self.chars = s
self.index = 0

def __repr__(self):
return self.chars + "(" + str(self.index) + ")"

class Matcher:
def __init__(self, words):
self.candidates = defaultdict(list)
self.longest = 0
self.results =
for w in words:
self.insert(Word(w))

def insert(self, word):
length = len(word.chars)
if length < self.longest:
return
elif word.index < length:
self.candidates[word.chars[word.index]].append(word)
elif length == self.longest:
self.results.append(word.chars)
elif length > self.longest:
self.longest = length
self.results = [word.chars]
for i in self.candidates.values():
i[:] = filter(lambda x: len(x.chars) >= length, i)

def step(self, char):
words = self.candidates[char]
del self.candidates[char]
for w in words:
w.index += 1
self.insert(w)
if not any(self.candidates.values()):
raise StopIteration

def read(self, it):
try:
for c in it:
self.step(c)
except StopIteration as e:
pass
return self.results


Testing this on a large problem set gives results around 20✕ faster (after removing the print() from your code, to give a fair comparison):



import itertools
import random
import string
import time
import timeit

def make_random_words(min_len, max_len):
"""An infinite generator of words containing min_len to max_len letters"""
while True:
length = random.randint(min_len, max_len)
yield ''.join([random.choice(string.ascii_letters) for n in range(length)])

if __name__ == "__main__":
# try ten-thousand random words and a long text
words = list(itertools.islice(make_random_words(5, 15), 10000))
mobydick = open("moby-dick.txt").read()

print(timeit.timeit("Main.get_longest_word(mobydick, words)",
number=10, timer=time.process_time, globals=globals()))

print(timeit.timeit("Matcher(words).read(mobydick)",
number=10, timer=time.process_time, globals=globals()))


9.290360821999998
0.510936933


These results might be more dramatic if we read directly from the file, as stopping early will have greater benefit in that case.







share|improve this answer














share|improve this answer



share|improve this answer








edited Feb 28 at 18:04

























answered Feb 28 at 9:54









Toby SpeightToby Speight

26.6k742118




26.6k742118











  • $begingroup$
    Thanks for the explination! The time complexity difference is insane! Thanks for the advice, and as I have told the others, I will try to study up on your recommendations. I appreciate it!
    $endgroup$
    – K. Kretz
    Feb 28 at 23:12

















  • $begingroup$
    Thanks for the explination! The time complexity difference is insane! Thanks for the advice, and as I have told the others, I will try to study up on your recommendations. I appreciate it!
    $endgroup$
    – K. Kretz
    Feb 28 at 23:12
















$begingroup$
Thanks for the explination! The time complexity difference is insane! Thanks for the advice, and as I have told the others, I will try to study up on your recommendations. I appreciate it!
$endgroup$
– K. Kretz
Feb 28 at 23:12





$begingroup$
Thanks for the explination! The time complexity difference is insane! Thanks for the advice, and as I have told the others, I will try to study up on your recommendations. I appreciate it!
$endgroup$
– K. Kretz
Feb 28 at 23:12


















draft saved

draft discarded
















































Thanks for contributing an answer to Code Review Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f214408%2ffind-the-longest-word-in-set-d-that-is-a-subsequence-of-string-s%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown






Popular posts from this blog

How to check contact read email or not when send email to Individual?

Displaying single band from multi-band raster using QGIS

How many registers does an x86_64 CPU actually have?