Lighting Fast Autocomplete Search

In the fast-paced world of software development, creating efficient and responsive autocomplete systems is a common challenge. Whether you're building a search bar for a search engine or an input field for a form, a well-designed autocomplete feature can greatly enhance the user experience. In this article, we will explore the technical underpinnings of creating a lightning-fast autocomplete search system using Redis sorted sets and delve into performance considerations and advanced techniques for building scalable solutions.

Understanding Redis Sorted Sets

What Are Redis Sorted Sets?

Redis Sorted Sets (ZSETs) are a powerful data structure provided by Redis. Unlike regular sets, which are unordered collections of unique elements, sorted sets maintain a specific order based on a score assigned to each element. Each element in a ZSET has two components:

Value: The actual data or word we are storing.
Score: A floating-point number that determines the order of the elements.

Elements in a ZSET are sorted by their scores. If multiple elements share the same score, they are ordered lexicographically.

Example:

redis> zadd zset 0 foo
redis> zadd zset 0 bar
redis> zadd zset 0 x
redis> zadd zset 0 z
redis> zadd zset 0 a
redis> zrange zset 0 -1

Output:

1. "a"
2. "bar"
3. "foo"
4. "x"
5. "z"

Here, all elements have the same score of 0, so they are sorted alphabetically.

Performance Characteristics of Sorted Sets

One of the standout features of Redis Sorted Sets is their performance. Both ZRANK and ZRANGE commands are performed in O(log(N)) time complexity. This logarithmic time complexity ensures that operations remain efficient even as the size of the ZSET scales up, accommodating millions of elements without performance degradation.

Example of ZRANK Command:

redis> zrank zset bar

Output:

(integer) 1

The ZRANK command returns the rank of the element "bar", which is at index 1.

Example of ZRANGE Command:

redis> zrange zset 2 -1

Output:

1. "foo"
2. "x"
3. "z"

ZRANGE retrieves elements from position 2 to the end of the ZSET.

Completing Words by Lexicographical Order

Current Approach

To complete a word based on a prefix, we look up the prefix in the ZSET and retrieve all subsequent words in lexicographical order. However, this method only provides a list of words starting from the prefix, not a complete solution for real-time autocomplete.

Solution: Adding All Prefixes

To improve autocomplete functionality, we can enhance the ZSET by including all prefixes of each word. We also add a special character (*) to denote complete words.

Implementation Steps:

Add Prefixes: For each word, add all prefixes to the ZSET.
Add Full Word with Special Character: Append * to the full word to distinguish it from the prefixes.

Example:

For words "foo", "bar", and "foobar", we would add:

For "foo":
- Prefixes: f, fo, foo
- Full Word: foo*
For "bar":
- Prefixes: b, ba, bar
- Full Word: bar*
For "foobar":
- Prefixes: f, fo, foo, foob, fooba, foobar
- Full Word: foobar*

Resulting ZSET:

redis> zrange zset 0 -1
1. "b"
2. "ba"
3. "bar"
4. "bar*"
5. "f"
6. "fo"
7. "foo"
8. "foo*"
9. "foob"
10. "fooba"
11. "foobar"
12. "foobar*"

Finding Complete Words: If a user types "fo", follow these steps:

Find Prefix Position:

redis> zrank zset fo

Output:

(integer) 5

Retrieve Subsequent Elements:

redis> zrange zset 6 -1

Output:

1. "foo"
2. "foo*"
3. "foob"
4. "fooba"
5. "foobar"
6. "foobar*"

Filter for Complete Words:

Extract only the elements with the * character to identify complete words:

"foo*" -> "foo"
"foobar*" -> "foobar"

Thus, the autocomplete suggestions for "fo" are "foo" and "foobar".

Performance and Memory Requirements

Performance (Big O)

Both ZRANK and ZRANGE (with a fixed range of 50 elements) are O(log(N)) operations. This efficiency allows handling very large lists of words, containing millions of elements, without performance issues.

Memory Requirements

In the worst-case scenario, each word contributes M+1 elements to the ZSET, where M is the length of the word. For N words, this results in a requirement of N*(Ma+1) elements, where Ma is the average length of the words. In practice, many prefixes will overlap, reducing the total number of elements.

Example: With 1 million words, each 10 characters long on average, the memory requirement would be approximately 11 million elements (1M*(10+1)) in the ZSET.

Advanced Query Predictions

Lexicographical Completion

Currently, the completion of words based on lexicographical order is suitable for online dictionary features. This approach is effective for applications where the goal is to provide a list of suggestions in alphabetical order.

Frequency-Based Completion

In some scenarios, you may want to complete words based on search frequency, similar to Google Instant Search. For instance, if a user types "ne", you might want to suggest "netflix", "news", and "new york times" based on their popularity.

Solution: Sorted Sets for Each Prefix

Approach:

Create a ZSET for Each Prefix: For each user query, calculate all prefixes and use each as a key for a ZSET.
Update Scores: Use the ZINCRBY command to increase the score of the word in each corresponding ZSET.

Example: For the query "news", the prefixes are "n", "ne", "new", and "news":

ZINCRBY n 1 news
ZINCRBY ne 1 news
ZINCRBY new 1 news
ZINCRBY news 1 news

Completing Words Based on Prefix Frequency: To complete words based on frequency, perform ZRANGE on the ZSET of the prefix:

ZRANGE <prefix> 0 4

Handling Long Lists: To calculate the top words, retrieve a longer list (e.g., top 300) for each prefix to get a good estimate.

Updating the ZSET: If the ZSET for a prefix has not reached the maximum number of elements (300), increment the score with ZINCRBY. If it has reached the maximum, remove the element with the lowest score and add the new element with an incremented score.

Simpler Variant: Another approach is to increase the score of the current element with ZINCRBY, then sample three random elements and remove the one with the lowest score.

Conclusion

These techniques ensure that autocomplete functionality is both efficient and effective. By leveraging sorted sets and understanding the nuances of performance and memory requirements, we can build a robust system that scales with user demands. Whether you need lexicographical ordering or frequency-based suggestions, these methods provide a solid foundation for creating a high-performing autocomplete feature.