Similarity of Text Using Trigram Matching

In this article, we will discuss the similarity of text using trigram matching. Trigram matching is a technique to find the similarity between two strings. It is a simple and effective way to find the similarity between two strings. Trigram matching is widely used in various applications such as spell checking, plagiarism detection, and information retrieval.

What is Trigram Matching?

Trigram matching is a technique to find the similarity between two strings by comparing the trigrams of the strings. A trigram is a sequence of three consecutive characters in a string. For example, the trigrams of the string "hi mom" are "hi ", "i m", " mo", "mom". Trigram matching works by comparing the trigrams of two strings and calculating the similarity between them.

How Trigram Matching Works?

Trigram matching works by comparing the trigrams of two strings and calculating the similarity between them. The similarity between two strings is calculated by counting the number of common trigrams between them. The similarity score is calculated using the following formula:

similarity = (2 * common_trigrams) / (total_trigrams_string1 + total_trigrams_string2)

Where common_trigrams is the number of common trigrams between the two strings, total_trigrams_string1 is the total number of trigrams in string 1, and total_trigrams_string2 is the total number of trigrams in string 2.

Example

Let's take an example to understand how trigram matching works. Consider two strings "hi mom" and "my mom is pretty". The trigrams of the strings are as follows:

String 1: "hi mom"
Trigrams: "hi ", "i m", " mo", "mom"

String 2: "my mom is pretty"
Trigrams: "my ", "y m", " mo", "mom", "om ", "m i", " is", "is ", "s p", " pr", "pre", "ret", "ett", "tty"

The common trigrams between the two strings are " mo" and "mom". The total number of trigrams in string 1 is 4, and the total number of trigrams in string 2 is 14. Therefore, the similarity score between the two strings is:

similarity = (2 * 2) / (4 + 14) = 0.25

The similarity score between the two strings is 0.25, which indicates that the two strings are not very similar.

Code Example

We'll create a table has two columns: id and text and insert some sample data into the table.

CREATE TABLE texts (
    id SERIAL PRIMARY KEY,
    text TEXT
);
INSERT INTO texts (text) VALUES
    ('hi mom'),
    ('my mom is pretty'),
    ('hello world'),
    ('hi dad'),
    ('my dad is cool');

Next, i use pg_trgm extension to calculate the similarity between the strings.

CREATE EXTENSION IF NOT EXISTS pg_trgm;
 
SELECT
    t1.text AS text1,
    t2.text AS text2,
    similarity(t1.text, t2.text) AS similarity,
    t1.text <-> t2.text AS distance
FROM
    texts t1
JOIN
    texts t2
ON
    t1.id < t2.id
ORDER BY
    similarity DESC;

The result will be:

| text1             | text2            | similarity | distance |
|-------------------|------------------|------------|----------|
| hi mom            | hi dad           | 0.272727   | 0.727273 |
| my mom is pretty  | my dad is cool   | 0.24       | 0.76     |
| hi dad            | my dad is cool   | 0.222222   | 0.777778 |
| hi mom            | my mom is pretty | 0.210526   | 0.789474 |
| hello world       | hi dad           | 0.0555556  | 0.944444 |
| hi mom            | hello world      | 0.0555556  | 0.944444 |
| hi mom            | my dad is cool   | 0.047619   | 0.952381 |
| my mom is pretty  | hello world      | 0          | 1        |
| my mom is pretty  | hi dad           | 0          | 1        |
| hello world       | my dad is cool   | 0          | 1        |

So, when i want to find "hi mom" in the table, i can use the following query:

SELECT
    text
FROM
    texts
WHERE
    similarity(text, 'hi mom') > 0.5
ORDER BY
    similarity(text, 'hi mom') DESC;

The result will be:

| text             |
|------------------|
| hi mom           |

With the same way, i can find the most similar string with "my mom is pretty" by using the following query:

SELECT
    text
FROM
    texts
WHERE
    similarity(text, 'my mom is pretty') >= 0.2
ORDER BY
    similarity(text, 'my mom is pretty') DESC;

And result for "my mom is pretty" will be:

| text               |
|--------------------|
| my mom is pretty   |
| my dad is cool     |
| hi mom             |

In there, we can see "my dad is cool" is the most similar string with "my mom is pretty". Because they have 0.24 similarity score.

Performance considerations

Trigram matching is a simple and effective way to find the similarity between two strings. However, it can be computationally expensive for large datasets. To improve performance, you can use the pg_trgm extension in PostgreSQL, which provides optimized trigram matching functions. You can also use indexing to speed up trigram matching queries. Here are some tips to improve the performance of trigram matching:

Indexing: Using pg_trgm extension, you can create a trigram index on the text column to speed up trigram matching queries.

CREATE INDEX trigram_index ON texts USING gin (text gin_trgm_ops);

Limit the number of comparisons: If you have a large dataset, you can limit the number of comparisons by using filters or indexes to reduce the number of comparisons.

Conclusion

pg_trgm extension in PostgreSQL offer a versatile set of tools for text processing and searching. We went over the basics of the extension, including how to enable it and how to use it for fuzzy string matching and proximity searches.

Next time, i will apply trigram matching to search for Vietnamese words in the database. Stay tuned!