All posts

SQL trick with many-to-many tables

📅 Created 2 months ago 👁️ 160

🏷️ #programming #javascript #typescript #sql

So here was my issue:

Given a many-to-many table (e.g., tagged_posts),
Select all posts which have all given tags.

A bit of context: PostgreSQL1; the Typescript backend; no ORM, just a pg package.
On the SQL side, I deliberately went for classic SQL many-to-many modelling way.

SQL many-to-many
posts {
  id:         PK
  -- ...
}

tags {
  tag:        PK
}

tagged_posts {
  post_id:    FK ref(post.id)
  tag:        FK ref(tags.tag)
  ---
  PK          (post_id, tag)
  INDEX ON    post_id
  INDEX ON    tag
}

Despite it’s possible to use JSON(B) array/object as tagged_posts.tags, I decided not to go that way due to more complicated queries and issues with managing tags integrity (e.g., global tag deletion or renaming).


At first glance

At first glance , I’d do something like this:

SELECT tp1.post_id
FROM tagged_posts AS tp1
JOIN tagged_posts AS tp2
  ON tp2.post_id = tp1.post_id AND tp2.tag = :tag2
JOIN tagged_posts AS tp3
  ON tp3.post_id = tp1.post_id AND tp3.tag = :tag3
--- and so on
WHERE tp1.tag = :tag1

…or even rewriting it in WITH (...) SELECT syntax.

💩 However, self-joins don’t feel right. Especially when number of tags becomes 4+: this way is too greedy on RAM.

Slightly better way

SELECT DISTINCT tp.post_id FROM tagged_posts AS tp
WHERE
  EXISTS (SELECT 1 FROM tagged_posts WHERE post_id = tp.post_id AND tag = :tag1)
  EXISTS (SELECT 1 FROM tagged_posts WHERE post_id = tp.post_id AND tag = :tag2)
  -- and so on

👍 Slightly better on RAM (the EXIST subqueries return very little).
💩 Scans for tags still N times.

The best way

SELECT post_id FROM tagged_posts AS tp
WHERE tag IN (:tag1, :tag2, -- ...)
GROUP BY post_id
HAVING count(DISTINCT tag) = :number_of_tags

👍 Rescanning the tag index only once. The GROUP BY operation is way less RAM-greedy.
🤔 Extra count() operation and extra HAVING filter — a small price to pay: relatively cheap operation + severely less data to process.

📌Canonical WHEREHAVING example

This query shows the difference between these two ways of filtering very vividly:

  • The WHERE filters normal SELECT FROM output; the filtering works with “real” columns.
  • The HAVING works with virtual columns, which appear only after grouping. In the example above, the HAVING filter is applied to count() which does not exist without grouping.

References

  1. I used Posrgress but it does not matter: most of RDBMS follow the principles referred in the article.

Once you see it... LLM owns the Code; Developer should own the Thinking