SQL trick with many-to-many tables ⨉ Roman Melnyk's blog

So here was my issue:

Given a many-to-many table (e.g., tagged_posts),
Select all posts which have all given tags.

A bit of context: PostgreSQL¹; the Typescript backend; no ORM, just a pg package.
On the SQL side, I deliberately went for classic SQL many-to-many modelling way.

SQL many-to-many

posts {
  id:         PK
  -- ...
}

tags {
  tag:        PK
}

tagged_posts {
  post_id:    FK ref(post.id)
  tag:        FK ref(tags.tag)
  ---
  PK          (post_id, tag)
  INDEX ON    post_id
  INDEX ON    tag
}

Despite it’s possible to use JSON(B) array/object as tagged_posts.tags, I decided not to go that way due to more complicated queries and issues with managing tags integrity (e.g., global tag deletion or renaming).

At first glance

At first glance , I’d do something like this:

SELECT tp1.post_id
FROM tagged_posts AS tp1
JOIN tagged_posts AS tp2
  ON tp2.post_id = tp1.post_id AND tp2.tag = :tag2
JOIN tagged_posts AS tp3
  ON tp3.post_id = tp1.post_id AND tp3.tag = :tag3
--- and so on
WHERE tp1.tag = :tag1

…or even rewriting it in WITH (...) SELECT syntax.

💩 However, self-joins don’t feel right. Especially when number of tags becomes 4+: this way is too greedy on RAM.

Slightly better way

SELECT DISTINCT tp.post_id FROM tagged_posts AS tp
WHERE
  EXISTS (SELECT 1 FROM tagged_posts WHERE post_id = tp.post_id AND tag = :tag1)
  EXISTS (SELECT 1 FROM tagged_posts WHERE post_id = tp.post_id AND tag = :tag2)
  -- and so on

👍 Slightly better on RAM (the EXIST subqueries return very little).
💩 Scans for tags still N times.

The best way

SELECT post_id FROM tagged_posts AS tp
WHERE tag IN (:tag1, :tag2, -- ...)
GROUP BY post_id
HAVING count(DISTINCT tag) = :number_of_tags

👍 Rescanning the tag index only once. The GROUP BY operation is way less RAM-greedy.
🤔 Extra count() operation and extra HAVING filter — a small price to pay: relatively cheap operation + severely less data to process.

Canonical WHERE ↔ HAVING example

This query shows the difference between these two ways of filtering very vividly:

The WHERE filters normal SELECT FROM output; the filtering works with “real” columns.
The HAVING works with virtual columns, which appear only after grouping. In the example above, the HAVING filter is applied to count() which does not exist without grouping.

I used Posrgress but it does not matter: most of RDBMS follow the principles referred in the article. ↩

SQL trick with many-to-many tables

At first glance

Slightly better way

The best way

References