So here was my issue:
Given a many-to-many table (e.g.,
tagged_posts),
Select all posts which have all given tags.
A bit of context: PostgreSQL1; the Typescript backend; no ORM, just a pg package.
On the SQL side, I deliberately went for classic SQL many-to-many modelling way.
SQL many-to-many
posts {
id: PK
-- ...
}
tags {
tag: PK
}
tagged_posts {
post_id: FK ref(post.id)
tag: FK ref(tags.tag)
---
PK (post_id, tag)
INDEX ON post_id
INDEX ON tag
}
Despite it’s possible to use JSON(B) array/object as tagged_posts.tags, I decided not to go that way due to more complicated queries and issues with managing tags integrity (e.g., global tag deletion or renaming).
At first glance
At first glance , I’d do something like this:
SELECT tp1.post_id
FROM tagged_posts AS tp1
JOIN tagged_posts AS tp2
ON tp2.post_id = tp1.post_id AND tp2.tag = :tag2
JOIN tagged_posts AS tp3
ON tp3.post_id = tp1.post_id AND tp3.tag = :tag3
--- and so on
WHERE tp1.tag = :tag1
…or even rewriting it in WITH (...) SELECT syntax.
💩 However, self-joins don’t feel right. Especially when number of tags becomes 4+: this way is too greedy on RAM.
Slightly better way
SELECT DISTINCT tp.post_id FROM tagged_posts AS tp
WHERE
EXISTS (SELECT 1 FROM tagged_posts WHERE post_id = tp.post_id AND tag = :tag1)
EXISTS (SELECT 1 FROM tagged_posts WHERE post_id = tp.post_id AND tag = :tag2)
-- and so on
👍 Slightly better on RAM (the EXIST subqueries return very little).
💩 Scans for tags still N times.
The best way
SELECT post_id FROM tagged_posts AS tp
WHERE tag IN (:tag1, :tag2, -- ...)
GROUP BY post_id
HAVING count(DISTINCT tag) = :number_of_tags
👍 Rescanning the tag index only once. The GROUP BY operation is way less RAM-greedy.
🤔 Extra count() operation and extra HAVING filter — a small price to pay: relatively cheap operation + severely less data to process.
📌Canonical WHERE ↔ HAVING example
This query shows the difference between these two ways of filtering very vividly:
- The
WHEREfilters normalSELECT FROMoutput; the filtering works with “real” columns. - The
HAVINGworks with virtual columns, which appear only after grouping. In the example above, theHAVINGfilter is applied tocount()which does not exist without grouping.
References
-
I used Posrgress but it does not matter: most of RDBMS follow the principles referred in the article. ↩