50

Conceptual question: Are individual queries faster than joins, or: Should I try to squeeze every info I want on the client side into one SELECT statement or just use as many as seems convenient?

TL;DR: If my joined query takes longer than running individual queries, is this my fault or is this to be expected?

First of, I am not very database savvy, so it may be just me, but I have noticed that when I have to get information from multiple tables, it is "often" faster to get this information via multiple queries on individual tables (maybe containing a simple inner join) and patch the data together on the client side that to try to write a (complex) joined query where I can get all the data in one query.

I have tried to put one extremely simple example together:

SQL Fiddle

Schema Setup:

CREATE TABLE MASTER 
( ID INT NOT NULL
, NAME VARCHAR2(42 CHAR) NOT NULL
, CONSTRAINT PK_MASTER PRIMARY KEY (ID)
);

CREATE TABLE DATA
( ID INT NOT NULL
, MASTER_ID INT NOT NULL
, VALUE NUMBER
, CONSTRAINT PK_DATA PRIMARY KEY (ID)
, CONSTRAINT FK_DATA_MASTER FOREIGN KEY (MASTER_ID) REFERENCES MASTER (ID)
);

INSERT INTO MASTER values (1, 'One');
INSERT INTO MASTER values (2, 'Two');
INSERT INTO MASTER values (3, 'Three');

CREATE SEQUENCE SEQ_DATA_ID;

INSERT INTO DATA values (SEQ_DATA_ID.NEXTVAL, 1, 1.3);
INSERT INTO DATA values (SEQ_DATA_ID.NEXTVAL, 1, 1.5);
INSERT INTO DATA values (SEQ_DATA_ID.NEXTVAL, 1, 1.7);
INSERT INTO DATA values (SEQ_DATA_ID.NEXTVAL, 2, 2.3);
INSERT INTO DATA values (SEQ_DATA_ID.NEXTVAL, 3, 3.14);
INSERT INTO DATA values (SEQ_DATA_ID.NEXTVAL, 3, 3.7);

Query A:

select NAME from MASTER
where ID = 1

Results:

| NAME |
--------
|  One |

Query B:

select ID, VALUE from DATA
where MASTER_ID = 1

Results:

| ID | VALUE |
--------------
|  1 |   1.3 |
|  2 |   1.5 |
|  3 |   1.7 |

Query C:

select M.NAME, D.ID, D.VALUE 
from MASTER M INNER JOIN DATA D ON M.ID=D.MASTER_ID
where M.ID = 1

Results:

| NAME | ID | VALUE |
---------------------
|  One |  1 |   1.3 |
|  One |  2 |   1.5 |
|  One |  3 |   1.7 |

Of course, I didn't measure any performance with these, but one may observe:

  • Query A+B returns the same amount of usable information as Query C.
  • A+B has to return 1+2x3==7 "Data Cells" to the client
  • C has to return 3x3==9 "Data Cells" to the client, because with the join I naturally include some redundancy in the result set.

Generalizing from this (as far fetched as it is):

A joined query always has to return more data than the individual queries that receive the same amount of information. Since the database has to cobble together the data, for large datasets one can assume that the database has to do more work on a single joined query than on the individual ones, since (at least) it has to return more data to the client.

Would it follow from this, that when I observe that splitting a client side query into multiple queries yield better performance, this is just the way to go, or would it rather mean that I messed up the joined query?

3
  • Comments are not for extended discussion; this conversation has been moved to chat. Commented Mar 4, 2017 at 19:22
  • 1
    I ran a benchmark and posted the results in an article on Medium. I would have added an answer here, but already did it on another question, and posting the same answer to multiple questions is frowned upon. Commented Jan 9, 2019 at 18:02
  • 1
    @BenMorel I don't think your article takes into account the case where you're left joining on unrelated associations. E.g. if you query a->b->c, joining is fine, however long the chain. If you query c->a<-b, you now return c*b records for each a vs. c+b for each a. Also I don't think your article accounts for bandwidth or network latency. It's going to take longer to transport more data even if the database crunches that data faster (though, in my own benchmarks, IN queries and temp tables perform roughly the same as JOINs in the aforementioned case, just less rows returned) Commented Apr 24, 2024 at 7:19

4 Answers 4

50

Are individual queries faster than joins, or: Should I try to squeeze every info I want on the client side into one SELECT statement or just use as many as seems convenient?

In any performance scenario, you have to test and measure the solutions to see which is faster.

That said, it's almost always the case that a joined result set from a properly tuned database will be faster and scale better than returning the source rows to the client and then joining them there. In particular, if the input sets are large and the result set is small -- think about the following query in the context of both strategies: join together two tables that are 5 GB each, with a result set of 100 rows. That's an extreme, but you see my point.

I have noticed that when I have to get information from multiple tables, it is "often" faster to get this information via multiple queries on individual tables (maybe containing a simple inner join) and patch the data together on the client side that to try to write a (complex) joined query where I can get all the data in one query.

It's highly likely that the database schema or indexes could be improved to better serve the queries you're throwing at it.

A joined query always has to return more data than the individual queries that receive the same amount of information.

Usually this is not the case. Most of the time even if the input sets are large, the result set will be much smaller than the sum of the inputs.

Depending on the application, very large query result sets being returned to the client are an immediate red flag: what is the client doing with such a large set of data that can't be done closer to the database? Displaying 1,000,000 rows to a user is highly suspect to say the least. Network bandwidth is also a finite resource.

Since the database has to cobble together the data, for large datasets one can assume that the database has to do more work on a single joined query than on the individual ones, since (at least) it has to return more data to the client.

Not necessarily. If the data is indexed correctly, the join operation is more likely to be done more efficiently at the database without needing to scan a large quantity of data. Moreover, relational database engines are specially optimized at a low level for joining; client stacks are not.

Would it follow from this, that when I observe that splitting a client side query into multiple queries yield better performance, this is just the way to go, or would it rather mean that I messed up the joined query?

Since you said you're inexperienced when it comes to databases, I would suggest learning more about database design and performance tuning. I'm pretty sure that's where the problem lies here. Inefficiently-written SQL queries are possible, too, but with a simple schema that's less likely to be a problem.

Now, that's not to say there aren't other ways to improve performance. There are scenarios where you might choose to scan a medium-to-large set of data and return it to the client if the intention is to use some sort of caching mechanism. Caching can be great, but it introduces complexity in your design. Caching may not even be appropriate for your application.

One thing that hasn't been mentioned anywhere is maintaining consistency in the data that's returned from the database. If separate queries are used, it's more likely (due to many factors) to have inconsistent data returned, unless a form of snapshot isolation is used for every set of queries.

1
  • 1
    OP is saying that JOINed data result sets are always bigger. > A joined query always has to return more data than the individual queries. I think this is objectively true (for >=), e.g. the result sets differ in size, so more data over the wire. Do you have an example where this isn't true? If I join Authors -> Posts and Authors has a field called "biography" which is 1MB JSON field, for an Author of 100 Posts, over the wire I'll transmit 100MB vs 1MB. Is this wrong? Commented Jun 25, 2019 at 12:01
6

Of course, I didn't measure any performance with these

You put together some good sample code. Did you look at the timing in SQL Fiddle? Even some brief unscientific performance testing will show that query three in your demonstration takes about the same amount of time to run as either query one or two separately. Combined one and two take about twice as long as three and that is before any client side join is performed.

As you increase the data, the speed of query one and two would diverge, but the database join would still be faster.

You should also consider what would happen if the inner join is eliminating data.

0
2

The query optimiser should be considered, too. Its role is to take your declarative SQL and translate it into procedural steps. To find the most efficient combination of procedural steps it will examine combinations of index usage, sorts, caching intermediate results sets and all sorts of other things, too. The number of permutations can get exceedingly large even with what look like quite simple queries.

Much of the calculation done to find the best plan is driven by the distribution of data within the tables. These distributions are sampled and stored as statistics objects. If these are wrong, they lead the optimiser to make poor choices. Poor choices early in the plan lead to even poorer choices later on in a snowball effect.

It's not unknown for a medium sized query returning modest amounts of data to take minutes to run. Correct indexing and good statistics then reduces this to milliseconds.

-3

Multiple queries IS the way to go. If you handle simple scenarios like that - the cost overhead of the query optimizer is a factor. With more data, the network inefficiency of the join (redundant rows) comes in. Only with a lot more data is there efficiency.

At the end, what you experience is something many developers see. The DBAs always say "no, make a join" but reality is: it IS faster to make multiple simple selects in this case.

5
  • 5
    There's no "network inefficiency" in a join - it all happens on the database server, so there's no network involved (unless you're joining over a db link!) Commented May 24, 2013 at 15:11
  • 2
    You might like to consider whether the network layer has compression or not. Oracle's SQL*Net does, in that values repeating in the same column are efficiently compressed. Commented May 24, 2013 at 16:24
  • 4
    @TomTom you may have a point or not (as David Aldridge points, compression matters) but your wording is confusing. "network inefficiency of the join"? Really, fix that so it is obvious what you mean. Commented May 24, 2013 at 16:39
  • @ChrisSaxon sure there is, image you have tables for a report "title->base->table-rows" and you need all the rows so you inner join these 3 tables. Each table has long varchars so what happens is for every row you are repeating these long varchars. The application layer needs to allocate memory for all of these strings and then group them for your model. So I think that is what he means, there is more data sent Commented Jun 1, 2018 at 4:02
  • @MIKE that depends on the expressions you select, not the join. And there may be network compression. In Oracle Database SQL*Net removes repeated duplicate values nicetheory.io/2018/01/11/… Commented Jun 3, 2018 at 17:01

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.