Best way to Index and or Key massive datasets [closed]

Question

Closed. This question is opinion-based. It is not currently accepting answers.

Want to improve this question? Because this question may lead to opinionated discussion, debate, and answers, it has been closed. You may edit the question if you feel you can improve it so that it requires answers that include facts and citations or a detailed explanation of the proposed solution. If edited, the question will be reviewed and might be reopened.

Closed last month.

The community reviewed whether to reopen this question last month and left it closed:

Needs details or clarity As written, this question is lacking some of the information it needs to be answered. If the author adds details in comments, consider editing them into the question. Once there's sufficient detail to answer, vote to reopen the question.

Improve this question

I have two tables related to each other, each with roughly 200M records.

CREATE TABLE [dbo].[AS_tblTBCDEF](
    [CDEF_SOC_NUM] [numeric](5, 0) NULL,
    [CDEF_EFF_DATE] [date] NULL,
    [CDEF_TYP_BUS] [nvarchar](1) NULL,
    [CDEF_CLASS_NUM] [smallint] NULL,
    [CDEF_GROUP] [smallint] NULL,
    [CDEF_COV_EXP_TYP] [nvarchar](1) NULL,
    [CDEF_SCHEDULE] [nvarchar](9) NULL,
    [CDEF_LIMIT] [numeric](9, 2) NULL,
    [CDEF_LIMIT_PCTILE] [nvarchar](2) NULL,
    [CDEF_WHY_NOT_COV] [smallint] NULL,
    [CDEF_PROVIEW_GRP] [smallint] NULL,
    [CDEF_BAS_ADJ_IND] [nvarchar](1) NULL,
    [CDEF_BAS_ADJ_AMT] [numeric](9, 2) NULL,
    [CDEF_DEF_TYPE] [nvarchar](1) NULL
) ON [PRIMARY]
GO

CREATE TABLE [dbo].[AS_tblTBCDEFD](
    [CDEF_DESC_SOC_NUM] [numeric](5, 0) NULL,
    [CDEF_DESC_EFF_DATE] [date] NULL,
    [CDEF_DESC_TYP_BUS] [nvarchar](1) NULL,
    [CDEF_DESC_CLASS] [smallint] NULL,
    [CDEF_DESC_GROUP] [smallint] NULL,
    [CDEF_DESC_TEXT] [nvarchar](77) NULL
) ON [PRIMARY]
GO

They are joined like this:

FROM [dbo].[AS_tblTBCDEF] GC_TBCDEF
    LEFT JOIN [dbo].[AS_tblTBCDEFD] GC_TBCDEFD 
        ON (GC_TBCDEF.CDEF_GROUP = GC_TBCDEFD.CDEF_DESC_GROUP) 
        AND (GC_TBCDEF.CDEF_CLASS_NUM = GC_TBCDEFD.CDEF_DESC_CLASS) 
        AND (GC_TBCDEF.CDEF_TYP_BUS = GC_TBCDEFD.CDEF_DESC_TYP_BUS) 
        AND (GC_TBCDEF.CDEF_EFF_DATE = GC_TBCDEFD.CDEF_DESC_EFF_DATE) 
        AND (GC_TBCDEF.CDEF_SOC_NUM = GC_TBCDEFD.CDEF_DESC_SOC_NUM)

These two tables get re-created monthly via an ETL script that is run by a different department. I'm trying to figure out the best way to key/index these tables so that it will actually return data. Right now, it just times out and returns nothing.

I have the following lines I run which add indexes, but it's clearly not enough. I'm looking for suggestions to optimize the join.

IF NOT EXISTS (SELECT * FROM sys.indexes WHERE object_id = object_id('[dbo].[AS_tblTBCDEF]') AND NAME ='idx_Soc_Num')
CREATE INDEX idx_Soc_Num
ON [dbo].[AS_tblTBCDEF] (CDEF_SOC_NUM);

IF NOT EXISTS (SELECT * FROM sys.indexes WHERE object_id = object_id('[dbo].[AS_tblTBCDEF]') AND NAME ='idx_Class_Num')
CREATE INDEX idx_Class_Num
ON [dbo].[AS_tblTBCDEF] (CDEF_CLASS_NUM);

IF NOT EXISTS (SELECT * FROM sys.indexes WHERE object_id = object_id('[dbo].[AS_tblTBCDEF]') AND NAME ='idx_Eff_Date')
CREATE INDEX idx_Eff_Date
ON [dbo].[AS_tblTBCDEF] (CDEF_EFF_DATE);

IF NOT EXISTS (SELECT * FROM sys.indexes WHERE object_id = object_id('[dbo].[AS_tblTBCDEF]') AND NAME ='idx_Typ_Bus')
CREATE INDEX idx_Typ_Bus
ON [dbo].[AS_tblTBCDEF] (CDEF_TYP_BUS);

IF NOT EXISTS (SELECT * FROM sys.indexes WHERE object_id = object_id('[dbo].[AS_tblTBCDEF]') AND NAME ='idx_Group')
CREATE INDEX idx_Group
ON [dbo].[AS_tblTBCDEF] (CDEF_GROUP);

IF NOT EXISTS (SELECT * FROM sys.indexes WHERE object_id = object_id('[dbo].[AS_tblTBCDEFD]') AND NAME ='idx_Soc_Num')
CREATE INDEX idx_Soc_Num
ON [dbo].[AS_tblTBCDEFD] (CDEF_DESC_SOC_NUM);

IF NOT EXISTS (SELECT * FROM sys.indexes WHERE object_id = object_id('[dbo].[AS_tblTBCDEFD]') AND NAME ='idx_Class_Num')
CREATE INDEX idx_Class_Num
ON [dbo].[AS_tblTBCDEFD] (CDEF_DESC_CLASS);

IF NOT EXISTS (SELECT * FROM sys.indexes WHERE object_id = object_id('[dbo].[AS_tblTBCDEFD]') AND NAME ='idx_Eff_Date')
CREATE INDEX idx_Eff_Date
ON [dbo].[AS_tblTBCDEFD] (CDEF_DESC_EFF_DATE);

IF NOT EXISTS (SELECT * FROM sys.indexes WHERE object_id = object_id('[dbo].[AS_tblTBCDEFD]') AND NAME ='idx_Typ_Bus')
CREATE INDEX idx_Typ_Bus
ON [dbo].[AS_tblTBCDEFD] (CDEF_DESC_TYP_BUS);

IF NOT EXISTS (SELECT * FROM sys.indexes WHERE object_id = object_id('[dbo].[AS_tblTBCDEFD]') AND NAME ='idx_Group')
CREATE INDEX idx_Group
ON [dbo].[AS_tblTBCDEFD] (CDEF_DESC_GROUP);

@ScottHunter Why is this closed? The DDL is all there, it's empirically proven to result in a poor execution plan. — Charlieface
– Charlieface, Commented Oct 16 at 15:02
@Charlieface I don't see an "empirically proven ... poor execution plan" (I agree but your comment fails to make that point). I think the question was poorly worded though, formulations like "the best way" and "I'm looking for suggestions" tend to stand out in close-vote review. It could've also done a better job at providing a minimal reproducible example by including INSERT statements. Fixing such things before submitting it to re-open review generally works a lot better than just expressing disagreement in a comment. — user4157124
– user4157124, Commented Oct 16 at 17:31
This seems like a duplicate of any question about multicolumns index, e.g. stackoverflow.com/questions/28475877/… — pascal
– pascal, Commented Oct 20 at 17:24

Charlieface · Accepted Answer · 2025-10-16 14:27:17Z

All your indexes are a complete waste of time, as they are single column indexes, with no INCLUDE columns. This means they are mostly only useful if doing a single point-lookup on that column. A giant join is not going to work, the optimizer will fall back to a hash match or sort/merge, which is going to be faster than a naive nested loop without proper indexing.

Delete all those indexes. Instead, create a single multi-column index on each table. Best to make them unique and clustered. Even better, make them the primary key, although that won't work with nullable columns (why are they nullable anyway??)

CREATE UNIQUE CLUSTERED INDEX idx_1 ON dbo.AS_tblTBCDEF
  (CDEF_SOC_NUM, CDEF_CLASS_NUM, CDEF_EFF_DATE, CDEF_TYP_BUS, CDEF_GROUP);

CREATE UNIQUE CLUSTERED INDEX idx_1 ON dbo.AS_tblTBCDEFD
  (CDEF_SOC_NUM, CDEF_CLASS_NUM, CDEF_EFF_DATE, CDEF_TYP_BUS, CDEF_GROUP);

The column ordering should ideally go from most selective (most distinct values) to least selective. But if you have other queries which only join or filter by some of the columns then put those columns first.

You can see from this fiddle that a much more efficient merge join with no sort.

The ETL team doesn't make our lives easy. :-) They're actually a bit lazy and tend to just use defaults a lot. I've added some code to change the columns to NOT NULL and I'm running that now. I'll give this a spin once that completes.
Additional ask while I'm waiting for this to finish: Would that look like ALTER TABLE [dbo].[AS_tblTBCDEF] ADD CONSTRAINT [PK_Soc_Class_Date_Bus_Grp] PRIMARY KEY CLUSTERED (CDEF_SOC_NUM ASC, CDEF_CLASS_NUM ASC, CDEF_EFF_DATE ASC, CDEF_TYP_BUS ASC, CDEF_GROUP ASC)

Collectives™ on Stack Overflow

Best way to Index and or Key massive datasets [closed]

1 Answer 1

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Linked

Related