1

We have a #ValidCode table with list of valid codes like: 'A', 'B', 'C', etc. Another table called #SourceData with input data -- that comes as a combination of valid and invalid tokens (sometimes duplicates).

Ex:

  • 'A;B;C' (valid)
  • 'A;A;A;A;A;B' (Valid)
  • 'ad;df;A;B' (invalid)

Trying to find an optimal query approach to process these strings to find valid rows in #SourceData. See example below:

DROP TABLE IF EXISTS #ValidCode
GO
CREATE TABLE #ValidCode
(
      ID        INT             IDENTITY(1,1)
    , Code      CHAR(1)
)
INSERT INTO #ValidCode (Code) VALUES ('A'), ('B'), ('C')
GO
DROP TABLE IF EXISTS #SourceData 
GO
CREATE TABLE #SourceData 
(
      ID        INT             IDENTITY(1,1)
    , Codes     VARCHAR(500)
    , Is_Valid  BIT
    , Is_Split  BIT
)

INSERT INTO #SourceData (Codes) 
VALUES    ('A;B;C')
        , ('B;A')
        , ('B;B;B;C;C;A;A;B')
        , ('B;Z;1')
        , ('B;ss;asd')


SELECT * FROM #ValidCode
SELECT * FROM #SourceData

Query would process the data in #SourceData table and update the Is_Valid flag, so they could be consumed in the subsequent process.

Rules:

  • Each and every token must be valid for the entire column row to be valid (rows 1 to 3)
  • Even if one token is invalid, then entire row value is invalid (rows 4 & 5)

So, this is the preferred output:

ID Codes Is_Valid
1 A;B;C 1
2 B;A 1
3 B;B;B;C;C;A;A;B 1
4 B;Z;1 0
5 B;ss;asd 0

Current approach: Loop through each row in #SourceData and split them on delimiter ';', then compare them to the #ValidCode table. If all tokens are individually valid, then mark the row in #SourceData as valid (Is_Valid flag). Else mark as invalid. The WHILE loop approach works, but is slow.

The #SourceData could have up to 3 million rows. With each row having multiple duplicate valid ('A;A;A;A') and invalid values combination ('A;as;sdf;B')

Is there a better approach?

Thanks!

2
  • 1
    storing delimited data in a column, is generally a bad decision, a normalized Approach to store data, would save lots of processing power
    – nbk
    Commented Nov 19, 2024 at 23:15
  • @nbk the source data comes in as delimited. It cannot be changed.
    – ToC
    Commented Nov 20, 2024 at 14:19

2 Answers 2

1

One relational way you can do this is by splitting your #SourceData first (fortunately you have access to the STRING_SPLIT() function despite being on an outdated version of SQL Server), then getting the rows that don't match your #ValidCodes, and finally using those rows to determine what Is_Valid in the original #SourceData table.

Here's an example of how to do that:

;WITH _BadData AS
(
    SELECT DISTINCT SD.ID
    FROM #SourceData AS SD
    CROSS APPLY STRING_SPLIT(SD.Codes, ';') AS SS
    WHERE NOT EXISTS
    (
        SELECT 1 AS RowExists
        FROM #ValidCode AS VC
        WHERE SS.[value] = VC.Code
    )
)

SELECT
    SD.ID,
    SD.Codes,
    ISNULL(SD.Is_Valid, IIF(BD.ID IS NULL, 1, 0)) AS Is_Valid,
    SD.Is_Split
FROM #SourceData AS SD
LEFT JOIN _BadData AS BD
    ON SD.ID = BD.ID;

Here's a dbfiddle.uk repo demonstrating that code.

Note depending on the size of #SourceData and the generated execution plan, you may want to materialize the results of the STRING_SPLIT() function, the whole CTE itself, or both, to a temp table first, before using it in the second half of the above query for the final LEFT JOIN. But I assume this should be measurably better than looping over your rows, one by one.

4
  • Thanks for the idea, I'll try this too. For each row, I need to preserve the "Is_Valid" flag i.e. irrespective of if a row is valid or not, I need to know and preserve that validation result. The CTE above seems to combine all tokens into one table -- which may result in not being able to capture the validity flag for each row of #SourceData
    – ToC
    Commented Nov 20, 2024 at 14:23
  • 1
    @ToC Gotcha, your sample data didn't include that, so you may want to consider updating your post to clarify that, as I believe you may have the same issue with the other answer that was provided too. Please see my updated answer that preserves the original Is_Valid flag as well.
    – J.D.
    Commented Nov 20, 2024 at 15:43
  • This is such a creative idea. I cannot believe it works !! I'll keep playing with it to see if it covers all the scenarios. This is a solid foundation for me to add other components. Thanks !!
    – ToC
    Commented Nov 20, 2024 at 20:43
  • 1
    @ToC Great, no problem! Best of luck!
    – J.D.
    Commented Nov 20, 2024 at 22:45
1

-- first thing that comes to mind:

SELECT 
    sd.ID
    , sd.Codes
    , CASE 
        WHEN NOT EXISTS (
            SELECT x.[value] 
            FROM string_split(sd.Codes, ';') as x
            LEFT OUTER JOIN #ValidCode as vc
                ON x.[value] = vc.Code
            WHERE vc.Code IS NULL
        )
        THEN 1
        ELSE 0
      END as is_valid
FROM #SourceData as sd

It might be more optimal to create a child table for the SourceData codes split out into rows.

1
  • Interesting approach! I'll try it.
    – ToC
    Commented Nov 19, 2024 at 21:31

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.