Sql Server: Query to parse and validate codes

Question

We have a #ValidCode table with list of valid codes like: 'A', 'B', 'C', etc. Another table called #SourceData with input data -- that comes as a combination of valid and invalid tokens (sometimes duplicates).

Ex:

'A;B;C' (valid)
'A;A;A;A;A;B' (Valid)
'ad;df;A;B' (invalid)

Trying to find an optimal query approach to process these strings to find valid rows in #SourceData. See example below:

DROP TABLE IF EXISTS #ValidCode
GO
CREATE TABLE #ValidCode
(
      ID        INT             IDENTITY(1,1)
    , Code      CHAR(1)
)
INSERT INTO #ValidCode (Code) VALUES ('A'), ('B'), ('C')
GO
DROP TABLE IF EXISTS #SourceData 
GO
CREATE TABLE #SourceData 
(
      ID        INT             IDENTITY(1,1)
    , Codes     VARCHAR(500)
    , Is_Valid  BIT
    , Is_Split  BIT
)

INSERT INTO #SourceData (Codes) 
VALUES    ('A;B;C')
        , ('B;A')
        , ('B;B;B;C;C;A;A;B')
        , ('B;Z;1')
        , ('B;ss;asd')


SELECT * FROM #ValidCode
SELECT * FROM #SourceData

Query would process the data in #SourceData table and update the Is_Valid flag, so they could be consumed in the subsequent process.

Rules:

Each and every token must be valid for the entire column row to be valid (rows 1 to 3)
Even if one token is invalid, then entire row value is invalid (rows 4 & 5)

So, this is the preferred output:

ID	Codes	Is_Valid
1	A;B;C	1
2	B;A	1
3	B;B;B;C;C;A;A;B	1
4	B;Z;1	0
5	B;ss;asd	0

Current approach: Loop through each row in #SourceData and split them on delimiter ';', then compare them to the #ValidCode table. If all tokens are individually valid, then mark the row in #SourceData as valid (Is_Valid flag). Else mark as invalid. The WHILE loop approach works, but is slow.

The #SourceData could have up to 3 million rows. With each row having multiple duplicate valid ('A;A;A;A') and invalid values combination ('A;as;sdf;B')

Is there a better approach?

Thanks!

storing delimited data in a column, is generally a bad decision, a normalized Approach to store data, would save lots of processing power — nbk, Commented Nov 19, 2024 at 23:15
@nbk the source data comes in as delimited. It cannot be changed. — ToC, Commented Nov 20, 2024 at 14:19

J.D. · Accepted Answer · 2024-11-20 15:44:12Z

1

One relational way you can do this is by splitting your #SourceData first (fortunately you have access to the STRING_SPLIT() function despite being on an outdated version of SQL Server), then getting the rows that don't match your #ValidCodes, and finally using those rows to determine what Is_Valid in the original #SourceData table.

Here's an example of how to do that:

;WITH _BadData AS
(
    SELECT DISTINCT SD.ID
    FROM #SourceData AS SD
    CROSS APPLY STRING_SPLIT(SD.Codes, ';') AS SS
    WHERE NOT EXISTS
    (
        SELECT 1 AS RowExists
        FROM #ValidCode AS VC
        WHERE SS.[value] = VC.Code
    )
)

SELECT
    SD.ID,
    SD.Codes,
    ISNULL(SD.Is_Valid, IIF(BD.ID IS NULL, 1, 0)) AS Is_Valid,
    SD.Is_Split
FROM #SourceData AS SD
LEFT JOIN _BadData AS BD
    ON SD.ID = BD.ID;

Here's a dbfiddle.uk repo demonstrating that code.

Note depending on the size of #SourceData and the generated execution plan, you may want to materialize the results of the STRING_SPLIT() function, the whole CTE itself, or both, to a temp table first, before using it in the second half of the above query for the final LEFT JOIN. But I assume this should be measurably better than looping over your rows, one by one.

edited Nov 20, 2024 at 15:44

answered Nov 20, 2024 at 13:30

J.D.

40.4k12 gold badges62 silver badges139 bronze badges

Thanks for the idea, I'll try this too. For each row, I need to preserve the "Is_Valid" flag i.e. irrespective of if a row is valid or not, I need to know and preserve that validation result. The CTE above seems to combine all tokens into one table -- which may result in not being able to capture the validity flag for each row of #SourceData
– ToC
Commented Nov 20, 2024 at 14:23
1

@ToC Gotcha, your sample data didn't include that, so you may want to consider updating your post to clarify that, as I believe you may have the same issue with the other answer that was provided too. Please see my updated answer that preserves the original Is_Valid flag as well.
– J.D.
Commented Nov 20, 2024 at 15:43
This is such a creative idea. I cannot believe it works !! I'll keep playing with it to see if it covers all the scenarios. This is a solid foundation for me to add other components. Thanks !!
– ToC
Commented Nov 20, 2024 at 20:43
1

@ToC Great, no problem! Best of luck!
– J.D.
Commented Nov 20, 2024 at 22:45

Add a comment |

Doug Hills · Accepted Answer · 2024-11-19 20:49:25Z

1

-- first thing that comes to mind:

SELECT 
    sd.ID
    , sd.Codes
    , CASE 
        WHEN NOT EXISTS (
            SELECT x.[value] 
            FROM string_split(sd.Codes, ';') as x
            LEFT OUTER JOIN #ValidCode as vc
                ON x.[value] = vc.Code
            WHERE vc.Code IS NULL
        )
        THEN 1
        ELSE 0
      END as is_valid
FROM #SourceData as sd

It might be more optimal to create a child table for the SourceData codes split out into rows.

edited Nov 19, 2024 at 20:49

answered Nov 19, 2024 at 20:44

Doug Hills

1076 bronze badges

Interesting approach! I'll try it.
– ToC
Commented Nov 19, 2024 at 21:31

Add a comment |

Stack Exchange Network

Sql Server: Query to parse and validate codes

2 Answers 2

Hot Network Questions

Sql Server: Query to parse and validate codes

2 Answers 2

Related

Hot Network Questions