0

I’m migrating SQL Server tables to Snowflake using Azure Data Factory. Direct Copy from SQL Server to Snowflake works, but I can’t use it because of the 100 MB row size limit. That forces me to take an intermediate staging approach.

Approach 1: Direct Copy (works, no precision issues)

SQL Server → (ADF Direct Copy → internal TEXT file) → Snowflake

The internal TEXT format generated by ADF preserves numeric precision exactly as shown in SSMS results.

Approach 2: Indirect Copy via Staging (causes precision rounding)

SQL Server → CSV in Azure Blob → COPY INTO Snowflake

Using CSV, some numeric columns get rounded, unlike the internal TEXT output used by Direct Copy. I cannot use Parquet because Parquet cannot handle ancient dates (e.g., date values in 1600s).

Problem If ADF Direct Copy can internally generate a TEXT file with full precision preserved, there should be a way to stage the same data (via file format) without losing precision.

Question How can I export data from SQL Server to a staging file (CSV or another supported format) without precision loss, achieving the same numeric accuracy that Direct Copy produces internally?

SQL Server Table: enter image description here

Direct Copy (Internal TXT file): enter image description here Note: Please look at the first column values alone.

Stage Copy CSV file: enter image description here

Data Comparison: enter image description here


CREATE TABLE dbo.DATATYPE_FLOAT_TEST (
    MY_FLOAT24 FLOAT(24)
);

GO

INSERT INTO dbo.DATATYPE_FLOAT_TEST VALUES
    ( CAST(         -9876.54321             AS float(24) ) ),
    ( CAST(            -0.0000123456        AS float(24) ) ),
    ( CAST(             0.0000003456789     AS float(24) ) ),
    ( CAST(             0.1                 AS float(24) ) ),
    ( CAST(             1.99999995          AS float(24) ) ),
    ( CAST(             3.1415926245        AS float(24) ) ),
    ( CAST(           123.456789123         AS float(24) ) ),
    ( CAST(           123.999999999999999   AS float(24) ) ),
    ( CAST(     123456789.1234567           AS float(24) ) ),
    ( CAST( 1000000000000.123               AS float(24) ) );

GO
11
  • Can Parquet really not handle dates in the 1600s? Commented Nov 26 at 17:11
  • 1
    Is SQL really holding data as float(24) aka 32bit reals? It looks to me like the SQL values are consistent with double(53) precision 64bit storage. -1.2345600225671660[snip]e-5 is the nearest float(24) representation of -1.23456e-5! see base convert You need 64bit double precision to get any closer & nearest value is -1.23455999999999991635[snip]e-5 The SQL approximation to pi of 3.141593 is more puzzling. The closest float(24) to approximate pi is: 3.1415927410125732421875 And that is the value stored in the BLOB. Commented Nov 26 at 17:20
  • Exactly how did you export CSV from Azure SQL? There are a variety of ways, all with their own different nuances. Commented Nov 26 at 18:37
  • 1
    @MattGibson Parquet's "NANOS" unit for Date/Time values has a lower-bound of 1677-09-21 - however the solution is simple: don't use nanosecond precision. Commented Nov 26 at 18:41
  • I do not know the exact mechanism that you used to create the CSV file, but if you have control of the query, you might be able to use CONVERT(VARCHAR, SomeValue, 3) to explicitly convert your float values to character strings with enough precision designed for lossless conversion. See this db<>fiddle that shows the different results when converting to and from text. Commented Nov 26 at 19:07

1 Answer 1

1

The displayed SQL side approximation to pi of 3.141593 is very suspicious. It is most definitely not the best float(24) 32bit approximation to pi.

True Value SQL Value[7digits] Float(24) 32bit Error
pi 25 digits shown 3.14159265358979323846264 <1e-25
pi 3.1415926535 3.14159274 3.1415927410125732421875 +8.74e-8
3.141593 3.141593 3.14159297943115234375 +3.464e-7

It looks to me as if the first method is actually corrupting the data in a way that may be a convention in the SQL world by only transferring float(32) values as text rounded to seven places of decimals. Forcing it to round to 8 decimal digits would help a lot.

This is actually the key to the problem. The values in the left hand column are rounded and shown to only 7 decimal digits of text - this is insufficient to accurately represent every possible float(24) in a form that is exactly reproducible when the string is converted back to a float. To be sure of exactly what is going on you would need to display the various float(32) representations in hexadecimal form.

Roughly 1 in 8 of the 2^23 possible FP values in a given power of two range will be corrupted by a round trip to a 7 decimal digit string representation and back.

ISTR that a minimum of 8 decimal digits is essential to preserve most FP32 values reliably and that to catch a handful of awkward edge cases in the full range of FP numbers 9 or even 10 decimal digits are required. There is a website with a few samples of awkward to convert float string constant values somewhere but the URL eludes me at the moment.

If you really care about preserving exact bitwise representation then transfer the values using the hybrid hexadecimal format %a in C language format specifications. That truly is lossless and unambiguous binary bit for bit transfer in text of valid IEEE754 numbers.

This is a toy C program to take all the values between 1.0 and 2.0 through the round trip with various levels of rounding 1 through 8 decimal digits together with the number of good and bad results. Bad meaning that the original floating point value was corrupted in the process.

#include <iostream>

// this code takes values from 1.0 to 1.9999995 on a round trip through string representation
// and counts the number of values that remain exact at each format specification

void toy(const char *format)
{
    int bad, good;
    char string[20];
    char* endptr;
    double pi = 3.1415926535897932384626433832795;
    float x, y = 1.0;
    float dy = 0.6e-7;
    good = bad = 0;
    for (int i = 0; i < 1 << 23; i++)
    {
        if (!(i & 0xfffff)) printf("."); // to show something is happening
        sprintf_s(string, format, y);
        x = strtof(string, &endptr);
        if (x == y)
            good++;
        else
            bad++;
        y += dy;
    }
    printf("format %s good %7i bad %7i pi  = ", format, good, bad);
    printf(format, pi);
    printf("\n");
}

int main() {
    toy("%15.1g");
    toy("%15.2g");
    toy("%15.3g");
    toy("%15.4g");
    toy("%15.5g");
    toy("%15.6g");
    toy("%15.7g");
    toy("%15.8g");
    return 0;
}

Here is the output:

........format %15.1g good       1 bad 8388607 pi  =               3
........format %15.2g good      10 bad 8388598 pi  =             3.1
........format %15.3g good     100 bad 8388508 pi  =            3.14
........format %15.4g good    1000 bad 8387608 pi  =           3.142
........format %15.5g good   10000 bad 8378608 pi  =          3.1416
........format %15.6g good  100000 bad 8288608 pi  =         3.14159
........format %15.7g good 1000000 bad 7388608 pi  =        3.141593
........format %15.8g good 8388608 bad       0 pi  =       3.1415927

This post is arguably a disguised dupe of Is floating point math broken although in this instance it is the conversion to and from decimal string representation that is causing the problem. IEEE754 floating point numbers can be unforgiving - particularly the 32bit float or float(24) as you prefer to call them. It is far too easy to introduce rounding error or have overflows with such a restrictive mantissa and exponent.

Stage copy appears to be doing everything right with enough decimal digits to accurately represent all of the floating point values that can occur (ignoring just a handful of exceptions that require a brute force search or a specialist FP URL to find).

In every case the "wrong" answer it has transferred is a valid and unambiguous float(24) value.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.