Why Pyspark change numeric data from a json payload even when I read as string

Question

When I'm reading some json payloads PySpark is changing the data even if I read it as a StringType and I want this as a String because I don't want to have each field as a column at this step. I just want to get this payload as String as it is in payload/source file

Locally I'm using Spark 3.3 in Jupiter Notebook with Glue 4 image PySpark version: 3.3.0+amzn.1.dev0

Here my payload/source (test.txt):

{"payload":{"points":1220000000}}
{"payload":{"count":1550554545.0}}
{"payload":{"points":125888002540.0, "count":1550554545.0}}
{"payload":{"name": "Roger", "count":55154111.0}}

Here my code:

path = "/home/glue_user/workspace/jupyter_workspace/test/test.txt"
schema =  StructType([StructField('payload', StringType(), True)])
my_df = spark.read.schema(schema).option("inferSchema", "false").json(path)
my_df.show(truncate=False)

Here the result where PySpark is setting the float number in scientific notation, even when I read it as String.

+------------------------------------------------+
|payload                                         |
+------------------------------------------------+
|{"points":1220000000}                           |
|{"count":1.550554545E9}                         |
|{"points":1.2588800254E11,"count":1.550554545E9}|
|{"name":"Roger","count":5.5154111E7}            |
+------------------------------------------------+

Why I can't simply have my data as it is? Why the final result is changed into my string field and receive this scientific notation? i.e:

"count":1550554545.0
"count":1.550554545E9

I opened a spark ticket: issues.apache.org/jira/browse/SPARK-49616 It seems to me a bug, I don't want that the framework change the raw data when I read it as String. — D. Pazeto
– D. Pazeto, Commented Sep 12, 2024 at 18:25
Apparently the Spark 3.3, the one I'm using always set the scientific notation for the string that contains/is a json object and I couldn't solve that, I mean, I had to save it with the scientific notation and as we have another spark job I needed to read this payload with Decimal(20,0) to be able to read it, since IntegerType or LongType wasn't able to open the data with scientific notation. — D. Pazeto
– D. Pazeto, Commented Sep 18, 2024 at 18:13

Nikhil Tale · Accepted Answer · 2024-09-18 05:23:21Z

0

I did a research little bit and asked my colleagues about it and I got to know, spark JSON reader automatically infers schema and you can't override it. What you can do is read it as a text first and then try to covert values.

I got same issue while reading data from SQL DB and I used cast expression in query itself.

For your use this is working for me. I tried to use from_json, MapType, get_json_object all three but I think whenever we convert to json, spark will convert to scientific notation.

df = spark.read.text(path)
df = df.withColumn('value', regexp_replace('value', r'^\{"payload":', ''))
df = df.withColumn('value', regexp_replace('value', r'\}$', ''))

edited Sep 18, 2024 at 5:23

answered Sep 17, 2024 at 13:51

Nikhil Tale

463 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

D. Pazeto Over a year ago

Hi @nikhil-tale I tried this solution after you sent here but it also doesn't changed the fact that the number was previously changed by spark, I mean, when we replace the { and } to kind of "cheat"/tease the spark the spark had already changed the notation to the scientific and we will have the number with this scientific notation anyway. Anyway thank you for the solution.

Nikhil Tale Over a year ago

No, scientific notation was not there in when I tried this. You have not used json conversion anywhere right?

D. Pazeto Over a year ago

No JSON conversion was used. If you df.show() you dataframe before and after this regexp_replace, are both with scientific notation or just the before one? Thank you @nikhil-tale

Collectives™ on Stack Overflow

Why Pyspark change numeric data from a json payload even when I read as string

1 Answer 1

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Related