1

When I'm reading some json payloads PySpark is changing the data even if I read it as a StringType and I want this as a String because I don't want to have each field as a column at this step. I just want to get this payload as String as it is in payload/source file

Locally I'm using Spark 3.3 in Jupiter Notebook with Glue 4 image PySpark version: 3.3.0+amzn.1.dev0

Here my payload/source (test.txt):

{"payload":{"points":1220000000}}
{"payload":{"count":1550554545.0}}
{"payload":{"points":125888002540.0, "count":1550554545.0}}
{"payload":{"name": "Roger", "count":55154111.0}}

Here my code:

path = "/home/glue_user/workspace/jupyter_workspace/test/test.txt"
schema =  StructType([StructField('payload', StringType(), True)])
my_df = spark.read.schema(schema).option("inferSchema", "false").json(path)
my_df.show(truncate=False)

Here the result where PySpark is setting the float number in scientific notation, even when I read it as String.

+------------------------------------------------+
|payload                                         |
+------------------------------------------------+
|{"points":1220000000}                           |
|{"count":1.550554545E9}                         |
|{"points":1.2588800254E11,"count":1.550554545E9}|
|{"name":"Roger","count":5.5154111E7}            |
+------------------------------------------------+

Why I can't simply have my data as it is? Why the final result is changed into my string field and receive this scientific notation? i.e:

"count":1550554545.0
"count":1.550554545E9

2
  • 1
    I opened a spark ticket: issues.apache.org/jira/browse/SPARK-49616 It seems to me a bug, I don't want that the framework change the raw data when I read it as String. Commented Sep 12, 2024 at 18:25
  • Apparently the Spark 3.3, the one I'm using always set the scientific notation for the string that contains/is a json object and I couldn't solve that, I mean, I had to save it with the scientific notation and as we have another spark job I needed to read this payload with Decimal(20,0) to be able to read it, since IntegerType or LongType wasn't able to open the data with scientific notation. Commented Sep 18, 2024 at 18:13

1 Answer 1

0

I did a research little bit and asked my colleagues about it and I got to know, spark JSON reader automatically infers schema and you can't override it. What you can do is read it as a text first and then try to covert values.

I got same issue while reading data from SQL DB and I used cast expression in query itself.

For your use this is working for me. I tried to use from_json, MapType, get_json_object all three but I think whenever we convert to json, spark will convert to scientific notation.

df = spark.read.text(path)
df = df.withColumn('value', regexp_replace('value', r'^\{"payload":', ''))
df = df.withColumn('value', regexp_replace('value', r'\}$', ''))
Sign up to request clarification or add additional context in comments.

3 Comments

Hi @nikhil-tale I tried this solution after you sent here but it also doesn't changed the fact that the number was previously changed by spark, I mean, when we replace the { and } to kind of "cheat"/tease the spark the spark had already changed the notation to the scientific and we will have the number with this scientific notation anyway. Anyway thank you for the solution.
No, scientific notation was not there in when I tried this. You have not used json conversion anywhere right?
No JSON conversion was used. If you df.show() you dataframe before and after this regexp_replace, are both with scientific notation or just the before one? Thank you @nikhil-tale

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.