Garbled code when using URLSubmit[] to request CJK web page

Question

When importing content containing Chinese (CJK) characters, correct results can be obtained by using URLread

URLRead["https://zhuanlan.zhihu.com/p/35359905", "Body"]

But when using URLSubmit, the output is garbled

URLSubmit[
 HTTPRequest["https://zhuanlan.zhihu.com/p/35359905", 
  CharacterEncoding -> "UTF-8"], 
 HandlerFunctions -> <|"BodyReceived" -> ((body = #Body) &)|>]

It seems that the notion of "Body" is not consistent across different functions. URLRead's Body is already decoded with respect to response's headers, here ContentType->text/html; charset=utf-8. For URLSubmit is is raw string, so either use Jean-Pierre's answer or FromCharacterCode[ToCharacterCode@#Body, "UTF8"]. — Kuba
– Kuba, Commented Jul 3, 2022 at 22:09
@Kuba I can't run your code well ![jJCDIJ.png](imgtu.com/i/jJCDIJ) — a15355447898a
– a15355447898a, Commented Jul 4, 2022 at 3:27

Jean-Pierre · Accepted Answer · 2022-07-03 18:34:36Z

8

It appears that using BodyByteArray is the preferred approach.

URLSubmit[
 HTTPRequest["https://zhuanlan.zhihu.com/p/35359905", 
  CharacterEncoding -> "UTF-8"], 
 HandlerFunctions -> <|
   "BodyReceived" -> ((body = 
        ByteArrayToString@#[["BodyByteArray"]]) &)|>, 
 HandlerFunctionsKeys -> {"BodyByteArray"}]

answered Jul 3, 2022 at 18:34

Jean-Pierre

5,25210 silver badges15 bronze badges

$\begingroup$ It works because ByteArrayToString is "assuming it contains UTF-8 data", in general it could not be the case and decoding should be based on information in response's headers. $\endgroup$

Kuba
– Kuba

2022-07-04 09:05:58 +00:00
Commented Jul 4, 2022 at 9:05

Add a comment |

Stack Exchange Network

Garbled code when using URLSubmit[] to request CJK web page

1 Answer 1

Hot Network Questions

Garbled code when using URLSubmit[] to request CJK web page

1 Answer 1

Related

Hot Network Questions