6
$\begingroup$

When importing content containing Chinese (CJK) characters, correct results can be obtained by using URLread

URLRead["https://zhuanlan.zhihu.com/p/35359905", "Body"]

screenshot of correct output

But when using URLSubmit, the output is garbled

URLSubmit[
 HTTPRequest["https://zhuanlan.zhihu.com/p/35359905", 
  CharacterEncoding -> "UTF-8"], 
 HandlerFunctions -> <|"BodyReceived" -> ((body = #Body) &)|>]

screenshot: output contains weird characters

$\endgroup$
2
  • 2
    $\begingroup$ It seems that the notion of "Body" is not consistent across different functions. URLRead's Body is already decoded with respect to response's headers, here ContentType->text/html; charset=utf-8. For URLSubmit is is raw string, so either use Jean-Pierre's answer or FromCharacterCode[ToCharacterCode@#Body, "UTF8"]. $\endgroup$ Commented Jul 3, 2022 at 22:09
  • $\begingroup$ @Kuba I can't run your code well ![jJCDIJ.png](imgtu.com/i/jJCDIJ) $\endgroup$ Commented Jul 4, 2022 at 3:27

1 Answer 1

8
$\begingroup$

It appears that using BodyByteArray is the preferred approach.

URLSubmit[
 HTTPRequest["https://zhuanlan.zhihu.com/p/35359905", 
  CharacterEncoding -> "UTF-8"], 
 HandlerFunctions -> <|
   "BodyReceived" -> ((body = 
        ByteArrayToString@#[["BodyByteArray"]]) &)|>, 
 HandlerFunctionsKeys -> {"BodyByteArray"}]
$\endgroup$
1
  • $\begingroup$ It works because ByteArrayToString is "assuming it contains UTF-8 data", in general it could not be the case and decoding should be based on information in response's headers. $\endgroup$ Commented Jul 4, 2022 at 9:05

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.