Re: Re: PHP Unicode support design document

From: Date: Mon, 15 Aug 2005 22:13:01 +0000
Subject: Re: Re: PHP Unicode support design document
References: 1 2 3 4 5 6 7 8  Groups: php.internals 
Request: Send a blank email to internals+get-18149@lists.php.net to get a copy of this message
If you want to optimize then I guess "remembering" the script_encoding is the only way to do it. We could do it similar to the way we "cache" script file names.
Another option is to just optimize for UTF-8 and use BOMs for UTF-8/UTF-16...

Andi

At 03:09 PM 8/15/2005 -0700, Rasmus Lerdorf wrote:
I think the main issue here is that if your script encoding is set to UTF-8 and you do everything in UTF-8 then these large blocks of UTF-8 are going to make a UTF-8 -> UTF-16 -> UTF-8 conversion roundtrip on every request. It would be nice if we could somehow avoid that. -Rasmus Andi Gutmans wrote: Wouldn't it be easiest to have inline html become IS_UNICODE and then not deal with the problem of remember what the script encoding was? I thought that's what we already do today. Andi At 12:37 PM 8/10/2005 -0700, Andrei Zmievski wrote:
I did not have time to write the full reply earlier so here goes. Even if we modify the output layer to be aware of various types of strings coming down the pipe, it would still need to know the encoding of IS_STRING's in order to convert them to the output encoding. This presents a particular problem for inline HTML blocks, as they are supposed to be in the script encoding, but by the time the HTML is sent to the output layer, we don't know what the source script encoding was for these HTML blocks. This problem exists in the current implementation also, because the ZEND_ECHO opcode does not keep track of what the script encoding was. This needs to be fixed, obviously. One approach could be to implement a separate opcode for inline HTML blocks and store the name of the script encoding it came from in the opcode. Then when the output layer (or whatever else) gets to it, we can check the encoding name in the opcode vs. the output encoding and perform transcoding if necessary. This does mean that we may need to dynamically open and close converters on each output (if there were different script encodings floating around), but can be alleviated by keeping some sort of converter cache around. I am open to other ideas. -Andrei On Aug 10, 2005, at 8:34 AM, Andrei Zmievski wrote:
That's not true, actually. 'echo' and 'print' resolve to ZEND_ECHO opcode which calls zend_print_variable(), which in turn calls zend_make_printable_zval(). Now, this last function is supposed to take a zval and turn it into a printable string, of course, which is then output using utility_functions->write_function aka php_body_write(). All that function cares about is how to output a binary string. So, if we want to bubble the conversion down to the output layer, we probably need to change the write function so that it takes a void* and a type and knows how to deal with them appropriately.


Thread (44 messages)

« previous php.internals (#18149) next »