On 14 May 2014, at 18:10, Andrey Hristov <php@hristov.com> wrote:
> This is purely academical. And the standard library has to support everything, it's the
> standard library. PHP is on its own, and if an addition is of little use to the most of the
> developers/scripts, why the heck it should be in/the default.
> A good solution is to typedef a php_size_t, leave it to uint32_t and for those, who need more
> than 4GB in strings and elements they can just build with size_t as definition. Offer the choice,
> don't force.
It is not just “purely academic”. Here, let me quote Pierre (in 'Re: [PHP-DEV] [VOTE] [RFC]
64 bit platform improvements for string length and integer’, just now) quoting Anthony:
> This thread has been pointed out to me by a few people. As the
> originator of this patch and concept I feel that I should clarify a
> few points.
>
> # Rationale
>
> The reason that I originally started this patch was to clean up and
> standardize the underlying types. This is to introduce predictability,
> portability and type sanity into the engine and entire cphp
> implementation.
>
> ## Rationale for Int64:
>
> Without this patch, the size of integers (longs) varies based on which
> compiler you use. This means that even for identical target
> architectures behavior can change with respect to userland code.
> Refactoring this allows for consistent sizes that can be relied upon
> by the programmer. This is an effort to make it a bit easier to rely
> on integer width as a developer.
>
> And ideally this is a free cost to most implementations, since ints
> are already 64 bits wide, so there is no memory overhead. And
> performance stays the same as well.
>
> ## Rationale for size_t (string lengths):
>
> This has significant advantages. There are some costs to doing it, but
> they are not as significant as they may appear on the surface. Let's
> dive into it:
>
> ### It's The Correct Data Type
>
> The C89 spec indicates in 3.3.3.4 (
> http://port70.net/~nsz/c/c89/rationale/c3.html#size-95t-3-3-3-4
> ) that
> the size_t type was created specifically for usage in this context. It
> is always, 100% guaranteed to be able to hold the bounds of every
> possible array element. Strings in C are simply char arrays.
> Therefore, the correct data type to use for string sizes (which really
> are just an offset qualifier) is size_t.
>
> Additionally, calloc, malloc, etc all expect parameters of type size_t
> for exactly this reason.
>
> Another good reference on it: http://www.viva64.com/en/a/0050/
>
> ### It's The Secure Data Type
>
> size_t (and ptrdiff_t) are the only C89 types that are 100% guaranteed
> to be able to hold the size of any possible object that the compiler
> will support. Other types will vary depending on the data model that
> the compiler supports, as the spec only defines minimum widths.
>
> This is so important that CERT issued a coding standard for it:
> INT01-C ( https://www.securecoding.cert.org/confluence/display/seccode/INT01-C.+Use+rsize_t+or+size_t+for+all+integer+values+representing+the+size+of+an+object
> ).
>
> One of the reasons is that it's difficult to do overflow checks in a
> portable way. See VU#162289: https://www.kb.cert.org/vuls/id/162289 .
> In there, they recommend using the C99 uintptr_t type, but suggest
> using size_t for platforms that don't have uintptr_t support (and
> since we target C89 for the engine, that's out).
>
> Apple's Secure Coding Guide's section on Avoiding Integer Overflows
> and Underflows says the same thing:
> https://developer.apple.com/library/mac/documentation/security/conceptual/securecodingguide/Articles/BufferOverflows.html
>
> ### About Long Strings
>
> The fact that changing to size_t allows strings (and arrays) to be >
> 4gb is a side-effect. A welcome one, but a side effect none the less.
> The primary reason to use it is that it's the correct data type, and
> gives you the most safety and security.
>
> # Response To Concerns Mentioned
>
> I'll respond here to some of the concerns mentioned in this thread:
>
> ## size_t uses more memory and will result in more CPU cache misses,
> which will result in worse performance
>
> Well, size_t will use more memory. No doubt about that.
>
> But the performance side is more nuanced. And as several benchmarks in
> this thread indicate, there isn't a practical difference. Heck, the
> benchmarks on Windows show an improvement in some cases.
>
> And there is a reason for that. Since a pointer is a 64 bit data type,
> and a int is a 32 bit data type, any time you add the two will result
> in extra CPU cycles needed for the cast. This can be clearly seen by
> analyzing a simple malloc call with an int vs a size_t param. Here's
> the diff:
>
> < movl $5, -12(%rbp)
> < movl -12(%rbp), %eax
> < cltq
> ---
>> movq $5, -16(%rbp)
>> movq -16(%rbp), %rax
>
> Now, a cache miss is much more expensive than a cast, but we don't
> have proof that cache misses will actually occur.
>
> In fact, in the benchmarks, the worst difference is 2%. Which is
> hardly significant (as indicated by several people here). But also
> notice that in both benchmarks (those done by Microsoft, and those
> done by Dmitry), some specific tests actually executed **faster** with
> the size_t transforms (namely Hello World, Wordpress, etc). So to say
> even 2% is not really the full story.
>
> We'll come back to the memory thing in a bit.
>
> ## Macro Renames and ZPP changes
>
> This was my idea, and I don't think it's been properly justified.
>
> ### ZPP Changes
>
> The ZPP changes are critical. The reason is that varargs is casting an
> arbitrary block of memory to a type, and then writing to it. So
> existing code that does zpp("s", str, &int_len) would wind up with a
> buffer overflow. Because zpp would be trying to write a 64 bit value
> to a 32 bit container. The other 32 bits would fall off the end, into
> who knows what. At BEST this can result in a segfault. At worst,
> memory corruption and MASSIVE security vulnerabilities.
>
> Also note that the compiler *can't* and actively doesn't catch these
> types of errors. That means that it's largely luck and testing that
> will lead to it.
>
> So, I chose to break BC and rename the ZPP symbols. Because that WILL
> error, and provide the developer with a meaningful indication that an
> improper data type was provided. As I considered a fatal error that an
> invalid type was supplied was a better way of identifying to the
> developer that "HEY, THIS NEEDS TO BE CHANGED ASAP" than just letting
> them hit random segfaults at runtime.
>
> If there is a way to get around this by giving the compiler more
> information, then do it. But to just leave the types there, and leave
> it to chance if a buffer overflow occurs, is dangerous. Which is why I
> made the call that the ZPP types **needed** to be changed.
>
> ### Macro Renames
>
> The reason for the rename is largely the same as with the ZPP changes.
> The severity of not changing is less (since the compiler will warn and
> do an implicit cast for you). But it's still there. Which is why I
> chose to change it. This is less critical, but was done to better
> indicate to the developer what needs to change to properly support the
> new system.
>
> ## Memory Overhead
>
> This is definitely a concern. There is a potential to double the
> amount of memory that PHP takes. Which on the surface looks enormous.
> And if we stop at the surface, we definitely shouldn't do it!
>
> But as we look deeper, we see that in actuality, the difference is not
> double. In fact, most data structures, as identified by Dmitry
> himself, only increase by between 6% (zend_op_array) 50%
> (zend_string's size). So that "double" figure quickly drops.
>
> But that's at the structure level. Let's look at what actually happens
> in practice. Dmitry himself also provides these answers. The average
> memory increase is 8% for Wordpress, and 6% for ZF1.
>
> Let's put that 8% in context. Wordpress used 12MB, and now it uses
> 13MB. 1MB more. That's not overly significant. ZF used 29MB. Now it
> uses 31MB. Still not overly significant.
>
> Don't get me wrong, it's still more. And more is bad. But it's not
> nearly as bad as it's being played out to be.
>
> To put this into context, 5.4 saved up to 50% memory from 5.3
> (depending on benchmark). 8 << 50.
>
> Now, I'm not saying that memory should be thrown around willy-nilly.
> But given the rationale that I gave above, I think the benefits of
> sanity, portability and security clearly are significant enough for
> the relatively small cost in memory.
--
Andrea Faulds
http://ajf.me/