Re: [VOTE] [RFC] 64 bit platform improvements for string length and integer

From: Andrey Hristov Date: Wed, 14 May 2014 17:28:21 +0000

Subject: Re: [VOTE] [RFC] 64 bit platform improvements for string length and integer

References: 1 2 3 4 5 6 7 8 9 10 Groups: php.internals

Request: Send a blank email to internals+get-74199@lists.php.net to get a copy of this message

On 14.05.2014 20:13, Andrea Faulds wrote:

On 14 May 2014, at 18:10, Andrey Hristov <php@hristov.com
<mailto:php@hristov.com>> wrote:

This is purely academical. And the standard library has to support
everything, it's the standard library. PHP is on its own, and if an
addition is of little use to the most of the developers/scripts, why
the heck it should be in/the default.
A good solution is to typedef a php_size_t, leave it to uint32_t and
for those, who need more than 4GB in strings and elements they can
just build with size_t as definition. Offer the choice, don't force.

It is not just “purely academic”. Here, let me quote Pierre (in 'Re:
[PHP-DEV] [VOTE] [RFC] 64 bit platform improvements for string length
and integer’, just now) quoting Anthony:

PHP is not general purpose library for writing applications, it's an environment on its own with its own specifics. For general purpose library size_t is the only way to go. Using size_t for a C based chat application to exchange 140 byte in length messages is not needed. The MySQL C/S protocol uses length encoding to lower the memory usage.
The API can take size_t but increasing the size of a very core and very often allocated structure is rarely a good thing. Those who want to be fully compliant can build and stay with php_size_t being alias to size_t but those who doesn't need it (think big installations) have the choice not to overbloat their machines.

Choice...

Andrey

This thread has been pointed out to me by a few people. As the
originator of this patch and concept I feel that I should clarify a
few points.

# Rationale

The reason that I originally started this patch was to clean up and
standardize the underlying types. This is to introduce predictability,
portability and type sanity into the engine and entire cphp
implementation.

## Rationale for Int64:

Without this patch, the size of integers (longs) varies based on which
compiler you use. This means that even for identical target
architectures behavior can change with respect to userland code.
Refactoring this allows for consistent sizes that can be relied upon
by the programmer. This is an effort to make it a bit easier to rely
on integer width as a developer.

And ideally this is a free cost to most implementations, since ints
are already 64 bits wide, so there is no memory overhead. And
performance stays the same as well.

## Rationale for size_t (string lengths):

This has significant advantages. There are some costs to doing it, but
they are not as significant as they may appear on the surface. Let's
dive into it:

### It's The Correct Data Type

The C89 spec indicates in 3.3.3.4 (
http://port70.net/~nsz/c/c89/rationale/c3.html#size-95t-3-3-3-4 ) that
the size_t type was created specifically for usage in this context. It
is always, 100% guaranteed to be able to hold the bounds of every
possible array element. Strings in C are simply char arrays.
Therefore, the correct data type to use for string sizes (which really
are just an offset qualifier) is size_t.

Additionally, calloc, malloc, etc all expect parameters of type size_t
for exactly this reason.

Another good reference on it: http://www.viva64.com/en/a/0050/

### It's The Secure Data Type

size_t (and ptrdiff_t) are the only C89 types that are 100% guaranteed
to be able to hold the size of any possible object that the compiler
will support. Other types will vary depending on the data model that
the compiler supports, as the spec only defines minimum widths.

This is so important that CERT issued a coding standard for it:
INT01-C (
https://www.securecoding.cert.org/confluence/display/seccode/INT01-C.+Use+rsize_t+or+size_t+for+all+integer+values+representing+the+size+of+an+object
).

One of the reasons is that it's difficult to do overflow checks in a
portable way. See VU#162289: https://www.kb.cert.org/vuls/id/162289 .
In there, they recommend using the C99 uintptr_t type, but suggest
using size_t for platforms that don't have uintptr_t support (and
since we target C89 for the engine, that's out).

Apple's Secure Coding Guide's section on Avoiding Integer Overflows
and Underflows says the same thing:
https://developer.apple.com/library/mac/documentation/security/conceptual/securecodingguide/Articles/BufferOverflows.html

### About Long Strings

The fact that changing to size_t allows strings (and arrays) to be >
4gb is a side-effect. A welcome one, but a side effect none the less.
The primary reason to use it is that it's the correct data type, and
gives you the most safety and security.

# Response To Concerns Mentioned

I'll respond here to some of the concerns mentioned in this thread:

## size_t uses more memory and will result in more CPU cache misses,
which will result in worse performance

Well, size_t will use more memory. No doubt about that.

But the performance side is more nuanced. And as several benchmarks in
this thread indicate, there isn't a practical difference. Heck, the
benchmarks on Windows show an improvement in some cases.

And there is a reason for that. Since a pointer is a 64 bit data type,
and a int is a 32 bit data type, any time you add the two will result
in extra CPU cycles needed for the cast. This can be clearly seen by
analyzing a simple malloc call with an int vs a size_t param. Here's
the diff:

< movl $5, -12(%rbp)
< movl -12(%rbp), %eax
< cltq
---
movq $5, -16(%rbp)
movq -16(%rbp), %rax

Now, a cache miss is much more expensive than a cast, but we don't
have proof that cache misses will actually occur.

In fact, in the benchmarks, the worst difference is 2%. Which is
hardly significant (as indicated by several people here). But also
notice that in both benchmarks (those done by Microsoft, and those
done by Dmitry), some specific tests actually executed **faster** with
the size_t transforms (namely Hello World, Wordpress, etc). So to say
even 2% is not really the full story.

We'll come back to the memory thing in a bit.

## Macro Renames and ZPP changes

This was my idea, and I don't think it's been properly justified.

### ZPP Changes

The ZPP changes are critical. The reason is that varargs is casting an
arbitrary block of memory to a type, and then writing to it. So
existing code that does zpp("s", str, &int_len) would wind up with a
buffer overflow. Because zpp would be trying to write a 64 bit value
to a 32 bit container. The other 32 bits would fall off the end, into
who knows what. At BEST this can result in a segfault. At worst,
memory corruption and MASSIVE security vulnerabilities.

Also note that the compiler *can't* and actively doesn't catch these
types of errors. That means that it's largely luck and testing that
will lead to it.

So, I chose to break BC and rename the ZPP symbols. Because that WILL
error, and provide the developer with a meaningful indication that an
improper data type was provided. As I considered a fatal error that an
invalid type was supplied was a better way of identifying to the
developer that "HEY, THIS NEEDS TO BE CHANGED ASAP" than just letting
them hit random segfaults at runtime.

If there is a way to get around this by giving the compiler more
information, then do it. But to just leave the types there, and leave
it to chance if a buffer overflow occurs, is dangerous. Which is why I
made the call that the ZPP types **needed** to be changed.

### Macro Renames

The reason for the rename is largely the same as with the ZPP changes.
The severity of not changing is less (since the compiler will warn and
do an implicit cast for you). But it's still there. Which is why I
chose to change it. This is less critical, but was done to better
indicate to the developer what needs to change to properly support the
new system.

## Memory Overhead

This is definitely a concern. There is a potential to double the
amount of memory that PHP takes. Which on the surface looks enormous.
And if we stop at the surface, we definitely shouldn't do it!

But as we look deeper, we see that in actuality, the difference is not
double. In fact, most data structures, as identified by Dmitry
himself, only increase by between 6% (zend_op_array) 50%
(zend_string's size). So that "double" figure quickly drops.

But that's at the structure level. Let's look at what actually happens
in practice. Dmitry himself also provides these answers. The average
memory increase is 8% for Wordpress, and 6% for ZF1.

Let's put that 8% in context. Wordpress used 12MB, and now it uses
13MB. 1MB more. That's not overly significant. ZF used 29MB. Now it
uses 31MB. Still not overly significant.

Don't get me wrong, it's still more. And more is bad. But it's not
nearly as bad as it's being played out to be.

To put this into context, 5.4 saved up to 50% memory from 5.3
(depending on benchmark). 8 << 50.

Now, I'm not saying that memory should be thrown around willy-nilly.
But given the rationale that I gave above, I think the benefits of
sanity, portability and security clearly are significant enough for
the relatively small cost in memory.

--
Andrea Faulds
http://ajf.me/

Thread (87 messages)

Anatol BelskiTue, 13 May 2014 20:51:41 +0000
Dmitry StogovTue, 13 May 2014 22:52:21 +0000
Pierre JoyeWed, 14 May 2014 04:44:59 +0000
Nikita PopovWed, 14 May 2014 05:30:48 +0000
Terry EllisonWed, 14 May 2014 05:44:44 +0000
Pierre JoyeWed, 14 May 2014 05:46:46 +0000
Lester CaineWed, 14 May 2014 08:16:51 +0000
Ferenc KovacsWed, 14 May 2014 08:22:49 +0000
Lester CaineWed, 14 May 2014 08:39:36 +0000
Ferenc KovacsWed, 14 May 2014 08:46:21 +0000
Pierre JoyeWed, 14 May 2014 08:49:48 +0000
Dmitry StogovWed, 14 May 2014 08:53:19 +0000
Andrea FauldsWed, 14 May 2014 08:57:52 +0000
Andrey HristovWed, 14 May 2014 17:10:54 +0000
Andrea FauldsWed, 14 May 2014 17:13:34 +0000
Ferenc KovacsWed, 14 May 2014 17:28:16 +0000
Andrey HristovWed, 14 May 2014 17:28:21 +0000
Pierre JoyeWed, 14 May 2014 17:41:33 +0000
Pierre JoyeWed, 14 May 2014 09:01:18 +0000
Stas MalyshevWed, 14 May 2014 18:20:24 +0000
Andrey HristovWed, 14 May 2014 18:34:56 +0000
Stas MalyshevWed, 14 May 2014 18:15:58 +0000
Zeev SuraskiWed, 14 May 2014 07:52:41 +0000RE: [PHP-DEV] [VOTE] [RFC] 64 bit platform improvements for string length and integer
Christian StollerWed, 14 May 2014 08:08:06 +0000
Ferenc KovacsWed, 14 May 2014 08:19:32 +0000Re: [VOTE] [RFC] 64 bit platform improvements for string length and integer
Dmitry StogovWed, 14 May 2014 08:21:13 +0000Re: [VOTE] [RFC] 64 bit platform improvements for string length and integer
Pierre JoyeWed, 14 May 2014 08:46:56 +0000
Lester CaineWed, 14 May 2014 09:13:35 +0000
Dmitry StogovWed, 14 May 2014 09:24:37 +0000
Lester CaineWed, 14 May 2014 19:39:32 +0000
Dmitry StogovWed, 14 May 2014 19:48:25 +0000
Lester CaineWed, 14 May 2014 20:41:46 +0000
Anatol BelskiSat, 17 May 2014 10:59:57 +0000
Zeev SuraskiSat, 17 May 2014 11:15:56 +0000
Pierre JoyeSat, 17 May 2014 12:02:50 +0000
Johannes SchlüterMon, 19 May 2014 12:21:11 +0000
Dmitry StogovMon, 19 May 2014 13:31:15 +0000
Andrea FauldsMon, 19 May 2014 13:57:30 +0000
Dmitry StogovMon, 19 May 2014 14:03:49 +0000
Andrea FauldsMon, 19 May 2014 15:51:40 +0000
David Soria ParraTue, 20 May 2014 18:34:55 +0000
Pierre JoyeWed, 21 May 2014 06:24:16 +0000
Peter CowburnSun, 20 Jul 2014 15:44:22 +0000
Chris WrightWed, 30 Jul 2014 07:42:56 +0000
Andrea FauldsSat, 17 May 2014 14:33:57 +0000
Stas MalyshevSat, 17 May 2014 21:57:52 +0000
Zeev SuraskiWed, 14 May 2014 08:38:54 +0000
Ferenc KovacsWed, 14 May 2014 08:12:18 +0000Re: [VOTE] [RFC] 64 bit platform improvements for string length and integer
Zeev SuraskiWed, 14 May 2014 08:31:33 +0000RE: [PHP-DEV] [VOTE] [RFC] 64 bit platform improvements for string length and integer
Kris CraigWed, 14 May 2014 09:37:42 +0000
Andrea FauldsWed, 14 May 2014 09:40:02 +0000
Kris CraigWed, 14 May 2014 09:44:43 +0000
Kris CraigWed, 14 May 2014 09:47:14 +0000
Andrea FauldsWed, 14 May 2014 09:49:44 +0000
Pierre JoyeWed, 14 May 2014 09:54:40 +0000
Ferenc KovacsWed, 14 May 2014 10:11:01 +0000
Andrey HristovWed, 14 May 2014 18:44:54 +0000
Ferenc KovacsWed, 14 May 2014 09:56:37 +0000
Kris CraigWed, 14 May 2014 10:02:29 +0000
Daniel ConvissorSat, 17 May 2014 13:30:37 +0000Re: [VOTE] [RFC] 2/3 vote needed (was: 64 bit platform improvements...)
Kris CraigSat, 17 May 2014 23:53:18 +0000
guilhermeblanco@gmail.comSun, 18 May 2014 04:57:58 +0000
Zeev SuraskiSun, 18 May 2014 06:13:35 +0000RE: [PHP-DEV] [VOTE] [RFC] 2/3 vote needed (was: 64 bit platform improvements...)
Andreas HeiglSun, 18 May 2014 07:15:37 +0000Re: [VOTE] [RFC] 2/3 vote needed (was: 64 bit platform improvements...)
Kris CraigSun, 18 May 2014 07:26:55 +0000
Zeev SuraskiSun, 18 May 2014 08:28:38 +0000RE: [PHP-DEV] [VOTE] [RFC] 2/3 vote needed (was: 64 bit platform improvements...)
Lester CaineSun, 18 May 2014 08:29:39 +0000Re: [VOTE] [RFC] 2/3 vote needed
Zeev SuraskiSun, 18 May 2014 06:00:52 +0000RE: [PHP-DEV] [VOTE] [RFC] 2/3 vote needed (was: 64 bit platform improvements...)
Pierre JoyeWed, 14 May 2014 08:43:05 +0000Re: [VOTE] [RFC] 64 bit platform improvements for string length and integer
Dmitry StogovWed, 14 May 2014 09:16:11 +0000
Ulf WendelWed, 14 May 2014 09:44:31 +0000
Stas MalyshevWed, 14 May 2014 18:24:00 +0000Re: [VOTE] [RFC] 64 bit platform improvements for string length and integer
Dmitry StogovWed, 14 May 2014 05:57:59 +0000
Pierre JoyeWed, 14 May 2014 06:27:15 +0000
Dmitry StogovWed, 14 May 2014 07:01:23 +0000
Pierre JoyeWed, 14 May 2014 07:16:49 +0000
Yasuo OhgakiMon, 19 May 2014 23:06:13 +0000
Andrea FauldsWed, 14 May 2014 10:07:39 +0000
Pierre JoyeWed, 14 May 2014 10:16:22 +0000
Andrea FauldsWed, 14 May 2014 10:17:36 +0000
Dmitry StogovWed, 14 May 2014 11:18:23 +0000
Pierre JoyeWed, 14 May 2014 15:59:40 +0000
Andrea FauldsWed, 14 May 2014 17:04:12 +0000
Dmitry StogovWed, 14 May 2014 17:14:24 +0000
Pierre JoyeWed, 14 May 2014 17:37:57 +0000
Stas MalyshevWed, 14 May 2014 18:35:16 +0000
Pierre JoyeWed, 14 May 2014 19:10:44 +0000

« previous	php.internals (#74199)	next »

From:	Andrey Hristov	Date:	Wed, 14 May 2014 17:28:21 +0000
Subject:	Re: [VOTE] [RFC] 64 bit platform improvements for string length and integer
References:	1 2 3 4 5 6 7 8 9 10	Groups:	php.internals
Request:	Send a blank email to internals+get-74199@lists.php.net to get a copy of this message