Sunday, June 18. 2006How long is a piece of string?Sunday morning I was asked by an IRC regular: "Where does the engine parse quoted strings?". Being a sunday morning, I began to launch into a sermon on the distinction between CONSTANT_ENCAPSED_STRING and the problems which befall a single-pass compiler when you start to introduce interpolation. Not what he asked precisely, but an important component in answering his question. Unfortunately, at the time I was busy watching the Brasil-Australia game so I didn't go into the kind of detail I would have. Now, some 12 hours later, since Angela is off buying toe-socks in Santa Cruz, I'll bore anyone with little enough life to read my blog by explaining the pitfalls of using PHP's string interpolation without using an optimizer. To start things off, let's take a page from my earlier discourse on Compiled Variables and look at the opcodes generated by a few simple PHP scripts: <?php echo "This is a constant string"; ?> Yields the nice, simple opcode: ECHO 'This is a constant string' No problem... Exactly what you'd expect... Now let's complicate the expressions a little: <?php echo "This is an interpolated $string"; ?> Yields the surprisingly messy instruction set: INIT STRING ~0 ADD_STRING ~0 ~0 'This' ADD_STRING ~0 ~0 ' ' ADD_STRING ~0 ~0 'is' ADD_STRING ~0 ~0 ' ' ADD_STRING ~0 ~0 'an' ADD_STRING ~0 ~0 ' ' ADD_STRING ~0 ~0 'interpolated' ADD_STRING ~0 ~0 ' ' ADD_VAR ~0 ~0 !0 ECHO ~0 Where !0 represents the compiled variable named $string. Looking at these opcodes: INIT_STRING allocates an IS_STRING variable of one byte (to hold the terminating NULL). Then it's realloc'd to five bytes by the first ADD_STRING ('This' plus the terminating NULL). Next it's realloc'd to six bytes in order to add a space, then again to eight bytes for 'is', then nine to add a space, and so on until the temporary string has the contents of the interpolated variable copied into its contents before being used by the echo statement and finally discarded. Now let's rewrite that line to avoid interpolation and use concatenation instead: <?php echo "This is a concatenated " . $string; ?> Which yields the significantly shorter and simpler set of ops: CONCAT ~0 'This is a concatenated ' !0 ECHO ~0 A vast improvement already, but this version still creates a temporary IS_STRING variable to hold the combined string contents meaning that data is duplicated when it's being used in a const context anyway. Now let's try out this oft-overlooked use of the echo statement: <?php echo "This is a stacked echo " , $string; ?> Look close, there is a meaningful difference from the last one. This time we're using a comma rather than a dot between the operands. If you don't know what the comma is doing there, ask the manual then check back here. Here's the resulting opcodes: ECHO 'This is a stacked echo ' ECHO !0 Same number of opcodes, but this time no temporary variables are being created so there's no duplication and no pointless copying (unless of course $string wasn't of type IS_STRING, in which case it does have to be converted for output, but don't get picky now). Think this is bad? Consider the average heredoc string which spans several lines of prepared output embedding perhaps a handful of variables along the way. Here's one of several such blocks found in run-tests.php within the PHP distribution source tree: <?php echo <<<NO_PCRE_ERROR +-----------------------------------------------------------+ | ! ERROR ! | | The test-suite requires that you have pcre extension | | enabled. To enable this extension either compile your PHP | | with --with-pcre-regex or if you've compiled pcre as a | | shared module load it via php.ini. | +-----------------------------------------------------------+ NO_PCRE_ERROR; ?> Notice that we're not even embedding variables to be interpolated here, yet does this come out to a simple, single opcode? Nope, because the rules necessary to catch a heredoc's end token demand the same careful examination as double-quoted variable substitution and you wind up (in this case) with SEVENTY-EIGHT opcodes! One INIT_STRING, 76 ADD_STRINGs. and a final ECHO. That means a malloc, 76 reallocs, and a free which will be executed every time that code snippet comes along. Even the original contents take up more memory because they're stored in 76 distinct zval/IS_STRING structures. Why does this happen? Because there are about a dozen ways that a variable can be hidden inside an interpolated string. Similarly, when looking for a heredoc end-token, the token can be an arbitrary length, containing any of the label characters, and may or may not sit on a line by itself. Put simply, it's too difficult to encompass in one regular expression. The engine could perform a second-pass during compilation, however the time saved reassembling these strings will typically be about the same amount of time spent actually processing them during runtime (if one assumes that each instance will execute exactly once). Rather than complicate the build process (potentially slowing down overall run-times in the process), the compiler leaves this optimization step to opcode caches which can achieve exponentially greater advantage cleaning up this mess then caching the results and reusing the faster, leaner versions on all subsequent runs. If you're using APC, you'll find just such an optimizer built in, but not enabled by default. To turn it on, you'll need to set apc.optimization=on in your php.ini. In addition to stitching these run-on opcodes back together, it'll also add run-time speed-ups like pre-resolving persistent constants to their actual values, folding static scalar expressions (like 1 + 1) to their fixed results (e.g. 2), and simpler stuff like avoiding the use of JMP when the target is the next opcode, or boolean casts when the original expression is known to be a boolean value. (It should be noted that these speed-ups also break some of the runtime-manipulation features of runkit, but that was stuff you....probably should have been doing anyway) Can't use an optimizer because your webhost doesn't know how to set php.ini options? You can still avoid 90% of the INIT_STRING/ADD_STRING dilema by simply using single quotes and concatenation (or commas when dealing with echo statements). It's a simple trick and one which shouldn't harm maintainability too much, but on a large, complicated script, you just might see an extra request or two per second. Comments
Display comments as
(Linear | Threaded)
Oooooh yeah ! We definitely need more article like this, to understand how PHP works. There are so many pointless benchmarks everywhere showing some small-speed difference between the different string syntaxes, with no explanation given.
Indeed, such articles add more value to hardcore PHP developer. Keep it going Sara! Now I'm getting really curious, I will probably buy your book to advance myself in this area
Definitely a very interesting article.
It would also be nice to know whether there is a difference between single and double quotes because there is always a lot of discussion going on about that, too. (and i don't know how to find out the opcodes and stuff Short version: No. The margin between single and double quoted is on the order of nanoseconds. The only thing that gives single-quote any kind of edge over double-quote (assuming they're both constant/non-interpolated strings) is the fact that double quoted strings are subject to more substitution rules (\r\n\t\xFF\012) and so have to spend an extra clock-cycle or two on scanning.
Just like so many compile-time bottlenecks though, once you introduce an opcode cache, this distinction vanishes. Thank you Sara for the information.
Is it expensive to move in and out of PHP mode? i.e. This is a string and again: hmm.. it ate my angle-brackets!
I meant a file like: This is a [?=$var?] and [?=$another_var?]! Where [ = left angle bracket and ] = right angle bracket In terms of compiled opcodes:
Foo [?=$bar?] baz is identical to: echo 'Foo ', $bar, ' baz'; Put more in terms of how the compilation occurs, say you're currently "in PHP mode", or as the lexer puts it: ST_IN_SCRIPTING... ?]htmlcontent[?php is seen as: echo 'htmlcontent'; From the standpoint of compilation times, they're going to be more or less identical with the former being marginally faster due to not having to substitute \\ and \' sequences, but marginally slower due to having more complex start and end tokens ( [? [?php [?= [% [%= [script language="php"] and ?] %] [/script] versus ' and '). A wash overall and not a significant enough potential to bother worrying about. As always, these distinctions vanish in the presense of an opcode cache. Sara, another interesting post on PHP internals. Thanks for the heads up on the optimisations without APC, I thought I was on the right track using '' compared to ""
are you saying I have no life!?
I'm not about to go compiling to get this kind of output, so it's great to get it from the people who are!
I did speed tests a few years ago on the difference between echo, print, sprint and the impact of single quotes, double quotes and curly brackets. In the end none of it probably matters when compared to the masses of database hits we make but it's still great to know what's going on under the hood thanks! INIT STRING ~0
ADD_STRING ~0 ~0 'This' ADD_STRING ~0 ~0 ' ' ... ECHO ~0 Too expensvie solutions for too frequent cases. There is standard fast way for multiple concatenation - to insert string pointers into an array and than do concatenate the array's content into one resulting string. In a perfect world the compiler would be looking at the script as a whole and would be reassembling those lost tokens just like that. Unfortunately, that compile-time analysis comes at the cost of speed so it's seen as "less helpful" to perform this work at compile-time for a string which may (or may not) actually be used. It's all a question of tradeoffs:
Fast, Clean, Cheap: Pick two. and what about echo implode($array);
I've been using that a bit lately... Native php arrays have big overhead for that task due their ability to work either associative array or indexed array. And in fact, using $a[] = $b ... $c = implode($a) is slower and requires more memory than $c .= $b.
I think the solution can be only at low level with implementing new special type like LIST or TUPLE that can be some kind of widely known dynamic arrays from C/C++ world with pointers to strings as its values. But I guess it would require some global changes in PHP compiler and Zend engine so it has not to be concerned as high priority task. And look at
http://phplens.com/lens/php-book/optimizing-debugging-php.php there is nice way to fast multiple concatenations with output buffer (ob_start/ob_get_clean) without messing with low levels etc and it really works. In some case it might make a difference to use some small optimisation, but still for most case when developing a website in PHP, it shouldn't do much of a difference to do thing like use single-quote versus double-quote.
One thing that is great when using high level is to not think of very little detail like this. While it may be fun and some minor detail might make a [small] difference in PHP version X, it might totally change in version Y. Of course I don't talk of optimisation like caching output, caching variable fetched from database, etc. that give you a bang for your buck and that make sense even at high level. I may be right or wrong (and might change my point of view in the future), but I often feel that using the same syntax can be better (e.g. always use double quote or single quote), since it gives you the same style visually when editing. That is like curly brace, there is argument in favour of the most popular style (one on each line, or one starting at the same line as the statement, I have my preference of course), however it doesn't change much, but a constant style is pleasant. I found a completely different result:
Executing Heredoc Interpolation Test : ......... 0.00376201 ms/iter. Executing String Concatenation Test : ......... 0.00410941 ms/iter. Executing String Interpolation Test : ......... 0.00474050 ms/iter. Executing Output Buffering Test : ......... 0.00430031 ms/iter. Executing Output Buffering Include Test : ......... 0.04734261 ms/iter. Executing SimpleXML Element Test : ......... 0.02940309 ms/iter. Executing Indented SimpleXML Element Test : ......... 0.07175908 ms/iter. Executing XML Serializer Test : ......... 0.38576980 ms/iter. Executing CachedXML Serializer Test : ......... 0.37935431 ms/iter. Executing DOM Document Test : ......... 0.04152939 ms/iter. Completed 10 test cases of 10000 iterations each. real 0m9.746s user 0m9.593s sys 0m0.132s izepp@izepp-desktop:~/Desktop$ Full overview & code available at: http://www.ianzepp.com/engineering/2008/12/a-comprehensive-review-of-string-xml-performance.html Rerunning the "String Concatenation" test removing the 7 unnecessary concatinations (see my comment on your blog) shows it to be faster than heredoc one my computer (no optimizer).
Executing Heredoc Interpolation Test : ......... 0.02168660 ms/iter. Executing String Concatenation Test : ......... 0.01204681 ms/iter. Which is another way of showing that the best way to optimize is to be lazy and do as little as possible. Great information, but, what if i did use an optimizer?
|
Buy My BookFavoritesPieces of string
Lies & References Opcodes Compiled Variables TSRMLS_CC Extension Writing Part 1 Part 2a Part 2b Part 3 Syndicate This BlogCategoriesQuicksearch |
Save to Del.icio.us
Save to MyWeb
Digg This
Flickr












![Validate my RSS feed [Valid RSS 2.0]](/uploads/valid-rss.png)
![Validate my Atom feed [Valid Atom 1.0]](/uploads/valid-atom.png)
As a PHP developer, the Planet PHP blog is, to be fair, pretty useless. Many of the posts aren't related to PHP, some of them are so basic they are not worth reading, and almost all of the rest are simply opinion pieces, or ill-conceived thoughts about th
Tracked: Jun 19, 06:47
How long is a piece of string (in PHP)? Sara Golemon has the answer. Very enlightening. ...
Tracked: Jun 23, 12:05