Re-Implementing the Range Operator in PHP

We sometimes come across some amazing posts in other locations, and with the permissions of their authors, repost them on SitePoint. This is one such instance. In the post below, Thomas Punt reimplements the previously implemented range operator in PHP, this time using an improved approach. If you’ve ever been interested in PHP internals and adding features to your favorite programming language, now’s the time to learn!

This article assumes that the reader is able to build PHP from source. If this is not the case, then please see the Building PHP chapter of the PHP Internals Book first.

Flying elephpant

In the prequel to this article (hint: make sure you’ve read it first), I showed one way to implement a range operator in PHP. Initial implementations, however, are rarely the best, and so it is the intention of this article to look at how the previous implementation can be improved.

Thanks once again to Nikita Popov for proofreading this article!

Previous Implementation Drawbacks

The initial implementation put all of the logic for the range operator into the Zend VM, which forced computation to take place purely at runtime when the ZEND_RANGE opcode was executed. This not only meant that computation could not be shifted to compile time for operands that were literal, but also meant that some features would simply not work.

In this implementation, we will shift the range operator logic out of the Zend VM to enable for computation to be done at either compile time (for literal operands) or runtime (for dynamic operands). This will not only provide a small win for Opcache users, but will more importantly allow for constant expression features to be used with the range operator.

For example:

// as constant definitions
const AN_ARRAY = 1 |> 100;

// as initial property definitions
class A
{
	private $a = 1 |> 2;
}

// as default values for optional parameters:
function a($a = 1 |> 2)
{
	//
}

So without further ado, let’s reimplement the range operator.

Updating the Lexer

The lexer implementation remains exactly the same. The token is firstly registered in Zend/zend_language_scanner.l (line ~1200):

<ST_IN_SCRIPTING>"|>" {
    RETURN_TOKEN(T_RANGE);
}

And then declared in Zend/zend_language_parser.y (line ~220):

%token T_RANGE 		     "|> (T_RANGE)"

The tokenizer extension must again be regenerated by going into the ext/tokenizer directory and executing the tokenizer_data_gen.sh file.

Updating the Parser

The parser implementation is partially the same as before. We again start by stating the operator’s precedence and associativity by adding the T_RANGE token onto the end of the following line (line ~70):

%nonassoc T_IS_EQUAL T_IS_NOT_EQUAL T_IS_IDENTICAL T_IS_NOT_IDENTICAL T_SPACESHIP T_RANGE

We then update the expr_without_variable production rule again, though this time the semantic action (the code within the curly braces) will be slightly different. Update it with the following code (I placed it just below the T_SPACESHIP rule, line ~930):

	|	expr T_RANGE expr
			{ $$ = zend_ast_create_binary_op(ZEND_RANGE, $1, $3); }

This time, we’ve used the zend_ast_create_binary_op function (instead of the zend_ast_create function), which creates a ZEND_AST_BINARY_OP node for us. zend_ast_create_binary_op takes an opcode name that will be used to distinguish binary operations from one-another during the compilation stage.

Since we’re reusing the ZEND_AST_BINARY_OP node type now, there is no need to define a new ZEND_AST_RANGE node type as done before in the Zend/zend_ast.h file.

Updating the Compilation Stage

This time, there is no need to update the Zend/zend_compile.c file since it already contains the necessary logic to handle binary operations. Thus, we are simply reusing this logic by making our operator a ZEND_AST_BINARY_OP node.

The following is a trimmed version of the zend_compile_binary_op function:

void zend_compile_binary_op(znode *result, zend_ast *ast) /* {{{ */
{
	zend_ast *left_ast = ast->child[0];
	zend_ast *right_ast = ast->child[1];
	uint32_t opcode = ast->attr;

	znode left_node, right_node;
	zend_compile_expr(&left_node, left_ast);
	zend_compile_expr(&right_node, right_ast);

	if (left_node.op_type == IS_CONST && right_node.op_type == IS_CONST) {
		if (zend_try_ct_eval_binary_op(&result->u.constant, opcode,
				&left_node.u.constant, &right_node.u.constant)
		) {
			result->op_type = IS_CONST;
			zval_ptr_dtor(&left_node.u.constant);
			zval_ptr_dtor(&right_node.u.constant);
			return;
		}
	}

	do {
		// redacted code
		zend_emit_op_tmp(result, opcode, &left_node, &right_node);
	} while (0);
}
/* }}} */

As we can see, it is pretty similar to the zend_compile_range function we created last time. The two important differences are in regards to how the opcode type is acquired and what happens when both operands are literals.

The opcode type is acquired from the AST node this time (as opposed to being hardcoded, as seen last time), since the ZEND_AST_BINARY_OP node stores this value (as seen from the new production rule’s semantic action) to differentiate between binary operations. When both operands are literals, the zend_try_ct_eval_binary_op function will be invoked. This function looks as follows:

static inline zend_bool zend_try_ct_eval_binary_op(zval *result, uint32_t opcode, zval *op1, zval *op2) /* {{{ */
{
	binary_op_type fn = get_binary_op(opcode);

	/* don't evaluate division by zero at compile-time */
	if ((opcode == ZEND_DIV || opcode == ZEND_MOD) &&
	    zval_get_long(op2) == 0) {
		return 0;
	} else if ((opcode == ZEND_SL || opcode == ZEND_SR) &&
	    zval_get_long(op2) < 0) {
		return 0;
	}

	fn(result, op1, op2);
	return 1;
}
/* }}} */

The function obtains a callback from the get_binary_op function (source ) in Zend/zend_opcode.c according to the opcode type. This means we will need to update this function next to cater for the ZEND_RANGE opcode. Add the following case statement to the get_binary_op function (line ~750):

		case ZEND_RANGE:
			return (binary_op_type) range_function;

Now we must define the range_function function. This will be done in the Zend/zend_operators.c file alongside all of the other operators:

ZEND_API int ZEND_FASTCALL range_function(zval *result, zval *op1, zval *op2) /* {{{ */
{
	zval tmp;

	ZVAL_DEREF(op1);
	ZVAL_DEREF(op2);

	if (Z_TYPE_P(op1) == IS_LONG && Z_TYPE_P(op2) == IS_LONG) {
		zend_long min = Z_LVAL_P(op1), max = Z_LVAL_P(op2);
		zend_ulong size, i;

		if (min > max) {
			zend_throw_error(NULL, "Min should be less than (or equal to) max");
			return FAILURE;
		}

		// calculate size (one less than the total size for an inclusive range)
		size = max - min;

		// the size cannot be greater than or equal to HT_MAX_SIZE
		// HT_MAX_SIZE - 1 takes into account the inclusive range size
		if (size >= HT_MAX_SIZE - 1) {
			zend_throw_error(NULL, "Range size is too large");
			return FAILURE;
		}

		// increment the size to take into account the inclusive range
		++size;

		// set the zval type to be a long
		Z_TYPE_INFO(tmp) = IS_LONG;

		// initialise the array to a given size
		array_init_size(result, size);
		zend_hash_real_init(Z_ARRVAL_P(result), 1);
		ZEND_HASH_FILL_PACKED(Z_ARRVAL_P(result)) {
			for (i = 0; i < size; ++i) {
				Z_LVAL(tmp) = min + i;
				ZEND_HASH_FILL_ADD(&tmp);
			}
		} ZEND_HASH_FILL_END();
	} else if ( // if both operands are either integers or doubles
		(Z_TYPE_P(op1) == IS_LONG || Z_TYPE_P(op1) == IS_DOUBLE)
		&& (Z_TYPE_P(op2) == IS_LONG || Z_TYPE_P(op2) == IS_DOUBLE)
	) {
		long double min, max, size, i;

		if (Z_TYPE_P(op1) == IS_LONG) {
			min = (long double) Z_LVAL_P(op1);
			max = (long double) Z_DVAL_P(op2);
		} else if (Z_TYPE_P(op2) == IS_LONG) {
			min = (long double) Z_DVAL_P(op1);
			max = (long double) Z_LVAL_P(op2);
		} else {
			min = (long double) Z_DVAL_P(op1);
			max = (long double) Z_DVAL_P(op2);
		}

		if (min > max) {
			zend_throw_error(NULL, "Min should be less than (or equal to) max");
			return FAILURE;
		}

		size = max - min;

		if (size >= HT_MAX_SIZE - 1) {
			zend_throw_error(NULL, "Range size is too large");
			return FAILURE;
		}

		// we cast the size to an integer to get rid of the decimal places,
		// since we only care about whole number sizes
		size = (int) size + 1;

		Z_TYPE_INFO(tmp) = IS_DOUBLE;

		array_init_size(result, size);
		zend_hash_real_init(Z_ARRVAL_P(result), 1);
		ZEND_HASH_FILL_PACKED(Z_ARRVAL_P(result)) {
			for (i = 0; i < size; ++i) {
				Z_DVAL(tmp) = min + i;
				ZEND_HASH_FILL_ADD(&tmp);
			}
		} ZEND_HASH_FILL_END();
	} else {
		zend_throw_error(NULL, "Unsupported operand types - only ints and floats are supported");
		return FAILURE;
	}

    return SUCCESS;
}
/* }}} */

The function prototype contains two new macros: ZEND_API and ZEND_FASTCALL. ZEND_API is used to control the visibility of functions by making them available to extensions that are compiled as shared objects. ZEND_FASTCALL is used to ensure a more efficient calling convention is used, where the first two arguments will be passed using registers rather than the stack (more relevant to 32bit builds than 64bit builds on x86).

The function body is very similar to what we had in the Zend/zend_vm_def.h file in the previous article. The VM-specific stuff is no longer present, including the HANDLE_EXCEPTION macro calls (which have been replaced with return FAILURE;), and the ZEND_VM_NEXT_OPCODE_CHECK_EXCEPTION macro calls have been removed entirely (this check and operation needs to stay in the VM, and so the macro will be invoked from the VM code later).

Another note-worthy difference is that we’re applying ZVAL_DEFEF to both operands to ensure that references are handled properly. This was something that was previously done inside of the VM using the pseudo-macro GET_OPn_ZVAL_PTR_DEREF, but has now been shifted into this function. This was done not because it is needed at compile time (since for compile time handling, both operands would have to be literals, and they cannot be referenced), but because it enables for other places inside the codebase to safely invoke range_function without having to worry about reference handling. As such, referencing handling is performed by most of the operator functions instead of in their VM opcode definition (except where performance matters).

Lastly, we must add the range_function prototype to the Zend/zend_operators.h file:

ZEND_API int ZEND_FASTCALL range_function(zval *result, zval *op1, zval *op2);

Updating the Zend VM

Now we must once again update the Zend VM to handle the execution of the ZEND_RANGE opcode during runtime. Place the following code in Zend/zend_vm_def.h (at the bottom):

ZEND_VM_HANDLER(182, ZEND_RANGE, CONST|TMPVAR|CV, CONST|TMPVAR|CV)
{
	USE_OPLINE
	zend_free_op free_op1, free_op2;
	zval *op1, *op2;

	SAVE_OPLINE();
	op1 = GET_OP1_ZVAL_PTR(BP_VAR_R);
	op2 = GET_OP2_ZVAL_PTR(BP_VAR_R);
	range_function(EX_VAR(opline->result.var), op1, op2);
	FREE_OP1();
	FREE_OP2();
	ZEND_VM_NEXT_OPCODE_CHECK_EXCEPTION();
}

(Again, the opcode number must be one greater than the current highest opcode number, which can be seen at the bottom of the Zend/zend_vm_opcodes.h file.)

The definition this time is far shorter since all of the work is handled in range_function. We simply invoke this function, passing in the result operand of the current opline to hold the computed value. The exception checks and skipping onto the next opcode that were removed from range_function are still handled in the VM by the call to ZEND_VM_NEXT_OPCODE_CHECK_EXCEPTION at the end. Also, as mentioned previously, we avoid handling references in the VM by using the GET_OPn_ZVAL_PTR pseudo-macros instead (rather than GET_OPn_ZVAL_PTR_DEREF).

Now regenerate the VM by executing the Zend/zend_vm_gen.php file.

Lastly, the pretty printer needs updating in the Zend/zend_ast.c file once again. Update the precedence table comment by specifying the new operator to have a priority of 170 (line ~520):

*  170     non-associative == != === !== |>

Then, insert a case statement into the zend_ast_export_ex function to handle the ZEND_RANGE opcode in the ZEND_AST_BINARY_OP case statement (line ~1300):

case ZEND_RANGE:               BINARY_OP(" |> ",  170, 171, 171);

Conclusion

This article has shown an alternative way to implement the range operator, where the computation logic was shifted out of the VM. This had the advantage of being able to use the range operator in constant expression contexts.

The third part to this article series will build upon this implementation by covering how we can overload this operator. This will enable for objects to be used as operands (such as those from the GMP library or those that implement an __toString method). It will also show how we can add proper support for strings (not like the support seen with PHP’s current range function). But for now, I hope this has served as a nice demonstration of some of ZE’s further aspects when implementing operators into PHP.