unserialize Yahoo! search results
Via John Cox, Yahoo! have opened up a PHP Development Center for their search APIs and, more interestingly, have started exposing their search data as serialized PHP strings. That’s “serialized” as in the serialize function.
This is very cool but think a little caution is needed when using it, given that it wasn’t designed to be a wire format but rather for local storage of PHP data, within a trusted environment.
Is this format safe?
First there’s a problem of trust and potentially a security issue. I guess we can trust Yahoo! OK but they need to make very sure that they’re escaping the data they publish this way correctly – make sure no-one they’ve gathered search results for is able to inject anything in there. Why?
Well for primitive data types (strings, int, PHP arrays) there’s more or less no problem – you can unserialize the result you get back without issues. That said, perhaps the Hardened-PHP Project needs to look at this – are there issues like deeply nested arrays, infinite recursion or very large data structures?
The potential (but low risk) security issue though is if any objects are being serialized this way. When you unserialize, PHP is going to attempt to create instances of those objects (assuming it can find the corresponding class), which is going to result in the constuctor being fired the __wakeup function if one exists in the class definition, and the destructor in PHP5. Given that PHP5 is growing more and more built-in classes, there’s a chance this could become a significant problem.
I’ve dealt with this before in JPSpan, which also used this format across the wire, and have mentioned it to Josh here. The solution I took was to screen the serialized string first with some regexes and I’m 99% sure this approach works – implementation here (JPSpan_Unserializer_PHP::getClasses
) and unit tests here – see TestOfJPSpan_Unserializer_PHP_getClasses
.
For Yahoo!, it the following function would do it;
function hasObjects($string) {
// Stip any string representations (which might contain object syntax)
$string = preg_replace('/s:[0-9]+:".*"/Us','',$string);
// Pull out the class named
preg_match_all('/O:[0-9]+:"(.*)"/U',$string,$matches,PREG_PATTERN_ORDER);
return count($matches[1]) > 0;
}
$serializedString = file_get_contents('http://api.search.yahoo.com ... (etc.) ... &output=php');
if ( hasObjects($serializedString) ) {
die ("Its got objects. I object!");
}
$result = unserialize($serializedString);
Given that most people will be fetching this data from Yahoo! in cleartext, there’s a chance an attacker might be able to modify it in transit, so calling that function to check for objects is probably a smart move.
Character Encoding?
Now Yahoo! have been smart in encoding the data as UTF-8. Here’s the HTTP response headers for a request I made (side note: perhaps Yahoo! might consider using some HTTP caching a little, given this is REST?);
HTTP/1.x 200 OK
Date: Thu, 23 Feb 2006 08:14:38 GMT
P3p: policyref="http://p3p.yahoo.com/w3c/p3p.xml", CP="[snip]"
Connection: close
Content-Type: text/php; charset="utf-8"
Checked with some multibyte results and Yahoo are correctly counting the multi string lengths in terms of bytes in the PHP results they return, so no problem there. You probably do need to be aware of what encoding you are using on the site displaying the results though. John, for example, is using ISO-8859-1 (which is very common), so you’ll want to convert UTF-8 to ISO-8859-1 using utf8_decode (which is purely for use with ISO-8859-1!) on the data elements, after the call to unserialize(). People with more exotic character encodings will need to turn to iconv. In Firefox, open a page on your site, right click and select “Page Info” – this will tell you your character encoding.
Example with PEAR::HTTP_Request
Yahoo! provide examples with curl and file_get_contents(). Here’s an alternative example using PEAR::HTTP_Request;
< ?php
// Include PEAR::HTTP_Request
require_once 'HTTP/Request.php';
// Something to search for
$searchword = 'Zürich';
// Build Yahoo! web search URL
$yahoo_url = 'http://api.search.yahoo.com/WebSearchService'.
'/V1/webSearch?appid=YahooDemo&results=10&output=php'.
'&query='.$searchword;
// Create the HTTP_Request object, specifying the URL
$Request = &new HTTP_Request($yahoo_url);
// Set proxy server as necessary
// $Request->setProxy('proxy.myisp.com', '8080', 'harryf', 'secret');
// Send the request for the feed to the remote server
$status = $Request->sendRequest();
// Check for errors
if (PEAR::isError($status)) {
// Do something friendlier than die...
die("Connection problem: " . $status->toString()."<br />");
}
// Check we got an HTTP 200 status code (if not there's a problem)
if ($Request->getResponseCode() != '200') {
// Do something friendlier than die...
die("Request failed: " . $Request->getResponseCode()."<br />");
}
// Get the PHP serialized string from Yahoo!
$phps = $Request->getResponseBody();
// Function to test that the serialized string doesn't contain any objects
function hasObjects($string) {
// Stip any string representations (which might contain object syntax)
$string = preg_replace('/s:[0-9]+:".*"/Us','',$string);
// Pull out the class named
preg_match_all('/O:[0-9]+:"(.*)"/U',$string,$matches,PREG_PATTERN_ORDER);
return count($matches[1]) > 0;
}
// Check this before unserializing
if ( hasObjects($phps) ) {
// Do something friendlier than die...
die("Serialized string contains objects<br />");
}
// Unserialize...
$data = unserialize($phps);
// Note the charset!
header('Content-type: text/html; charset=utf-8');
echo '<pre>';
foreach ( $data['ResultSet']['Result'] as $row ) {
print_r($row);
}
echo '</pre>';
Not just PHP
Encoders and decoders for this format have actually been done in quite a few different languages. Some that I’m aware of;
Move over encoded SOAP – text/php
is taking over :-)