unserialize Yahoo! search results

Via John Cox, Yahoo! have opened up a PHP Development Center for their search APIs and, more interestingly, have started exposing their search data as serialized PHP strings. That’s “serialized” as in the serialize function.

This is very cool but think a little caution is needed when using it, given that it wasn’t designed to be a wire format but rather for local storage of PHP data, within a trusted environment.

Is this format safe?

First there’s a problem of trust and potentially a security issue. I guess we can trust Yahoo! OK but they need to make very sure that they’re escaping the data they publish this way correctly – make sure no-one they’ve gathered search results for is able to inject anything in there. Why?

Well for primitive data types (strings, int, PHP arrays) there’s more or less no problem – you can unserialize the result you get back without issues. That said, perhaps the Hardened-PHP Project needs to look at this – are there issues like deeply nested arrays, infinite recursion or very large data structures?

The potential (but low risk) security issue though is if any objects are being serialized this way. When you unserialize, PHP is going to attempt to create instances of those objects (assuming it can find the corresponding class), which is going to result in the constuctor being fired the __wakeup function if one exists in the class definition, and the destructor in PHP5. Given that PHP5 is growing more and more built-in classes, there’s a chance this could become a significant problem.

I’ve dealt with this before in JPSpan, which also used this format across the wire, and have mentioned it to Josh here. The solution I took was to screen the serialized string first with some regexes and I’m 99% sure this approach works – implementation here (JPSpan_Unserializer_PHP::getClasses) and unit tests here – see TestOfJPSpan_Unserializer_PHP_getClasses.

For Yahoo!, it the following function would do it;


function hasObjects($string) {
    
    // Stip any string representations (which might contain object syntax)
    $string = preg_replace('/s:[0-9]+:".*"/Us','',$string);
    
    // Pull out the class named
    preg_match_all('/O:[0-9]+:"(.*)"/U',$string,$matches,PREG_PATTERN_ORDER);
    
    return count($matches[1]) > 0;
}

$serializedString = file_get_contents('http://api.search.yahoo.com ... (etc.) ... &output=php');

if ( hasObjects($serializedString) ) {
    die ("Its got objects. I object!");
}

$result = unserialize($serializedString);

Given that most people will be fetching this data from Yahoo! in cleartext, there’s a chance an attacker might be able to modify it in transit, so calling that function to check for objects is probably a smart move.

Character Encoding?

Now Yahoo! have been smart in encoding the data as UTF-8. Here’s the HTTP response headers for a request I made (side note: perhaps Yahoo! might consider using some HTTP caching a little, given this is REST?);


HTTP/1.x 200 OK
Date: Thu, 23 Feb 2006 08:14:38 GMT
P3p: policyref="http://p3p.yahoo.com/w3c/p3p.xml", CP="[snip]"
Connection: close
Content-Type: text/php; charset="utf-8"

Checked with some multibyte results and Yahoo are correctly counting the multi string lengths in terms of bytes in the PHP results they return, so no problem there. You probably do need to be aware of what encoding you are using on the site displaying the results though. John, for example, is using ISO-8859-1 (which is very common), so you’ll want to convert UTF-8 to ISO-8859-1 using utf8_decode (which is purely for use with ISO-8859-1!) on the data elements, after the call to unserialize(). People with more exotic character encodings will need to turn to iconv. In Firefox, open a page on your site, right click and select “Page Info” – this will tell you your character encoding.

Example with PEAR::HTTP_Request

Yahoo! provide examples with curl and file_get_contents(). Here’s an alternative example using PEAR::HTTP_Request;


< ?php
// Include PEAR::HTTP_Request
require_once 'HTTP/Request.php';

// Something to search for
$searchword = 'Zürich';

// Build Yahoo! web search URL
$yahoo_url = 'http://api.search.yahoo.com/WebSearchService'.
    '/V1/webSearch?appid=YahooDemo&results=10&output=php'.
    '&query='.$searchword;

// Create the HTTP_Request object, specifying the URL
$Request = &new HTTP_Request($yahoo_url);

// Set proxy server as necessary
// $Request->setProxy('proxy.myisp.com', '8080', 'harryf', 'secret');

// Send the request for the feed to the remote server
$status = $Request->sendRequest();

// Check for errors
if (PEAR::isError($status)) {
    // Do something friendlier than die...
    die("Connection problem: " . $status->toString()."<br />");
}

// Check we got an HTTP 200 status code (if not there's a problem)
if ($Request->getResponseCode() != '200') {
    // Do something friendlier than die...
    die("Request failed: " . $Request->getResponseCode()."<br />");
}

// Get the PHP serialized string from Yahoo!
$phps = $Request->getResponseBody();

// Function to test that the serialized string doesn't contain any objects
function hasObjects($string) {
    
    // Stip any string representations (which might contain object syntax)
    $string = preg_replace('/s:[0-9]+:".*"/Us','',$string);
    
    // Pull out the class named
    preg_match_all('/O:[0-9]+:"(.*)"/U',$string,$matches,PREG_PATTERN_ORDER);
    
    return count($matches[1]) > 0;
}

// Check this before unserializing
if ( hasObjects($phps) ) {
    // Do something friendlier than die...
    die("Serialized string contains objects<br />");
}

// Unserialize...
$data = unserialize($phps);

// Note the charset!
header('Content-type: text/html; charset=utf-8');

echo '<pre>';
foreach ( $data['ResultSet']['Result'] as $row ) {
    print_r($row);
}
echo '</pre>';

Not just PHP

Encoders and decoders for this format have actually been done in quite a few different languages. Some that I’m aware of;

Move over encoded SOAPtext/php is taking over :-)

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • http://www.phppatterns.com HarryF

    One further thing I should have mentioned – it’s worth looking at Keith’s uriescape function.

  • http://www.schlitt.info dotxp

    What you are basically saying is, that there is no issue with the Yahoo! API, because they won’t return objects to you, but only scalars or arrays. Or am I wrong.

  • http://www.phppatterns.com HarryF

    What you are basically saying is, that there is no issue with the Yahoo! API, because they won’t return objects to you, but only scalars or arrays. Or am I wrong.

    I’m saying you need to be careful to check for objects before unserializing. It looks like Yahoo! won’t be returning objects in their results but that’s not to say

    a) someone injected something into their search results and they failed to escape it

    or

    b) someone modified the results between your web server and Yahoo, given cleartext transfer

    This is low risk (and the impact is also low risk) but it’s still a risk. Someone could result in, say, a dir object being constructed when you unserialize (which isn’t dangerous in itself but it should make you nervous).

  • http://www.schlitt.info dotxp

    Ah,sorry, I missed the point of “somebody injecting something”. :) Then you’re prefectly right.

  • Ren

    Also try unserialising an array with 100,000,000 items…

    a:100000000:{}

    Run a few times will exhaust memory.

  • Joshua Eichorn

    In HTML_AJAX we pull out the class names and compare them against a white list. But i’m still not comfortable with using the PHP serialization format to move data around. Since it contains string sizes and doesn’t know about other encodings its extremely brittle.

    As far as the empty array trick, i believe you can do that with urlencoded strings as well so im not sure if its actually a big idea, a memory limit will just cause that process to die.

  • tkruthoff

    Although I agree there may be some security risks, object constructors are not called on unserialize. The primary reason being that constructors can have arguments, which the serializer would not know what to do with.

  • Anonymous

    Check out the documentation on Yahoo! (http://developer.yahoo.net/common/phpserial.html)

    “Since Yahoo! Web Services uses associative and numerically indexed PHP arrays to represent web services responses, the results of deserialization will be PHP arrays.”

    So Yahoo! won’t be returning any objects. Hence, no security issues.

  • Anonymous

    There is plenty of sanitation going on in the backend. Yahoo also sends JSON replies that people traditionally feed directly to a Javascript eval(). That is even more dangerous than the tricks someone could do in a serialized string. So yes, if you have a man-in-the-middle attack where you are talking to something that isn’t actually Yahoo, relying on the serialized php or json output can get you into a lot of trouble. But assuming you are confident that you are getting this directly from Yahoo there is nothing to worry about.

  • http://www.phppatterns.com HarryF

    Although I agree there may be some security risks, object constructors are not called on unserialize. The primary reason being that constructors can have arguments, which the serializer would not know what to do with.

    You’re right – trying again don’t know where I got that one from – one of those old assuptions I haven’t thougt about for a long time.

    It’s the __wakeup function (if it exists) that get’s called when unserializing. Will do an update to the entry soon. In PHP5 you also have the destructor though, when the object goes out of scope or the script ends/

    This is low risk but all the same, I personally don’t like the idea of unexpected code being executed and it costs almost nothing to protect yourself.

    Good point on JSON also. At the same time, there’s only so much damage you can cause with a browser and eval() – main danger is probably session hijacking and similar. On a server there’s much more dangerous functionality around.

  • Pingback: Rhyll > PHP Blog > Yahoo! - PHP Developer Center

  • Pingback: MrGierer’s World » Blog Archive » Yahoo! PHP Developer Center

  • Anonymous

    Has anyone already pointed out to those Yahoo guys, that “text/php” is a seldom dopey mime type for this type of more or less program-dependent and almost binary data?

  • Triphon Klevon

    Very many thanks for a good work. Nice and useful. Like it!

  • Griph Knight

    Congratulations on a great web site. I am a new computer user and finding you was like coming home. Continued success.

  • Kleon Indig

    I would like to wish you much luck. And a lot of money. Thank you.

  • Kugel Margo

    Very many thanks for a good work. Nice and useful. Like it!

  • Junior Lee

    Thank you very very very much. Wish you luck and mercy from all the creatures around the world.

  • Mindy Moore

    How do you think. If I quit using internet… No, CAN I quit?