Efficient way to passing large objects between scripts?

Vali · January 30, 2013, 4:30pm

Hi

I have a system set up a bit like a tree, where the trunk is the start and end point.
Ex:
Request goes to the controller (1)
That controller starts up multiple sub-controllers (N)
Each sub-controller starts some workers (n)
The workers return the data to the sub-controller (which does it’s magic), which in turn returns the data to the main controller, so it can do it’s magic before returning it to the user.

These scripts are spread between multiple servers (on the same GB network), usually there about 900 scripts started for every request, and the data passed between scripts is usually under 1MB (multi-dimensional arrays making up objects )

Right now, the way I pass the data is by json_encode in the worker and json_decode in the parent.
But, this is #1 to slow (and about 5x faster than serialize) and #2 takes WAY to much RAM (sometimes for 500KB of values it takes 60mb of ram, and this is per worker/child).

From one request that takes about 20sec, 10 to 15sec is usually only this json_encode/json_decode part.

So the question is:

Is there a better way to transfer this data from one script to another (I need to use all the data in each script, so can’t pass the ID and select from the global cache/db)

Please reply.

Michael_Morris1 · January 30, 2013, 5:57pm

What are you doing exactly?

Vali · January 30, 2013, 6:01pm

It’s a system that sells plane tickets, but that’s irrelevant to the problem (could be anything)

2ndmouse · January 30, 2013, 6:26pm

Are you sure that the bottle-neck occurs during the json_encode/decode process?

At what point does the system branch out to other servers - sub-controller or worker? - or maybe both
Data transfer speed over the network might be responsible???

Justa thought!

I could be miles off target here, but I found this article a while back - might be of interest.

KyleWolfe · January 30, 2013, 6:41pm

Hmm. This just doesn’t seem right to me…

$_POST allows for multidimensional arrays to be transferred…

To me it sounds like your problems could be solved by a proper object oriented design.

Vali · January 30, 2013, 6:48pm

Yes, one of the bottlenecks is the way I pass the actual object from one server/script to another.
The 20sec test did not max up the network (went to ~50Mb/sec), but the CPU/RAM of the servers encoding/decoding the php object spike up for a good 5 to 10 sec.

I also added some logging around the part that just reads/sends the data, and I’m 100% sure that it’s one of the bottlenecks that needs to be fixed.
And since it’s literally 2 lines of code (encode/decode…), that’s the place I figure I would look for a smarter way to do it.

I tried the php serialise/unserialise, but that was 4x slower than json encode/decode.

So basically, I need a better way to transfer an array of arrays (objects) from one server to another.

Vali · January 30, 2013, 6:53pm

K. Wolfe I post the data to the children, and then I need to get it back, they echo json_encoded variables, and I json_decode them so I can use them. (that is the slow part…)

Michael_Morris1 · January 30, 2013, 7:07pm

There aren’t many problems that would require starting up 900+ scripts per request, and airline ticket sales aren’t one of those; and I can think of none that need 60MB to process 500K of data. Something is seriously wrong. I sense a ball of mud project that has been evolved rather than designed - which has been nursed along by the old throw more hardware solution at it and may be fast approaching the end of the line where it must be replaced as the expense of maintaining it will eclipse the cost of replacing it. Dealt with those myself - they aren’t fun - especially when upper management would rather deny the reality of the situation within the code and keep trying to patch it along.

KyleWolfe · January 30, 2013, 7:11pm

Right… But why do you have so many “children” on remote servers. If you can get some of these operations on to the same machine, you can avoid network / json bottleknecks. Additinally, you can save more resources by a proper object oriented design.

KyleWolfe · January 30, 2013, 7:11pm

+1

Vali · January 30, 2013, 7:36pm

K. Wolfe It’s OOP, and I need it on multiple machines since one can’t handle it.

Michael Morris unfortunately, I have to start that many…
My system loads data from other systems, and as a simplified example, this is why I need to start 900 workers:

The user wants to fly from YYZ to NYC and back to YYZ, with flexible dates (as in, ±3 days on departure/arrival), with any airline and any seat in economy or business class.

The data source, only accepts requests in the FROM CITY/DATE - TO CITY/DATE - AIRLINE - CLASS
So, this means:
[CITY] [-3 to +3] - [CITY] [-3 to +3] - [airline] - [class]
[7 days][7 days][10 airlines that fly between those cities]*[2 classes] = 980 requests right there (assuming no cache was hit), all workers that just standardise this data to something the rest of the script can use.

And this assumes one data source, no need to ask for a “next page”, composite tickets and so on…

The bottleneck I’m trying to fix is getting the standardised data from those 980 requests back to the controller script, where I can actually start the real work.
And I have to standardise the data, since each data source has it’s own format/rules/aliases/etc… (systems made in the 70s that never changed… and with crap on top of crap as Michael Morris explained).

Any suggestions? (So I have an array in script A, server A and want to pass it to the script that called it, on server B)

Michael_Morris1 · January 30, 2013, 7:57pm

Hmm… The only thing I can think of to set up a server who’s only job is to hold the standardized data, feed it to the front requests and continually negotiate the translation of data. The PHP frontend would talk only to that server which would have the schedule data for it. The translation side would probably be better off in another language - C++ I would think. There’d be a lag between when the old systems got updated and the new system gets the data right, but this could be worked around as provisional, with some legal text on the front end explaining the prices displayed are continually in flux. Once the user has made the choice you can then hunt up that exact ticket and send a confirmation.

Or, you could honor the old price and eat the losses when they occur, but also take the profits when a customer agrees to pay more than what the airline charged in that period of lag. Such issues are policy related - the sort of decisions managers should make.

Whatever you discover though, Good Luck. I don’t envy your predicament.

Vali · January 30, 2013, 8:02pm

I tried that approach already (have one server with the data, passing the IDs back from the workers), but then I max up the bandwidth to/from that server (since I need multiple requests in the same time), and that server (was a cluster of memcached servers) needed a ton of ram for nothing.

And the worst, I still need to serialise the data somehow to place it in that one server, and that’s what’s slow…
(after I parse the standardised data, I cache the results in a similar way, but my issue if before I parse it, when I get it from the 900+ workers, serialise/unserialise seems redundant…)

KyleWolfe · January 30, 2013, 8:17pm

Can you go into detail on this? I really don’t understand what you have so far, but right here I’m having a feeling this can be simplified.

Ideally you should have 1-2 application servers (second is a backup, not a second machine to split duties) and as many data servers as needed to fulfill the requests in a timely fashion. If it’s designed correctly and its still falling behind, start adding in new data servers to start sharing the load. But unless you are over 120 gigs of active working set, you don’t even need to think about a second data server.

Vali · January 30, 2013, 8:46pm

K. Wolfe My system gets it’s data from other systems (outside of my control).
Each of those systems have their own format, so I need to standardize the data to my own format (I’m getting this data in parallel calls).

I have to spread the load of the workers to multiple servers (got ~24 for now), since one server cannot handle parsing all that XML/HTML/JSON/TEXT/SOAP calls to the various data sources, and parsing those responses to my standard format.
Because of that, I need a way to pass that data over to the parent of those scripts, where I can apply my business logic and so on.

If 1 request takes 20 sec, this is how the time it’s spent:
~ 1 sec parent business logic/starting the 900 workers in parallel
This is where work in parallel starts, while the parent waits
~ 1 to 10 sec workers waiting for data (in parallel per worker, so 900 to 9,000 sec worker time)
~ 1 to 3 sec workers parsing the data and formatting it to my standard format. (in parallel per worker, so 900 to 2,700 sec worker time)
----- this is what I want optimised, since it seams redundant -----
~ 1 to 2 sec workers json_encoding/serialising the data (cpu, in parallel per worker, so 900 to 1)
~ 1 sec transfer the data to the parent (network)
This is where work in parallel ends
~ 5 to 10 sec parent decoding the data from all workers, as it receives it (cpu)
----- up to here -----
~ 5 sec applying my business rules/magic

Total user time: 15 to 32 sec , where 7 to 12 sec seems redundant & useless (~37% of total time used just to pass data around)

If I don’t use workers and do everything in the parent, I have to wait the 1 to 3 sec parsing the data per worker, so 900 to 2,700 sec (15 to 45min)

What I’m looking for, is an efficient way to get the standardized data (done by the workers) to the parent/controller that initiated them.

KyleWolfe · January 30, 2013, 8:59pm

I see. This still comes back to my and Micheals original point. We don’t feel that should be taking that long to parse out JSON unless there’s something else extremely goofy going on.

Just curious, how many different remote systems are you hitting?

EDIT: BTW My current job has me doing much of this type of thing. All my company deals with is external systems syncing to our own through curl / soap / xml etc.

Vali · January 30, 2013, 9:28pm

I currently have 7 different data sources:

2 HTML websites,
2 stateless soap (xml/custom format),
3 TA based (basically plain text over socket communication).
I take the data from there, and when the user takes an action that needs to be synced there, I sent them data.

Each request gives 100-800Kb of data (but I do 900 of them), and after I parse that data, I end up with arrays of objects like this one:


$fare_tpl = array(
            'id' => 0,
            'airline' => 0,
            'consolidator' => 0,
            'cost' => 0,
            'tax' => 0,
            'adult_cost' => 0,
            'adult_tax' => 0,
            'child_cost' => 0,
            'child_tax' => 0,
            'infant_cost' => 0,
            'infant_tax' => 0,
            'flights' => array(),
            'filters' => array(
                'outbound_start_date' => 0,
                'outbound_end_date' => 0,
                'outbound_duration' => 0,
                'outbound_stops' => 0,
                'inbound_start_date' => 0,
                'inbound_end_date' => 0,
                'inbound_duration' => 0,
                'inbound_stops' => 0,
                'duration' => 0,
                'stops' => 0,
                'airline' => '',
                'price' => 0,
            ),
        );
// That's just a random object I got at the end of the line.

The one the workers return has about 100 fields per object, and each worker returns about 100 of these objects, each of these objects with about 100 different flight objects, each flight with 1 to 5 legs. (ex: $fare[0]->fights[24]->legs[outbound][1]->departure_time).

So each worker returns about 100KB to 500KB of json/serialized (it’s gziped content so there’s no problem for the network).

How long should it take to parse that json (encode/decode)?
Maybe I’m missing something stupid here…

KyleWolfe · January 30, 2013, 9:46pm

Oh, this is a fun project! I’ll dive into this later after work

system · January 30, 2013, 11:14pm

This is fascinating reading guys… I’ve never worked on a project on this kind of magnitude. It’s fascinating to read about. Not that I think I can add much to the discussion - but just out of interest, this all happens in real time? That is, as a customer of your website, when I go to search for tickets, these searches to remote sources all happen in real time and are actually triggered by me doing a search?

I’ve often thought about how these aggregate websites work, and figured they must cache data and have workers storing the data in the background constantly. I guess that’s not so easy for you to do due to the sheer complexity of combinations involved?

As I say, I don’t think I can really add much here, but it’s fun to read about, so I’m getting out the popcorn and I’m sitting in the background

cpradio · January 31, 2013, 12:27am

Okay, so this grabbed my attention and I’m interested, so I did a quick test.

First, I generated a 783K json file using the following code (granted it doesn’t have nested arrays, but I planned on building that in later):

build.php

<?php
define('VALID_CHARS', 'abcdefghijklmnopqrstuvwxyz');

$keys = array();
for ($i = 0; $i < 100; $i++)
{
	$keys[] = randomString(6);
}

$objs = array();
for ($i = 0; $i < 400; $i++)
{
	$obj = array();
	foreach ($keys as $key)
	{
		$obj[$key] = randomString(8);
	}

	$objs[] = $obj;
}

file_put_contents('data/objects.json', json_encode($objs));

function randomString($length)
{
	$str = '';

	$validChars = VALID_CHARS;
	$validCharsLength = strlen($validChars);

	for ($i = 0; $i < $length; $i++)
	{
		$str .= $validChars[mt_rand(1, $validCharsLength) - 1];
	}

	return $str;
}

Then I created a script that reads the file and performs json_decode:

read.php

<?php
$content = file_get_contents('data/objects.json');
$objs = json_decode($content);

Both run in a split second on my development machine (granted I’ve got a quad-core 8 GB RAM machine, but I really don’t see this being your hold up).

Next I profiled the code using xDebug, the time to run build.php was 1,166 ms, 63% of the time to run was in randomString (which you wouldn’t have, but you would have something that generates your objects).
read.php ran in 42 ms. 97% of the time was spent in json_decode (DUH! there were only two lines, what did you expect?).

So now I obviously need to go bigger, and see if I can start seeing seconds instead of ms.

2M file, I changed 400 to 1000 in the build.php

Profiler shows 3,135 ms for build.php, with 57% in randomString, and read.php shows 105 ms with 98% being in json_decode.

So from these numbers I have concluded thus far that as the file size grew (ie the number of objects grew) the time to build the json encode file did not increase, it remained pretty static (granted I didn’t make a big leap, but I did go from 700K to 2M in file size). The read.php reacted the same as the build.php, so reading the larger file didn’t adversely affect performance either.

I’ve attached my profiles here and the code for others to observe as well, but I strongly think you need to setup xDebug and figure out where your bottlenecks are at, as I don’t believe it is json_encode or json_decode directly.