Memory Performance Boosts with Generators and Nikic/Iter
Arrays, and by extension iteration, are fundamental parts to any application. And like the complexity of our applications, how we use them should evolve as we gain access to new tools.
New tools, like generators, for instance. First came arrays. Then we gained the ability to define our own array-like things (called iterators). But since PHP 5.5, we can rapidly create iterator-like structures called generators.
These appear as functions, but we can use them as iterators. They give us a simple syntax for what are essentially interruptible, repeatable functions. They’re wonderful!
And we’re going to look at a few areas in which we can use them. We’re also going to discover a few problems to be aware of when using them. Finally, we’ll study a brilliant library, created by the talented Nikita Popov.
You can find the example code at https://github.com/sitepoint-editors/generators-and-iter.
The Problems
Imagine you have lots of relational data, and you want to do some eager loading. Perhaps the data is comma-separated, and you need to load each data type, and knit them together.
You could start with something as simple as:
function readCSV($file) {
$rows = [];
$handle = fopen($file, "r");
while (!feof($handle)) {
$rows[] = fgetcsv($handle);
}
fclose($handle);
return $rows;
}
$authors = array_filter(
readCSV("authors.csv")
);
$categories = array_filter(
readCSV("categories.csv")
);
$posts = array_filter(
readCSV("posts.csv")
);
Then you’d probably try to connect related elements through iteration or higher-order functions:
function filterByColumn($array, $column, $value) {
return array_filter(
$array, function($item) use ($column, $value) {
return $item[$column] == $value;
}
);
}
$authors = array_map(function($author) use ($posts) {
$author["posts"] = filterByColumn(
$posts, 1, $author[0]
);
// make other changes to $author
return $author;
}, $authors);
$categories = array_map(function($category) use ($posts) {
$category["posts"] = filterByColumn(
$posts, 2, $category[0]
);
// make other changes to $category
return $category;
}, $categories);
$posts = array_map(function($post) use ($authors, $categories) {
foreach ($authors as $author) {
if ($author[0] == $post[1]) {
$post["author"] = $author;
break;
}
}
foreach ($categories as $category) {
if ($category[0] == $post[1]) {
$post["category"] = $category;
break;
}
}
// make other changes to $post
return $post;
}, $posts);
Seems ok, right? Well, what happens when we have huge CSV files to parse? Let’s profile the memory usage a bit…
function formatBytes($bytes, $precision = 2) {
$kilobyte = 1024;
$megabyte = 1024 * 1024;
if ($bytes >= 0 && $bytes < $kilobyte) {
return $bytes . " b";
}
if ($bytes >= $kilobyte && $bytes < $megabyte) {
return round($bytes / $kilobyte, $precision) . " kb";
}
return round($bytes / $megabyte, $precision) . " mb";
}
print "memory:" . formatBytes(memory_get_peak_usage());
The example code includes generate.php
, which you can use to make these CSV files…
If you have large CSV files, this code should show just how much memory if takes to link these arrays together. It’s at least the size of the file you have to read, because PHP has to hold it all in memory.
Generators to the Rescue!
One way you could improve this would be to use generators. If you’re unfamiliar with them, now is a good time to learn more.
Generators will allow you to load tiny amounts of the total data at once. There’s not much you need to do to use generators:
function readCSVGenerator($file) {
$handle = fopen($file, "r");
while (!feof($handle)) {
yield fgetcsv($handle);
}
fclose($handle);
}
If you loop over the CSV data, you’ll notice an immediate drop in the amount of memory you need at once:
foreach (readCSVGenerator("posts.csv") as $post) {
// do something with $post
}
print "memory:" . formatBytes(memory_get_peak_usage());
If you were seeing megabytes of memory used before, you’ll see kilobytes now. That’s a huge improvement, but it doesn’t come without its share of problems.
For a start, array_filter
and array_map
don’t work with generators. You’ll have to find other tools to handle that kind of data. Here’s one you can try!
composer require nikic/iter
This library introduces a few functions that work with iterators and generators. So how could you still get all this relatable data, without keeping any of it in memory?
function getAuthors() {
$authors = readCSVGenerator("authors.csv");
foreach ($authors as $author) {
yield formatAuthor($author);
}
}
function formatAuthor($author) {
$author["posts"] = getPostsForAuthor($author);
// make other changes to $author
return $author;
}
function getPostsForAuthor($author) {
$posts = readCSVGenerator("posts.csv");
foreach ($posts as $post) {
if ($post[1] == $author[0]) {
yield formatPost($post);
}
}
}
function formatPost($post) {
foreach (getAuthors() as $author) {
if ($post[1] == $author[0]) {
$post["author"] = $author;
break;
}
}
foreach (getCategories() as $category) {
if ($post[2] == $category[0]) {
$post["category"] = $category;
break;
}
}
// make other changes to $post
return $post;
}
function getCategories() {
$categories = readCSVGenerator("categories.csv");
foreach ($categories as $category) {
yield formatCategory($category);
}
}
function formatCategory($category) {
$category["posts"] = getPostsForCategory($category);
// make other changes to $category
return $category;
}
function getPostsForCategory($category) {
$posts = readCSVGenerator("posts.csv");
foreach ($posts as $post) {
if ($post[2] == $category[0]) {
yield formatPost($post);
}
}
}
// testing this out...
foreach (getAuthors() as $author) {
foreach ($author["posts"] as $post) {
var_dump($post["author"]);
break 2;
}
}
This could be less verbose:
function filterGenerator($generator, $column, $value) {
return iter\filter(
function($item) use ($column, $value) {
return $item[$column] == $value;
},
$generator
);
}
function getAuthors() {
return iter\map(
"formatAuthor",
readCSVGenerator("authors.csv")
);
}
function formatAuthor($author) {
$author["posts"] = getPostsForAuthor($author);
// make other changes to $author
return $author;
}
function getPostsForAuthor($author) {
return iter\map(
"formatPost",
filterGenerator(
readCSVGenerator("posts.csv"), 1, $author[0]
)
);
}
function formatPost($post) {
foreach (getAuthors() as $author) {
if ($post[1] == $author[0]) {
$post["author"] = $author;
break;
}
}
foreach (getCategories() as $category) {
if ($post[2] == $category[0]) {
$post["category"] = $category;
break;
}
}
// make other changes to $post
return $post;
}
function getCategories() {
return iter\map(
"formatCategory",
readCSVGenerator("categories.csv")
);
}
function formatCategory($category) {
$category["posts"] = getPostsForCategory($category);
// make other changes to $category
return $category;
}
function getPostsForCategory($category) {
return iter\map(
"formatPost",
filterGenerator(
readCSVGenerator("posts.csv"), 2, $category[0]
)
);
}
It’s a bit wasteful to re-read each data source, every time. Consider keeping smaller related data (like authors and categories) in memory…
Other Fun Things
That’s just the tip of the iceberg when it comes to Nikic’s library! Ever wanted to flatten an array (or iterator/generator)?
$array = iter\toArray(
iter\flatten(
[1, 2, [3, 4, 5], 6, 7]
)
);
print join(", ", $array); // "1, 2, 3, 4, 5"
You can return slices of iterable variables, using functions like slice
and take
:
$array = iter\toArray(
iter\slice(
[-3, -2, -1, 0, 1, 2, 3],
2, 4
)
);
print join(", ", $array); // "-1, 0, 1, 2"
As you work more with generators, you may come to find that you can’t always reuse them. Consider the following example:
$mapper = iter\map(
function($item) {
return $item * 2;
},
[1, 2, 3]
);
print join(", ", iter\toArray($mapper));
print join(", ", iter\toArray($mapper));
If you try to run that code, you’ll see an exception saying; “Cannot traverse an already closed generator”. Each iterator function in this library has a rewindable counterpart:
$mapper = iter\rewindable\map(
function($item) {
return $item * 2;
},
[1, 2, 3]
);
You can use this mapping function many times. You can even make your own generators rewindable:
$rewindable = iter\makeRewindable(function($max = 13) {
$older = 0;
$newer = 1;
do {
$number = $newer + $older;
$older = $newer;
$newer = $number;
yield $number;
}
while($number < $max);
});
print join(", ", iter\toArray($rewindable()));
What you get from this is a reusable generator!
Conclusion
For every looping thing you need to think about, generators may be an option. They can even be useful for other things ,too. And where the language falls short, Nikic’s library steps in with higher-order functions aplenty.
Are you using generators yet? Would you like to see more examples on how to implement them in your own apps to gain some performance upgrades? Let us know!