I’ve had a stubborn mental block trying to come up with code that isn’t a memory hogging convoluted mess of temporary arrays and loops to do what I could easily do if the data was from a database. For example, if I had a table like:
lname - fname
Hemingway Ernest
Hemingway Margaux
Hemingway Mariel
Ritter John
Ritter Jason
Yes, I have thought of that, the "temporary " array approach. The problem is that the array I’m working with has over a couple hundred sub-arrays and the script needs to have memory and time limits bumped up and runs slowly.
I’m thinking I may need to INSERT the array into a database but I can’t help but think there are some native PHP functions that could do the work without needing to do that and that I just can’t think of what that might be.
The obvious answer is “don’t have PHP do what a database should do”. But with all the “merge”, “intersect” etc. array functions, I wanted to ask.
Hmmm. I’m wondering how efficient a multisort “sort by lname” followed by passing the sorted array through a callback that altered the “fname” and unset one of the. sub-arrays would be.
When I look at what @ahundiak suggested I don’t think this method uses a temporary array approach - the $groups variable is not a temporary array but the output array - it is in fact what you need. The fact that elements of $groups are arrays and not strings is of little importance because then it’s just a matter of formatting output with implode(). So in my opinion there is no temporary array because effectively you can use it as the final array with data.
In this case PHP will consume memory both for the original array ($rows) and for the destination array ($groups). I don’t think you can work around that but when you read from a csv file you may get rid of the original array footprint by reading from the file and constructing the output array immediately:
I don’t think you can get faster that this. With a database you would need to loop over the rows and insert each one into the database, which I presume will be slower than grouping them immediately in PHP. If you had this data in the database beforehand then of course grouping them in the database will be fastest.
But in your code you are constructing 2 temporary arrays: $lname and $fname, which will consume memory and that’s what you wanted to avoid so aren’t you back to square one?
I still do not see the advantage in using a database here - even to the fact that it’s highly optimized for joining/grouping there has to be some method of copying the data around, from source to destination memory. You just move the memory footprint from one process to another and have to deal with the overhead. Also for large data sets you have to create an index to be fast, which will consume even more memory and processor time, or you will run into a seqscan.
I think in part I have fallen into my old ways of going about it. Similar to doing a SELECT * and using PHP to do the heavy lifting, the fgetcsv was written to be used in various ways. It could be more efficient to have multiple variations of the fgetcsv code instead.
I guess I could unset() the temp arrays when done with them or limit scope so they self destruct.
I need to get some sleep soon, but I think I’ll try some current(), next() etc. cursor code tomorrow and see how that goes.
I suppose you could use an in memory database such as sqlite and probably import your csv file more or less directly. Then do your query. But you should probably make sure the database supports your functions first as sqlite is quite limited.
Gonna be resource intensive no matter what. php is just not designed to process 51k of anything efficiently.
I’ve run into similar size problems when working with large census files for QGIS maps. I had to “prune” the CSV file in LibreOffice down to a more acceptable size so that QGIS wouldn’t hang.
I balk at removing data for fear I may end up needing to use it. But in this case I think I’ll have to tough up and make some hard decisions, it’s not like the original file can’t serve as a backup if I find I need something later.
I’ll see how much of a size reduction I can get by writing a pruned CSV file.
Is there any reason why you don’t want to load all that data into a database? How often will you have a new csv file to load? How often will you have to generate grouped results? How fast do you need each of these operations to run?
It’s more or less a “hobby” project that I will offer to share with a very few others for their personal use. As for the CSV files, they get updated constantly so it is as often as a new one is wanted.
I could put the info into a database and I would have no problem with that. But I would like to have the script be as portable as possible without it requiring any extra setup and ini config required.
For example, I was exploring using EXIF but it required enabling allow_url_fopen, cURL and `Exif`` so I gave up on that. I can write script that creates and populates tables, but not to create a database to put the tables in.
SQLite is built-in so I don’t think there are PHP hosts without one? Maybe there are but not many. SQLite has CONCAT_WS function, it won’t sort the first names for you but you can do it in PHP if you really need it.
A crazy idea would be to build the name groups with files - create a separate file for each surname (instead of array element) and append first names to it - analogous to the loop we’ve shown earlier. I have no idea how that would perform and you might end up with thousands of temporary files but you could do that all without requiring virtually any memory
At 260K unique names, that would be a lot of files.
This doesn’t help to reduce the number of rows, but it has reduced the weight from 51,800KB to 19,400KB and results in better performance by shifting the hit to a one-time.
After a few more changes I think I have an acceptable solution. By passing in “included fields” I can reduce the number of columns. By passing in “required fields” I can reduce the number of rows by eliminating the rows that don’t have the values I need to be there. eg. down to ~400 KB. More files, but much more efficient.
Never really like using in_array just because most of the time it’s slower than isset. So consider doing something like:
$col_args = ['all' => true];
if (isset($col_args['all'])) {
// whatever
But the big thing about your code is doing all those checks on each line. Seems like lots of overhead. What I would do is to take my header row and generate a list of desired indexes based on $col_args and $req_fields. Only do the in_array or isset stuff once.
The individual line processing then consists of simply copying the values for the indexes. Seems like it would be a bit cleaner.
I’m not sure what you have in mind, but maybe the way I had started?
The CSV files are not my own. They may be relatively stable, but maybe not. They currently have 35 fields and I was originally writing code that depended on their sequence, names and number. But because there is no guarantee that sequence, names, or fields won’t be changed or that some may be added or removed without prior notice, the more functions I wrote, the more it became obvious the code was fragile.
A series of if conditionals for each field might not be so bad in one function but needing to make changes in multiple places would be a maintenance nightmare. Hence a “field array” in one place that can be changed and used elsewhere.
Currently the resource intensive code is in “file manipulation” and not run as often as the “data manipulation” which is not as hungry.
I was looking at different options to reduce the six minute time time taken to download and convert two XLS spreadsheets into a common database table. I reduced the time down to less than a minute mostly by using the following instead of stepping through each CSV row then adding to the table of over 22,000 rows.
It was not easy to setup because the table did not exactly match the CSV fields so I now create a new matching table, import the CSV data. After importing I then add additional columns then finally add a primary index. The latter two operations only take fractions of a second. Before adding the primary key it was necessary to delete duplicates which only took a fraction of a second.
Adding and verifying a couple of hundred updates (from another CSV file with a different structure) does not take long.
Thinking outside the box, use phpspreadsheet.
I have only just started using it and find it better to create and use .xlsx files for data than .csv files.
Just being able to load an array of data into a sheet or read a sheet directly into an array instead of doing it line by line simplifies things no end.