Spliting a 75mb CSV into managable file sizes

Hi all,

I am downloading a load of CSVs daily and they range from 1mb to 75mb.
When I get the downloaded CSV. I loop through each row and insert it into a db

When the CSVs are 75mb php runs out of resources.

I have tried to change various limits:

php_value max_execution_time 600
php_value memory_limit 512M

However this still doesn’t help. So my solution was to split the CSV file into manageable file sizes. So split the 75mb into 5mb files leaving me with 15 files, then loop around each of these files individually.

The only problem with this is that I am not sure how to do this? Please could someone advise?

Cheers

You could use the rather handy LimitIterator.


<?php
$csv = new SplFileObject('data.csv');
$csv->setFlags(SplFileObject::READ_CSV);

foreach(new LimitIterator($csv, 0, 500) as $line){
  #save $line
}
?>

You would just need to increase the starting point (0 in this case) on each run. The example saves lines 0-500, on the next run change it the 0 to 501 to save lines 501-1001.

:wink:

This is awesome!! Quick question though. is there a way to find out how many rows a csv has without reading the whole file and iterating a counter?

Cheers

Well, the iterator doesn’t implement ‘Countable’, nor does SplFileObject, so I’d guess not.

Is there a reason you need to know?

UNTESTED!

<?php
$csv = new SplFileObject('data.csv');
$csv->setFlags(SplFileObject::READ_CSV);
$start = 0;
$batch = 500;
while (!$csv->eof()) {
  foreach(new LimitIterator($csv, $start, $batch) as $line){
    #save $line
  }
  $start += $batch;
}
?>

Thankyou to you both.

I also just found out I can do

count(file(‘mydata.csv’));

and this will tell me how many lines there are but I am guessing siteguru solution is more efficient

:confused:

What does that achieve? As far as I can see, the ->eof() call is unnecessary. If the OP wanted to iterate the complete file in one go, they’d only need the foreach loop no?

I’m probably missing something though… :stuck_out_tongue:

The ‘problem’ you have with this though is that [fphp]file/fphp will load the complete file into memory IIRC, this maybe an issue.

What do you need the count for, maybe a better solution can be found. :slight_smile:

The only reason for counting the num of rows was just to see how many times i would need to run through the loop. But siteguru’s solution seems to work fine. Unless you can spot something wrong or come up with another way which is more efficient?

Well, you don’t need to calculate how many times are required really. The iterator will stop on the last record, so there’s no real math needed.


<?php
$array = new ArrayObject(
  array(
    'foo',
    'bar',
    'ying',
    'yang'
  )
);

foreach(new LimitIterator($array->getIterator(), 1, 500) as $num => $item){
  #do stuff
}

echo $num; #3
?>

If $num is less that 500, you’re done.

Still, you have a couple of options now, hopefully you’re good to go. :slight_smile:

I’d like to offer some critique of the snippets offered so far, and my own similar idea.

First, they both use the SplFileObject::READ_CSV flag. This is entirely unnecessary since the aim (if I’ve understood) is to split the big file into multiple smaller files. That said, the SplFileObject approach might be a good way to solve the whole issue because, as it only reads one line at any given time then the memory usage is tiny (compared to reading the whole file like with the file() function).

Also, both snippets limit the size of the file (or amount of processed data) by the number of lines rather than 5MB of data.

So, my idea is to read the big CSV file line-by-line and write that data into smaller files (named data-[i]<number>[/i].csv) as needed.


$csv = new SplFileObject("data.csv");

$bound    = 5 * 1024 * 1024; // 5MB in bytes
$bytes    = 0;
$chunk_id = 1;

foreach ($csv as $line) {
    // If we have not opened a chunk file yet
    // or writing to the chunk file would overflow
    // the bound then start a new chunk file
    if ( ! isset($chunk) || $bytes + strlen($line) > $bound) {
        $bytes = 0;
        $chunk = new SplFileObject(sprintf("data-%03d.csv", $chunk_id++), "w");
        echo "Created {$chunk->getFilename()}\
";
    }

    // Write the line to the current chunk file and tally up the bytes written
    $bytes += $chunk->fwrite($line);
}

Hi Salathe,

As we’re reading the file line by line, the need to create separate files (and iterate them individually) is negated is it not?

Are we not providing a solution to a problem that isn’t really there?

So this is what I have so far…


$csv = new SplFileObject('data.csv');
$csv->setFlags(SplFileObject::READ_CSV);
foreach(new LimitIterator($csv, 0, 500) as $num =>$line)
{
	$data[$num] = $line;
}
$fp = fopen('file1.csv', 'w');
foreach ($data as $fields) 
{
	fputcsv($fp, $fields);
}
fclose($fp);

I then just need to loop around the csv file and increasing the 0 and 500 limits

@Anthony: Depends what is happening during the processing stage (“insert it into a db”) if there is one, and other limitations (time limit, etc.), which only the OP can tell us. “PHP runs out of resources” is pretty vague! For all we know, the real solution might not even be PHP-related. :wink:

This is what I have with the loop

$csv = new SplFileObject('data.csv');
$csv->setFlags(SplFileObject::READ_CSV);
$start = 0;
$batch = 2000;
$counter = 1;
while (!$csv->eof())
{
	$data = array();
        foreach(new LimitIterator($csv, $start, $batch) as $num =>$line)
	{
		$data[$num] = $line;
	}
	$fp = fopen('file'.$counter.'.csv', 'w');
	foreach ($data as $fields) {
		fputcsv($fp, $fields);
	}
	fclose($fp);
	$start += $batch;
	$counter++;
}

Jamie, would you might stepping back for a minute and answering a few questions?

  • If we could suggest some solutions which don’t involve writing those smaller files, but may be quicker and/or less memory greedy than that you’re currently using, would you be willing to try those and see if they’re too slow, or use too much memory, or make any other bottlenecks?
  • If we suggest code, could you try that code and report back (see previous), rather than taking bits and pieces and posting your changes (for review?)?
  • How much work is involved when processing the CSV data (into the database)?
  • If we keep asking questions, rather than being helpful and providing code snippets, would you mind?
  • Would you ask questions if the code isn’t clear to you, or just muddle through and if it works then awesome?

Salathe…

I completely missed your solution. My Bad dunno how I did that. Your solution is loads quicker!!

It created 15 files in like 20 seconds compared to 1 minute 10 seconds the other way.

Please could you explain why you think this is so