PHP PDO MySQL Parse/Filter Data before MySQL

I’m working on remodeling my import code in one of my programs which looks like this:

<?php

include('dbconnect.php');

if(isset($_POST['btn_upload'])){
	$filename = $_FILES['file']['tmp_name'];
	$file = fopen($filename,"r");


	// Read and throw away the header row
	fgetcsv($file);

	$stmt = $pdo->prepare("INSERT INTO contacts (id, last_name, first_name, 
middle_name, suffix, residential_address, mailing_address, precinct, age, 
ethnicity, gender, race, status, phone, reg_date) 
VALUES (:id, :last_name, :first_name, :middle_name, :suffix, 
:residential_address, :mailing_address, :precinct, :age, :ethnicity, 
:gender, :race, :status, :phone, :reg_date)");

	while (! feof($file)) {
  		$row = fgetcsv($file);
			if($row[0] == NULL) {
				continue;
			}
			$stmt->bindParam(':id', $row[0]);
			$stmt->bindParam(':last_name', $row[1]);
		$stmt->bindParam(':first_name', $row[2]);
		$stmt->bindParam(':middle_name', $row[3]);
		$stmt->bindParam(':suffix', $row[4]);
  		$stmt->bindParam(':residential_address', $row[5]);
  		$stmt->bindParam(':mailing_address', $row[6]);
  		$stmt->bindParam(':precinct', $row[7]);
  		$stmt->bindParam(':age', $row[8]);
  		$stmt->bindParam(':ethnicity', $row[9]);
  		$stmt->bindParam(':gender', $row[10]);
  		$stmt->bindParam(':race', $row[11]);
  		$stmt->bindParam(':status', $row[12]);
  		$stmt->bindParam(':phone', $row[13]);
			$date = date_create($row[14]);
			$dfstr = date_format($date, 'Y-m-d');
			$stmt->bindParam(':reg_date', $dfstr);
  		$stmt->execute();
}

	fclose($file);
	header("Location: ../contacts_list.php");
}

if(isset($_POST['btn_back'])) {
	header("Location: ../contacts_list.php");
}

?>

I have a few items in this code that I’m wanting to change:

  1. Instead of clicking a button to read an uploaded file, is it possible for the code to read a file that is saved in the server automatically or is this not recommended? I have a file that is downloaded to my server once a week via cron job and would like to use this file since it has official data inside of it rather than try to parse the data myself. Also, this file is a .txt file instead of a .csv file and it is tab separated, not comma separated.

  2. Is it possible to filter out certain data that I don’t need before it is imported? For example, there’s a column in this new file called party that I need to filter and I’m only interested in records that contain REP in this column. The file contains over 200,000 records and I’m thinking by filter certain columns to get what I need is a good a way to minimize this number as 200,000+ is too much data to use in the site and would likely crash my server.

  3. Ideally, I would like the database to act as a mirror to the file so that if a record gets deleted or updated, then that change would show in the database after a certain amount of time. Is this possible to do or is it worth the effort?

Hi @jmyrtle,

  1. Yes of course. While your upload script is reading from $_FILES you could have just instead read from some given path on the file server. So when the script is run, it reads the file instead of reading from $_FILES.

  2. Well, not without reading each record first and seeing which rows have this column set to REP. This is part of the downside to using a file like this. Ideally if you could put this content into a database directly instead of the file, then you would gain the ability to quickly search for columns that contain REP and only query those records. But files pretty much have to be read each line at a time.

  3. Yes as long as each line has some kind of identifier that would identify it in the database. So if that ID disappears for instance, you could then locate that same ID in the database table and remove it. Of if a row changes some how, you can use the ID to find it and update it in the database. The ID could be as simple as some kind of unique auto incrementing number or could even be a complete hash of the entire line in which that hash then gets saved as the ID in the database. To compare you would rehash an updated line, compare it to the database hash and if different you know a record changed.

There are multiple ways to do all of this but it depends on your design. :slight_smile:

So, correct me if I’m wrong. Does this:

if(isset($_POST['btn_upload'])){
	$filename = $_FILES['file']['tmp_name'];
	$file = fopen($filename,"r");

...

change to this:

$filename = /path/to/file;	
$file = fopen($filename,"r");

...

How often would the code read the file? My idea is having some kind of <a> link that can be clicked on to run this file and refresh the loading the new data – kind of like a sync feature unless it can copy the cron job and run an auto sync every week.

This is what I feared. I checked the file and it has a total of 156,085 records. Loading that into the database manually would take a long time. So, what if (maybe) I keep this (while modifying the query and column names as needed):

$stmt = $pdo->prepare("INSERT INTO contacts (id, last_name, first_name, 
middle_name, suffix, residential_address, mailing_address, precinct, age, 
ethnicity, gender, race, status, phone, reg_date) VALUES (:id, :last_name, 
:first_name, :middle_name, :suffix, :residential_address, :mailing_address,
 :precinct, :age, :ethnicity, :gender, :race, :status, :phone, :reg_date)");

	while (! feof($file)) {
  		$row = fgetcsv($file);
			if($row[0] == NULL) {
				continue;
			}
			$stmt->bindParam(':id', $row[0]);
			$stmt->bindParam(':last_name', $row[1]);
		$stmt->bindParam(':first_name', $row[2]);
		$stmt->bindParam(':middle_name', $row[3]);
		$stmt->bindParam(':suffix', $row[4]);
  		$stmt->bindParam(':residential_address', $row[5]);
  		$stmt->bindParam(':mailing_address', $row[6]);
  		$stmt->bindParam(':precinct', $row[7]);
  		$stmt->bindParam(':age', $row[8]);
  		$stmt->bindParam(':ethnicity', $row[9]);
  		$stmt->bindParam(':gender', $row[10]);
  		$stmt->bindParam(':race', $row[11]);
  		$stmt->bindParam(':status', $row[12]);
  		$stmt->bindParam(':phone', $row[13]);
			$date = date_create($row[14]);
			$dfstr = date_format($date, 'Y-m-d');
			$stmt->bindParam(':reg_date', $dfstr);
  		$stmt->execute();

then create a table view with what I actually need and make the code work off of that?

This is the second thing I feared. There is no ID column present, nor is there a column that can act as a unique key. Trying to add this into the file is going to be very difficult and it may not even work anyway.

There are two way of handling deleted data -

  1. Delete all the data and re-insert just the wanted data. To do this in an error tolerant way, you would create a new empty database table, insert just the wanted data, and if there are no errors, rename the original table as a backup and rename the new table as the original working table.
  2. Get ALL the identifying information out of the file and UPDATE/DELETE the records in the database table that are NOT IN() the list of identifiers from the file. You would typically UPDATE a status column in existing data to indicate that it is deleted/unused, rather than actually deleting the row.
  1. Yes, that is correct. It is run every time the file itself is run. This is a script after all, code is run every time the script is run. You can certainly have a link or anything else to kick off the script being run by the server.

  2. You are still going to be reading every row. I mean if you are ok with that part, then it is trivial to look at the column containing REP and skipping it. I just figured most of the time is that you have to read all the records.

  3. You can still create your own unique ID. Like I had said, you can hash a value and create your own ID of sorts. Just hash together enough values that make the ID unique enough. Hash the entire row if need be. You just have to hash the same columns each time for when you go to compare.

Which file? The file with the data or the file with the code?

The data file is downloaded from the server once a week. If the cron job to download it runs once a week, does that mean that the code should run automatically once a week or do I have to add that to the cron job? Just asking for clarification.

Well, it looks like I’m going to have to do that anyway. There are more than 150,000 records and more than 50 columns. I was just curious if there was a way to try and filter the data first so that the database doesn’t have a bunch of records that aren’t necessary.

I don’t think there’s a column or value that is unique enough to do that. For example, if two people live in the same house, that means the street address is the same which means that the has would be identical if I created one. Of course, that’s just an illustration for street addresses, but the concept would apply across the data file.

Just a tiny thing, you can call bindParam() before your loop, rather than binding the variables every time you read them. That’s the key difference (I think, I’m sure someone will tell me if I’m wrong) between bindParam() and bindValue() - you’re binding the parameter to the variable name, whatever it happens to contain when you call execute(). You may have to declare the array before doing that, I’ve never been quite sure.

Your existing code is filtering data using - if($row[0] == NULL) { adding a condition to operate only on those rows containing or not containing a value in a specific column would be similar.

Explicit binding is not needed. Just supply an array of values to the ->execute([…]) call inside the loop.

Is there then a way to only call certain columns based on the filtered data that I need or do I have to call the entire file first? My data file has 71 columns and 150,000+ records. So I’m trying to see what my options are so that I don’t upload data I can’t use and to save server resources.

For the filtering you are describing, where you only want to keep rows with the party column == ‘REP’, you will need to loop through all the lines in the file at least once. Since you can do this filtering at the time you insert the data, you might as well just loop over the data that single time.

Note: For the file logic you have shown, without any error handling for the fopen() statement, you had a thread where the while() loop was looping forever (until php’s execution timed out.) You need to test if the fopen() was successful before trying to loop over the data. If the fopen fails, feof() is false and ! feof() will always be true.

Isn’t that what this does?

while (! feof($file)) {
  		$row = fgetcsv($file);
			if($row[27] == 'REP') {
				continue;
			}

Shouldn’t it always be set to ! feof()? I don’t quite follow you.

PHP has a wonderful import CSV routine designed for many records. It is not straight forward but well worth the effort because imports only takes seconds instead of minutes.

I would first try importing only a few CSV records into a new table without any unique indexes then parse the table and remove duplicates before applying unique indexes. There are scripts available for deleting duplicate records, adding indexes and try/catch blocks that fail gracefully.

Once importing a few CSV records is working satisfactorily then gradually increase the CSV records until it breaks.

Edit:

https://dev.mysql.com/doc/refman/8.0/en/load-data.html

Thank you @everyone!

I’ve figured it out by using this:


    LOAD DATA INFILE '/path/to/file'
    INTO TABLE contacts
    FIELDS TERMINATED BY '\t'
    ENCLOSED BY '"'
    ESCAPED BY ''
    LINES TERMINATED BY '\r\n'
    IGNORE 1 ROWS

I am curious however if this can be scheduled to run once a week at a certain time (e.g. 6am)? Apparently I cannot schedule this as a stored procedure…

I’m pleased you took the time to learn about load data and also curious how long the task now takes.

Regarding once a week scheduling try crontab:

Thank you @John_Betong! I think I got it!

I created a .sh script to download the files I need, then created a .sql that contains the LOAD DATA I had to use LOCAL due to some errors at first.

After adding the .sql script in the .sh script, I was able to configure cron to run the .sh script every week as I needed.

It now works perfectly!

1 Like

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.