Load Data Infile for import CSV file not linear in MySQL

Hi,

I use MySql RDBMS

mysql> SELECT VERSION();
+------------------+
| VERSION()        |
+------------------+
| 5.1.51-community |
+------------------+
1 row in set

I need insert into one MySql table this CSV file updated each day and as pipeline delimited

||||||||||||||||||||||||||||||||||||||||||||||||||
|Table1|||||||||||||||||||||||||||||||||||||||||||||||||            
||||||||||||||||||||||||||||||||||||||||||||||||||          
N|IDI  |TEST|START DATE HOUR    |CAUSE|KIND|NUMB|NAMES|         
1|10704|    |21/07/2020 15:05:54|L    |MT  |2786|NAV  |         
2|10660|    |21/07/2020 09:27:31|L    |MT  |4088|PIS  |     
||||||||||||||||||||||||||||||||||||||||||||||||||          
|Table2|||||||||||||||||||||||||||||||||||||||||||||||||            
||||||||||||||||||||||||||||||||||||||||||||||||||          
N|IDI  |TEST|START DATE HOUR    |END DATE HOUR      |LENGHT  |RETURNS         |CAUSE|KIND|NUMB|NAMES|           
1|10710|    |21/07/2020 19:34:00|21/07/2020 20:19:09|00:45:09| -              |L    |MT  |7806|ACC  |
2|10708|    |21/07/2020 18:28:12|21/07/2020 18:28:13|00:00:01| -              |T    |MT  |2600|LIT  |       
3|10700|    |21/07/2020 14:16:37|21/07/2020 15:19:13|01:02:36|21/07/2020 17:00|L    |MT  |4435|UHI  |       
4|10698|    |21/07/2020 14:06:45|21/07/2020 14:07:22|00:00:37|-               |B    |MT  |5789|TYK  |
5|10674|    |21/07/2020 10:21:04|21/07/2020 10:44:41|00:23:37|21/07/2020 12:30|T    |MT  |6699|FGR  |
||||||||||||||||||||||||||||||||||||||||||||||||||

For insert in table this CSV using LOAD DATA INFILE

TRUNCATE TABLE `table_csv`;

LOAD DATA INFILE 'C:/MySQL/Data/import.csv' 
IGNORE INTO TABLE `table_csv` 
FIELDS TERMINATED BY '|' ESCAPED BY '\\' IGNORE 3 LINES;

Note that the number of columns between table 1 and table 2 are different on CSV file

For table 1 the number of column is 8

For table 2 the number of column is 10

This is my problem

Because the import in MySql table is wrong

+---+-----+----+-------------------+-------------------+--------+-------------------+-----+----+----+-----+
| N |IDI  |TEST|START DATE HOUR    |END DATE HOUR      |LENGHT  |RETURNS            |CAUSE|KIND|NUMB|NAMES| 
| 1 |10704|    |2020-07-21 15:05:54|                   |00:00:00|                   |2786 |NAV |    |     |
| 2 |10660|    |2020-07-21 09:27:31|                   |00:00:00|                   |4088 |PIS |    |     |
| 1 |10710|    |2020-07-21 19:34:00|2020-07-21 20:19:09|00:45:09|                   |L    |MT  |7806|ACC  |
| 2 |10708|    |2020-07-21 18:28:12|2020-07-21 18:28:13|00:00:01|                   |T    |MT  |2600|LIT  |
| 3 |10700|    |2020-07-21 14:16:37|2020-07-21 15:19:13|01:02:36|2020-07-21 17:00:00|L    |MT  |4435|UHI  |
| 4 |10698|    |2020-07-21 14:06:45|2020-07-21 14:07:22|00:00:37|                   |B    |MT  |5789|TYK  |
| 5 |10674|    |2020-07-21 10:21:04|2020-07-21 10:44:41|00:23:37|21/07/2020 12:30:00|T    |MT  |6699|FGR  |
+---+-----+----+-------------------+-------------------+--------+-------------------+-----+----+----+-----+

The first two rows of table 1 have the data moved

+---+-----+----+-------------------+-------------------+--------+-------------------+-----+----+----+-----+
| N |IDI  |TEST|START DATE HOUR    |END DATE HOUR      |LENGHT  |RETURNS            |CAUSE|KIND|NUMB|NAMES| 
| 1 |10704|    |2020-07-21 15:05:54|                   |00:00:00|                   |2786 |NAV |    |     |
| 2 |10660|    |2020-07-21 09:27:31|                   |00:00:00|                   |4088 |PIS |    |     |
+---+-----+----+-------------------+-------------------+--------+-------------------+-----+----+----+-----+

Instead of

+---+-----+----+-------------------+-------------------+--------+-------------------+-----+----+----+-----+
| N |IDI  |TEST|START DATE HOUR    |END DATE HOUR      |LENGHT  |RETURNS            |CAUSE|KIND|NUMB|NAMES| 
| 1 |10704|    |2020-07-21 15:05:54|                   |00:00:00|                   |     |    |2786| NAV |
| 2 |10660|    |2020-07-21 09:27:31|                   |00:00:00|                   |     |    |4088| PIS |
+---+-----+----+-------------------+-------------------+--------+-------------------+-----+----+----+-----+

My table below

DROP TABLE IF EXISTS `table_csv`;
CREATE TABLE `table_csv` (
  `N` varchar(255) DEFAULT NULL,
  `IDI` varchar(255) DEFAULT NULL,
  `TEST` varchar(255) DEFAULT NULL,
  `START DATE HOUR` varchar(255) DEFAULT NULL,
  `END DATE HOUR` varchar(255) DEFAULT NULL,
  `LENGHT` varchar(255) DEFAULT NULL,
  `RETURNS` varchar(255) DEFAULT NULL,
  `CAUSE` varchar(255) DEFAULT NULL,
  `KIND` varchar(255) DEFAULT NULL,
  `NUMB` varchar(255) DEFAULT NULL,
  `NAMES` varchar(255) DEFAULT NULL,
  `ID` int(11) NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`ID`),
  UNIQUE KEY `IDI` (`IDI`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

This the csv file

import.csv (1.2 KB)

I mean, my first suggestion would be to preprocess that file into an actual CSV.

The problem you’re seeing is because your file has different columns in Table 1 than Table 2.

Preprocess your file into a proper CSV, adding missing columns as necessary, and the data will load fine.

1 Like

thanks for reply

You suggested split the csv file using whatever tool with aspx, C#, etc ?

I have tried preprocess csv file into a proper CSV using C#, adding missing columns, but the output is

An actual preprocessing of this file into a CSV file would return the following:

N,IDI,TEST,"START DATE HOUR","END DATE HOUR",LENGHT,RETURNS,CAUSE,KIND,NUMB,NAMES           
1,10704,"","21/07/2020 15:05:54",,,,"L","MT",2786,"NAV"         
2,10660,"","21/07/2020 09:27:31",,,,"L",MT",4088,"PIS"     
1,10710,"","21/07/2020 19:34:00","21/07/2020 20:19:09","00:45:09","-","L","MT",7806,"ACC"
2,10708,"","21/07/2020 18:28:12","21/07/2020 18:28:13","00:00:01","-","T","MT",2600,"LIT"       
3,10700,"","21/07/2020 14:16:37","21/07/2020 15:19:13","01:02:36","21/07/2020 17:00","L","MT",4435,"UHI"       
4,10698,"","21/07/2020 14:06:45","21/07/2020 14:07:22","00:00:37","-","B","MT",5789","TYK"
5,10674,"","21/07/2020 10:21:04","21/07/2020 10:44:41","00:23:37","21/07/2020 12:30","T","MT",6699,"FGR"

(CSV stands for Comma Separated Values)

what are you using?
what language for preprocessing?

That? That was the language of my hand.

Could I do it in a language? Sure. I’m going to use PHP ‘pseudo-language’ in the below, but any programming language should be able to follow roughly the same logic pattern:

Foreach Row:
 If row contains more than 11 pipe characters, skip it.
 If row contains exactly 11 pipe characters:
   If the first character of the row is "N", skip it.
   Else explode the row on "|", trim all values for whitespace. 
       Foreach Token:
            If the result is a string (does not match the regex `^\d+$`), wrap it in quotes.
       Take the first 11 tokens, join them with commas, write to output.
 If row contains less than 11 pipe characters:
    If the first character of the row is "N", skip it.
    Else explode the row on "|", trim all values for whitespace.
      Foreach Token:
            If the result is a string (does not match the regex `^\d+$`), wrap it in quotes.
      Take the first 4 tokens, join them with commas.
      Add 4 additional commas to the string.
      Join the next 4 tokens with commas, and add it to the string.
      Output the string.
 

fewer

Correcting my grammar or the number? There’s a trailing pipe on the rows that got my count to 11.

OR? evaluates true

Lets start at the beginning. How exactly is your data ending up in that pipe delimited format in the first place?

Solved using C#

Thanks all for help and suggestions

int posNewColumn = 4;

string input = @"C:\MySQL\Data\import.csv";
string output = @"C:\MySQL\Data\out.csv";

string[] CSVDump = File.ReadAllLines(input);
List<List<string>> CSV = CSVDump.Select(x => x.Split('|').ToList()).ToList();
foreach (List<string> line in CSV)
{
    if (line[1] == "Table2")
    {
        break;
    }
    line.Insert(posNewColumn, line[0] == "N" ? "LENGHT" : string.Empty);
    line.Insert(posNewColumn, line[0] == "N" ? "END DATE HOUR" : string.Empty);
}

File.WriteAllLines(output, CSV.Select(x => string.Join("|", x)));