SitePoint Sponsor

User Tag List

Results 1 to 20 of 20
  1. #1
    SitePoint Enthusiast
    Join Date
    Jan 2005
    Location
    UK
    Posts
    28
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Help with cleaning up misquoted values in a tab-delimited file

    Hello,

    I'm hoping someone can help me out here because I've almost lost the will to live. I'm not sure if this is the right place but I figure that as you guys are dealing with data then you may well have come up against something similar.

    I'm trying to prepare a large amount of products (about 5000) to insert into a shopping cart system. I have a text file with all the products in with the data delimited by tabs. Amendments need to be made to the data in Excel before its inserted using the carts admin system. This is all fine until I export the data from Excel.

    Aside from Excel replacing random characters with other characters (which has taken me a stupid amount of time to sort out), it surrounds any text field that contains a comma with double quotes (eg "This is a product, description"). The problem is that it only does this to certain fields ie the ones with commas and there doesn't seem to any way to turn it off.

    The system I'm inserting the data into has an option to strip off double quotes but because the data also contains inch marks and valid uses of double quotes the system chokes and can't distinguish between the valid uses and the ones I want it to strip.

    Is there a way to get Excel to either put double quotes around all fields (I could then remove them with a search and replace in BBEdit) OR force it to not use double quotes at all?

    I've looked in the help files and they're next to useless - All they say is "Excel places quotation marks around text with commas" as if its a handy feature.

    If someone could help me out here I would really really really appreciate it.

    Please help me before its too late.

  2. #2
    SitePoint Author silver trophybronze trophy
    wwb_99's Avatar
    Join Date
    May 2003
    Location
    Washington, DC
    Posts
    10,653
    Mentioned
    4 Post(s)
    Tagged
    0 Thread(s)
    It is a handy feature--if it did not quote the fields, you would now have two fields, breaking your table import. And most CSV

    One option would be to import the data into access and use access' text export features to push it out of the db. You could then have it quote all the fields and the system might be able to handle it.

    Another option from access would be using something that can talk to it--like most any scripting package that can talk to ODBC--to generate actual INSERT statements from the data.

  3. #3
    SitePoint Enthusiast
    Join Date
    Jan 2005
    Location
    UK
    Posts
    28
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks for the reply.

    Quote Originally Posted by wwb_99 View Post
    It is a handy feature--if it did not quote the fields, you would now have two fields, breaking your table import. And most CSV
    I don't understand - It's a tab separated file not CSV. Why would having a comma in it create two fields? Surely there must be an option to turn it off?

    Quote Originally Posted by wwb_99 View Post
    One option would be to import the data into access and use access' text export features to push it out of the db. You could then have it quote all the fields and the system might be able to handle it.

    Another option from access would be using something that can talk to it--like most any scripting package that can talk to ODBC--to generate actual INSERT statements from the data.
    Sorry if I seem ungrateful but this just seems like a lot of work. I know nothing about Access. I thought Excel was designed to handle spreadsheets of data. It seems like a pretty basic function to export a file without it adding stuff I haven't asked it to. Is there no other way without involving Access?

  4. #4
    reads the ********* Crier silver trophybronze trophy longneck's Avatar
    Join Date
    Feb 2004
    Location
    Tampa, FL (US)
    Posts
    9,854
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    a code snippet for pre-processing the file with PHP comes to mind. would you be interested in that?

  5. #5
    SitePoint Author silver trophybronze trophy
    wwb_99's Avatar
    Join Date
    May 2003
    Location
    Washington, DC
    Posts
    10,653
    Mentioned
    4 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Magpie View Post
    I don't understand - It's a tab separated file not CSV. Why would having a comma in it create two fields? Surely there must be an option to turn it off?

    Sorry if I seem ungrateful but this just seems like a lot of work. I know nothing about Access. I thought Excel was designed to handle spreadsheets of data. It seems like a pretty basic function to export a file without it adding stuff I haven't asked it to. Is there no other way without involving Access?
    Sorry, I misread your initial post about the tab stuff. Anyhow, the access method is very quick and easy--just import spreadsheet, export text file. The advantage is the access has a much, much better text export wizard which will let you do stuff like not export the quotes. And trust me, I have munged alot of spreadsheets into usable data in my days.

    Which leads me to your second point--spreadsheets are not data. They are NOT schema bound in any meaningful way. They lack row integrety and data type enforcement. Excel is a great business intelligence and arguably one of the best reporting tools on the planet. But it is NOT a database in any meaningful way, even if it can be abused as one.

  6. #6
    SitePoint Wizard bronze trophy bluedreamer's Avatar
    Join Date
    Jul 2005
    Location
    Middle England
    Posts
    3,402
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    It sounds like you need a nice editor to make your changes. have a look at http://csved.sjfrancke.nl/index.html - it's great for editing CSV's with whatever delimeter you have.

  7. #7
    SitePoint Enthusiast
    Join Date
    Jan 2005
    Location
    UK
    Posts
    28
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by longneck View Post
    a code snippet for pre-processing the file with PHP comes to mind. would you be interested in that?
    Yes please. Yes please. Yes please. If it can be done by a running a script on the actual exported .txt file that would be amazing (Is that what you were thinking of?). If it could be done like this it would save me having to go back and redoing the work thats been done to the file since it was exported from Excel.

    I think it might be quite complicated though (although maybe a bit of a challenge ).

    What I think would be needed is to strip off all double quotes that:

    1) Have a tab immediately in front of them but only those that are followed by another double quote immediately before the next tab character in sequence

    2) Have a tab immediately after them (or a line break) but only those that are preceded by another double quote with the previous tab character in sequence immediately before that

    I think it would have to be done like this because some of the data is in quotes and some not and some actually should start/end with a quote. Example data:

    Code:
    [tab]Some data[tab]Some data[tab]Some data[newline] --- (This is correct and should be left untouched)
    [tab]Some "quoted" data[tab]Some "quoted" data[tab]Some "quoted" data[newline] --- (Again left untouched)
    [tab]Speaker Cone 19"[tab]Speaker Cone 19"[tab]Speaker Cone 19"[newline] --- (Untouched)
    [tab]"Welcome" sign[tab]"Welcome" sign[tab]"Welcome" sign[newline] --- (Untouched)
    [tab]"Some data"[tab]"Some data"[tab]"Some data"[newline] --- (Needs stripping)
    [tab]"Some "quoted" data"[tab]"Some "quoted" data"[tab]"Some "quoted" data"[newline] --- (Needs stripping)
    [tab]"Speaker Cone 19""[tab]"Speaker Cone 19""[tab]"Speaker Cone 19""[newline] --- (Needs stripping)
    [tab]""Welcome" sign"[tab]""Welcome" sign"[tab]""Welcome" sign"[newline] --- (Needs stripping)
    It would need to accommodate all of these scenarios. If it is possible I would be eternally grateful.

    Thanks.

  8. #8
    SitePoint Enthusiast
    Join Date
    Jan 2005
    Location
    UK
    Posts
    28
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by wwb_99 View Post
    Sorry, I misread your initial post about the tab stuff. Anyhow, the access method is very quick and easy--just import spreadsheet, export text file. The advantage is the access has a much, much better text export wizard which will let you do stuff like not export the quotes. And trust me, I have munged alot of spreadsheets into usable data in my days.
    The problem with this (although admittedly I didn't mention it in my first post) is that I've made quite considerable changes to the file since it was exported from Excel. I didn't realise that the extra quotes were going to cause such a problem until I tried to insert the "finished" file. So it would mean taking the original file into Access doing the necessary changes again, then exporting, and then redoing all the other changes. Because of that, processing the existing file is a lot more attractive to me - But if that doesn't work I may have to look at using Access as you describe or another method. Thanks.

  9. #9
    SitePoint Enthusiast
    Join Date
    Jan 2005
    Location
    UK
    Posts
    28
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by bluedreamer View Post
    It sounds like you need a nice editor to make your changes. have a look at http://csved.sjfrancke.nl/index.html - it's great for editing CSV's with whatever delimeter you have.
    That looks like a good little package. Unfortunately, I'm on a Mac but I'm sure I could find a PC to use when I needed it. Thanks.

  10. #10
    SitePoint Author silver trophybronze trophy
    wwb_99's Avatar
    Join Date
    May 2003
    Location
    Washington, DC
    Posts
    10,653
    Mentioned
    4 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Magpie View Post
    The problem with this (although admittedly I didn't mention it in my first post) is that I've made quite considerable changes to the file since it was exported from Excel. I didn't realise that the extra quotes were going to cause such a problem until I tried to insert the "finished" file. So it would mean taking the original file into Access doing the necessary changes again, then exporting, and then redoing all the other changes. Because of that, processing the existing file is a lot more attractive to me - But if that doesn't work I may have to look at using Access as you describe or another method. Thanks.
    You can import from tab delimited to access (or excel), and it handles conditional quotes well. All of that is a moot point because you are on a mac and they are a bit hamstrung in the desktop db sector.

    You should be able to get someone to cook up something to process the original file.

  11. #11
    SitePoint Enthusiast
    Join Date
    Jan 2005
    Location
    UK
    Posts
    28
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by wwb_99 View Post
    You should be able to get someone to cook up something to process the original file.
    I feel quite cheeky, but is there anyone that could do that for me? Longneck? Does a similar script already exist that could be modified? Is it even possible or am I going in the wrong direction?

    Thanks

  12. #12
    reads the ********* Crier silver trophybronze trophy longneck's Avatar
    Join Date
    Feb 2004
    Location
    Tampa, FL (US)
    Posts
    9,854
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    it appears you want to eliminate double quotes if the first and the last character of a field is quotes, right? you can do this in SQL, but the exact syntax varies depending on the server type. here's some mysql compatible SQL:
    Code:
    update sometable
       set somefield = substring(somefield, 2, char_length(somefield) -2)
     where somefield like '"%'
       and somefield like '%"'

  13. #13
    SitePoint Enthusiast
    Join Date
    Jan 2005
    Location
    UK
    Posts
    28
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by longneck View Post
    it appears you want to eliminate double quotes if the first and the last character of a field is quotes, right?
    I think that is pretty much it, yes.

    Quote Originally Posted by longneck View Post
    you can do this in SQL, but the exact syntax varies depending on the server type. here's some mysql compatible SQL:
    Code:
    update sometable
       set somefield = substring(somefield, 2, char_length(somefield) -2)
     where somefield like '"%'
       and somefield like '%"'
    The problem is I can't get the correct data into the correct fields in the database due to the excess quotes. By the time it gets into the DB some fields have already been joined together and others have nothing in when they should (Due to the confusion caused by the quote marks). Thats why I thought it may be better to strip the quotes before its gets near the DB?

    Is there a PHP equivalent that could be run prior to insertion?

  14. #14
    SitePoint Enthusiast
    Join Date
    Jan 2005
    Location
    UK
    Posts
    28
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Can anyone help me out with this?

  15. #15
    SitePoint Author silver trophybronze trophy
    wwb_99's Avatar
    Join Date
    May 2003
    Location
    Washington, DC
    Posts
    10,653
    Mentioned
    4 Post(s)
    Tagged
    0 Thread(s)
    Not sure what you have handy to process the file, but you should be able to handle it using a series of string replacements.

    1) Replace all [tab]"
    2) replace all "[tab]
    3) replace all "[newline]
    4) replace all [newline]"

    You will probably lose some of the 'Speaker, 19"' stuff, but you should save the majority of the data.

    If you cannot manage to do that yourself, you should be able to find someone to do it in the looking to hire forum here.

  16. #16
    SitePoint Enthusiast
    Join Date
    Jan 2005
    Location
    UK
    Posts
    28
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi, thanks again for your response.

    I've already done many search and replaces on the file to correct all the junk that Excel put into it and I'm comfortable doing this. The problem I have is as you describe - There are many valid instances of " at the start and end of the fields which I don't want to strip. There are just too many to not worry about them (at least 100s and maybe a lot more) and it would mean checking through all the products after the search and replace.

    I don't think there is any way to distinguish the valid from the invalid using a standard search and replace. The only way to "tell them apart" is as Longneck described above (i.e. the invalid ones are those where the first and the last character of a field is quotes).

    Is there a way to do this in PHP rather than SQL? Should I be posting this in the PHP forum?

    Thanks.

  17. #17
    SitePoint Author silver trophybronze trophy
    wwb_99's Avatar
    Join Date
    May 2003
    Location
    Washington, DC
    Posts
    10,653
    Mentioned
    4 Post(s)
    Tagged
    0 Thread(s)
    The technique longneck described above is essentially finding and replacing anything with quotes at the edge of a field. And, yes, you could do this with PHP. You are probably going to need to use some regular expressions to get there.

    In furtherance of the 'fix it with php' effort, I am moving this to the PHP forum.

  18. #18
    SitePoint Enthusiast
    Join Date
    Jan 2005
    Location
    UK
    Posts
    28
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks for moving this.

    Quote Originally Posted by wwb_99 View Post
    The technique longneck described above is essentially finding and replacing anything with quotes at the edge of a field. And, yes, you could do this with PHP. You are probably going to need to use some regular expressions to get there.

    In furtherance of the 'fix it with php' effort, I am moving this to the PHP forum.

    Could anyone help me with this? Regular expressions are a bit beyond me...

    To clarify, what I'm after is a script that will strip the first and last character from all fields but only if both those characters (First and last in that field) are double quotes.

    So it would strip: "Speaker"
    But not: Speaker 19"

    If someone could help me out with this I would really appreciate it.
    Thanks.

  19. #19
    SitePoint Enthusiast
    Join Date
    Jan 2005
    Location
    UK
    Posts
    28
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Anyone?

  20. #20
    SitePoint Member
    Join Date
    Jun 2008
    Posts
    3
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I realize this post is almost a year old but... In case anyone else stumbles across it, here's a function that will remove quotes from the beginning and end of a string only if there is a quote at the beginning AND the end. Should work for all of the cases outlined by magpie in the Jun 26, 2007 15:35 post but would fail under conditions like:
    [tab]"quoted text" other text "more quoted text"[tab] --- (This would be stripped even tho the desired result would be to have it left untouched)

    I'm sure regex gurus could do a lot better but.. For a quick and dirty, this should work.


    function stripExcelQuotes($string) {
    if (substr($string, 0, 1) == '"' && substr(strrev($string), 0, 1) == '"') {
    return substr($string, 1, strlen($string)-2);
    }
    else {
    return $string;
    }
    }


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •