SitePoint Sponsor

User Tag List

Page 1 of 3 123 LastLast
Results 1 to 25 of 61
  1. #1
    SitePoint Guru phantom007's Avatar
    Join Date
    May 2008
    Posts
    725
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)

    Question Getting delimiter from a line

    Hi

    Can someone please tell me the best possible way to find the delimiter from a given line (not including the spaces)?

    For our convenience we can assume use the email address to split if necessary. So the very next char (except space) after the email can be our delimiter. But there may be cases where the email address is at last and no delimiter are there.


    Some examples are:

    Ex 1:
    Code:
    jon, doe, abc@gmail.com, 996655
    Ex 2:
    Code:
    abc@gmail.com; doe; ;996655
    Ex 3:
    Code:
    jon# doe# 996655# abc@gmail.com
    Ex 4:
    Code:
    jon doe 96655
    Ex 5:
    Code:
    jon doe 996655 abc@gmail.com

    In ex 4 and 5 above, it should return as no delimiter found.

    Any help is appreciated.

    Thanks

  2. #2
    SitePoint Mentor bronze trophy
    fretburner's Avatar
    Join Date
    Apr 2013
    Location
    Brazil
    Posts
    1,256
    Mentioned
    32 Post(s)
    Tagged
    4 Thread(s)
    Something like this should do what you want:
    PHP Code:
    $input "jon# doe# 996655# abc@gmail.com";
    $segments explode(' '$input);
    $last_char substr($segments[0], -1); 
    We split the string on each space and get an array of segments, then grab the last character from the first segment, which should give you your delimiter.

    Edit: Just re-reading your OP, you'd also have to loop through all the elements except the last one to check if the values for $last_char are the same, otherwise the delimiter is missing.

  3. #3
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,810
    Mentioned
    141 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by fretburner View Post
    Something like this should do what you want:
    PHP Code:
    $input "jon# doe# 996655# abc@gmail.com";
    $segments explode(' '$input);
    $last_char substr($segments[0], -1); 
    We split the string on each space and get an array of segments, then grab the last character from the first segment, which should give you your delimiter.
    That works well for examples 1-3, but 4 and 5 would fail that test. Also, if the spaces were just theoretical for showing you the components and not actually in the data that would fail too.

    @cancer10 ; are there any assumptions we can make? Can we assume it will be a non-alphabetical/numerical character? Meaning primarily it should be a punctuation mark or special character?
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  4. #4
    SitePoint Guru phantom007's Avatar
    Join Date
    May 2008
    Posts
    725
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    Hi

    Thanks for the reply.

    I forgot to cover that there might not be any space between the fields. So in that case the above will not work


    Thanks

  5. #5
    SitePoint Guru phantom007's Avatar
    Join Date
    May 2008
    Posts
    725
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by cpradio View Post
    That works well for examples 1-3, but 4 and 5 would fail that test. Also, if the spaces were just theoretical for showing you the components and not actually in the data that would fail too.

    @cancer10 ; are there any assumptions we can make? Can we assume it will be a non-alphabetical/numerical character? Meaning primarily it should be a punctuation mark or special character?
    Hi Thanks for reply,

    No because say u assume that special chars can be in between actual chars, for example look at the following eg:


    Code:
    jon's, doe, 996655, abc@gmail.com
    
    OR this one...
    
    
    'my name is joe and my mob # is 2525' #  abc@gmail.com
    The problem is that users from all over the world will be uploading CSVs with any delimiters in it, so it cannot be sure what will they upload. I am just thinking of a way to handle it.

  6. #6
    SitePoint Mentor bronze trophy
    fretburner's Avatar
    Join Date
    Apr 2013
    Location
    Brazil
    Posts
    1,256
    Mentioned
    32 Post(s)
    Tagged
    4 Thread(s)
    Could you not either specify the delimiter that must be used, or prompt the user to tell you which they are using?

  7. #7
    SitePoint Guru phantom007's Avatar
    Join Date
    May 2008
    Posts
    725
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by fretburner View Post
    Could you not either specify the delimiter that must be used, or prompt the user to tell you which they are using?
    No, that would have been very easy to implement.

    The challenge is to get the delimiter from the csv with some AI

    So like I said in my first post, the very next special char after email except space can be used as a delimiter or if the email is at last we can get the first special char. But I am not too sure if this is the proper way.

  8. #8
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,810
    Mentioned
    141 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by cancer10 View Post
    Hi Thanks for reply,

    No because say u assume that special chars can be in between actual chars, for example look at the following eg:


    Code:
    jon's, doe, 996655, abc@gmail.com
    
    OR this one...
    
    
    'my name is joe and my mob # is 2525' #  abc@gmail.com
    The problem is that users from all over the world will be uploading CSVs with any delimiters in it, so it cannot be sure what will they upload. I am just thinking of a way to handle it.
    Both of those examples still fit my question. Neither the , or # are alphabetical or numerical. They are punctuation/special characters (ie: ,.;:'"?[]{}/\|`~!@#$%^&*()=+-_\t\s)

    If we can assume the delimited will be any of those, the process becomes a bit easier, but if we can't safely assume that, then we have a problem. Just came across this through a search, which may be interesting:
    http://www.codeproject.com/Articles/...-CSV-separator
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  9. #9
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,810
    Mentioned
    141 Post(s)
    Tagged
    0 Thread(s)
    This one also peeked my interest:
    http://www.powertheshell.com/autodet...csv-delimiter/
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  10. #10
    SitePoint Guru phantom007's Avatar
    Join Date
    May 2008
    Posts
    725
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by cpradio View Post
    Both of those examples still fit my question. Neither the , or # are alphabetical or numerical. They are punctuation/special characters (ie: ,.;:'"?[]{}/\|`~!@#$%^&*()=+-_\t\s)

    If we can assume the delimited will be any of those, the process becomes a bit easier, but if we can't safely assume that, then we have a problem. Just came across this through a search, which may be interesting:
    http://www.codeproject.com/Articles/...-CSV-separator
    Ok so if we agree to that, how do we detect which one of those is our delimiter for the following case?


    Code:
    foo.bar#example.com#"I like using ""#"" or ""."" as a CSV delimiter."

    Thanks

  11. #11
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,810
    Mentioned
    141 Post(s)
    Tagged
    0 Thread(s)
    The linked to article would deduce that the # is the delimiter because it has a "quote" checker to ensure any special characters within the quotes are not considered to be part of the delimiter.
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  12. #12
    SitePoint Guru phantom007's Avatar
    Join Date
    May 2008
    Posts
    725
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by cpradio View Post
    The linked to article would deduce that the # is the delimiter because it has a "quote" checker to ensure any special characters within the quotes are not considered to be part of the delimiter.

    Those are not in PHP

  13. #13
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,810
    Mentioned
    141 Post(s)
    Tagged
    0 Thread(s)
    Yes, I realize that, but the logic wasn't too hard to follow. If I have time, I'll try converting one to PHP later on (not sure if I'll have the time though).
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  14. #14
    SitePoint Guru phantom007's Avatar
    Join Date
    May 2008
    Posts
    725
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by cpradio View Post
    Yes, I realize that, but the logic wasn't too hard to follow. If I have time, I'll try converting one to PHP later on (not sure if I'll have the time though).
    I'd really appreciate that.


    Thanks

  15. #15
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,810
    Mentioned
    141 Post(s)
    Tagged
    0 Thread(s)
    Okay, here is the setup (derived from http://www.powertheshell.com/autodet...csv-delimiter/):

    Edit:

    Please see post #43 for the most up-to-date version of this code.


    Code:
    project/
    - files/
    - - colon.txt
    - - comma.txt
    - - mixture.txt
    - - pipe.txt
    - - pound.txt
    - - semicolon.txt
    - csv.php
    - test.php
    The files:
    colon.txt
    Code:
    this:is:"a test":to:123:see:how:it:works
    this: is: "a test": to: 123: see: how: it: works
    123.:can?:you&:see:what:I'm:doing?:eight*:nine
    comma.txt
    Code:
    this,is,"a test",to,123,see,how,it,works
    this, is, "a test", to, 123, see, how, it, works
    123.,can?,you&,see,what,I'm,doing?,eight*,nine
    mixture.txt
    Code:
    this|is|"a test"|to|123|see|how|it|works
    this; is; "a test"; to; 123; see; how; it; works
    123.|can?|you&|see|what|I'm|doing?|eight*|nine
    pipe.txt
    Code:
    this|is|"a test"|to|123|see|how|it|works
    this| is| "a test"| to| 123| see| how| it| works
    123.|can?|you&|see|what|I'm|doing?|eight*|nine
    pound.txt
    Code:
    this#is#"a test"#to#123#see#how#it#works
    this# is# "a test"# to# 123# see# how# it# works
    123.#can?#you&#see#what#I'm#doing?#eight*#nine
    semicolon.txt
    Code:
    this;is;"a test";to;123;see;how;it;works
    this; is; "a test"; to; 123; see; how; it; works
    123.;can?;you&;see;what;I'm;doing?;eight*;nine
    csv.php
    PHP Code:
    <?php
    class CSV
    {
        private 
    $filePath;
        private 
    $fileContents;
        const 
    ACCEPTABLE_DELIMITERS '~[#,;:|]~'// acceptable delimiters

        
    public function __construct($file)
        {
            
    $this->filePath $file;
            
    $this->fileContents file($file);
        }

        public function 
    getDelimiter()
        {
            
    $delimitersByLine = array();
            foreach (
    $this->fileContents as $lineNumber => $line)
            {
                
    $quoted false;
                
    $delimiters = array();

                for (
    $i 0$i strlen($line) - 1$i++)
                {
                    
    $char substr($line$i1);
                    if (
    $char === '"')
                    {
                        
    $quoted = !$quoted;
                    }
                    else if (!
    $quoted && preg_match(self::ACCEPTABLE_DELIMITERS$char))
                    {
                        if (
    array_key_exists($char$delimiters))
                        {
                            
    $delimiters[$char]++;
                        }
                        else
                        {
                            
    $delimiters[$char] = 1;
                        }
                    }
                }

                if (empty(
    $delimitersByLine))
                {
                    
    $delimitersByLine $delimiters;
                }
                else
                {
                    
    $newDelimitersByLine $delimiters;
                    foreach (
    $delimitersByLine as $key => $value)
                    {
                        if ((
    array_key_exists($key$delimiters) && $delimiters[$key] === $value)
                            || !
    array_key_exists($key$delimiters))
                        {
                            
    $newDelimitersByLine[$key] = $value;
                        }
                    }
                    
    $delimitersByLine $newDelimitersByLine;

                    if (
    sizeof($delimitersByLine) < 2)
                        break;
                }
            }

            
    arsort($delimitersByLine);
            
    $firstDelimiter key($delimitersByLine);

            if (
    sizeof($delimitersByLine) > 1)
            {
                
    next($delimitersByLine);
                
    $nextDelimiter key($delimitersByLine);
                if (
    $delimitersByLine[$firstDelimiter] === $delimitersByLine[$nextDelimiter])
                {
                    
    // multiple delimiters with the same frequency found
                    // throw an error
                    
    throw new UnexpectedValueException();
                }

                return 
    $firstDelimiter;
            }
            else
                return 
    $firstDelimiter;
        }
    }
    test.php
    PHP Code:
    <?php
        
    include('csv.php');

        
    $comma = new CSV('files/comma.txt');
        echo 
    'Delimiter for comma.txt is ' $comma->getDelimiter() . '<br />';

        
    $colon = new CSV('files/colon.txt');
        echo 
    'Delimiter for colon.txt is ' $colon->getDelimiter() . '<br />';

        
    $pipe = new CSV('files/pipe.txt');
        echo 
    'Delimiter for pipe.txt is ' $pipe->getDelimiter() . '<br />';

        
    $pound = new CSV('files/pound.txt');
        echo 
    'Delimiter for pound.txt is ' $pound->getDelimiter() . '<br />';

        
    $semicolon = new CSV('files/semicolon.txt');
        echo 
    'Delimiter for semicolon.txt is ' $semicolon->getDelimiter() . '<br />';

        
    $mixture = new CSV('files/mixture.txt');
        echo 
    'Delimiter for mixture.txt is ' $mixture->getDelimiter() . '<br />';
    The Output:
    Code:
    Delimiter for comma.txt is ,
    Delimiter for colon.txt is :
    Delimiter for pipe.txt is |
    Delimiter for pound.txt is #
    Delimiter for semicolon.txt is ;
    
    Fatal error: Uncaught exception 'UnexpectedValueException' in M:\SVN\sitepoint\trunk\Sitepoint\cancer10\csv.php:75 Stack trace: #0 M:\SVN\sitepoint\trunk\Sitepoint\cancer10\test.php(20): CSV->getDelimiter() #1 {main} thrown in M:\SVN\sitepoint\trunk\Sitepoint\cancer10\csv.php on line 75
    As an attachment:
    cancer10-updated.zip
    Last edited by cpradio; Jun 27, 2013 at 03:00. Reason: Added edit/warning
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  16. #16
    SitePoint Guru phantom007's Avatar
    Join Date
    May 2008
    Posts
    725
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    Hi cpradio

    Thanks for your efforts.

    Does it also support tabs?

    Thanks

  17. #17
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,810
    Mentioned
    141 Post(s)
    Tagged
    0 Thread(s)
    Okay, I did find a small issue with my initial code (so I've updated it). It should support any type of delimiter you can think of, you simply have to alter the following line to have \t for tab
    PHP Code:
    const ACCEPTABLE_DELIMITERS '~[#,;:|]~'// acceptable delimiters 

    Example:
    tab.txt
    Code:
    this	is	"a test"	to	123	see	how	it	works
    this	 is	 "a test"	 to	 123	 see	 how	 it	 works
    123.	can?	you&	see	what	I'm	doing?	eight*	nine
    Updated ACCEPTABLE_DELIMITERS
    PHP Code:
    const ACCEPTABLE_DELIMITERS '~[#,;:|\t]~'// acceptable delimiters 
    Output (after updating the test.php file to have
    PHP Code:
        $tab = new CSV('files/tab.txt');
        echo 
    'Delimiter for tab.txt is ' $tab->getDelimiter() . '<br />'
    Output (note tab.txt shows empty because you can't visibly see a tab character):
    Code:
    Delimiter for comma.txt is ,
    Delimiter for colon.txt is :
    Delimiter for pipe.txt is |
    Delimiter for pound.txt is #
    Delimiter for semicolon.txt is ;
    Delimiter for tab.txt is 
    
    Fatal error: Uncaught exception 'UnexpectedValueException' in M:\SVN\sitepoint\trunk\Sitepoint\cancer10\csv.php:75 Stack trace: #0 M:\SVN\sitepoint\trunk\Sitepoint\cancer10\test.php(23): CSV->getDelimiter() #1 {main} thrown in M:\SVN\sitepoint\trunk\Sitepoint\cancer10\csv.php on line 75
    Edit:

    Added tab instructions/test
    Last edited by cpradio; Jun 20, 2013 at 10:54. Reason: Added tab instructions
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  18. #18
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,810
    Mentioned
    141 Post(s)
    Tagged
    0 Thread(s)
    Here is another neat thing you could do (if you don't want to define a range of acceptable delimiters), you can define a range of characters that can't be delimiters.

    Just change this line in csv.php
    PHP Code:
    const ACCEPTABLE_DELIMITERS '~[#,;:|\t]~'// acceptable delimiters 
    to:
    PHP Code:
    const EXCLUDED_CHARS '~[a-zA-Z0-9 ]~'// delimiters can't be characters, numbers or spaces 
    And change this line
    PHP Code:
    else if (!$quoted && preg_match(self::ACCEPTABLE_DELIMITERS$char)) 
    to:
    PHP Code:
    else if (!$quoted && !preg_match(self::EXCLUDED_CHARS$char)) 
    Then everything except a-z, A-Z, 0-9, and spaces can be a delimiter.

    Edit:

    Updated so tabs work in the EXCLUDED_CHARS version
    Last edited by cpradio; Jun 20, 2013 at 12:30.
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  19. #19
    SitePoint Guru phantom007's Avatar
    Join Date
    May 2008
    Posts
    725
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    Thanks again cpradio for your inputs.

    Is it mandatory to define the allowed chars within the square brackets? []

    Because I see you putting all chars inside ~[]~

    Secondly why is there an "Fatal error: Uncaught exception" in the output of your post #17?

    Thanks

  20. #20
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,810
    Mentioned
    141 Post(s)
    Tagged
    0 Thread(s)
    You can define ACCEPTABLE_DELIMITERS or change it to EXCLUDED_CHARS per Post #18. EXCLUDED_CHARS allow you to define which characters can't be delimiters. Think a-z and 0-9 along with spaces (might want to add " and ' in there as well).

    The Fatal Exception is because of mixture.txt, because it has two possible delimiters, that both take up 8 positions on a line, so the system can't adequately tell which one should be used for that case (so I have it throw an exception).
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  21. #21
    SitePoint Guru phantom007's Avatar
    Join Date
    May 2008
    Posts
    725
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    While the "EXCLUDED_CHARS" concept makes more sense but it does not work perfect for a particular case.

    Consider the CSV file has only these 2 lines:


    EMAIL
    test@gmailcom

    It shows the delimiter as @ which is wrong

    THanks

  22. #22
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,810
    Mentioned
    141 Post(s)
    Tagged
    0 Thread(s)
    We'll you don't have a lot of test data there, what did you expect it to come up with if you don't exclude @? There is no delimiter in your example...
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  23. #23
    SitePoint Guru phantom007's Avatar
    Join Date
    May 2008
    Posts
    725
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    If the CSV does not have any delimiter then it should return empty value.

  24. #24
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,810
    Mentioned
    141 Post(s)
    Tagged
    0 Thread(s)
    And it might do that, if you exclude @
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  25. #25
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,810
    Mentioned
    141 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by cpradio View Post
    And it might do that, if you exclude @
    In fact, it does. It will return NUL or a value of chr(0). Which you can see by outputting ord($email->getDelimiter()); of which I used the following exclusions:
    PHP Code:
    const EXCLUDED_CHARS '~[a-zA-Z0-9 @]~'// delimiters can't be characters, numbers or spaces 
    Keep in mind the goal here is to find an UNKNOWN delimiter. Unless you give it some indication of the type of characters to seek, it will make a delimiter out of ANYTHING.
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •