Regex to match file paths at root

nmeri17 · February 19, 2017, 5:14am

Following regex matches the ensuing URL except the first one, and I can’t seem to get it right with roping such links in, as well as retaining the match I have on other URLs.

regex
var regex = /\/(([\w-]+)(\/([a-z]+(\.([a-z]+)))?)?)?$/gi

links to match

/jquery.js
/login
/login/file.css
/login/
login/random/path/to/file.css

Use case here http://regexr.com/3fb3m

PS: Matching all input strings above is compulsory.

EDIT: Using this /\/(([\w-]+)(\.([a-z]+))?(\/([a-z]+\.([a-z]+))?)?)?$ works but results are inconsistent thus unreliable i.e without this new pattern, directories all fall on index 2, file paths fall under index 4 etc. I’ll really appreciate any further attempts http://jsbin.com/lofepiq/edit?js,console

nmeri17 · February 19, 2017, 12:18pm

This pattern does it /(\/([\w-]+))?((\/([a-z]+(\.([a-z]+)))?)?)?$/gi but is still open to optimization as it’s greedy enough to match even the end of the line and empty strings.

Demo here

Mittineague · February 19, 2017, 6:50pm

Nothing personal, but that regex looks a bit scary to me.

The goal of a good regex is to not only match what you want it to match, but to also not match what you want it to not match.

For me it often helps to indent groupings and “translate” when I’m analyzing them. eg.

/
  (\/    a forward slash
    ([\w-]+)    one or more hyphen or word characters 
  )?    zero or one of the previous
  (
    (\/   a forward slash
      ([a-z]+    one or more alpha characters
        (\.    a dot
          ([a-z]+)    one or more alpha characters
        )
      )?    zero or one of the previous
    )?    zero or one of the previous
  )?    zero or one of the previous
$/gi

All those "zero or one"s don’t make me feel comfortable.

A key to crafting good regex is to have a thorough understanding of what patterns you will be working with. Not all data sets will have enough similarities between its members to allow for an easy regex pattern, indeed, for some no pattern at all may be possible.

Maybe some can read regex directly, but for me, just as I often do to analyze, I also often “translate” when crafting a pattern.
My first step is to get what is hopefully all possible examples that might occur, then rough list out the “must have”, “always has”, “will never have” etc.

For example, your dataset as posted is

/jquery.js
/login
/login/file.css
/login/
login/random/path/to/file.css

may or may not begin with a slash
always followed by alpha chars
may be followed by either a dot or a slash
always ending with either a slash or an alpha char

Then I look for patterns.
“between” slashes are always only alpha chars. That can be a character set [a-z]
How many? I’m guessing that at least one would always be there, so this would probably be OK unless you need to be more precise. [a-z]+
The strings with file extennsions are the only ones that have a dot, and they are always only the ending group. They are also always only alpha chars followed by a dot followed by alpha chars. So that can be a grouped like ([a-z]+\.[a-z]+) and because they are always only last ([a-z]+\.[a-z]+)$
When something may or may not be there, the choices are “zero or one” (a ?) and “zero or more” (a *)

So a preliminary (but not yet good enough) pattern might be
/(\/)?([a-z]+|[a-z]+\.[a-z]+)$/
This would match

/jquery.js
/login

but not match

/login/file.css
/login/
login/random/path/to/file.css

The possible ending slash that is never preceding by a string containing a dot can be matched by changing the pattern to
/(\/)?([a-z]+(\/)?|[a-z]+\.[a-z]+)$/
So the pattern will now match

/jquery.js
/login
/login/

but not match

/login/file.css
login/random/path/to/file.css

The ones that are left needing to be matched are one or more sub-folders. When they are there, they are always only alpha chars enclosed by slashes.
Matching the alpha chars is straight forward, but what to do about the slashes?
If after the alpha chars, what about the ones before, and vice-versus?
Perhaps the easiest would be to modify this part of the pattern
[a-z]+(\/)?
Making that “zero or more” should do the trick. So
/(\/)?(([a-z]+(\/)?)*|[a-z]+\.[a-z]+)$/

Analyzing, this translates to

/
  (\/)?   beginning with zero or one slash
  (
    ([a-z]+    one or more alphas
      (\/)?    zero or one slash
    )*    zero or more of the previous
    |    or
    [a-z]+\.[a-z]+    one or more alphas followed by a dot followed by one or more alphas
  )
$/

NOTE
Still not “done” but should be enough to give you the idea of how you could craft a regex. If you can’t figure out what’s wrong with it after a few tries, post back.

nmeri17 · March 6, 2017, 12:11pm

1st, I want to apologize for my inability to respond earlier. I didn’t get any email notification until a few days back and I’ve had to cope with no internet connection from then till now.

The second thing I should apologise about is the typo in my data set. ALL input definitely begin with a slash, which is why I’m not checking for its absence in my regex. I did pick something new though, in your arrangement of patterns. It’s a lot like serving the same food but this time with a fancy dish and napkin. I’ve never quite had the need to read anybody’s regex as the concept is usually inspired by whatever possessed its author at press time. But I believe all that will change with this new emphasis on legibility.

As for my regex, I modified it once again to suit my needs. Going by your style, it would look like this.

/
 (\/  // a leading slash
  ([\w-]+) // some letters, possibly with hyphens somewhere in-between
  )? // one or more indicating string is valid if previous is absent
 (
  (\/ // literal slash following previous group
   ([\w-\s]+ // // some letters, possibly with hyphens and spaces somewhere in-between
    (\. // literal dot following last group
     ([a-z]+) // one letter of more
    )
   )?
  )?
 )?
$/gi

The current engine conforms to the premier goal of returning uniform results i.e.[quote=“nmeri17, post:1, topic:254217”]
directories all fall on index 2, file paths fall under index 4 etc.
[/quote]

Same goes for file extensions and the likes. You might be tempted to ask what I need so many question marks for. The problem is I’m expecting diverse links and its it is pertinent that my pattern has what it takes to absorb it all. So, I need something that matches when certain bits are absent and returns undefined if they aren’t for proper action to be taken.

Another thing is, JavaScript (or any other language I know) does not accept comments within regexes, so how do you propose I implement your brilliant idea of literate patterns?

system · June 5, 2017, 7:11pm

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.