Saving and retrieving documents with invalid characters

I am trying to store PDF documents using their description as the document name. Since the description may contain spaces and characters not permitted in a name I have used urlencode() to give me a valid doc name.

The first one is “Bridget Barling Thesis: A History of Trinity Grammar School” which contains spaces and a colon.

I can open the document from disk, but if try to open it from localhost, I get a 403 error. If I try from my server, I get a 404.

It is taxing my brain trying to figure out where I’ve gone wrong.

I have written a little script to test it

<?php
$title = 'Bridget Barling Thesis: A History of Trinity Grammar School';
$name  = urlencode($title) . '.pdf';
echo '<p>Title: ', $title, '</p>', PHP_EOL;
echo '<p>Name: ', $name, '</p>', PHP_EOL;
echo '<p><a href="', $name, '">', $title, '</a></p>', PHP_EOL;

I’m not even sure I’ve explained myself very well! :blush:

403 is “forbidden”, I’m not sure why that would be related to the name.

I wonder whether presenting the url-encoded name as part of a link is url-decoding it at the far end, which is then legitimately not found? I haven’t tried that, just thinking out loud.

1 Like

That was my thought. When I mouse-over the link it comes up as

http://localhost/etc/Bridget+Barling+Thesis:+A+History+of+Trinity+Grammar+School.pdf

which seems partly unencoded, but why???

I think the problem is with the colon.

Try str_replace

Thanks John. I’d thought of that, but the idea was to be able to take the encoded name and decode it so I got the title back.

Think is, if it’s URL encoded why does it cause a problem? I must be missing a trick here somewhere.

Well, I don’t get why it’s leaving the colon in place. If I try some sample code, it converts it to %3A as you’d expect. I see why the colon upsets it as it’s a drive separator in a strange place, but I don’t see why it is still there.

2 Likes

403 is “forbidden”. What permissions does the file have? Make sure the user that runs the webserver process has read access to the file.

Permissions are the same as for those that do work. It’s 0644 on the server. On my Windows test server, who can tell? :upside_down_face:

I think as @droopsnoot says the colon is the problem.

if you plan on converting many problem strings I would create a function using the first contributers notes solution:

https://secure.php.net/manual/en/function.urlencode.php

function myUrlEncode($string) {

    $entities = array('%21', '%2A', '%27', '%28', '%29', '%3B', '%3A', '%40', '%26', '%3D', '%2B', '%24', '%2C', '%2F', '%3F', '%25', '%23', '%5B', '%5D');

    $replacements = array('!', '*', "'", "(", ")", ";", ":", "@", "&", "=", "+", "$", ",", "/", "?", "%", "#", "[", "]");

    return str_replace($entities, $replacements, urlencode($string));
}
1 Like

That’s the idea, if I can get past first base! Thanks, that could be just the ticket…

Edit: nah, same problemo.

1 Like

Looking at the function it appears to be using the two variables in the wrong order, Try swapping them around…,

It seems to me the home-made function is doing just what urlencode does and converts a colon to %3A, but something (the browser?) is converting it back before it should…

I don’t think urlencode is the right tool for the job. As the manual says for that function:

This function is convenient when encoding a string to be used in a query part of a URL

(emphasis mine)

But you’re not encoding a query part, you’re encoding a filename!

Have you tried rawurlencode instead of urlencode? That gives Bridget%20Barling%20Thesis%3A%20A%20History%20of%20Trinity%20Grammar%20School for your example, which looks more like a filename than the output of urlencode.

2 Likes

Thanks, yes. I had been expecting %20 rather than + which puzzled me. Unfortunately I still get Access Forbidden.

Can you try removing the colon from the filename just for testing purposes? At least you’ll then work out whether that’s the issue.

2 Likes

What server are you running this on? If it’s apache, could you post your .htaccess?

1 Like

Well, I get a 404 rather than a 403. (Same if I replace the colon with an exclamation mark).

It is Apache, but I have no .htaccess

That’s really strange. What if you just name it test.pdf or some other name that should work without any encoding?

1 Like

Sounds like the file is there, but somehow Apache won’t allow the user to access it.
So either the directory the file is in is not accessible to Apache, or there is some directive in the Apache config somewhere that prevents access.

Have you checked Apache’s access and error logs?

1 Like

It does seem to be weird. If I use the name test it works. test 1 does not (404).

I checked the source code. The anchor points at test%201.pdf and I have a file with that name in the same directory, but I get a 404.

There’s nothing in the error log since yesterday.