Questions about Control Characters

applefritters · November 29, 2014, 10:17pm

I am trying to understand how things like carriage returns work in a PHP/HTML form.

If a user goes into one of my web forms and hits < return >, does that create a hidden character?

And is there any way to replicate that carriage return in my code?

(All of this relates to a larger question, but I’m not sure where to begin, so I figured I would ask a basic question, although it may sound cryptic!!)

Michael_Morris · November 29, 2014, 11:28pm

If a user goes into one of my web forms and hits < return >, does that create a hidden character?

Depends. Most of the time, no, but on a text area field yes. On other fields hitting return causes the form to be submitted.

And is there any way to replicate that carriage return in my code?

“\n” is the newline character. You can search for it in str_replace and echo “A line\n” will put a carriage return at the end.

applefritters · November 29, 2014, 11:41pm

Hi Michael!

So when I hit the < return > my computer places a hidden character in the Text Area?

Is that called a “control character”?

If I do other things like < tab > or < spacebar > will other hidden characters also appear in the Text Area?

felgall · November 29, 2014, 11:45pm

A carriage return character can be entered as “\r” - it is NOT the same as the “\n” new line character.

On some systems sending a carriage return by itself will return to the start of the current line while sending a new line character by itself will move down a line but keep the horizontal position the same.

Most systems assume that you actually want both and so use one or the other to mean both.

applefritters · November 30, 2014, 12:15am

felgall,

What about my questions in Post #3?

Michael_Morris · November 30, 2014, 12:22am

Are there any systems out there that use \r and do not use \n? Mac’s where that way prior to OSX, but that was a long time ago.

applefritters · November 30, 2014, 1:18am

What is the difference between hitting the < return > and typing \n in a Text Area?

felgall · November 30, 2014, 6:10pm

Pressing return is the equivalent of the \n while typing \n gives you \n

applefritters · November 30, 2014, 6:16pm

So if I have PHP code that checks for a carriage return/new line and I hit < enter > then my code should detect it, but if I typed “\n” in a Text Area and hit < enter > then my code would treat the “\n” that I typed as a string?

felgall · November 30, 2014, 6:17pm

yes

applefritters · November 30, 2014, 6:21pm

I only have access to my MacBook, so how can I test for various control characters?

I found this code off the Internet, but don’t know how to properly test it…

$whitespace = '~(<CR>|<LF>|0x0A|%0A|0x0D|%0D|\\n|\\r|\t|\s)+~i';
$new = trim(preg_replace($whitespace, '', $old));

That is, how would AI test all of those control characters?

Suggestions?

P.S. There used to be a way to markup HTML, CSS, PHP, etc on SitePoint. Is that no longer available?

applefritters · November 30, 2014, 9:08pm

Here is an example of where these control characters confuse me…

In my web form I typed the following…

(That is a 1 on line 1, a 2 on line 2, null on line 3 and line 4, and a 5 on line 5.)

Here is my test code…

$old = $_POST['comment'];
var_dump($old);

$new = preg_replace('~(\n)~', 'x', $old);
var_dump($new);
exit();

When I submit my form I get…

string '1
2


5' (length=11)

string '1x2xxx5' (length=11)

That happens whether I use \n or \\n in my regex which seems strange, because the first one should check for the control character and the second one should check for a literal.

It seems they are treated interchangeably, but I’m not certain?!

droopsnoot · December 1, 2014, 12:46pm

If you hit the tab key the browser will almost certainly take you to the next field (although that isn’t always where you might expect it to be). Spacebar will just generate a space character.

Michael_Morris · December 1, 2014, 1:51pm

Tab can be detected with “\t” It can end up in a text area via a copy/paste.

Jeff_Mott · December 1, 2014, 3:17pm

It seems that way because there are two separate processing stages going on: one for the string, and one for the regex.

For simplicity, let’s look at just strings for a moment.

$lineFeed = "\n"; // strlen($lineFeed) === 1
$backslashN = "\\n"; // strlen($backslashN) === 2

The \n is called an escape sequence. It allows us to use ordinary printable characters on our keyboard to represent an un-printable character.

Just as strings allow us this convenience, so to do regular expressions. For simplicity, let’s now look at just regexes.

/\n/ // matches a single character, the line feed character
/\\n/ // matches two characters, the backslash and the lowercase "n"

But now the complication. We need to give this regular expression to a regex library for processing. How do we pass along this regex? As a string!

"/\n/"
"/\\n/"

There are now two processing stages: when evaluating the string and when evaluating the regex. When we have "/\n/", then the string interpretation converts the \n escape sequence into a literal line feed character, and that literal character is what gets sent to the regex engine. It’s now the regex engine’s turn to interpret, but there’s no escape sequence anymore. There’s just the literal line feed.

Whereas when we have "/\\n/", then the string interpretation converts the double backslash (an escaped backslash) into a single backslash, so the value that gets sent to the regex engine is /\n/. It’s now the regex engine’s turn to interpret, and it sees the \n as an escape sequence, so will match a line feed character.

This double layer of processing can get tricky, but you get used to it after a while.

applefritters · December 1, 2014, 5:44pm

Interesting description, Jeff. I think I followed you.

So, since this relates to my other thread on Email Header Injection, which of these would be better to use to replace a Newline character with a zero-length string?

$whitespace = '~\n~i';
$new = preg_replace($whitespace, '', $old);

or

$whitespace = '~\\n~i';
$new = preg_replace($whitespace, '', $old);

It sounds like for what I am trying to do, either would work.

What do you think?

felgall · December 1, 2014, 6:41pm

Yes - either would work in this situation.

applefritters · December 1, 2014, 7:06pm

If I do something like this, should it catch all cases where there is a Carriage Return and/or Line Feed?

$whitespace = '~(<CR>|<LF>|0x0A|%0A|0x0D|%0D|\r|\n|\t)~i';

$new = preg_replace($whitespace, '', $old);

Also, I believe that preg_replace goes one character at a time from left to right and replaces any matches, correct?

system · March 3, 2015, 2:15am

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.