How to overwrite a file atomically?

#8

I just had a quick glance and thought that if the file_get_contents() was being called many times in a short period of time then it may check at the file times and not bother reading the contents again, especially if the size of the file seldom changes.

Trusting the preciseness of the results is being optimistic :slight_smile:

0 Likes

#9

It could very well be something like this happening

It will use memory mapping techniques if supported by your OS to enhance performance.

Maybe use the less performant fread instead?

1 Like

#10

No one seems to have ideas about atomic file writes - I know, from what I have researched not many people dive into that territory in PHP :slight_smile:

So in order to try out some solutions I've implemented this code for now:

    function atomicFileWrite($file, $contents) {
        $fp = fopen($file, 'c');
        
        if (flock($fp, LOCK_EX | LOCK_NB)) {
            ftruncate($fp, 0);
            fwrite($fp, $contents);
            flock($fp, LOCK_UN);
        }
        
        fclose($fp);
    }

I want to see how flock() copes with the problem. I don't need to use ftruncate() but I included it in the script just to test if the file is really locked - if it is then it should not cause problems.

Ironically, I've found an article that says that flock() is not atomic in PHP and therefore doesn't guarantee a successful lock in concurrent usage and that hard links should be used instead! :open_mouth: Really weird...

For now I've deployed the above script and will watch if I get any corrupt writes in the next few days.

1 Like

#11

Hm, maybe. I'm still not sure if the problem is with reading or writing (or both). But with fread how would I make sure the operation is atomic from fopen() to fread()? The file can change in that time slot - would it cause problems? I think I can only test out different solutions since there doesn't seem to be anything definite written about it in the manual.

0 Likes

#12

It is an interesting problem. I usually write to the database and the only times I've used file writes are from an admin page. Hence I've never needed to lock a file to prevent concurrency problems.

This about flock isn't exactly reassuring

Warning
On some operating systems flock() is implemented at the process level. When using a multithreaded server API like ISAPI you may not be able to rely on flock() to protect files against other PHP scripts running in parallel threads of the same server instance!

fcntl looks interesting, but I have no experience with it.

I'm afraid I can't think of anything now that wouldn't be a kludgy hacky mess more likely to introduce more issues than resolve any.

0 Likes

#13

Which means we should assume flock() may not work at all?

A think a solid solution could be an overwrite via rename() because rename() is supposed to be atomic - at least that's what people say... But I don't know about performance - a rename is an additional file operation.

Interestingly, I've now found a comment on php.net that describes my problem exactly. According to it the problem is not with file_put_contents() but with file_get_contents() and it's necessary to lock the file when reading, too. It's weird that the guy uses file_get_contents() after he opens the file with fopen() but I think that could be changed to fread(). I'll need to test it out, too.

0 Likes

#14

Linux + most other OSes use Advisory Locking, which means locking only works if all readers + writers optionally cooperate + use the same locking strategy.

Linux Mandatory Locking can be achieved + requires a good bit of complexity...

1) Mounting your filesystem with -o mand /etc/fstab mount option.

2) Then you have to manage the set group related file bits on the actual file.

http://www.thegeekstuff.com/2012/04/linux-file-locking-types/ provides a simple/thorough overview.

Linux Mandatory Locking == only for the stout of heart.

0 Likes

#15

I've read this. The real question is do php readers and writers, I mean php functions, cooperate? I realize an external program running in the system may not respect the advisory lock but shouldn't php respect its own locks?

0 Likes

#16

Sounds like a perfect time to use an SQLite3 database instead. One file, less I/O overhead, portable and all the locking logic already handled for you.

Work smarter, not harder :stuck_out_tongue:

0 Likes

#17

Except SQLite is very slow compared to plain file_put_contents and the like. Sure, you can tweak it not to rsync with every update but still its overhead is large - relatively. When you have to do 50 updates (or more) and 50 selects every second then every millisecond matters. Establishing SQLite connection alone takes more time than file_put_contents.

0 Likes

#18

Might hold true if you only had one person chatting. You are not just doing file_put_contents() though. Shouldn't be a problem anywhere near the 2 seconds you talk about earlier.

Easy enough to actually time/test (very few lines of code needed). When dealing with multiple files for this sort of thing SQLite3 can actually be faster in many cases with such small records.

I used to write networking locking B-tree code in C, not easy stuff to do right. I consider my programming time more costly than computer time though...

It's certainly an interesting PHP problem - has memcached written all over it for a larger scale.

0 Likes

#19

In general, no.

This is code you have to write or inject into existing code, to manage the locks.

There's a suggestion to use SQLite3, which might resolve your situation, as all the locking is managed in SQLite3 for you.

0 Likes

#20

2 seconds is an interval for one user only, when I have 100 users at the same time this is roughly 50 writes per second. Now I don't do 50 writes per second to a single file because each user has its separate file but if I had a single sqlite database then it would write 50 times per second to the file - I'm worried this would have poor performance if so many sqlite connections were trying to lock and write to the database almost simultaneously.

Unless I used a separate sqlite database for each user. I might actually try out the solution and time it since I'm curious myself how this will perform. Now this is just for learning purposes since I don't think anything will beat the idea of using touch() for just storing a timestamp. But I'm willing to give sqlite a chance!

0 Likes

#21

Okay, so I've done some real life tests in recent days to see which methods work well for concurrent writes and how they perform. Let me share the results. This was done with PHP 7 on a Debian Linux server, I don't know its exact configuration because it was set up my my hosting company but certainly it runs on traditional hard drives and has Opcache enabled.

I decided to do a heavier test - a single file for all users so as to increase concurrency and chances of collisions. I don't get 50 requests per second yet but sometimes there may be up to 20 and in this test I randomly did either one write or one read to the same file so sometimes there could be up to 10 writes per second.

Also, I made the written content self-checking so that I would detect any kind of corruption in reading or writing immediately, an example:

2017-07-21 15:09:43|8aafed7218a3d566644c45b5094c171b

The hash is md5 of the timestamp and I verified the hash on every read.

First, my last idea I posted above failed the concurrency test when reading with file_get_contents:

function atomicFileWrite($file, $contents) {
    $fp = fopen($file, 'c');

    if (flock($fp, LOCK_EX | LOCK_NB)) {
        ftruncate($fp, 0);
        fwrite($fp, $contents);
        flock($fp, LOCK_UN);
    }

    fclose($fp);
}

I was getting many empty string results from file_get_contents. flock didn't really lock the file and allowed ftruncate to run in inappropriate moments. When I removed ftruncate then this code did not fail in concurrent runs but then it would work only if the length of each new file content didn't change. So I kept looking further.

Now I will post other methods and all of them passed the concurrency test for me.

Method 1:

Write code:

file_put_contents($file, $contents, LOCK_EX);

Read code:

function fileRead($file) {
    $fp = fopen($file, 'r');
    $locked = flock($fp, LOCK_SH);
   
    if (!$locked) {
        // this actually never executed - not needed:
        fclose($fp);
        return false;
    }
    
    $cts = stream_get_contents($fp);

    flock($fp, LOCK_UN);
    fclose($fp);

    return $cts;
}

Now this is interesting because I didn't know of the stream_get_contents() function as an often better alternative to fread(), which doesn't need the length to be specified. The whole fileRead() function also performed pretty well, roughly 0.04 milliseconds on average, while file_get_contents() ran 0.03 milliseconds in previous tests. Most writes with file_put_contents() lasted about 0.09 ms with occasional spikes reaching even 500 ms - but that's understandable on a live server where sometimes the write buffer needs to be flushed to disk.

Conclusion: if read locks cooperate with file_put_contents' LOCK_EX then they seem to actually work. This may be treated as a modernized version of this.

Method 2:

Read code:

$contents = @file_get_contents($file);

Write code:

$tmpFile = "$dir/" . uniqid('', true);
file_put_contents($tmpFile, $contents);
rename($tmpFile, $file);

As expected, this was also a solid performer and a rename seems to be really atomic. However, writes were slower with this method thanks to the rename function - about 0.2 ms on average, still pretty fast but measurably slower than plain file_put_contents. There were also occasional spices to larger values but they were less frequent and not as high.

Method 3:

Using SQLite (version 3.8.7.1) with a very simple 1-row table.

Write code:

$dbExists = is_file($dbFile);
$db = new PDO('sqlite:' . $dbFile, '', '', [
    PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION
]);
$db->exec("PRAGMA synchronous=OFF");

if ($dbExists) {
    $db->exec("UPDATE t SET val=" . $db->quote($contents));
}
else {
    $db->exec("CREATE TABLE t (
        val text NOT NULL
    )");
    $db->exec("INSERT INTO t VALUES(" . $db->quote($contents) . ")");
}

Read code:

$db = new PDO('sqlite:' . $dbFile);
$contents = $db->query("SELECT val FROM t")
    ->fetchColumn();

Obviously, I didn't get any corrupt writes or reads, as expected SQLite managed the locking stuff for me very well. However, the performance was much worse than plain files. An average write (including db connection) took about 0.7 ms with occasional spikes up to 500 ms. An average read was about 0.45 ms. Still very fast but when compared with plain files this is a lot slower.

When I tried the same without PRAGMA synchronous=OFF the performance went downhill immediately - roughly 40 ms per write with frequent spikes to 500 or 1000 ms - clearly SQLite began choking during the more frequent requests.

Method 4:

The clear winner - using touch() just to store the timestamp.

Write code:

touch($file);

Read code:

$timestamp = @filemtime($file);

Simple, efficient and concurrency-safe. Each write (touch) took roughly 0.02 ms (often it went down to 0.017 ms) and interestingly, there were no high spikes like with previous methods, well, there were, but very rare and only up to 0.5 ms. Reads (filemtime) were 0.006 ms (6 microseconds) on average. A very solid performer!


To sum up - there seem to be ways to overwrite a file atomically, however it's a pity this is not documented well. I don't know how this would perform on other systems, especially I'm not sure about method 1 using flock, if it is portable. I suppose the other methods should work fine everywhere.

3 Likes

#22

Are errors or warnings shown if the @ is removed?

I would have thought the file must exist and introducing the @ was an additional overhead.

1 Like

#23

No errors because a few lines later I check if $timestamp === false and log such cases as errors - and none were found.

As far as I remember @ introduces overhead when there is actually an error to suppress. Anyway, I think this overhead has been minimized in PHP 7 - I certainly wouldn't complain about the 6 microseconds I got in the benchmarks :slight_smile:

1 Like

#24

Linux file locks are advisory + only work if process cooperates correctly.

You read code must change substantially for this to work correctly.

https://gavv.github.io/blog/file-locks/ provides a good discussion.

That said, using a file for recording user data is... well... less than optimal.

Refer to https://codex.wordpress.org/Transients_API for a more optimal approach.

If you're using WordPress, just use the Transients API.

If you're hand rolling your own CMS code (shudder), then duplicate the WordPress Transients API functionality.

In essence, best to keep this sort of data in memory.

0 Likes

#25

That's a good article about file locks.

I don't know what you call optimal but if you read timing results from my other posts here they turn out to be very good. Sure, keeping data in memory will be fastest but on this server I don't have access to any extensions that allow me to store data in memory.

If I don't have any PHP extension for accessing memory then this API will fall back to using database. This hardly sounds optimal.

Actually, I'd shudder at using wordpress for any serious application to be used within a company. I don't even want to think how far the performance would go downhill...

BTW, a curious question: if I wanted to use the Transients API - where can I download it to include it in my project??

0 Likes

#26

Ah the old WordPress is slow myth...

Example, one of my physical machines has been running 250K uniques/hour for... well over a year now.

CPU load runs around .5%, so half of one CPU continuous... out of 16 CPUs...

This machine exclusively runs a handful of WordPress sites.

Two reasons WordPress runs fast (or slow) - LAMP Stack tuning + Site design.

Adding poorly crafted WordPress code, is just like adding poorly crafted code to any site. Performance tanks.

Good code runs fast.

WordPress core is very well crafted (good) code. Many themes + plugins... not so good...

One reason WordPress runs so fast is the Transients API, which avoids file i/o thrash + file locking issues.

0 Likes

closed #27

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.

0 Likes