First benchmarks of the ext4 file system

Seeing that the development of the ext3 file system successor has started, and that Andrew Morton has released mm patch containing ext4 file system, I decided to run some simple benchmarks, even in this early stage of development.

Because the mm patch also contains Hans Reiser's reiser4 file system, I decided to run benchmarks against it, too, for a good measure. Let me once again remind that both ext4 and reiser4 are still in development, while ext3 has been in production for many years, so take all the results below with a grain of salt.

Setup

All tests were run on a modern Intel E6600 based desktop equipped with an ATA disk. I made a small partition exclusively for this testing. File system under testing was recreated and freshly mounted for each and every test below, so that caching doesn't get in out way (I could've also used drop_caches kernel mechanism to get rid of caches, but this is safer). This also means that all tests were done in pristine environment, not accounting for fragmentation and other effects happening in real environments. ext4 was mounted with mount /dev/hda2 /mnt -t ext4dev -o extents so that extents are enabled. The kernel used was 2.6.19-rc2-mm1. The I/O scheduler was the kernel default CFQ.

Sequential reading/writing

I used an aging bonnie application to make some quick measurements of file system performance when doing sequential I/O operation.

Not much surprises here. While there are some differences among the file systems, I don't think they're that much interesting. All file systems are able to read sequentially from big files with speeds close to the platter speed (57MB/s in this case).

Sequential writing, on the other hand, shows that there are some real improvements built into ext4. Probably extents together with delayed block allocation allow ext4 to come first in this benchmark, leaving the other two file systems good 20 percent behind.

Creating/deleting small files

The other typical operation file systems do in some workloads is managing many small files. To measure performance in such environment I decided to prepare a simple application (make-many-files) whose only task is to make many small files (405,000 in this case), but also distribute them in a tree like structure, so that we don't measure directory operations only (it's a known fact that the performance of file system drops rapidly when you go over some number of files in the same directory). You can find the source of the applicaton attached to this article, if you would like to run it yourself.

While it can be seen that ext4 shows a slight regression over ext3, the real winner here is reiser4. It seems that ext4 could use some improvements in this area (rapid file creation).

Deleting almost half a million files created in the previous step is a completely different picture. Here ext3/4 completely dominate reiser4. Once again ext4 is slightly slower, but not much.

I also need to mention that for the above two tests reiser4 took much more CPU cycles. I'm not putting the graph here because it would not be easy to interpret, but it seems that the other testers were on the right track saying that reiser4 is quite heavy on CPU under some loads.

Postmark

For the final test, I decided to go with a macro benchmark. I ran postmark, a benchmark that's based around small file operations similar to those used on large mail servers and news servers. I used the following configuration for testing:

set read 4096
set write 4096
set transactions 10000
set size 500 500000
set number 5000

What we get out, after a few minutes of crunching with postmark, is the number of transactions per seconds. Where the bigger number indicates more performant file system. Ext4 has improved over ext3, but once again reiser4 is leading the bunch.

Conclusion

The ext4 file system promises improved data integrity and performance, together with less limitations, and is definitely the step in the right way. Even if there are some regressions in our measurements, when compared to ext3, they're quite small and no doubt will be fixed before the development is finished. On the other hand, under some workloads ext4 is already showing much better results.

Another surprise of this test is reiser4, which has the best performance in some tests.

It should be also noted that all the file systems were completely stable during these testing, no crashes, no unexpected behaviour. So feel free to do your own tests, but still be very careful before entrusting your important data to them (except ext3, of course).

AttachmentSize
make-many-files.c777 bytes

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

alternatives to ext3 and ext4

Ext3 and ext4 have modes for preserving user data, however the manner in which blocks are written to the disk can be cause for concern. In a transactional file system, user data belonging to the committed state is never overwritten.

Ext3 and ext4 require that a fixed amount of disk space be set aside for the journal itself—this is in addition to the standard file system metadata. The journal size must be large enough to contain the maximum number of events the system could ever need and the size is determined at format time.

Transactional file systems like Reliance Nitro by Datalight, have no such requirement, and in fact, the space needed to record the dual-state information is smaller than the overhead required by most FAT implementations.

Ext3 and ext4 users can specify whether they want to log all changes to both file data and metadata or whether they want to log only metadata changes. The file system options can be changed, but it should be noted that there is no application programming interface (API) to change the journal options.

With Datalight Reliance family of file systems, the application developer can set the default model to automatic or timed transactions, and then programmatically disable that mode, perform operations on a whole group of files, perform an explicit transaction point, and then re-enable the default transaction mode.

Its worth looking at.

How about safety?

I do not think performance is the only thing of a filesystem, what I am caring more the safety, who can tell me which one is the most stable and the least crashed filesystem?

ext3 is a perfect choice, then...

Oh well, whenever you look for stability you never chase new versions of software, let alone re-engineered and enhanced products. And that mantra applies to file systems exceptionally well. It typically takes 5-10 years for a file system to become really stable.

So, your safest bet would be ext3, it has been long in the production and it is reasonable to expect that it is really stable now. But it's also very well supported and will be in some time to come, as it's the default choice of practically all popular Linux distributions.

More benchmarks

Hi,

I made the several benchmarks one year ago to compare ext2/ext3/ext3+patches (= ext4)/reiserFS/XFS/JFS

You can see results at http://www.bullopensource.org/ext4/

and especially:

http://www.bullopensource.org/ext4/kernbuild
http://www.bullopensource.org/ext4/sqlbench
http://www.bullopensource.org/ext4/sysbench_oltp

REISER4 - THE BEST

REISER4 - THE BEST FILESYSTEM EVER. Much better than ext4.

You can read more here:

http://m.domaindlx.com/LinuxHelp/resources/fs-benchmarks.htm

.-------------------------.
|.FILESYSTEM.|.TIME.|DISK.|
|.TYPE.......|(secs)|USAGE|
.-------------------------.
|REISER4.lzo.|.1938.|.278.|
|REISER4.gzip|.2295.|.213.|
.-------------------------.
|REISER4.....|.3462.|.692.|
|EXT2........|.4092.|.816.|
|JFS.........|.4225.|.806.|
|EXT4........|.4408.|.816.|
|EXT3........|.4421.|.816.|
|XFS.........|.4625.|.779.|
|REISER3.....|.6178.|.793.|
|FAT32.......|12342.|.988.|
|NTFS-3g.....|10414.|.772.|
.-------------------------.

Column one measures the time taken to complete the bonnie++ benchmarking test (run with the parameters bonnie++ -n128:128k:0). The top two results use Reiser4 with compression. Since bonnie++ writes test files which are almost all zeros, compression speeds things up dramatically. That this is not the case in real world examples can be seen below where compression does not speed things up. However, more importantly, it does not slow things down either.

Column two, Disk Usage: measures the amount of disk used to store 655MB of raw data (which was 3 different copies of the Linux kernel sources).

http://m.domaindlx.com/LinuxH

http://m.domaindlx.com/LinuxHelp/resources/fs-benchmarks.htm can now be found at

http://linuxhelp.leadhoster.com

The linuxhelp site seems to be censored every year or so.

It has been available at linux.coconia.net linuxhelp.150m.com m.domaindlx.com/LinuxHelp/ linux.50webs.org and probably more places,... none of which are now available.

Yeah, it's a KILLER

Yeah, it's a KILLER filesystem!

Pun intended? ;)

Pun intended? ;)

That's hilarious.

That's hilarious.

Bonnie++ has a -b option to cause a fsync()

If you are getting stupid results (a writing speed of 63 MB/s, which is faster than the reading speed of 52 MB/s (and also faster than the physical disk speeds)) apparently due to the lack of fsync()ing, then use Bonnie++.

Bonnie++ is just Bonnie with a few (very informative) extra features.

Anyway, what is important here, is that:

Bonnie++ has a -b option to cause a fsync() after every write (and a fsync() of the directory after file create or delete).

If you used the stodgy ext3

If you used the stodgy ext3 default configuration, which is ridiculous as far as performance is concerned, the results don't really give a true picture of ext3's potential. Did you set dir_index and data=writeback for ext3 with tune2fs for these tests? This consistently and significantly improves ext3, and it surpasses reiser3 in most cases, and even the current reiser4 in some.

dir_index was enabled, writeback mode not

Yes, directory indexing was enabled. Good point, I forgot to mention that important configuration detail. Recent e2fsprogs enable it by default.

But, I haven't used writeback mode, because in my opinion the speed difference is small (in many cases unmeasurable) and that is a high price for lost robustness. Maybe some other time...

Terrible Graphs

Apologies in advance for my pedancy, but...

Could you use an origin of zero for all your graphs. Your first graph appears to be showing 200% & 400% differences at first glance when in fact its mearly showing 2% & 4 % differences.

Agreed.

Agreed.

Well, you got it didn't you?

You figured it out obviously, so why bother? If you put it on a 0-start chart the lines will all look the same so why bother with charts in the first place?

they are meant to look the same

If, as you point out, people are able to "figure out" the true picture eventually, then just what does the exaggeration accomplish, except a few seconds of sensationalism before the reality sinks in?

The reality is that the results in most cases show the file systems to be very close in performance - and the charts should show this too.

Charts are meant to show the TRUE picture. If the results are more or less the same, then the charts should show that they are more or less the same. Anything else is inaccurate and is a deception.

Why do you show charts? To

Why do you show charts?
To make differences and similarities of data sets visible.
When your chart shows 40% difference when there is only 4%, it is totally useless.
Of course the reader can study the details (the axis description) and get it, but then he might as well read the textual description of the data sets.
Charts should show the intended meaning at the first glance!

I agree...

Ohh, you cannot imagine how much I agree with you!

Such badly scaled diagrams are everywhere, and they are useless and misleading.

The charts are misleading.

I totally, totally agree. When you have a badly scaled chart, it ends up misleading most of us who don't check very carefully.

+1

+1 again...

+1 again...

+1

+1

+1

+1

-4

-4 muWHAHAHA!

+5

NOOOO!

Your graphs are misleading at best

The scale on your graphs is misleading. Ext4 is not all that after all is it? But gosh your charts sure make it look that way. That's just wrong.

Yeah, I know, guilty as

Yeah, I know, guilty as charged. :) At the time I was making graphs I didn't expect so much people would not observe the scale of the vertical axis. I've since learned my lesson and won't repeat the same mistake. But, now that this fact is known and so heavily "documented" in the comments of the article, I believe there no need to remake the graphs.

+1!

+1!

Not a mailserver

It's nice to see the tests done on many small files like a mailserver would handle, but this article is obviously written for the common layman and his desktop machine, not for experienced system admins who might actually have a mailserver. How does ext4 perform on everyday files, such as a 100k .odf or .ods, a 5MB mp3 files, or even a 700MB movie? That's what would interest the people reading this.

How many files do you have?

Don't know about you, but I have close to a million files on my desktop. And most of the time I'm waiting for disk is not when I'm copying one big file, but many small ones.

The bonnie results above should apply well to rather big files. As for 100k .odf, do you really think anybody cares how many milliseconds that takes? Unless you have thousands of those, of course, and then the results from the postmark test are suddenly usable. :)

I'd be interested in

I'd be interested in large-file (over 4GB) random-read and random-write performance. That's closer to the sort of I/O performed by database systems.

Large?

I'm working on an SQL database which is inserting data with one index at over a million insert per second sustained for hours. I can get sufficient bandwidth with a large raid(550+ MB/second). My current problem is that I want to insert past the 2TB file size limit. Other problems is that the drop of a 2TB file on an ext2 filesystem takes a half hour. Ideally I'd like to pre-extend the file to maximize insert performance. While I've heard that Linux is depreciating raw devices is that my best bet(No fragmentation, size is limited to that of the device and not the fs type, no extending the file write penalty, etc.)?

Random I/O

You're right.

To be completely honest, I was about to do some more testing of random I/O, but eventually I decided that it's not that interesting. If it's of any use to you, running bonnie stock random seeker on 1.8GB file (cca 600MB of memory usable for caching at the time) returned these results:

  • ext3 - 165.3
  • ext4 - 163.0
  • reiser4 - 174.3

Very, very close. So, I decided not to investigate it further under the impression that, well, random reads are random reads, there's not much file system can do in that case.

Random writes are, of course, a completely different beast, but I'll leave that for some other time. It's a more complex matter that needs careful explanation. ;)

Some errors in your tests/calculations?

Your sequential writing speeds (63 MB/s) is faster than your sequential reading (53 MB/s) speeds and faster than your platter speed (57 MB/s).

I think there are some errors in your tests/calculations.

That's just how bonnie works,

That's just how bonnie works, not fsync()ing before reporting final results. So, at the moment the application has finished throwing data to kernel and reports results, there are some dirty buffers in the memory not yet written to disk. That's why speed is slightly higher that the disk itself would allow.

Because all file systems were tested with the same setup, I think the results are credible.
Nigel - UK
Fylde Computer Repairs

No error

That's just how bonnie works, not fsync()ing before reporting final results. So, at the moment the application has finished throwing data to kernel and reports results, there are some dirty buffers in the memory not yet written to disk. That's why speed is slightly higher that the disk itself would allow.

Because all file systems were tested with the same setup, I think the results are credible.

Sorry, but you clearly do not know how bonnie works

Sorry, but you clearly do not know how bonnie works (are you confusing it with bonnie++?).

Bonnie doesn't fsync!

Sure I do. Do your homework. :)

First of all, I used older version of bonnie (not ++). I don't like the newer version because it runs many tests that I don't know much about, so I don't trust it.

Second, use the source Luke! Doing grep sync bonnie.c returns nothing. Knowing how kernel manages dirty buffers, you can be SURE that not all the blocks are on disk when you finish writing to a file. Those blocks will eventually finish on disk, but only later (up to half a minute later!) when you have already collected statistics.

If you don't trust me, run bonnie in parallel with vmstat 1 (in other window) and get ready for a surprise. At the moment your bonnie switches to the reading tests watch out carefully and at some moment you'll see some heavy writeouts on disk. Now, explain those!

Actually...

Actually the reason for the higher write speed than platter speed is explained in the article "Probably extents together with delayed block allocation allow ext4 to come first in this benchmark". Therefore, ext4 reports the write finished BEFORE everything is actually written to disk, this gives the impression to the user the filesystem is faster since you get control of your system back sooner when doing writes and in the background ext4 finishes writing to disk.

Umm... yeah... that's what

Umm... yeah... that's what he's been saying the whole time... Kernel caches and writes later, benchmark thinks it's done...

Writing to disk

Yes, it's this caching effect that allows us to have very fast writeouts at times (depending on the amount of data we're writing, of course). But, it's also true for all file systems out there, not only ext4!

From what I know, SGI's XFS goes furthest with that philosophy which is why it's the fastest file system at some tests. But, that feature can become your worst nightmare if your computer suddenly loses power. At that time the fastest file systems lose the most data. :(

Seeing how much people don't actually understand how writing to disk works in modern Linux (and other Unixes) I think it's time for a nice article that explains it all in one place: kernel page cache, managing memory, journalling, misc. parameters that tune it all and finally what is the on-disk cache impact on performance and robustness. When I have more time...

...the fastest file systems lose the most data. :(

Actually if two file systems are essentially the same speed but one (the 'faster') appears faster only because it returns earlier because of cached data waiting to be written, then in the event of power failure the same amount of data will still be lost, only you might think that the faster system had lost less...?

Sustained writes 2 or more times the amount of memfree....

Indeed. For example, the default dirty_ratio on most 2.6 kernels are way too large imo. If you are going to do sustained writes multiple times the size of the memory you have at least two problems. 1) The precious dentry and inodecache will be dropped leaving you with a *very* unresponsive system 2) The amount of dirty_pages which need to be flushed to disk is very large, if not taking all of the VM, and hogs the i/o channel. What we really need is a 'cfq' for all processes -especially misbehaving ones like dd if=/dev/zero of=/location/large bs=1M count=10000-. If you want to DoS any current 2.6 kernel just start a ever running dd write and you know what I mean. Huge latencies due the fact that all name_to_inode caches are lost and have to be fetched from disk again only to be quickly flushed again and again. I already explained this disaster scenario with Linus, Andrew and Jens; I'am hoping for a auto-tuning solution which takes diskspeed per partition into account. Anyway for now try to preserve the imported caches with sysctl vm.vfs_cache_pressure = 1 and for a safer, more in-sync server, vm.dirty_ratio = 2 combined with vm.dirty_background_ratio = 1. Some benchmarks may get worse but you have a more resiliant server.