Thursday 29th August 2013 1.15am

Thu, Aug 29, 2013

So last few days I've been doing some prototyping and experimentation with how to enumerate directories far, far quicker than one usually can as part of implementing the #1 requested feature for proposed Boost.AFIO before GSoC ends, Directory Monitoring Support (http://boostafio.uservoice.com/forums/218980-boost-afio-feature-request/filters/top), and I have discovered many interesting things worth sharing.

We all know the common advice to never put too many files into a single directory because "performance suffers", yet that advice doesn't really gel with the B-trees used by any modern filing system. Let me quickly explain: B-trees have complexity O(log2 entries), so a 1000 entry directory will need 10 tree node visits, whereas a 100,000 entry directory will need 20 tree node visits i.e. only twice as slow. That is what things are supposed to be: the number of entries in a directory shouldn't really matter too much going into the tens of millions.

Yet on Linux calling 'time ls' on a 1m entry directory on Linux ext4 is about 5 seconds as compared to 0.25 seconds for a 100,000 entry directory. So why the 20x blowup when it should be only 1.2x? The answer is that the Linux C library calls the getdents() syscall with a small, fixed sized buffer, so with 1m+ entry directories you do a ton load of syscalls. On Windows, the problem is a bit different: fetching information about a file causes the Win32 layer to open that file for access, yet the Win32 CreateFile() function is unbelievably slow and I see only about a 20-30k HANDLE allocations per second max. Hence, listing a 1m entry directory on Windows will take you about a minute :(

The solution, in both cases, is to talk to the kernel directly. With Linux we use huge buffers for the getdents() syscall to avoid kernel transitions; with NT we use the NtQueryDirectoryFile() syscall which will return as many entries at once as you ask for along with most of a full lstat() structure which therefore lets you almost always avoid reading information about entries via a file handle. This lets you achieve some truly spectacular numbers: on my Intel Core i7-3770K CPU @ 3.50Ghz workstation, I see these:

* Windows 8 x64, NTFS: approx 1.8m entries enumerated per second max along with most of a lstat structure (all but st_nlinks, st_blocks, st_blksize).

* Linux 3.2 x64, ext4: approx 2.5m entries enumerated per second max with st_ino and st_type fields only. By the way lstat() can do 1.1m entries per sec max to fill in the remaining fields.

An interesting feature of the NT kernel's NtQueryDirectoryFile() syscall is that it lets you supply a glob pattern to have the kernel do glob filtering kernel side. That enables a very useful optimisation: a vastly improved lstat() performance on single files on Windows. Whereas before 20-30k single file lstat()'s per second was maximum due to the file handle creation rate limit, I can now do 250k single file lstat()'s per second so long as they are within the same parent directory. It's not quite Linux's 1.1m lstat()'s per second, but it's a world better than before.

Of course, just because future Boost.AFIO using code will be able to use 1m entry directories with ease doesn't mean we should! Opening a 1m entry directory in Windows Explorer or in KDE's Plasma File System Explorer makes you think it has hanged itself, and that doesn't benefit the average user experience. For our purposes though - near zero overhead directory change monitoring - very fast directory enumeration and lstat() is exactly what the doctor ordered. And Paul is working on coding up what we hope will be the best directory change monitoring implementation on any platform ever written! :)

All coming to Boost and C++ soon, assuming we pass peer review!