How to best do zillions of little files?
    Kent Borg 
    kentborg at borg.org
       
    Wed Oct  2 12:19:12 EDT 2002
    
    
  
On Wed, Oct 02, 2002 at 10:49:07AM -0400, Derek Atkins wrote:
> I had a scenario where I was trying to create 300,000 files in
> one directory.  The files were named V[number] where [number]
> was monotonically increasing from 0-299999.  I killed the process
> after waiting a couple hours.
I once did something similar on a Macintosh.  First, given a directory
with thousands of files, as long as one didn't try to look at it in
Finder, the Macintosh's HFS was very fast.  Second, when I created too
many files (was it 32K?), it trashed the disk and it had to be
reformatted.  (Also, I think HFS had a limit of total files per
partition.)
Moral One: Only some file systems can efficiently handle lots of files
in a single directory.
Moral Two: Be cautious when doing things that might be perverse. (I
can hear it now: "No one will ever create that many files in a single
directory, there is no need to slow down the normal case to check for
that unlikely case.  Hell, Finder would have no hope of opening a
folder with so many files in it.")
> Breaking it up into directories of about 2000 files each REALLY
> helped.  
How are all these files related to each other?  What natural
organization do they have?  Is there a natural mapping from file to
directory path that can guarantee no directory will balloon to many
thousands of files?  If so, maybe use that organization.  Will the
data change in a way that would change a file's location?  Will you be
able to efficiently ripple that change through all dependent places
that reference it?
You might want to use a database to organize your files...
You might have to come up with a hash that distributes files fairly
randomly, in which case I suggest you do enough levels to handle a
much greater number of files.  
Another consideration is how these files will be used.  Will they just
be put there, or will people also be accessing them?  Are they fairly
static, or are their lots of changes to be tracked?  
Are these by chance MIT's webification of course materials?  If so,
you might be spreading complete copies across multiple machines to
handle bandwidth needs.  Sounds like a cool problem...
-kb
    
    
More information about the Discuss
mailing list