Quick and dirty reorganization of a big file cache (bash)

find . -maxdepth 1 -type f | while read doc; do if [ ! -d ${doc: -3} ]; then mkdir ${doc: -3}; fi; mv $doc ${doc: -3}/${doc#./document_}; done

Working with thousands of files in one directory can get you into trouble. Linux doesn’t seem to do anything special to make directory listing (or accessing a file within a directory by name) super-fast. I think the actual structure of directories probably hasn’t changed much in the last 10 years. (Comment if you know something about this).

The relatively random file names allowed me to break them up into roughly 1000 subdirectories organized by the last 3 characters in the names (first 3 characters are less random in this case, as there are lots of 1s, 2s and 3s). Looping through the files in the current directory, if there is not already a directory whose name is the last the characters of the current filename, create such a directory. Then, move the file into that subdirectory (whether it already existed or we just created it), taking the worthless ‘document_’ string off the beginning of the filename for good measure.

With 1000 subdirectories, I can handle a million files with pretty decent performance. The number will not grow beyond that any time soon, but if it did, creating a tree structure (perhaps sub-directories on the next level based on last 2 characters, and so on) could help keep the number of files in any given directory to a reasonable size. FYI this is not mission-critical data; it’s just a local cache of seldom-changing output to reduce some costly queries.

Side note: ls can break when dealing with long lists of files. find seems to work better.
Side side note: when using negative offsets in bash substring expansions, a space is required between the colon and the minus sign (-) because :- is the default value expansion or something.

social bookmark of choice:
  • Digg
  • del.icio.us
  • Ma.gnolia
  • Reddit
  • Slashdot

Tags: , ,

Leave a Reply