Home - Topics - Papers - Talks - Theses - Blog - CV - Photos - Funny |
Many application programs create and manage cache directories
of some kind:
directories in which they store temporary information
whose storage can benefit the application's performance,
but which can easily be regenerated from other "primary" data sources
if it is lost.
For example,
most web browsers have a cache directory
in which they store downloaded Internet content
(Mozilla
typically uses a directory called Cache
somewhere under $(HOME)/.mozilla
).
When backing up, synchronizing, or transferring a user's home directory
from one machine to another,
we would usually like to avoid saving or transferring cache directories at all,
because:
There are a great many applications that use cache directories,
however.
While some evolving standards attempt to standardize the locations of such directories,
most applications create their own cache directories in unpredictable locations
and manage them independently of other applications.
It therefore becomes very tedious,
even for experienced users who can easily recognize cache directories,
to figure out and list all of the appropriate --exclude
options
for backups or other directory traversal operations.
Some applications,
such as recent versions of KDE,
keep their cache directories in /tmp
or /var/tmp
,
which at least partly solves this problem
by getting the cached content out of the user's home directory entirely.
But there are good reasons an application or user
might not want these caches in /tmp
or /var/tmp
:
for example,
those directories may be on a different partition with limited space,
or files in them may be cleared out by the system too frequently
(e.g., after a day, or on every reboot)
to allow the cache to serve its intended purpose effectively.
The Filesystem Hierarchy Standard
specifies a system-wide /var/cache
directory
intended to hold such cached information,
but this directory is not generally world-writable,
and thus can only be used by applications
pre-packaged with the system or installed by the system administrator.
The XDG Base Directory Specification
instead recommends that applications create their caches
in a location specified by a user environment variable,
or in a standard subdirectory of the user's home directory if that environment variable doesn't exist.
The XDG convention allows applications to operate without special permissions,
but it does not help systemwide backup or data management utilities
locate individual users' cache directories reliably,
because each user might set the environment variable differently in his or her login profile.
Given the present non-existence of any perfect agreement on where applications should store their cached information, I propose a very simple convention that will at least allow such information to be identified effectively. Regardless of where the application decides to (or is configured to) place its cache directory, it should place within this directory a file named:
CACHEDIR.TAG
This file must be an ordinary file, not for example a symbolic link. Additionally, the first 43 octets of this file must consist of the following ASCII header string:
Signature: 8a477f597d28d172789f06886806bc55
Case is important in the header signature,
there can be no whitespace or other characters in the file
before the 'S',
and there is exactly one space character
(ASCII code 32 decimal)
after the colon.
The header string does not have to be followed by an LF or CR/LF in the file
in order for the file to be recognized as a valid cache directory tag.
The hex value in the signature
happens to be the MD5 hash of the string ".IsCacheDirectory
".
This signature header is required
to avoid the chance of an unrelated file named CACHEDIR.TAG
being mistakenly interpreted as a cache directory tag
by data management utilities,
and (for example) causing valuable data not to be backed up.
The content of the remainder of the tag file is currently unspecified,
except that it should be a plain text file in UTF-8 encoding,
and any line beginning with a hash character ('#
')
should be treated as a comment
and ignored by any software that reads the file.
We will henceforth refer to a file named as specified above, and having the required signature at the beginning of its content, as a cache directory tag.
For the benefit of anyone who happens to find and look at a cache directory tag directly, it is recommended that applications include in the file a comment referring back to this specification. For example:
Signature: 8a477f597d28d172789f06886806bc55 # This file is a cache directory tag created by (application name). # For information about cache directory tags, see: # http://www.brynosaurus.com/cachedir/
The official "home" URL of this specification may change from the above at some point if/when this proposal becomes more of a formal standard, but in that case an appropriate forwarding link will be left in the old location.
The presence of a cache directory tag within a given directory
indicates that the entire contents of that directory,
including any and all subdirectories underneath it,
consists of cached information that can be re-generated if necessary
from appropriate source material located elsewhere.
An application that creates and/or uses a cache directory
should write a single cache directory tag
into the topmost directory whose contents represent cached information.
For example, if Mozilla creates a directory named
$(HOME)/.mozilla/default/dg8t1ih0.slt/Cache
to hold cached content downloaded from the Internet,
it would then create within it a cache directory tag named:
$(HOME)/.mozilla/default/dg8t1ih0.slt/Cache/CACHEDIR.TAG
.
Only one cache directory tag is required
to tag an entire subdirectory tree of cached content.
The application should also regenerate the cache directory tag
if it disappears:
e.g., if the entire contents of the cache directory is deleted
without the directory itself being deleted.
Data backup, transfer, or synchronization software, if configured to do so, may interpret the presence of a cache directory tag in a directory as a command to exclude that directory automatically (including any subdirectories) from the backup or transfer operation. I leave it to the designers of such applications to decide whether this "auto-exclude" behavior should be the default or only adopted when specifically requested by the user. A bit of conservatism is probably warranted in the short term, but automatically excluding cache directories by default could improve efficiency and ease-of-use in the longer term.
For maximum robustness against accidental data loss,
any tool that has built-in support for cache directory tags
is strongly encouraged to verify
that any file it finds named CACHEDIR.TAG
is a regular file
and actually contains the 43-byte header specified above,
before concluding that the file is indeed a cache directory tag.
User-friendly file managers and other directory hierarchy browsing software may also want to notice the presence of a cache directory tag in a directory, and inform the user that the contents of the directory is automatically managed and is not likely to make much sense if "browsed" directly.
System administration software such as du
might provide a way to locate, total, and report the space usage
of all the cache directories on a partition,
helping the administrator to determine how much "soft" space
is in use on a file system
that could be easily freed up if necessary.
Similarly, a utility might allow the administrator
to clean out all the cache directories on the system,
or to delete files in cache directories beyond a certain "age."
System software should not, however, just periodically go through and unilaterally delete "old" files in arbitrary cache directories on a purely automatic basis. Different applications need to be able to manage their caches differently; cached data that is considered "old" to one application or to the system might not be considered "old" to another (e.g., if certain data takes a long time to regenerate). In the future this standard might be extended to allow applications to write additional meta-data in their cache directory tags, providing system utilities with the additional information they would need to manage application caches intelligently on a global basis. For now, however, the primary purpose for introducing cache directory tags is merely to make it easy to avoid backing up or copying cached data unnecessarily from one machine to another.
Why not just standardize the location where applications keep their cache directories?
First, there are already multiple conflicting standards:
e.g., the Filesystem Hierarchy Standard
specifies /var/cache
,
KDE currently uses /var/tmp
,
and the XDG Base Directory Specification
specifies the use of a user-customizable environment variable,
falling back on $(HOME)/.cache
in the user's home directory.
Even if one such standard eventually "wins",
there is inevitable tension between standardizing the placement of application caches
and allowing users to customize it.
The XDG specification clearly recognizes this tension and accounts for it,
but its very flexibility makes it difficult or impossible for systemwide utilities
to locate each user's cache directory reliably,
even if all applications were to conform to the XDG standard.
Why not just use the existing "no-backup" or "no-dump" extended file attributes that some operating systems support? Because only some operating systems (and generally only certain specific file systems in them) support those attributes. Any proposed convention for tagging cache directories that depends on non-portable OS features would inherently have limited applicability and would not likely be widely adopted among portable applications. Applications are free to support system-specific extensions when available, but for maximum usefulness we really need one extremely simple, portable convention that caching applications can adopt with almost no effort, and that is the purpose of this proposal.
Why shouldn't all web browsers just use the same cache? Sharing a single Internet content cache among multiple web browsers or other web applications, all of which might be running concurrently, is a very useful and important goal - but it presents considerable technical challenges that are still being worked out. Even when such sharing is achieved for web content caches, however, many other applications maintain other types of on-disk caches in which they store entirely different kinds of data, such as intermediate results for long-running computations or swap space for large data sets. It is unreasonable to expect applications to share caches when those caches serve entirely different purposes and inherently must be managed differently. But it would still be useful, and it is much more realistic, to expect applications at least to make their caches easily identifiable as such by adopting a simple convention such as the one proposed here.
Why does the proposal only support cache directories - what if an application only has a single file that it uses as a cache? Because the tagging scheme is much simpler and easier to take advantage of this way, and imposes no great burden on caching applications. If you're not paranoid about checking cache directory tag file headers, then most existing, unmodified backup/archive applications can immediately take advantage of this tagging convention via some pretty trivial shell-fu. For example:
find . -name CACHEDIR.TAG | sed -e 's/[/]CACHEDIR.TAG$//g' >/tmp/excludes tar cvf /tmp/backup.tar --exclude-from=/tmp/excludes .Trying to support the tagging of individual files within a directory would probably mean at least having to parse text files and scan them for file names, making the convention much more complex and potentially delicate at the outset. And when it comes to standards that affect what data gets backed up and what doesn't, we obviously would like them to be as simple and robust as possible.
For those existing caching applications that keep all their cached data in a single file, it should not be inordinately difficult for them just to move that cache file down one level into its own separate subdirectory, and write a cache directory tag into that new subdirectory alongside the cache file. Such a slight directory structure change should not cause any backward compatibility problems, since by definition any cache file in the old location is regenerable and can just be deleted or moved. And users won't care because they aren't normally expected to see directly or understand the application's cache files or directories anyway.
Version 0.6, November 4, 2019: Web page consolidation and restructuring; no substantive changes.
Version 0.5, October 27, 2004: Added Security Considerations section in response to a potential vulnerability pointed out by Gregory P. Smith.
Version 0.4, August 10, 2004:
Changed cache tag file name
from ".IsCacheDirectory
" to "CACHEDIR.TAG
",
to be compatible with old 14-character POSIX file systems
and 8.3-character MS-DOS file systems.
Version 0.3, July 29, 2004: Recommend that apps include a comment in their tags referring back to this spec. Recommend that backup/archival apps verify tag headers. Discuss tag filename length issue.
Version 0.2, July 23, 2004: Fleshed out Alternative Approaches section based on various feedback. Removed reference to thumbnail managing standard since the thumbnails directory doesn't quite have pure cache semantics.
Version 0.1, July 19, 2004: First public release.
Caching Applications | Archive/Backup Tools |
---|---|
The following applications
are known to create and maintain caches of some kind
in the user's home directory.
Applications that currently support this specification:
Positive responses received from developers of: Discussion initiated with developers of: |
The following archival tools support this tagging convention:
Other archival tools that are not (yet) directly sensitive to cache tags, but can be made so through the use of appropriate options or shell magic: |
Please let me know of other applications you know of that should be on this list!
Topics: Storage Operating Systems | Bryan Ford |