Home - Topics - Papers - Talks - Theses - Blog - CV - Photos - Funny

July 19, 2004

Cache Directory Tagging Specification

Version 0.6 (Changes)

Proposed by Bryan Ford

Abstract

Many applications create and manage directories containing cached information about content stored elsewhere, such as cached Web content or thumbnail-size versions of images or movies. For speed and storage efficiency we would often like to avoid backing up, archiving, or otherwise unnecessarily copying such directories around, but it is a pain to identify and individually exclude each such directory during data transfer operations. I propose an extremely simple convention by which applications can reliably "tag" any cache directories they create, for easy identification by backup systems and other data management utilities. Data management utilities can then heed or ignore these tags as the user sees fit.

The Problem

Many application programs create and manage cache directories of some kind: directories in which they store temporary information whose storage can benefit the application's performance, but which can easily be regenerated from other "primary" data sources if it is lost. For example, most web browsers have a cache directory in which they store downloaded Internet content (Mozilla typically uses a directory called Cache somewhere under $(HOME)/.mozilla). When backing up, synchronizing, or transferring a user's home directory from one machine to another, we would usually like to avoid saving or transferring cache directories at all, because:

There are a great many applications that use cache directories, however. While some evolving standards attempt to standardize the locations of such directories, most applications create their own cache directories in unpredictable locations and manage them independently of other applications. It therefore becomes very tedious, even for experienced users who can easily recognize cache directories, to figure out and list all of the appropriate --exclude options for backups or other directory traversal operations. Some applications, such as recent versions of KDE, keep their cache directories in /tmp or /var/tmp, which at least partly solves this problem by getting the cached content out of the user's home directory entirely. But there are good reasons an application or user might not want these caches in /tmp or /var/tmp: for example, those directories may be on a different partition with limited space, or files in them may be cleared out by the system too frequently (e.g., after a day, or on every reboot) to allow the cache to serve its intended purpose effectively.

The Filesystem Hierarchy Standard specifies a system-wide /var/cache directory intended to hold such cached information, but this directory is not generally world-writable, and thus can only be used by applications pre-packaged with the system or installed by the system administrator. The XDG Base Directory Specification instead recommends that applications create their caches in a location specified by a user environment variable, or in a standard subdirectory of the user's home directory if that environment variable doesn't exist. The XDG convention allows applications to operate without special permissions, but it does not help systemwide backup or data management utilities locate individual users' cache directories reliably, because each user might set the environment variable differently in his or her login profile.

Proposed Solution

Given the present non-existence of any perfect agreement on where applications should store their cached information, I propose a very simple convention that will at least allow such information to be identified effectively. Regardless of where the application decides to (or is configured to) place its cache directory, it should place within this directory a file named:

CACHEDIR.TAG

This file must be an ordinary file, not for example a symbolic link. Additionally, the first 43 octets of this file must consist of the following ASCII header string:

Signature: 8a477f597d28d172789f06886806bc55

Case is important in the header signature, there can be no whitespace or other characters in the file before the 'S', and there is exactly one space character (ASCII code 32 decimal) after the colon. The header string does not have to be followed by an LF or CR/LF in the file in order for the file to be recognized as a valid cache directory tag. The hex value in the signature happens to be the MD5 hash of the string ".IsCacheDirectory". This signature header is required to avoid the chance of an unrelated file named CACHEDIR.TAG being mistakenly interpreted as a cache directory tag by data management utilities, and (for example) causing valuable data not to be backed up.

The content of the remainder of the tag file is currently unspecified, except that it should be a plain text file in UTF-8 encoding, and any line beginning with a hash character ('#') should be treated as a comment and ignored by any software that reads the file.

We will henceforth refer to a file named as specified above, and having the required signature at the beginning of its content, as a cache directory tag.

For the benefit of anyone who happens to find and look at a cache directory tag directly, it is recommended that applications include in the file a comment referring back to this specification. For example:

Signature: 8a477f597d28d172789f06886806bc55
# This file is a cache directory tag created by (application name).
# For information about cache directory tags, see:
#	http://www.brynosaurus.com/cachedir/

The official "home" URL of this specification may change from the above at some point if/when this proposal becomes more of a formal standard, but in that case an appropriate forwarding link will be left in the old location.

Application Semantics

The presence of a cache directory tag within a given directory indicates that the entire contents of that directory, including any and all subdirectories underneath it, consists of cached information that can be re-generated if necessary from appropriate source material located elsewhere. An application that creates and/or uses a cache directory should write a single cache directory tag into the topmost directory whose contents represent cached information. For example, if Mozilla creates a directory named $(HOME)/.mozilla/default/dg8t1ih0.slt/Cache to hold cached content downloaded from the Internet, it would then create within it a cache directory tag named: $(HOME)/.mozilla/default/dg8t1ih0.slt/Cache/CACHEDIR.TAG. Only one cache directory tag is required to tag an entire subdirectory tree of cached content. The application should also regenerate the cache directory tag if it disappears: e.g., if the entire contents of the cache directory is deleted without the directory itself being deleted.

Data backup, transfer, or synchronization software, if configured to do so, may interpret the presence of a cache directory tag in a directory as a command to exclude that directory automatically (including any subdirectories) from the backup or transfer operation. I leave it to the designers of such applications to decide whether this "auto-exclude" behavior should be the default or only adopted when specifically requested by the user. A bit of conservatism is probably warranted in the short term, but automatically excluding cache directories by default could improve efficiency and ease-of-use in the longer term.

For maximum robustness against accidental data loss, any tool that has built-in support for cache directory tags is strongly encouraged to verify that any file it finds named CACHEDIR.TAG is a regular file and actually contains the 43-byte header specified above, before concluding that the file is indeed a cache directory tag.

User-friendly file managers and other directory hierarchy browsing software may also want to notice the presence of a cache directory tag in a directory, and inform the user that the contents of the directory is automatically managed and is not likely to make much sense if "browsed" directly.

System administration software such as du might provide a way to locate, total, and report the space usage of all the cache directories on a partition, helping the administrator to determine how much "soft" space is in use on a file system that could be easily freed up if necessary. Similarly, a utility might allow the administrator to clean out all the cache directories on the system, or to delete files in cache directories beyond a certain "age."

System software should not, however, just periodically go through and unilaterally delete "old" files in arbitrary cache directories on a purely automatic basis. Different applications need to be able to manage their caches differently; cached data that is considered "old" to one application or to the system might not be considered "old" to another (e.g., if certain data takes a long time to regenerate). In the future this standard might be extended to allow applications to write additional meta-data in their cache directory tags, providing system utilities with the additional information they would need to manage application caches intelligently on a global basis. For now, however, the primary purpose for introducing cache directory tags is merely to make it easy to avoid backing up or copying cached data unnecessarily from one machine to another.

Security Considerations

"Blind" use of cache directory tags in automatic system backups could potentially increase the damage that intruders or malware could cause to a system. A user or system administrator might be substantially less likely to notice the malicious insertion of a CACHDIR.TAG into an important directory than the outright deletion of that directory, for example, causing the contents of that directory to be omitted from regular backups. To mitigate this risk, backup software should at least inform the user which directories are being omitted due to the presence of cache directory tags. Automatic incremental backup software might maintain a list of "approved" cache directories, and whenever new cache directory tags appear, only heed them after being approved by the system administrator. In short, to maintain robustness of backups in the face of security compromises, cache directory tags should only be treated as hints, never as "the final word" on what should and should not be backed up.

Alternative Approaches

This section discusses a few of the alternatives that were considered in developing this proposal, and the rationale for designing the proposal as specified. It is formatted as a mini-FAQ addressing the most common questions.

Conclusion

By adopting the convention described above of writing cache directory tags where appropriate, applications that create and manage cache directories will make it much easier for users and system administrators to identify and manage "soft" storage areas whose contents are useful for performance but do not need to be backed up or synchronized across machines. This proposal deliberately attempts to adopt the simplest reasonable and portable solution, in the interest of maximizing robustness while minimizing cost of adoption, and hence hopefully maximizing its chance of eventual widespread adoption. The proposal nevertheless leaves room for future enhancements, in which applications might include within their cache directory tags additional useful information about the semantics and management of their cache directories.

Changes

Application Support Status

Caching Applications Archive/Backup Tools
The following applications are known to create and maintain caches of some kind in the user's home directory.

Applications that currently support this specification:

Positive responses received from developers of:

Discussion initiated with developers of:

The following archival tools support this tagging convention:

Other archival tools that are not (yet) directly sensitive to cache tags, but can be made so through the use of appropriate options or shell magic:

Please let me know of other applications you know of that should be on this list!

Related Links



Topics: Storage Operating Systems Bryan Ford