157 lines
7.0 KiB
HTML
157 lines
7.0 KiB
HTML
<HTML>
|
|
<HEAD><TITLE>APR Canonical Filenames</TITLE></HEAD>
|
|
<BODY>
|
|
<h1>APR Canonical Filename</h1>
|
|
|
|
<h2>Requirements</h2>
|
|
|
|
<p>APR porters need to address the underlying discrepancies between
|
|
file systems. To achieve a reasonable degree of security, the
|
|
program depending upon APR needs to know that two paths may be
|
|
compared, and that a mismatch is guarenteed to reflect that the
|
|
two paths do not return the same resource</p>.
|
|
|
|
<p>The first discrepancy is in volume roots. Unix and pure deriviates
|
|
have only one root path, "/". Win32 and OS2 share root paths of
|
|
the form "D:/", D: is the volume designation. However, this can
|
|
be specified as "//./D:/" as well, indicating D: volume of the
|
|
'this' machine. Win32 and OS2 also may employ a UNC root path,
|
|
of the form "//server/share/" where share is a share-point of the
|
|
specified network server. Finally, NetWare root paths are of the
|
|
form "server/volume:/", or the simpler "volume:/" syntax for 'this'
|
|
machine. All these non-Unix file systems accept volume:path,
|
|
without a slash following the colon, as a path relative to the
|
|
current working directory, which APR will treat as ambigious, that
|
|
is, neither an absolute nor a relative path per se.</p>
|
|
|
|
<p>The second discrepancy is in the meaning of the 'this' directory.
|
|
In general, 'this' must be eliminated from the path where it occurs.
|
|
The syntax "path/./" and "path/" are both aliases to path. However,
|
|
this isn't file system independent, since the double slash "//" has
|
|
a special meaning on OS2 and Win32 at the start of the path name,
|
|
and is invalid on those platforms before the "//server/share/" UNC
|
|
root path is completed. Finally, as noted above, "//./volume/" is
|
|
legal root syntax on WinNT, and perhaps others.</p>
|
|
|
|
<p>The third discrepancy is in the context of the 'parent' directory.
|
|
When "parent/path/.." occurs, the path must be unwound to "parent".
|
|
It's also critical to simply truncate leading "/../" paths to "/",
|
|
since the parent of the root is root. This gets tricky on the
|
|
Win32 and OS2 platforms, since the ".." element is invalid before
|
|
the "//server/share/" is complete, and the "//server/share/../"
|
|
seqence is the complete UNC root "//server/share/". In relative
|
|
paths, leading ".." elements are significant, until they are merged
|
|
with an absolute path. The relative form must only retain the ".."
|
|
segments as leading segments, to be resolved once merged to another
|
|
relative or an absolute path.</p>
|
|
|
|
<p>The fourth discrepancy occurs with acceptance of alternate character
|
|
codes for the same element. Path seperators are not retained within
|
|
the APR canonical forms. The OS filesystem and APR (slashed) forms
|
|
can both be returned as strings, to be used in the proper context.
|
|
Unix, Win32 and Netware all accept slashes and backslashes as the
|
|
same path seperator symbol, although unix strictly accepts slashes.
|
|
While the APR form of the name strictly uses slashes, always consider
|
|
that there could be a platform that actually accepts slashes as a
|
|
character within a segment name.</p>
|
|
|
|
<p>The fifth and worst discrepancy plauges Win32, OS2, Netware, and some
|
|
filesystems mounted in Unix. Case insensitivity can permit the same
|
|
file to slip through in both it's proper case and alternate cases.
|
|
Simply changing the case is insufficient for any character set beyond
|
|
ASCII, since various dilectic forms of characters suffer from one to
|
|
many or many to one translations. An example would be u-umlaut, which
|
|
might be accepted as a single character u-umlaut, a two character
|
|
sequence u and the zero-width umlaut, the upper case form of the same,
|
|
or perhaps even a captial U alone. This can be handled in different
|
|
ways depending on the purposes of the APR based program, but the one
|
|
requirement is that the path must be absolute in order to resolve these
|
|
ambiguities. Methods employed include comparison of device and inode
|
|
file uniqifiers, which is a fairly fast operation, or quering the OS
|
|
for the true form of the name, which can be much slower. Only the
|
|
acknowledgement of the file names by the OS can validate the equality
|
|
of two different cases of the same filename.</p>
|
|
|
|
<p>The sixth discrepancy, illegal or insignificant characters, is especially
|
|
significant in non-unix file systems. Trailing periods are accepted
|
|
but never stored, therefore trailing periods must be ignored for any
|
|
form of comparison. And all OS's have certain expectations of what
|
|
characters are illegal (or undesireable due to confusion.)</p>
|
|
|
|
<p>A final warning, canonical functions don't transform or resolve case
|
|
or character ambiguity issues until they are resolved into an absolute
|
|
path. The relative canonical path, while useful, while useful for URL
|
|
or similar identifiers, cannot be used for testing or comparison of file
|
|
system objects.</p>
|
|
|
|
<hr>
|
|
|
|
<h2>Canonical API</h2>
|
|
|
|
Functions to manipulate the apr_canon_file_t (an opaque type) include:
|
|
|
|
<ul>
|
|
<li>Create canon_file_t (from char* path and canon_file_t parent path)
|
|
<li>Merged canon_file_t (from path and parent, both canon_file_t)
|
|
<li>Get char* path of all or some segments
|
|
<li>Get path flags of IsRelative, IsVirtualRoot, and IsAbsolute
|
|
<li>Compare two canon_file_t structures for file equality
|
|
</ul>
|
|
|
|
<p>The path is corrected to the file system case only if is in absolute
|
|
form. The apr_canon_file_t should be preserved as long as possible and
|
|
used as the parent to create child entries to reduce the number of expensive
|
|
stat and case canonicalization calls to the OS.</p>
|
|
|
|
<p>The comparison operation provides that the APR can postpone correction
|
|
of case by simply relying upon the device and inode for equivalence. The
|
|
stat implementation provides that two files are the same, while their
|
|
strings are not equivalent, and eliminates the need for the operating
|
|
system to return the proper form of the name.</p>
|
|
|
|
<p>In any case, returning the char* path, with a flag to request the proper
|
|
case, forces the OS calls to resolve the true names of each segment. Where
|
|
there is a penality for this operation and the stat device and inode test
|
|
is faster, case correction is postponed until the char* result is requested.
|
|
On platforms that identify the inode, device, or proper name interchangably
|
|
with no penalities, this may occur when the name is initially processed.</p>
|
|
|
|
<hr>
|
|
|
|
<h2>Unix Example</h2>
|
|
|
|
<p>First the simplest case:</p>
|
|
|
|
<pre>
|
|
Parse Canonical Name
|
|
accepts parent path as canonical_t
|
|
this path as string
|
|
|
|
Split this path Segments on '/'
|
|
|
|
For each of this path Segments
|
|
If first Segment
|
|
If this Segment is Empty ([nothing]/)
|
|
Append this Root Segment (don't merge)
|
|
Continue to next Segment
|
|
Else is relative
|
|
Append parent Segments (to merge)
|
|
Continue with this Segment
|
|
If Segment is '.' or empty (2 slashes)
|
|
Discard this Segment
|
|
Continue with next Segment
|
|
If Segment is '..'
|
|
If no previous Segment or previous Segment is '..'
|
|
Append this Segment
|
|
Continue with next Segment
|
|
If previous Segment and previous is not Root Segment
|
|
Discard previous Segment
|
|
Discard this Segment
|
|
Continue with next Segment
|
|
Append this Relative Segment
|
|
Continue with next Segment
|
|
</pre>
|
|
|
|
</BODY>
|
|
</HTML>
|