diff --git a/share/doc/psd/02.implement/Makefile b/share/doc/psd/02.implement/Makefile new file mode 100644 index 000000000000..09d7d8bd5c6c --- /dev/null +++ b/share/doc/psd/02.implement/Makefile @@ -0,0 +1,12 @@ +# @(#)Makefile 8.1 (Berkeley) 6/8/93 +# $FreeBSD$ + +DIR= psd/02.implement +SRCS= implement +MACROS= -ms +REFER= refer -p /usr/old/dict/papers/Ind -e + +paper.ps: ${SRCS} fig1.pic fig2.pic + ${SOELIM} ${SRCS} | ${REFER} | ${PIC} | ${ROFF} > ${.TARGET} + +.include diff --git a/share/doc/psd/02.implement/implement b/share/doc/psd/02.implement/implement new file mode 100644 index 000000000000..c70f8f0107d6 --- /dev/null +++ b/share/doc/psd/02.implement/implement @@ -0,0 +1,1250 @@ +.\" Copyright (c) 1986, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" This module is believed to contain source code proprietary to AT&T. +.\" Use and redistribution is subject to the Berkeley Software License +.\" Agreement and your Software Agreement with AT&T (Western Electric). +.\" +.\" @(#)implement 8.1 (Berkeley) 6/8/93 +.\" +.\" $FreeBSD$ +.EH 'PSD:2-%''UNIX Implementation' +.OH 'UNIX Implementation''PSD:2-%' +.de P1 +.DS +.. +.de P2 +.DE +.. +.de UL +.lg 0 +.if n .ul +\%\&\\$3\f3\\$1\fR\&\\$2 +.lg +.. +.de UC +\&\\$3\s-1\\$1\\s0\&\\$2 +.. +.de IT +.lg 0 +.if n .ul +\%\&\\$3\f2\\$1\fR\&\\$2 +.lg +.. +.de SP +.sp \\$1 +.. +.hw device +.TL +UNIX Implementation +.AU "MH 2C-523" 2394 +K. Thompson +.AI +AT&T Bell Laboratories +Murray Hill, NJ +.AB +This paper describes in high-level terms the +implementation of the resident +.UX +kernel. +This discussion is broken into three parts. +The first part describes +how the +.UX +system views processes, users, and programs. +The second part describes the I/O system. +The last part describes the +.UX +file system. +.AE +.NH +INTRODUCTION +.PP +The +.UX +kernel consists of about 10,000 +lines of C code and about 1,000 lines of assembly code. +The assembly code can be further broken down into +200 lines included for +the sake of efficiency +(they could have been written in C) +and 800 lines to perform hardware +functions not possible in C. +.PP +This code represents 5 to 10 percent of what has +been lumped into the broad expression +``the +.UX +operating system.'' +The kernel is the only +.UX +code that +cannot be substituted by a user to his +own liking. +For this reason, +the kernel should make as few real +decisions as possible. +This does not mean to allow the user +a million options to do the same thing. +Rather, it means to allow only one way to +do one thing, +but have that way be the least-common divisor +of all the options that might have been provided. +.PP +What is or is not implemented in the kernel +represents both a great responsibility and a great power. +It is a soap-box platform on +``the way things should be done.'' +Even so, if +``the way'' is too radical, +no one will follow it. +Every important decision was weighed +carefully. +Throughout, +simplicity has been substituted for efficiency. +Complex algorithms are used only if +their complexity can be localized. +.NH +PROCESS CONTROL +.PP +In the +.UX +system, +a user executes programs in an +environment called a user process. +When a system function is required, +the user process calls the system +as a subroutine. +At some point in this call, +there is a distinct switch of environments. +After this, +the process is said to be a system process. +In the normal definition of processes, +the user and system processes are different +phases of the same process +(they never execute simultaneously). +For protection, +each system process has its own stack. +.PP +The user process may execute +from a read-only text segment, +which is shared by all processes +executing the same code. +There is no +.IT functional +benefit +from shared-text segments. +An +.IT efficiency +benefit comes from the fact +that there is no need to swap read-only +segments out because the original +copy on secondary memory is still current. +This is a great benefit to interactive +programs that tend to be swapped while +waiting for terminal input. +Furthermore, +if two processes are +executing +simultaneously +from the same copy of a read-only segment, +only one copy needs to reside in +primary memory. +This is a secondary effect, +because +simultaneous execution of a program +is not common. +It is ironic that this effect, +which reduces the use of primary memory, +only comes into play when there is +an overabundance of primary memory, +that is, +when there is enough memory +to keep waiting processes loaded. +.PP +All current read-only text segments in the +system are maintained from the +.IT "text table" . +A text table entry holds the location of the +text segment on secondary memory. +If the segment is loaded, +that table also holds the primary memory location +and the count of the number of processes +sharing this entry. +When this count is reduced to zero, +the entry is freed along with any +primary and secondary memory holding the segment. +When a process first executes a shared-text segment, +a text table entry is allocated and the +segment is loaded onto secondary memory. +If a second process executes a text segment +that is already allocated, +the entry reference count is simply incremented. +.PP +A user process has some strictly private +read-write data +contained in its +data segment. +As far as possible, +the system does not use the user's +data segment to hold system data. +In particular, +there are no I/O buffers in the +user address space. +.PP +The user data segment has two growing boundaries. +One, increased automatically by the system +as a result of memory faults, +is used for a stack. +The second boundary is only grown (or shrunk) by +explicit requests. +The contents of newly allocated primary memory +is initialized to zero. +.PP +Also associated and swapped with +a process is a small fixed-size +system data segment. +This segment contains all +the data about the process +that the system needs only when the +process is active. +Examples of the kind of data contained +in the system data segment are: +saved central processor registers, +open file descriptors, +accounting information, +scratch data area, +and the stack for the system phase +of the process. +The system data segment is not +addressable from the user process +and is therefore protected. +.PP +Last, +there is a process table with +one entry per process. +This entry contains all the data +needed by the system when the process +is +.IT not +active. +Examples are +the process's name, +the location of the other segments, +and scheduling information. +The process table entry is allocated +when the process is created, and freed +when the process terminates. +This process entry is always directly +addressable by the kernel. +.PP +Figure 1 shows the relationships +between the various process control +data. +In a sense, +the process table is the +definition of all processes, +because +all the data associated with a process +may be accessed +starting from the process table entry. +.KF +.in .375i +.so fig1.pic +.in -.375i +.sp 2v +.ce +Fig. 1\(emProcess control data structure. +.KE +.NH 2 +Process creation and program execution +.PP +Processes are created by the system primitive +.UL fork . +The newly created process (child) is a copy of the original process (parent). +There is no detectable sharing of primary memory between the two processes. +(Of course, +if the parent process was executing from a read-only +text segment, +the child will share the text segment.) +Copies of all writable data segments +are made for the child process. +Files that were open before the +.UL fork +are +truly shared after the +.UL fork . +The processes are informed as to their part in the +relationship to +allow them to select their own +(usually non-identical) +destiny. +The parent may +.UL wait +for the termination of +any of its children. +.PP +A process may +.UL exec +a file. +This consists of exchanging the current text and data +segments of the process for new text and data +segments specified in the file. +The old segments are lost. +Doing an +.UL exec +does +.IT not +change processes; +the process that did the +.UL exec +persists, +but +after the +.UL exec +it is executing a different program. +Files that were open +before the +.UL exec +remain open after the +.UL exec . +.PP +If a program, +say the first pass of a compiler, +wishes to overlay itself with another program, +say the second pass, +then it simply +.UL exec s +the second program. +This is analogous +to a ``goto.'' +If a program wishes to regain control +after +.UL exec ing +a second program, +it should +.UL fork +a child process, +have the child +.UL exec +the second program, and +have the parent +.UL wait +for the child. +This is analogous to a ``call.'' +Breaking up the call into a binding followed by +a transfer is similar to the subroutine linkage in +SL-5. +.[ +griswold hanson sl5 overview +.] +.NH 2 +Swapping +.PP +The major data associated with a process +(the user data segment, +the system data segment, and +the text segment) +are swapped to and from secondary +memory, as needed. +The user data segment and the system data segment +are kept in contiguous primary memory to reduce +swapping latency. +(When low-latency devices, such as bubbles, +.UC CCD s, +or scatter/gather devices, +are used, +this decision will have to be reconsidered.) +Allocation of both primary +and secondary memory is performed +by the same simple first-fit algorithm. +When a process grows, +a new piece of primary memory is allocated. +The contents of the old memory is copied to the new memory. +The old memory is freed +and the tables are updated. +If there is not enough primary memory, +secondary memory is allocated instead. +The process is swapped out onto the +secondary memory, +ready to be swapped in with +its new size. +.PP +One separate process in the kernel, +the swapping process, +simply swaps the other +processes in and out of primary memory. +It examines the +process table looking for a process +that is swapped out and is +ready to run. +It allocates primary memory for that +process and +reads its segments into +primary memory, where that process competes for the +central processor with other loaded processes. +If no primary memory is available, +the swapping process makes memory available +by examining the process table for processes +that can be swapped out. +It selects a process to swap out, +writes it to secondary memory, +frees the primary memory, +and then goes back to look for a process +to swap in. +.PP +Thus there are two specific algorithms +to the swapping process. +Which of the possibly many processes that +are swapped out is to be swapped in? +This is decided by secondary storage residence +time. +The one with the longest time out is swapped in first. +There is a slight penalty for larger processes. +Which of the possibly many processes that +are loaded is to be swapped out? +Processes that are waiting for slow events +(i.e., not currently running or waiting for +disk I/O) +are picked first, +by age in primary memory, +again with size penalties. +The other processes are examined +by the same age algorithm, +but are not taken out unless they are +at least of some age. +This adds +hysteresis to the swapping and +prevents total thrashing. +.PP +These swapping algorithms are the +most suspect in the system. +With limited primary memory, +these algorithms cause total swapping. +This is not bad in itself, because +the swapping does not impact the +execution of the resident processes. +However, if the swapping device must +also be used for file storage, +the swapping traffic severely +impacts the file system traffic. +It is exactly these small systems +that tend to double usage of limited disk +resources. +.NH 2 +Synchronization and scheduling +.PP +Process synchronization is accomplished by having processes +wait for events. +Events are represented by arbitrary integers. +By convention, +events are chosen to be addresses of +tables associated with those events. +For example, a process that is waiting for +any of its children to terminate will wait +for an event that is the address of +its own process table entry. +When a process terminates, +it signals the event represented by +its parent's process table entry. +Signaling an event on which no process +is waiting has no effect. +Similarly, +signaling an event on which many processes +are waiting will wake all of them up. +This differs considerably from +Dijkstra's P and V +synchronization operations, +.[ +dijkstra sequential processes 1968 +.] +in that +no memory is associated with events. +Thus there need be no allocation of events +prior to their use. +Events exist simply by being used. +.PP +On the negative side, +because there is no memory associated with events, +no notion of ``how much'' +can be signaled via the event mechanism. +For example, +processes that want memory might +wait on an event associated with +memory allocation. +When any amount of memory becomes available, +the event would be signaled. +All the competing processes would then wake +up to fight over the new memory. +(In reality, +the swapping process is the only process +that waits for primary memory to become available.) +.PP +If an event occurs +between the time a process decides +to wait for that event and the +time that process enters the wait state, +then +the process will wait on an event that has +already happened (and may never happen again). +This race condition happens because there is no memory associated with +the event to indicate that the event has occurred; +the only action of an event is to change a set of processes +from wait state to run state. +This problem is relieved largely +by the fact that process switching can +only occur in the kernel by explicit calls +to the event-wait mechanism. +If the event in question is signaled by another +process, +then there is no problem. +But if the event is signaled by a hardware +interrupt, +then special care must be taken. +These synchronization races pose the biggest +problem when +.UX +is adapted to multiple-processor configurations. +.[ +hawley meyer multiprocessing unix +.] +.PP +The event-wait code in the kernel +is like a co-routine linkage. +At any time, +all but one of the processes has called event-wait. +The remaining process is the one currently executing. +When it calls event-wait, +a process whose event has been signaled +is selected and that process +returns from its call to event-wait. +.PP +Which of the runable processes is to run next? +Associated with each process is a priority. +The priority of a system process is assigned by the code +issuing the wait on an event. +This is roughly equivalent to the response +that one would expect on such an event. +Disk events have high priority, +teletype events are low, +and time-of-day events are very low. +(From observation, +the difference in system process priorities +has little or no performance impact.) +All user-process priorities are lower than the +lowest system priority. +User-process priorities are assigned +by an algorithm based on the +recent ratio of the amount of compute time to real time consumed +by the process. +A process that has used a lot of +compute time in the last real-time +unit is assigned a low user priority. +Because interactive processes are characterized +by low ratios of compute to real time, +interactive response is maintained without any +special arrangements. +.PP +The scheduling algorithm simply picks +the process with the highest priority, +thus +picking all system processes first and +user processes second. +The compute-to-real-time ratio is updated +every second. +Thus, +all other things being equal, +looping user processes will be +scheduled round-robin with a +1-second quantum. +A high-priority process waking up will +preempt a running, low-priority process. +The scheduling algorithm has a very desirable +negative feedback character. +If a process uses its high priority +to hog the computer, +its priority will drop. +At the same time, if a low-priority +process is ignored for a long time, +its priority will rise. +.NH +I/O SYSTEM +.PP +The I/O system +is broken into two completely separate systems: +the block I/O system and the character I/O system. +In retrospect, +the names should have been ``structured I/O'' +and ``unstructured I/O,'' respectively; +while the term ``block I/O'' has some meaning, +``character I/O'' is a complete misnomer. +.PP +Devices are characterized by a major device number, +a minor device number, and +a class (block or character). +For each class, +there is an array of entry points into the device drivers. +The major device number is used to index the array +when calling the code for a particular device driver. +The minor device number is passed to the +device driver as an argument. +The minor number has no significance other +than that attributed to it by the driver. +Usually, +the driver uses the minor number to access +one of several identical physical devices. +.PP +The use of the array of entry points +(configuration table) +as the only connection between the +system code and the device drivers is +very important. +Early versions of the system had a much +less formal connection with the drivers, +so that it was extremely hard to handcraft +differently configured systems. +Now it is possible to create new +device drivers in an average of a few hours. +The configuration table in most cases +is created automatically by a program +that reads the system's parts list. +.NH 2 +Block I/O system +.PP +The model block I/O device consists +of randomly addressed, secondary +memory blocks of 512 bytes each. +The blocks are uniformly addressed +0, 1, .\|.\|. up to the size of the device. +The block device driver has the job of +emulating this model on a +physical device. +.PP +The block I/O devices are accessed +through a layer of buffering software. +The system maintains a list of buffers +(typically between 10 and 70) +each assigned a device name and +a device address. +This buffer pool constitutes a data cache +for the block devices. +On a read request, +the cache is searched for the desired block. +If the block is found, +the data are made available to the +requester without any physical I/O. +If the block is not in the cache, +the least recently used block in the cache is renamed, +the correct device driver is called to +fill up the renamed buffer, and then the +data are made available. +Write requests are handled in an analogous manner. +The correct buffer is found +and relabeled if necessary. +The write is performed simply by marking +the buffer as ``dirty.'' +The physical I/O is then deferred until +the buffer is renamed. +.PP +The benefits in reduction of physical I/O +of this scheme are substantial, +especially considering the file system implementation. +There are, +however, +some drawbacks. +The asynchronous nature of the +algorithm makes error reporting +and meaningful user error handling +almost impossible. +The cavalier approach to I/O error +handling in the +.UX +system is partly due to the asynchronous +nature of the block I/O system. +A second problem is in the delayed writes. +If the system stops unexpectedly, +it is almost certain that there is a +lot of logically complete, +but physically incomplete, +I/O in the buffers. +There is a system primitive to +flush all outstanding I/O activity +from the buffers. +Periodic use of this primitive helps, +but does not solve, the problem. +Finally, +the associativity in the buffers +can alter the physical I/O sequence +from that of the logical I/O sequence. +This means that there are times +when data structures on disk are inconsistent, +even though the software is careful +to perform I/O in the correct order. +On non-random devices, +notably magnetic tape, +the inversions of writes can be disastrous. +The problem with magnetic tapes is ``cured'' by +allowing only one outstanding write request +per drive. +.NH 2 +Character I/O system +.PP +The character I/O system consists of all +devices that do not fall into the block I/O model. +This includes the ``classical'' character devices +such as communications lines, paper tape, and +line printers. +It also includes magnetic tape and disks when +they are not used in a stereotyped way, +for example, 80-byte physical records on tape +and track-at-a-time disk copies. +In short, +the character I/O interface +means ``everything other than block.'' +I/O requests from the user are sent to the +device driver essentially unaltered. +The implementation of these requests is, of course, +up to the device driver. +There are guidelines and conventions +to help the implementation of +certain types of device drivers. +.NH 3 +Disk drivers +.PP +Disk drivers are implemented +with a queue of transaction records. +Each record holds a read/write flag, +a primary memory address, +a secondary memory address, and +a transfer byte count. +Swapping is accomplished by passing +such a record to the swapping device driver. +The block I/O interface is implemented by +passing such records with requests to +fill and empty system buffers. +The character I/O interface to the disk +drivers create a transaction record that +points directly into the user area. +The routine that creates this record also insures +that the user is not swapped during this +I/O transaction. +Thus by implementing the general disk driver, +it is possible to use the disk +as a block device, +a character device, and a swap device. +The only really disk-specific code in normal +disk drivers is the pre-sort of transactions to +minimize latency for a particular device, and +the actual issuing of the I/O request. +.NH 3 +Character lists +.PP +Real character-oriented devices may +be implemented using the common +code to handle character lists. +A character list is a queue of characters. +One routine puts a character on a queue. +Another gets a character from a queue. +It is also possible to ask how many +characters are currently on a queue. +Storage for all queues in the system comes +from a single common pool. +Putting a character on a queue will allocate +space from the common pool and link the +character onto the data structure defining the queue. +Getting a character from a queue returns +the corresponding space to the pool. +.PP +A typical character-output device +(paper tape punch, for example) +is implemented by passing characters +from the user onto a character queue until +some maximum number of characters is on the queue. +The I/O is prodded to start as +soon as there is anything on the queue +and, once started, +it is sustained by hardware completion interrupts. +Each time there is a completion interrupt, +the driver gets the next character from the queue +and sends it to the hardware. +The number of characters on the queue is checked and, +as the count falls through some intermediate level, +an event (the queue address) is signaled. +The process that is passing characters from +the user to the queue can be waiting on the event, and +refill the queue to its maximum +when the event occurs. +.PP +A typical character input device +(for example, a paper tape reader) +is handled in a very similar manner. +.PP +Another class of character devices is the terminals. +A terminal is represented by three +character queues. +There are two input queues (raw and canonical) +and an output queue. +Characters going to the output of a terminal +are handled by common code exactly as described +above. +The main difference is that there is also code +to interpret the output stream as +.UC ASCII +characters and to perform some translations, +e.g., escapes for deficient terminals. +Another common aspect of terminals is code +to insert real-time delay after certain control characters. +.PP +Input on terminals is a little different. +Characters are collected from the terminal and +placed on a raw input queue. +Some device-dependent code conversion and +escape interpretation is handled here. +When a line is complete in the raw queue, +an event is signaled. +The code catching this signal then copies a +line from the raw queue to a canonical queue +performing the character erase and line kill editing. +User read requests on terminals can be +directed at either the raw or canonical queues. +.NH 3 +Other character devices +.PP +Finally, +there are devices that fit no general category. +These devices are set up as character I/O drivers. +An example is a driver that reads and writes +unmapped primary memory as an I/O device. +Some devices are too +fast to be treated a character at time, +but do not fit the disk I/O mold. +Examples are fast communications lines and +fast line printers. +These devices either have their own buffers +or ``borrow'' block I/O buffers for a while and +then give them back. +.NH +THE FILE SYSTEM +.PP +In the +.UX +system, +a file is a (one-dimensional) array of bytes. +No other structure of files is implied by the +system. +Files are attached anywhere +(and possibly multiply) +onto a hierarchy of directories. +Directories are simply files that +users cannot write. +For a further discussion +of the external view of files and directories, +see Ref.\0 +.[ +ritchie thompson unix bstj 1978 +%Q This issue +.]. +.PP +The +.UX +file system is a disk data structure +accessed completely through +the block I/O system. +As stated before, +the canonical view of a ``disk'' is +a randomly addressable array of +512-byte blocks. +A file system breaks the disk into +four self-identifying regions. +The first block (address 0) +is unused by the file system. +It is left aside for booting procedures. +The second block (address 1) +contains the so-called ``super-block.'' +This block, +among other things, +contains the size of the disk and +the boundaries of the other regions. +Next comes the i-list, +a list of file definitions. +Each file definition is +a 64-byte structure, called an i-node. +The offset of a particular i-node +within the i-list is called its i-number. +The combination of device name +(major and minor numbers) and i-number +serves to uniquely name a particular file. +After the i-list, +and to the end of the disk, +come free storage blocks that +are available for the contents of files. +.PP +The free space on a disk is maintained +by a linked list of available disk blocks. +Every block in this chain contains a disk address +of the next block in the chain. +The remaining space contains the address of up to +50 disk blocks that are also free. +Thus with one I/O operation, +the system obtains 50 free blocks and a +pointer where to find more. +The disk allocation algorithms are +very straightforward. +Since all allocation is in fixed-size +blocks and there is strict accounting of +space, +there is no need to compact or garbage collect. +However, +as disk space becomes dispersed, +latency gradually increases. +Some installations choose to occasionally compact +disk space to reduce latency. +.PP +An i-node contains 13 disk addresses. +The first 10 of these addresses point directly at +the first 10 blocks of a file. +If a file is larger than 10 blocks (5,120 bytes), +then the eleventh address points at a block +that contains the addresses of the next 128 blocks of the file. +If the file is still larger than this +(70,656 bytes), +then the twelfth block points at up to 128 blocks, +each pointing to 128 blocks of the file. +Files yet larger +(8,459,264 bytes) +use the thirteenth address for a ``triple indirect'' address. +The algorithm ends here with the maximum file size +of 1,082,201,087 bytes. +.PP +A logical directory hierarchy is added +to this flat physical structure simply +by adding a new type of file, the directory. +A directory is accessed exactly as an ordinary file. +It contains 16-byte entries consisting of +a 14-byte name and an i-number. +The root of the hierarchy is at a known i-number +(\fIviz.,\fR 2). +The file system structure allows an arbitrary, directed graph +of directories with regular files linked in +at arbitrary places in this graph. +In fact, +very early +.UX +systems used such a structure. +Administration of such a structure became so +chaotic that later systems were restricted +to a directory tree. +Even now, +with regular files linked multiply +into arbitrary places in the tree, +accounting for space has become a problem. +It may become necessary to restrict the entire +structure to a tree, +and allow a new form of linking that +is subservient to the tree structure. +.PP +The file system allows +easy creation, +easy removal, +easy random accessing, +and very easy space allocation. +With most physical addresses confined +to a small contiguous section of disk, +it is also easy to dump, restore, and +check the consistency of the file system. +Large files suffer from indirect addressing, +but the cache prevents most of the implied physical I/O +without adding much execution. +The space overhead properties of this scheme are quite good. +For example, +on one particular file system, +there are 25,000 files containing 130M bytes of data-file content. +The overhead (i-node, indirect blocks, and last block breakage) +is about 11.5M bytes. +The directory structure to support these files +has about 1,500 directories containing 0.6M bytes of directory content +and about 0.5M bytes of overhead in accessing the directories. +Added up any way, +this comes out to less than a 10 percent overhead for actual +stored data. +Most systems have this much overhead in +padded trailing blanks alone. +.NH 2 +File system implementation +.PP +Because the i-node defines a file, +the implementation of the file system centers +around access to the i-node. +The system maintains a table of all active +i-nodes. +As a new file is accessed, +the system locates the corresponding i-node, +allocates an i-node table entry, and reads +the i-node into primary memory. +As in the buffer cache, +the table entry is considered to be the current +version of the i-node. +Modifications to the i-node are made to +the table entry. +When the last access to the i-node goes +away, +the table entry is copied back to the +secondary store i-list and the table entry is freed. +.KF +.in .25i +.so fig2.pic +.in -.25i +.sp 2v +.ce +Fig. 2\(emFile system data structure. +.sp +.KE +.PP +All I/O operations on files are carried out +with the aid of the corresponding i-node table entry. +The accessing of a file is a straightforward +implementation of the algorithms mentioned previously. +The user is not aware of i-nodes and i-numbers. +References to the file system are made in terms of +path names of the directory tree. +Converting a path name into an i-node table entry +is also straightforward. +Starting at some known i-node +(the root or the current directory of some process), +the next component of the path name is +searched by reading the directory. +This gives an i-number and an implied device +(that of the directory). +Thus the next i-node table entry can be accessed. +If that was the last component of the path name, +then this i-node is the result. +If not, +this i-node is the directory needed to look up +the next component of the path name, and the +algorithm is repeated. +.PP +The user process accesses the file system with +certain primitives. +The most common of these are +.UL open , +.UL create , +.UL read , +.UL write , +.UL seek , +and +.UL close . +The data structures maintained are shown in Fig. 2. +In the system data segment associated with a user, +there is room for some (usually between 10 and 50) open files. +This open file table consists of pointers that can be used to access +corresponding i-node table entries. +Associated with each of these open files is +a current I/O pointer. +This is a byte offset of +the next read/write operation on the file. +The system treats each read/write request +as random with an implied seek to the +I/O pointer. +The user usually thinks of the file as +sequential with the I/O pointer +automatically counting the number of bytes +that have been read/written from the file. +The user may, +of course, +perform random I/O by setting the I/O pointer +before reads/writes. +.PP +With file sharing, +it is necessary to allow related +processes to share a common I/O pointer +and yet have separate I/O pointers +for independent processes +that access the same file. +With these two conditions, +the I/O pointer cannot reside +in the i-node table nor can +it reside in the list of +open files for the process. +A new table +(the open file table) +was invented for the sole purpose +of holding the I/O pointer. +Processes that share the same open +file +(the result of +.UL fork s) +share a common open file table entry. +A separate open of the same file will +only share the i-node table entry, +but will have distinct open file table entries. +.PP +The main file system primitives are implemented as follows. +.UL \&open +converts a file system path name into an i-node +table entry. +A pointer to the i-node table entry is placed in a +newly created open file table entry. +A pointer to the file table entry is placed in the +system data segment for the process. +.UL \&create +first creates a new i-node entry, +writes the i-number into a directory, and +then builds the same structure as for an +.UL open . +.UL \&read +and +.UL write +just access the i-node entry as described above. +.UL \&seek +simply manipulates the I/O pointer. +No physical seeking is done. +.UL \&close +just frees the structures built by +.UL open +and +.UL create . +Reference counts are kept on the open file table entries and +the i-node table entries to free these structures after +the last reference goes away. +.UL \&unlink +simply decrements the count of the +number of directories pointing at the given i-node. +When the last reference to an i-node table entry +goes away, +if the i-node has no directories pointing to it, +then the file is removed and the i-node is freed. +This delayed removal of files prevents +problems arising from removing active files. +A file may be removed while still open. +The resulting unnamed file vanishes +when the file is closed. +This is a method of obtaining temporary files. +.PP +There is a type of unnamed +.UC FIFO +file called a +.UL pipe. +Implementation of +.UL pipe s +consists of implied +.UL seek s +before each +.UL read +or +.UL write +in order to implement +first-in-first-out. +There are also checks and synchronization +to prevent the +writer from grossly outproducing the +reader and to prevent the reader from +overtaking the writer. +.NH 2 +Mounted file systems +.PP +The file system of a +.UX +system +starts with some designated block device +formatted as described above to contain +a hierarchy. +The root of this structure is the root of +the +.UX +file system. +A second formatted block device may be +mounted +at any leaf of +the current hierarchy. +This logically extends the current hierarchy. +The implementation of +mounting +is trivial. +A mount table is maintained containing +pairs of designated leaf i-nodes and +block devices. +When converting a path name into an i-node, +a check is made to see if the new i-node is a +designated leaf. +If it is, +the i-node of the root +of the block device replaces it. +.PP +Allocation of space for a file is taken +from the free pool on the device on which the +file lives. +Thus a file system consisting of many +mounted devices does not have a common pool of +free secondary storage space. +This separation of space on different +devices is necessary to allow easy +unmounting +of a device. +.NH 2 +Other system functions +.PP +There are some other things that the system +does for the user\-a +little accounting, +a little tracing/debugging, +and a little access protection. +Most of these things are not very +well developed +because our use of the system in computing science research +does not need them. +There are some features that are missed in some +applications, for example, better inter-process communication. +.PP +The +.UX +kernel is an I/O multiplexer more than +a complete operating system. +This is as it should be. +Because of this outlook, +many features are +found in most +other operating systems that are missing from the +.UX +kernel. +For example, +the +.UX +kernel does not support +file access methods, +file disposition, +file formats, +file maximum size, +spooling, +command language, +logical records, +physical records, +assignment of logical file names, +logical file names, +more than one character set, +an operator's console, +an operator, +log-in, +or log-out. +Many of these things are symptoms rather than features. +Many of these things are implemented +in user software +using the kernel as a tool. +A good example of this is the command language. +.[ +bourne shell 1978 bstj +%Q This issue +.] +Each user may have his own command language. +Maintenance of such code is as easy as +maintaining user code. +The idea of implementing ``system'' code with general +user primitives +comes directly from +.UC MULTICS . +.[ +organick multics 1972 +.] +.LP +.[ +$LIST$ +.]