\documentclass[english]{report}
\usepackage[T1]{fontenc}
\usepackage[latin1]{inputenc}
\usepackage{a4wide}
\setcounter{secnumdepth}{3} \setcounter{tocdepth}{3}
\usepackage{graphicx}
\makeatletter
\usepackage{babel}
\makeatother
\begin{document}
\thispagestyle{empty}
\begin{center}{\LARGE S.Ha.R.K. User Manual}\end{center}{\LARGE \par} \vfill{}
\begin{center}{\large Volume VI}\end{center}
\begin{center}S.Ha.R.K. File System\end{center} \vfill{}
\begin{center}Written by\end{center}
\begin{center}Paolo Gai (pj at sssup.it)\end{center}
\begin{center}Tullio Facchinetti (tullio.facchinetti at unipv.it)\end{center}
\begin{center}Original italian version by\end{center}
\begin{center}Massimiliano Giorgi (massy at gandalf.sssup.it)\end{center}
\vfill{}
\begin{center}\includegraphics[width=2cm]{../common/sssup.ps}\end{center}
\begin{center}Scuola Superiore di Studi e Perfezionamento S. Anna\end{center}
\begin{center}RETIS Lab\end{center}
\begin{center}Via Carducci, 40 - 56100 Pisa\end{center}
\begin{center}\pagebreak\end{center}
\tableofcontents{}
\listoffigures
\listoftables
\begin{abstract}
This is the documentation about S.Ha.R.K. File system. The document gives an
insight on many data structures used internally by the file system. This is the
first step toward a \emph{true} file system user manual. If you carefully read
it, you will understand how it works, how can be initialized and used.
In any case, please send any comments to pj@sssup.it.
Enjoy,
Paolo
\end{abstract}
%----------------------------------------------------------------------------
\chapter{File system architecture}
\label{sec:arch}
%----------------------------------------------------------------------------
%----------------------------------------------------------------------------
\section{Features}
%----------------------------------------------------------------------------
\label{sec:arch:gen}The library is composed by two main layers:
\begin{description}
\item [File-system] The file system contains all the data structures used to
handle files.
\item [Blocks~devices]This part handles directly the hardware. They provide a
common interface that unifies the usage of all the block devices (i.e., the hard
disks).
\end{description}
The architecture of the library is modular, and it is described in
Figure \ref{fig:Filesystem-architecture}.
\begin{figure}
\begin{center}
\includegraphics[width=6cm]{images/arch2.eps}
\end{center}
\caption{File system architecture}
\label{fig:Filesystem-architecture}
\end{figure}
The library is connected to the Kernel via a \emph{glue} layer, to
the application via a \emph{wrapper} layer; the block devices and
the file system operations are separated by a cache, that may be
excluded.
\label{sec:arch:req}
The library needs a set of primitives that must be supported by
the underlying kernel, as memory handling, string handling, and
functions with a variable number of parameters. Moreover, the
kernel should support functions that implement:
\begin{itemize}
\item Mutual exclusion.
\item Allocation of low level resources (as I/O spaces and interrupts).
\item Memory protection (not needed in S.Ha.R.K.).
\end{itemize}
%----------------------------------------------------------------------------
\section{Low Level: Device Drivers}
%----------------------------------------------------------------------------
\begin{figure}
\begin{center}
\includegraphics[width=6cm]{images/archlow.eps}
\end{center}
\caption{Device Drivers}
\label{fig:archlow}
\end{figure}
The device drivers are divided in two main parts:
\begin{itemize}
\item A general part that exports the basic low level services. It
also gives a set of internal routines to the internal modules.
\item A set of internal modules that handle one or more
peripherals. The application can choose which module must be
linked, initialized and used at compile time.
\end{itemize}
\begin{figure}
\begin{center}
\includegraphics[width=6cm]{images/archlow2.eps}
\end{center}
\caption{Device Drivers - details}
\label{fig:archlow2}
\end{figure}
The general part is composed by four modules:
\begin{description}
\item [Block~device]The main part; it exports the interface to the upper layers,
inits the device driver subsystem and register all the drivers present in the
system.
\item [Logical~disk]This module provides functions that allow to inspect the
logical structure of an hard disk, like start, end, size and type of each
partition on the Hard Disk.
\item [Physical~disk]This module provide a small data base that contains
informations about all the hard disks in the system; these informations can be
useful for the module that must handle the low level pending request queues.
\item [Queue~policy]This module handles the low level policies for the pending
requests (SCAN, LOOK, EDF, ...).
\end{description}
%----------------------------------------------------------------------------
\section{High Level: File systems}
%----------------------------------------------------------------------------
\begin{figure}
\begin{center}
\includegraphics[width=10cm]{images/archhigh.eps}
\end{center}
\caption{High Level - File system details}
\label{fig:archhigh}
\end{figure}
The high level is divided in three parts:
\begin{itemize}
\item The first part handle the file systems, its initialization and its
interface to the internal modules.
\item A second part composed by a set of modules that implement a particular
file system (i.e., FAT16).
\item A cache module (whose policy can be easily changed).
\end{itemize}
The first part is composed by sub-modules (internal implementation
of a sub-module can be changed without affecting the rest of the
library):
\begin{description}
\item [Filesystems]It is responsible for the initialization and termination of
all the high level layer. It registers all the file systems, and it contains the
routines for mounting/unmounting the partitions.
\item [File]It handles the descriptor tables and the files; it defines the
\texttt{file} structure.
\item [Super]It handles the super blocks (that contains the global informations
about partitions on a disk).
\item [Inodes]It handles the inodes (informations about a file in the file
system).
\item [Dentry]It handles file names in a tree structure.
\item [Open,~read,~umask,~...]They implement the correspondent system
primitives.
\end{description}
%----------------------------------------------------------------------------
\chapter{Device drivers}
\label{sec:dd}
%----------------------------------------------------------------------------
%----------------------------------------------------------------------------
\section{Interface}
\label{sec:dd:interfaccia}
%----------------------------------------------------------------------------
%----------------------------------------------------------------------------
\subsection{Initialization}
\label{sec:dd:interfaccia:init}
%----------------------------------------------------------------------------
\begin{table}
\begin{verbatim}
typedef struct bdev_parms {
__uint32_t showinfo;
char *config;
void *bmutexattr;
__uint16_t dummy;
#ifdef IDE_BLOCK_DEVICE
IDE_PARMS ideparms;
#endif
} BDEV_PARMS;
void bdev_def_showinfo (BDEV_PARMS s, int v);
void bdev_def_configstring(BDEV_PARMS s, char *v);
void bdev_def_mutexattrptr(BDEV_PARMS s,void *v);
\end{verbatim}
\caption{The BDEV\_PARMS structure}
\label{tab:BDEV_PARMS}
\end{table}
The bdev\_init() function (see Table \ref{tab:BDEV_PARMS}) must be
called to initialize a device driver, passing a pointer to a
BDEV\_PARMS structure. Use the provided macros to init that
structure.
The structure is composed by two parts: a generic part and a
specific part. The generic part is composed by the following
fields:
\begin{description}
\item [showinfo]If this flag is set, the initialization of the device will be
``verbose''. To set this flag, use the macro bdev\_def\_showinfo.
\item [config]It is a string that can have a particular meaning for the device
driver. The string is parsed by the device driver, that usually searches for
some initialization parameter (something like ``foo:[number]''). To set this
string, use the macro bdev\_def\_configstring.
\item [bmutexattr]This parameter can be used to tell the system which is the
kind of mutex to use to implement the mutual exclusion. To set this parameter,
use the macro \texttt{bdev\_def\_mutexattrptr}.
\item [dummy]Used to align the data structure (see the macro BASE\_DEV).
\end{description}
The specific part is directly specified by the device driver; the
bdev
\_init function can be called can be called passing a pointer
to a BDEV
\_PARMS structure initialized with a BASE
\_BDEV value. If
NULL is specified, the default values are used.
%----------------------------------------------------------------------------
\subsubsection{How to create a new device driver}
%----------------------------------------------------------------------------
\begin{enumerate}
\item Add a new \emph{major} number in t the file include/fs/major.h (see
\ref{sec:dd:interfaccia:nomi});
\item Add a ``\#define'' into the file include/fs/bdevconf.h (to enable
selective compilation);
\item If needed, add fields to bdev\_parms, and modify the BASE\_BDEV macro with
the new default parameters;
\item Modify the function bdev\_init to call the init function of the new device
driver (into drivers/block/bdev.c).
\end{enumerate}
\begin{figure}
\begin{center}
\includegraphics[width=7cm]{images/ddinit.eps}
\end{center}
\caption{Device Drivers Initialization.}
\label{fig:ddinit}
\end{figure}
The device driver initialization function, after the
initialization of its internal part, does the following actions
(see Figure \ref{fig:ddinit}):
\begin{enumerate}
\item It registers itself calling bdev\_register(); That function must be called
for each device handled by the driver.
\item It registers every block device that it find into its interfaces to the
Physical Disk Module.
\item It registers every logical device found calling bdev\_register\_minor().
\end{enumerate}
Data structures and registration functions used in this phased are
described in Section \ref{sec:dd:interfaccia:det}.
%----------------------------------------------------------------------------
\subsection{Device serial number}
\label{sec:dd:interfaccia:nomi}
%----------------------------------------------------------------------------
\begin{figure}
\begin{center}
\includegraphics[width=7cm]{images/devtype.eps}
\end{center}
\caption{Device Serial Number}
\label{tab:devtype}
\end{figure}
Every logical device (i.e., a logical partition of a hard disk)
has a device serial number of type \_\_dev\_t, that must have a
unique value (see the POSIX standard).
The number has an internal structure (see Figure
\ref{tab:devtype}) composed by a \emph{major} \emph{number}, and
another number (whose number depends on the device implementation)
called \emph{minor} \emph{number}.
Three macro exists (see include/fs/sysmacro.h) to handle the
device numbers:
\begin{description}
\item [makedev]This macro creates a \_\_dev\_t number from the major and the
minor number.
\item [major]Returns the major number of a device.
\item [minor]Returns the minor number of a device.
\end{description}
The major number is used to decide which device driver will handle
a request (see Section \ref{sec:dd:interfaccia:det}). A device
driver can register more than on major number. Minor numbers are
device driver dependent, and the validity of a minor number is
checked only by the driver.
To simplify the search of a device, every device has a symbolic
name (see Section \ref{sec:dd:intercaccia:ext}). Symbolic names
are composed by two parts: a unique name for each major, and a
specific name assigned to each minor. For example, the major
MAJOR\_B\_IDE has a symbolic name ``ide''; a minor name for that
major can be ``hda2'', so the full name is ``ide/hda2'', that
identifies the second partition on the master hard disk on the
first IDE controller interface (see Section \ref{sec:dd:ide}).
%----------------------------------------------------------------------------
\subsection{Implementation details}
\label{sec:dd:interfaccia:det}
%----------------------------------------------------------------------------
\begin{table}
\begin{verbatim}
int bdev_register(__dev_t major, char *name, struct block_device *);
int bdev_register_minor(__dev_t device, char *name, __uint8_t fsind);
struct phdskinfo *phdsk_register( struct phdskinfo *disk);
\end{verbatim}
\caption{Block device registration functions}
\label{tab:Block_register}
\end{table}
These are the registration functions for the device drivers (see
Table \ref{tab:Block_register}):
\begin{description}
\item [bdev\_register]This function registers a device type (see Section
\ref{sec:dd:interfaccia:nomi}); the function receives as parameter the major
number, its symbolic name, and a pointer to a structure block\_device. The
function returns 0 in case of success, != 0 otherwise.
\item [bdev\_register\_minor]It registers a minor number, its symbolic name and
the file system type (see include/fs/fsind.h). This information is used by
bdev\_finddevice\_byname (see Section \ref{sec:dd:intercaccia:ext}). The
function returns 0 in case of success, != 0 otherwise.
\item [phdsk\_register]This function is used to register every physical device
found with parameters that can be used by the Disk Scheduling Algorithms (see
Section \ref{sec:dd:sched}). This function receives a pointer to a structure
phdiskinfo that contains the device parameters, and returns NULL in case of
error, or a pointer to the registered structure (that can be != from the
parameter) in case of success.
\end{description}
\begin{table}
\begin{verbatim}
struct dskgeometry {
__uint16_t cyls;
__uint16_t heads;
__uint16_t sectors;
};
struct phdskinfo \{
char pd_name[MAXPHDSKNAME];
__dev_t pd_device;
__blkcnt_t pd_size;
int pd_sectsize;
struct dskgeometry pd_phgeom;
struct dskgeometry pd_logeom;
};
\end{verbatim}
\caption{The phdsk structure.}
\label{tab:structphdsk}
\end{table}
The phdskinfo structure contains the physical parameters of a
device. all the field must be filled before calling
phdsk\_register. The fields are:
\begin{description}
\item [pd\_name]device name, used only for debug purposes;
\item [pd\_device]device identificator;
\item [pd\_size]number of logical sectors of the device;
\item [pd\_sectsize]Dimension (bytes) of a logical sector;
\item [pd\_phgeom~e~pd\_logeom]geometry of the device (logical and phisical);
geometries are described using the struct dskgeometry (see Table
\ref{tab:structphdsk}):
\begin{description}
\item [cyls]number of cylinders (traces) of the device;
\item [heads]number of heads;
\item [sectors]number of sectors for each cylinder.
\end{description}
\end{description}
\begin{table}
\begin{verbatim}
struct block_device {
char *bd_name;
int bd_sectorsize;
__uint32_t bd_flag_used:1;
struct block_device_operations *bd_op;
};
\end{verbatim}
\caption{block\_device Structure}
\label{tab:structbdev}
\end{table}
The struct block\_device described in Table \ref{tab:structbdev}
is used to register the block devices:
\begin{description}
\item [bd\_name]Physical device name; for example, for the IDE device driver,
``ide''.
\item [bd\_sectorsize]logical sector dimension (bytes; redundant info but
useful).
\item [bd\_flag\_used]It should not modified by the device drivers. If set, the
corresponding device driver is present (a table of all the possible devices is
used to access the system in a fast way).
\item [bd\_op]Virtual device operations.
\end{description}
%----------------------------------------------------------------------------
\subsubsection{block\_device\_operation structure.}
\label{sec:dd:interfaccia:det:bdevop}
%----------------------------------------------------------------------------
\begin{table}
\begin{verbatim}
struct block_device_operations {
int (*_trylock) (__dev_t device);
int (*_tryunlock)(__dev_t device);
int (*read) (__dev_t device, __blkcnt_t blocknum, __uint8_t *buffer);
int (*seek) (__dev_t device, __blkcnt_t blocknum);
int (*write) (__dev_t device, __blkcnt_t blocknum, __uint8_t *buffer);
};
\end{verbatim}
\caption{block\_device\_operations structure.}
\label{tab:structbdevop}
\end{table}
This structure contains the virtual operation that have to be
supplied by every device driver:
\begin{description}
\item [read]This function should read a logical pblock passed as parameter,
writing the result on the buffer. It returns 0 in case of success, != 0
otherwise.
\item [write]Similar to read, but it writes.
\item [seek]moves the head on the required block.
\item [\_trylock~e~\_tryunlock]Used to block/unblock a device. They return 0 in
case of success, != 0 otherwise; after initialization, and usually, every device
is in the unblocked state.
\end{description}
Why blocking/unblocking a device? The problem, described in
Section \ref{sec:dcache}, is that the data cache have to refer to
a logical block in an unique way. Since there are logical devices
that shares blocks (think at /dev/hda and /dev/hda2 in Linux), the
mount operation must be done in mutual exclusion, blocking the
device.
%----------------------------------------------------------------------------
\subsubsection{The ``lodsk'' module}
%----------------------------------------------------------------------------
\begin{table}
\begin{verbatim}
struct lodskinfo {
__uint8_t fs_ind;
__blkcnt_t start;
__blkcnt_t size;
};
typedef int (*lodsk_callback_func)(int, struct lodskinfo*, void *);
int lodsk_scan(__dev_t device, lodsk_callback_func func, void *data, int
showinfo, char *lname);
\end{verbatim}
\caption{The lodsk module}
\label{tab:lodsk}
\end{table}
This module (see Table \ref{tab:lodsk}) offer a service to the
device drivers: It recognizes the logical structure (partitions)
of an HD as described into \cite{manual:DOS330} (part of the code
is derived from Linux).
before calling lodsk\_scan, the device driver must be already
registered; the function has five parameters:
\begin{enumerate}
\item The device ID where the search have to be done.
\item A callback function called for each partition found.
\item A generic pointer passed to the callback function.
\item A verbose flag.
\item The name of the logical device (i.e., ``hdb'', used if the verbose option
has been set.
\end{enumerate}
Each callback function is called passing as arguments:
\begin{enumerate}
\item The logical number of the partition (the Linux numbering scheme is used).
\item A structure with informations on the partition found. The structure has
the following fields:
\begin{description}
\item [fs\_ind]The file system number (see include/fs/fsind.h).
\item [start]The partition starting block.
\item [size]The partition size (number of blocks).
\end{description}
\item A the generic pointer passed to lodsk\_scan.
\end{enumerate}
The low
\_level routines always check these values to avoid
read/write operations outside the partition.
%----------------------------------------------------------------------------
\subsubsection{Semaphores}
%----------------------------------------------------------------------------
The files include/fs/mutex.h and include/fs/semaph.h contain the
mutex/semaphore interface used by the file system.
\begin{table}
\begin{verbatim}
typedef internal_sem_t __sem_t;
#define __sem_init (ptr,val)
#define __sem_wait (ptr)
#define __sem_trywait (ptr)
#define __sem_signal (ptr)
\end{verbatim}
\caption{The semaphore interface}
\label{tab:semaph}
\end{table}
In particular, the semaphore interface (see Table
\ref{tab:semaph}) is simply a wrapper to the S.Ha.R.K. Internal
Semaphores.
\begin{table}
\begin{verbatim}
typedef internal_sem_t __mutex_t;
#define __mutex_init (ptr,attr)
#define __mutex_lock (ptr)
#define __mutex_trylock (ptr)
#define __mutex_unlock (ptr)
typedef SYS_FLAGS __fastmutex_t;
#define __fastmutex_init (ptr)
#define __fastmutex_lock (ptr)
#define __fastmutex_unlock (ptr)
\end{verbatim}
\caption{The mutex interface}
\label{tab:mutex}
\end{table}
The mutex interface requires two types of mutexes:
\begin{enumerate}
\item Fast mutexes: when only a few lines have to be protected (remapped to
kern\_fsave/kern\_frestore).
\item Normal mutexes: without any restriction (remapped again to the internal
semaphores).
\end{enumerate}
Please note that these functions are not directly used in the
code; instead, functions like
\_\_b
\_mutex
\_lock and
\_\_fs
\_mutex
\_lock are used. This is done to make possible the
disabling of the mutual exclusion requirement when not needed
(i.e., if the file system runs as a process with its own address
space).
%----------------------------------------------------------------------------
\subsection{Exported interface}
\label{sec:dd:intercaccia:ext}
%----------------------------------------------------------------------------
\begin{table}
\begin{verbatim}
int bdev_read (__dev_t dev, __blkcnt_t blocknum, __uint8_t *buffer);
int bdev_write (__dev_t dev, __blkcnt_t blocknum, __uint8_t *buffer);
int bdev_seek (__dev_t dev, __blkcnt_t blocknum);
int bdev_trylock(__dev_t device); int bdev_tryunlock(__dev_t device);
__uint8_t bdev_getdeffs(__dev_t device);
__dev_t bdev_find_byname(char *name);
__dev_t bdev_find_byfs(__uint8_t fsind);
int bdev_findname(__dev_t device, char **majorname, char **minorname);
int bdev_scan_devices(int(*callback) (__dev_t,__uint8_t));
\end{verbatim}
\caption{Low level functions}
\label{tab:bdev}
\end{table}
The low level exports two kinds of functions: one to use the
devices, one to search/get the logical devices present on a
system.
The three main functions are:
\begin{description}
\item [bdev\_find\_byname]This function accepts a symbolic device name, as
described in Section \ref{sec:dd:interfaccia:nomi}, and returns 0 if the device
is not registered, or its number otherwise.
\item [bdev\_find\_byfs]This function accepts a file system ID (see
include/fs/fsind.h), and returns the first device that contains that file
system, or 0 if no one provides it.
\item [bdev\_scan\_device]This function can be used to get a list of all the
devices in the system; the user have to pass a callback that is called for each
device in the system. The function returns 0 in case of success, -1 if an error
occurred, or a value != 0 if the callback returns a value != 0. The callback is
called with a device serial number and a file system type that should be present
in the device.
\end{description}
These function are useful when at system startup the root file
system have to be mounted (see Section \ref{sec:fs:mount}).
The functions bdev\_getdeffs and bdev\textbackslash{}\_findname
are used for these purposes:
\begin{description}
\item [bdev\_getdeffs]to know a probable file system present on a logical
device.
\item [bdev\_findname]to know the logical name of a device. The function returns
!= 0 in case of error, and 0 in case of success. In the latter case, the two
buffers passed as parameter are filled with the major and minor names.
\end{description}
These functions simply check the parameter values and then they
calls the device drivers functions (see Section
\ref{sec:dd:interfaccia:det:bdevop}):
\begin{description}
\item [bdev\_read]Reads a block on the device.
\item [bdev\_write]Write a block on the device.
\item [bdev\_seek]Move the head in the position passed as parameter.
\item [bdev\_trylock~e~bdev\_tryunlock]Tries to block/unblock a device.
\end{description}
These functions can block the calling thread for an unspecified
amount of time. All the device drivers were thought to be used in
a multithreaded systems.
If a device driver is not multithread, proper mutual exclusion
must be ensured, because the file system high level functions will
call a low level service before the previous request has been
served. Note that if such a mutual exclusion is used, the access
to the device driver will be forced by the mutex policy, and so
all the disk scheduling algorithms will not work properly.
%----------------------------------------------------------------------------
\section{I/O requests scheduling}
\label{sec:dd:sched}
%----------------------------------------------------------------------------
Massy's thesis here inserts a dissertation on
the various disk scheduling algorithms that has been removed.
%----------------------------------------------------------------------------
\section{Disk scheduling algorithms implemented in S.Ha.R.K.}
\label{sec:dd:schedpres}
%----------------------------------------------------------------------------
The system currently support 5 different scheduling algorithms
that can be changed/increased simply modifying system
configuration.
\begin{table}
\begin{verbatim}
int bqueue_init (bqueue_t *);
int bqueue_numelements (bqueue_t *);
int bqueue_removerequest (bqueue_t *);
int bqueue_insertrequest (bqueue_t *, struct request_prologue *);
struct request_prologue *bqueue_getrequest(bqueue_t *);
\end{verbatim}
\caption{Queue handling functions}
\label{tab:bqueuefunc}
\end{table}
Only one algorithm can be active at a time. All the device drivers
should use the queuing interface listed in the bqueue.h file (see
Table \ref{tab:bqueuefunc}). The structure bqueue\_t is specific
for each scheduling algorithm. The semantic of each function in
the interface is the following:
\begin{description}
\item [bqueue\_init]Initialize the pending request queue; it returns 0 in case
of success, != 0 otherwise.
\item [bqueue\_numelements]Returns the number of elements in the queue.
\item [bqueue\_insertrequest]This function inserts a request in the queue,
following the specific algorithm behavior. The function returns 0 in case of
success, != 0 otherwise. Please note that requests are passed through a pointer
to the structrequest\_prologue\_t (see tab:structreqprolog). That is, every
algorithm should use a structure with a struct request\_prologue as first field,
passing a casted pointer to that structure.
\item [bqueue\_getrequest]Returns the first request in queue (without removing
it), and returns a pointer to that request, or NULL if there are not any pending
requests.
\item [bqueue\_removerequest]Removes the first request in queue, that is the
request that would have been returned by bqueue\_getrequest; it returns 0 in
case of success, != 0 in case of error.
\end{description}
\begin{table}
\begin{verbatim}
#define REQ_DUMMY 0
#define REQ_SEEK 1
#define REQ_READ 2
#define REQ_WRITE 3
struct request_prologue {
unsigned sector;
unsigned head;
unsigned cylinder;
unsigned nsectors;
unsigned operation;
struct request_specific x;
};
#define request_prologue_t struct request_prologue
\end{verbatim}
\caption{struct request\_prologue\_t}
\label{tab:structreqprolog}
\end{table}
before calling bdev\_insertrequest, all the fields of the struct
request\_prologue\_t must be filled (see Table
\ref{tab:structreqprolog}):
\begin{description}
\item [sector,~head,~cylinder]This is the starting position in the hard disk.
\item [nsectors]This is the number of sectors to transfer.
\item [operation]The operation that must be performed:
\begin{description}
\item [REQ\_DUMMY]request not specified.
\item [REQ\_SEEK]head seek to the specified position.
\item [REQ\_READ]read request.
\item [REQ\_WRITE]write request.
\end{description}
\end{description}
To add algorithm specific fields, you can use the method used for
the deadlines into Section \ref{sec:dd:sched:edf}.
struct request\_specific (see Table \ref{tab:structreqprolog})
must be defined by each algorithm, and contains algorithm-specific
fields.
struct bqueue\_t contains a scheduling queue. Its first field must
be a pointer to a struct phdiskinfo (see Section
\ref{sec:dd:interfaccia:det}, and Table \ref{tab:structphdsk}).
Here is a description about the implementation of various disk
scheduling algorithms.
%----------------------------------------------------------------------------
\subsection{FCFS}
\label{sec:dd:sched:fcfs}
%----------------------------------------------------------------------------
\begin{table}
\begin{verbatim}
typedef struct TAGfcfs_queue_t {
struct phdskinfo *disk;
__b_fastmutex_t mutex;
struct request_prologue *head;
struct request_prologue *tail;
int counter;
} fcfs_queue_t;
struct request_specific {
void *next;
};
\end{verbatim}
\caption{struct fcfs\_queue\_t}
\label{tab:structfcfs}
\end{table}
The FCFS policy is implemented using the structure in table
\ref{tab:structfcfs}:
\begin{description}
\item [mutex]A mutual exclusion semaphore.
\item [head~e~tail]a pointer to head and tail of a request queue.
\item [counter]number of pending requests.
\end{description}
Then, the struct request
\_specific contains a pointer that is used
to link the list.
%----------------------------------------------------------------------------
\subsection{SSTF}
\label{sec:dd:sched:sstf}
%----------------------------------------------------------------------------
\begin{table}
\begin{verbatim}
typedef struct TAGsstf_queue_t {
struct phdskinfo *disk;
__b_fastmutex_t mutex;
struct request_prologue *actual;
struct request_prologue *lqueue;
struct request_prologue *hqueue;
int counter;
} sstf_queue_t;
struct request_specific {
void *next;
};
\end{verbatim}
\caption{struct sstf\_queue\_t}
\label{tab:structsstf}
\end{table}
This scheduling policy is implemented using the data structures in
Figure \ref{tab:structsstf}:
\begin{description}
\item [mutex]for mutual exclusion.
\item [actual]the current request (that is returned by bqueue\_getrequest()).
\item [lqueue~e~hqueue]two lists ordered in decreasing and increasing order:
when a new request arrives, the request is inserted on one of the two queues
depending on the value of the current request cylinder. The current request will
become the nearest to the current cylinder. Note: This algorithm can cause
starvation.
\item [counter]The request counter.
\end{description}
Then, the struct request
\_specific contains a pointer that is used
to link the list.
%----------------------------------------------------------------------------
\subsection{LOOK}
\label{sec:dd:sched:look}
%----------------------------------------------------------------------------
\begin{table}
\begin{verbatim}
typedef struct TAGlook_queue_t {
struct phdskinfo *disk;
__b_fastmutex_t mutex;
int dir;
struct request_prologue *queue[2];
int counter;
} look_queue_t;
struct request_specific {
void *next;
};
\end{verbatim}
\caption{struct look\_queue\_t}
\label{tab:structlook}
\end{table}
This scheduling policy is implemented using the data structures in
Figure \ref{tab:structlook}:
\begin{description}
\item [mutex]for mutual exclusion.
\item [dir]Is the looking direction: 0 ascending, 1 descending.
\item [queue]two ordered lists (one ascending, one descending). queue[dir] is
the queue used. When the list becomes empty, dir=!dir. a new request is inserted
in the current queue if its position is after/before the actual position
(otherwise, it is inserted in the other list).
\item [counter]the request counter.
\end{description}
This algorithm can be implemented in two variants:
\begin{enumerate}
\item Requests with position = current position are inserted in the current
queue (starvation can happen).
\item Requests with position = current position are inserted in the queue not
currently used. This is the default choice; to modify it, you have to modify the
look.c file.
\end{enumerate}
%----------------------------------------------------------------------------
\subsection{CLOOK}
\label{sec:dd:sched:clook}
%----------------------------------------------------------------------------
\begin{table}
\begin{verbatim}
typedef struct TAGclook_queue_t {
struct phdskinfo *disk;
__b_fastmutex_t mutex;
int act;
struct request_prologue *queue[2];
int counter;
} clook_queue_t;
struct request_specific {
void *next;
};
\end{verbatim}
\caption{struct clook\_queue\_t}
\label{tab:structclook}
\end{table}
This scheduling policy is implemented using the data structures in
Figure \ref{tab:structclook}:
\begin{description}
\item [mutex]for mutual exclusion.
\item [dir]This is the current queue.
\item [queue]Two ascending lists. queue[dir] is used until it is emptied. Then,
dir is changed. A new request is inserted in the current queue if the current
position is greater than the current one.
\item [counter]the request counter.
\end{description}
This implementation can introduce starvation because insertions at
the current position are done on the same queue.
%----------------------------------------------------------------------------
\subsection{EDF}
\label{sec:dd:sched:edf}
%----------------------------------------------------------------------------
\begin{table}
\begin{verbatim}
typedef struct TAGbd_edf_queue_t {
struct phdskinfo *disk;
__b_fastmutex_t mutex;
struct request_prologue *queue;
int inservice;
int counter;
} bd_edf_queue_t;
struct request_specific {
void *next;
long dl;
};
\end{verbatim}
\caption{struct look\_queue\_t}
\label{tab:structedf}
\end{table}
This is the only real-time disk scheduling algorithm implemented
in S.Ha.R.K. This scheduling policy is implemented using the data
structures in Figure \ref{tab:structedf}:
\begin{description}
\item [mutex]For mutual exclusion.
\item [queue]An ordered list, increasing with the deadline.
\item [inservice]This flag is set when a request is in service.
\item [counter]The request counter.
\end{description}
\begin{table}
\begin{verbatim}
typedef struct {
RES_MODEL r;
TIME dl;
} BDEDF_RES_MODEL;
#define BDEDF_res_default_model(res)
#define BDEDF_res_def_level(res,l)
#define BDEDF_res_def_dl(res,dl)
\end{verbatim}
\caption{Resource Module for deadlines}
\label{tab:edfresmodel}
\end{table}
The request's deadline is got during task creation using the
S.Ha.R.K. resource module shown in Figure \ref{tab:edfresmodel}.
If the resource module is not used, the deadline is set to
infinity. That is, in that case the scheduling algorithm become a
FCFS.
%----------------------------------------------------------------------------
\section{Device driver: the IDE interface}
\label{sec:dd:ide}
%----------------------------------------------------------------------------
%----------------------------------------------------------------------------
\subsection{Description}
\label{sec:dd:ide:desc}
%----------------------------------------------------------------------------
A description of the IDE standard can be found in
\cite{manual:ideintel} and
\cite{manual:idevia}. This interface allows the connection of up to 2 IDE
peripherals using the ATA or ATAPI protocol (as described into
\cite{manual:ata4}). This driver implements the ATA-4 protocol \footnote{The
original Massy's thesis have a small description of the standard, that I have
removed.}.
%----------------------------------------------------------------------------
\subsection{Implementation}
\label{sec:dd:ide:code}
%----------------------------------------------------------------------------
\begin{figure}
\begin{center}
\includegraphics[width=7cm]{images/ide.eps}
\end{center}
\caption{The IDE subsystem}
\label{fig:ide}
\end{figure}
The architecture of the IDE subsystem is shown in Figure
\ref{fig:ide}: there is a part of the interface responsible for
the system initialization (for example, for finding the IDE
interfaces in the system, for finding the devices and their
logical structure, for transforming the low level requests in
ATA-4 commands), for command handling and another part for
interfacing with the disk scheduling queue handling. Finally,
there is a small part OS-dependent that should be rewritten to
port it to other OSes.
If possible, the driver uses the DMA; in any case, polling can be
forced. DMA is only supported in bus-master mode
\cite{manual:ideintel}.
The system actually use only some commands of the classes ``PIO
data in'', ``PIO data out'', ``Non-data'' e ``DMA''; it uses
the ``read'', ``write'', ``seek'', ``dma read'', ``dma
write'', ``identify'', ``pidentify'' and ``set feature''
commands; packet commands are not currently supported, so ATAPI
devices (like cd-roms) are currently not supported.
%----------------------------------------------------------------------------
\subsubsection{Initialization}
%----------------------------------------------------------------------------
The system is initialized calling the ide
\_init() function into
bdev
\_init(). This is automatically done if the IDE support is
compiled. The user should only initialize the struct bdev
\_parms,
as specified in Section
\ref{sec:dd:interfaccia}.
\begin{table}
\begin{verbatim}
typedef struct ide_parms {
void *parm_initserver;
} IDE_PARMS;
struct ide_server_model {
TIME dl;
TIME wcet;
TIME mit;
};
\end{verbatim}
\caption{Data structures used to initialize the IDE driver.}
\label{tab:ideinit}
\end{table}
The IDE\_PARMS structure has only one parameter, and is included
into the ideparms field in the global init data structure:
\begin{description}
\item [parm\_initserver]This parameter is OS-specific and is
related to the glue code.
\end{description}
\begin{table}
\begin{verbatim}
void ide_glue_send_request (int ideif);
int ide_glue_activate_interface (int ideif);
void ide_glue_unactivate_interface (int ideif);
\end{verbatim}
\caption{The glue code for the IDE driver.}
\label{tab:ideglue}
\end{table}
The system dependent part of the IDE driver is contained into the
file ideglue.c (see Figure \ref{tab:ideglue}):
\begin{description}
\item [ide\_glue\_activate\_interface]It enables the interface passed as
parameter: It creates an aperiodic task, and assigns to it the interface
interrupt. The task is activated when the interrupt arrives.
\item [ide\_glue\_activate\_interface]It disables the interface. The aperiodic
task is killed.
\item [ide\_glue\_send\_request]It executes the first available request: it
explicitly activates the task interface; when the task is activated because of
an interrupt, the pending request is terminated, and another request (if
present) is activated.
\end{description}
the parm\_initserver field contains a pointer to the struct
ide\_server\_model, used to create the aperiodic tasks (see Figure
\ref{tab:ideinit}):
\begin{description}
\item [dl]task deadline in $\mu$sec.
\item [wcet]task worst case execution time in $\mu$sec.
\item [mit]the mean inter arrival time.
\end{description}
If these parameters are not specified, standard parameters are
used.
These defines are used to conditionally compile the driver (see
the source code):
\begin{description}
\item [FORCE\_SCANNING]this define force the search of the IDE peripherals also
if the bus reset fails. This is useful if some peripherals does not follow the
standard timings.
\item [IDE\_OLDATAPIDELAY]this define forces a little delay that can be useful
when using old ATAPI interfaces.
\item [IDE\_FORCERESET]if defined a soft reset operation cannot be aborted
because of a timeout. It must be specified if the peripheral does not follows
the soft reset timings or if it cannot go in stand-by mode.
\item [IDE\_SKIPSTATUSTEST]with this symbol we do not check if the command is
being executed correctly (that is done reading on a status register); all the
TX-based motherboard we found needs this symbol.
\item [IDE\_SKIPALLTEST]when set, all the command are written into the registers
without looking on the status register. This symbol implies the previous. This
mode was needed on a TX motherboard where the polled mode for commands was
unreliable.
\end{description}
%----------------------------------------------------------------------------
\subsubsection{Policy queues Interface}
%----------------------------------------------------------------------------
\begin{table}
\begin{verbatim}
typedef struct TAGidereq_t {
request_prologue_t info;
int next;
__uint8_t resetonerror;
__uint8_t cmd;
__b_sem_t wait;
int result;
__uint8_t features;
__uint8_t cyllow;
__uint8_t cylhig;
__uint8_t seccou;
__uint8_t secnum;
__uint8_t devhead;
__uint8_t *buffer;
} idereq_t;
\end{verbatim}
\caption{struct idereq\_t}
\label{tab:idereq}
\end{table}
Table \ref{tab:idereq} shows struct idereq\_t that is used by the
driver to store the pending requests. It can be noted that the
info field is of type request\_prologue\_t, as required by the
modules that handle the scheduling policies (see
\ref{sec:dd:schedpres}).
Note that (since the IDE interface allows only a command at a
time) there are two queues for each IDE interface (one for the
master device, one for the slave). These request queues are
prioritized. That is, the master queue always have priority on the
slave queue.
Here a small description of the fields showed in Table
\ref{tab:idereq}:
\begin{description}
\item [info]To handle the requests using a module for disk scheduling.
\item [next]used internally to handle a request pool.
\item [resetonerror]If set, an error during a command implies a soft reset.
\item [cmd]The command for the peripheral (see \cite{manual:ata4}).
\item [wait]Synchronization semaphore: when a task sends a command, it waits the
end of the execution of the command.
\item [result]This is the result: 0 in case of success, != 0 in case of error
(see idelow.c for the error codes).
\item [features,~cyllow,~cylhigh,~seccou,~secnum~e~devhead]Values to be inserted
into the I/O registers of the IDE interface. The behavior is different depending
on the peripheral mode (LBA or C/H/S) (see Section \ref{sec:dd:ide:desc}).
\item [buffer]read/write buffer (if needed).
\end{description}
%----------------------------------------------------------------------------
\subsubsection{Minor numbers}
%----------------------------------------------------------------------------
\begin{figure}
\begin{center}
\includegraphics[width=10cm]{images/ideminor.eps}
\end{center}
\caption{IDE minor numbers}
\label{fig:ideminor}
\end{figure}
This section describes how minor numbers are handled.
As described in Section \ref{sec:dd:intercaccia:ext}, the generic
low level part uses the major number to pass a request to a
device. The specific device is then selected using the minor
number.
Figure \ref{fig:ideminor} shows the minor number structure of this
device:
\begin{itemize}
\item The first 3 bits identifies the interface (8 interfaces maximum).
\item A bit for the peripheral (master/slave). This is used to enqueue request
in the right queue.
\item The last 4 bits says which partition should be used. This info is used to
convert relative sector numbers into absolute sector numbers.
\end{itemize}
Minor numbers are accessed only via macros. To modify the minor
number mapping you should only modify these macros.
%----------------------------------------------------------------------------
\chapter{File system}
\label{sec:fs}
%----------------------------------------------------------------------------
With the word ``file system'' we mean the internal organization
of a block device; the purpose of such an organization is to store
a set of files, usually divided in directories.
A block device is seen by the file system as an ordered sequence
of fixed sized blocks. The basic operations of a file system are
read/write operations on the n-th block.
%----------------------------------------------------------------------------
\section{Interface}
\label{sec:fs:int}
%----------------------------------------------------------------------------
This section describes the interface between the system and the
user tasks. It also describes the interface between a specific
instance of file system and the system (just to make easy the
expansion of the system with new real-time file systems).
%----------------------------------------------------------------------------
\subsection{Initialization}
\label{sec:fs:int:init}
%----------------------------------------------------------------------------
To initialize the higher level of the system we use an interface
similar to the device drivers interface (see Section
\ref{sec:dd:interfaccia:init}.
\begin{table}
\begin{verbatim}
typedef struct filesystem_parms {
__dev_t device;
__uint8_t fs_ind;
__uint32_t fs_showinfo:1;
void *fs_mutexattr;
struct mount_opts *fs_opts;
} FILESYSTEM_PARMS;
#define filesystem_def_rootdevice(par,rootdev)
#define filesystem_def_fs(par,fs)
#define filesystem_def_showinfo(par,show)
#define filesystem_def_options(par,opts)
int filesystem_init(FILESYSTEM_PARMS *);
\end{verbatim}
\caption{struct filesystem\_parms}
\label{tab:fsparms}
\end{table}
The function filesystem\_init() is called to initialize the higher
layer passing the parameters showed in Figure \ref{tab:fsparms}.
These parameters must be initialized using the following macros:
\begin{description}
\item [filesystem\_def\_rootdevice]Set the partition that must be used as
\emph{root device}, that is, the directory associated to ``/''. This field is
mandatory, and that implies that the higher layer must be initialized
\emph{after} the device drivers.
\item [filesystem\_def\_fs]Selects the filesystem that must be used with the
device. It must be one of the symbols defined into include/fs/fsind.h; if the
special symbol FS\_DEFAULT is used, the system tries to autodetect the
filesystem type.
\item [filesystem\_def\_showinfo]If this flag is set, then initialization
messages are printed on the screen.
\item [filesystem\_def\_options]There are the mount options.
\end{description}
The field fs\_mutexattr has the same meaning of the correspondent
field into bdev\_parms (see Section
\ref{sec:dd:interfaccia:init}).
You can use the macro BASE\_FILESYSTEM to initialize a struct
filesystem\_parms with some default values. NULL is not allowed as
a parameter of filesystem\_init.
\begin{table}
\begin{verbatim}
struct mount_opts {
__uint32_t flags;
union {
struct msdosfs_parms msdos;
} x;
};
/* Flags */
#define MOUNT_FLAG_RW 0x01
\end{verbatim}
\caption{struct mount\_opts}
\label{tab:mountopt}
\end{table}
\label{sec:mountopt}The mount options for every device must be
inserted into a struct mount\_opts showed in Table
\ref{tab:mountopt}. The structure is composed by two parts: a
general part and a filesystem-specific part. The general part
actually contains the following fields:
\begin{description}
\item [flags]This field contains the mount fields. The only flag
currently defined is MOUNT\_FLAG\_RW, that allows write operations
on the file system (default is read-only).
\end{description}
\begin{figure}
\begin{center}
\includegraphics[width=7cm]{images/fsinit.eps}
\end{center}
\caption{Filesystem init}
\label{fig:fsinit}
\end{figure}
Filesystem initialization proceeds as shown in Figure \ref{fig:fsinit}:
\begin{enumerate}
\item The user calls filesystem\_init();
\item filesystem\_init() initializes itself and its modules, then all the file
systems are initialized. filesystem\_init() have to be changed if a new
filesystem is introduced.
\item The file systems initialization functions have to register themselves
calling the function filesystem\_register() with a struct file\_system\_type.
\end{enumerate}
%----------------------------------------------------------------------------
\subsubsection{Struct file\_system\_type}
\label{sec:filesystemtype}
%----------------------------------------------------------------------------
\begin{table}
\begin{verbatim}
struct file_system_type {
const char *name;
int fs_flags;
__uint8_t fs_ind;
struct super_block *(*read_super) (
__dev_t device,
__uint8_t fs_ind,
struct mount_opts *options
);
};
\end{verbatim}
\caption{struct file\_system\_type}
\label{src:structfs}
\end{table}
This structure (see Table \ref{src:structfs})stores the
informations needed for the filesystem registration. Here a short
description of its fields:
\begin{description}
\item [name]Pointer to the filesystem name (e.g., ``FAT16'').
\item [fs\_flags]Filesystem flags (currently not used).
\item [fs\_ind]Filesystem ID (every filesystem has a different ID).
\item [read\_super]This the filesystem initialization function. It is called
when a partition is mounted to initialize the correspondent struct super\_block.
\end{description}
%----------------------------------------------------------------------------
\subsection{Internal interface}
%----------------------------------------------------------------------------
This section describes the interface between the high layer (that
handles the file systems), and the modules that implements a
specific filesystem.
When a user mount a device into a directory using a filesystem,
the system uses the file system's specific struct
file
\_system
\_type, calling the function read
\_super() to read the
superblock of the specified device.
The struct super
\_block contains a pointer to a structure that
contains a set of ``virtual operations'' (see Section
\ref{sec:fs:intstruct}), that implements the behavior of a
specific filesystem.
%----------------------------------------------------------------------------
\subsubsection{struct super\_operations}
\label{sec:superop}
%----------------------------------------------------------------------------
\begin{table}
\begin{verbatim}
struct super_operations {
int (*init_inode) (struct inode *);
int (*read_inode) (struct inode *);
int (*write_inode) (struct inode *);
int (*put_super) (struct super_block *);
int (*delete_inode) (struct inode *);
};
\end{verbatim}
\caption{struct super\_operations}
\label{src:structsuperop}
\end{table}
The structure in Table \ref{src:structsuperop} is used for super
block manipulation. All the functions should return 0 in case of
success, or a value != 0 otherwise.
\begin{description}
\item [init\_inode]This function initializes a struct inode passed as parameter.
That structure must have the fields i\_sp and i\_st.st\_ino already initialized
(see Section \ref{sec:inode}).
\item [read\_inode]This function reads the inode passed as parameter from the
filesystem. The inode structure must have the fields i\_sp and i\_st.st\_ino
already initialized.
\item [write\_inode]This function writes a (valid, full initialized) inode into
the filesystem.
\item [delete\_inode]This function removes the inode from the filesystem (the
inode must already exist on the filesystem).
\item [put\_super]This function writes the super block passed as parameter into
the filesystem. This function is called when a filesystem is unmounted. A dual
function (that is, a function that reads a super\_block from a filesystem)
(which is only used when mounting the filesystem) can be found into the struct
file\_system\_type (see Section
\ref{sec:filesystemtype}).
\end{description}
%----------------------------------------------------------------------------
\subsubsection{Struct dentry\_operations}
%----------------------------------------------------------------------------
\begin{table}
\begin{verbatim}
struct dentry_operations {
int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
};
\end{verbatim}
\caption{struct dentry\_operations}
\label{src:structdentryop}
\end{table}
\label{sec:dentryop} This structure contains all the functions
used to manipulate a struct dentry.
\begin{description}
\item [d\_compare]This function compares the two strings passed as
parameter. it returns 0 if the two strings are equal, or != 0
otherwise. The filesystem do not use directly the strcmp()
function, because filename comparison depends on the filesystem
(e.g., case sensitive, case insensitive).
\end{description}
%----------------------------------------------------------------------------
\subsubsection{struct inode\_operations}
\label{sec:inodeop}
%----------------------------------------------------------------------------
\begin{table}
\begin{verbatim}
struct inode_operations {
struct file_operations *default_file_ops;
struct inode* (*create) (struct inode *, struct dentry *);
struct inode* (*lookup) (struct inode *, struct dentry *);
int (*unlink) (struct dentry *);
void (*destroy) (struct inode *);
int (*truncate) (struct inode *, __off_t len);
};
\end{verbatim}
\caption{struct inode\_operations}
\label{src:structinodeop}
\end{table}
This struct contains the virtual operations on the struct inode.
These functions returns 0 in case of success, !=0 in case of
error, that will be interpreted as a negate errno() value (e.g.,
-EPERM instead of EPERM).
\begin{description}
\item [default\_file\_ops]these are the file\_operations described into the
paragraph \ref{sec:fileop}, that must be used with the inode. By default, the
field is initialized with the file\_operation of its filesystem. These pointers
can be modified to implement a kind of virtual filesystem that works on a
device. For example, we can define an inode ``/dev/ide/hda1'' in a way that all
read/write operations on it are not mapped in the operations of the filesystem
where it is stored.
\item [create]This function creates a new inode (a new file) with the name
passed as parameter into the struct inode passed as parameter (that must be an
inode of type ``directory''). It returns the new inode or NULL in case of
error.
\item [lookup]This function searches an inode passed as parameter into another
inode of type directory passed as parameter. It returns the inode just found or
NULL in case of error.
\item [unlink]This function unlinks the filename passed as parameter from the
parameter of the corresponding inode. The inode will be removed (by generic
filesystem routines) if it remains without links associated to it.
\item [destroy]This function destroys the file linked to the inode. It frees the
space occupied by the file. Often, it is sufficient to call the truncate
function with length 0.
\item [truncate]This function truncates the file length of the inode passed as
parameter at the length specified in the second parameter. If the given length
is greater than the actual file length, the function does nothing.
\end{description}
%----------------------------------------------------------------------------
\subsubsection{struct file\_operations}
\label{sec:fileop}
%----------------------------------------------------------------------------
\begin{table}
\begin{verbatim}
struct file_operations {
__off_t (*lseek) (struct file *, __off_t, int);
__ssize_t (*read) (struct file *, char *, __ssize_t);
__ssize_t (*write) (struct file *, char *, __ssize_t);
int (*readdir) (struct file *, void *);
int (*open) (struct inode *, struct file *);
int (*close) (struct inode *, struct file *);
};
\end{verbatim}
\caption{struct file\_operations}
\label{src:structfileop}
\end{table}
This structure contains all the function that must work with
files. If not otherwise stated, the functions can operate both on
files and on directory.
\begin{description}
\item [lseek]Moves the input/output position of the file specified as parameter.
The new position is specified as in the lseek primitive (described in
\cite{book:posix1003.1} or in all the Unix programming books). That is, the
position is specified with a delta with respect to a given position (start of
file, end of file, current position); As described in section \ref{sec:file}
during the description of the field f\_pos, the new position can be greater than
the end-of-file. This function must return a position inside the file or an
error code.
\item [read]Reads some data and puts it into the buffer. The function is called
only for regular files. It returns the number of bytes read, that can be less
than requested, or an error code.
\item [write]Writes some bytes from the specified buffer. It is called only for
regular files. It returns the number of bytes written, that can be less than
requested, or an error code.
\item [readdir]Reads the next filename of the directory parameter passed through
a struct file. It inserts it into the struct dirent passed as parameter. Returns
0 in case of success, a negative number in case of error.
\item [open]Opens the file linked to the inode. Fills some internal fields in
the filesystem-specific struct file. Returns 0 in case of success, an error
otherwise.
\item [close]Closes the file passed as parameter. returns 0 in case of success,
an error in case of failure.
\end{description}
Most of the function described in this section works on buffers
passed by the user. the general routines checks that these buffer
are corrects. In case the operating system implement some memory
protection mechanisms, the functions copyfromuser() and
copytouser() should be used (their syntax is similar to memcpy()).
Note that Shark currently do not provide memory protection, so
memcpy() can be used.
%----------------------------------------------------------------------------
\subsection{External interface}
%----------------------------------------------------------------------------
%----------------------------------------------------------------------------
\subsubsection{Non standard functions}
\label{sec:fs:mount}
%----------------------------------------------------------------------------
\begin{table}
\begin{verbatim}
int mount (dev_t device, u_int8_t fsind, char *where, struct mount_opts
*options);
int umount (dev_t device);
dev_t fdevice(char *device_name);
\end{verbatim}
\caption{mount/umount functions}
\label{tab:mount}
\end{table}
The mount/umount functions are the non-standard functions provided
by the filesystem (their prototype can be found in
include/sys/mount.h). Here is a small description of these
functions (see Table \ref{tab:mount}):
\begin{description}
\item [fdevice]This function returns the device serial number of a
given filename. See Section \ref{sec:dd:interfaccia:nomi}.
\item [mount]Is used to mount a partition. These are the informations that must
be passed to it:
\begin{itemize}
\item The device serial number that must be mounted
\item The filesystem identifier (fsind) that should be used (see
include/fs/fsind.h).
\item The directory where the device will be mounted.
\item Other parameters (see Section \ref{sec:mountopt}).
\end{itemize}
The function returns 0 in case of success, != 0 in case of failure
(errno is modified).
\item [umount]This function unmount a device that was previously
mounted. This function cannot unmount the ``root'' device.
During shutdown, all the filesystem will be unmounted in the
proper order.
\end{description}
%----------------------------------------------------------------------------
\subsection{Initialization: an example}
%----------------------------------------------------------------------------
This is an example of the filesystem initialization in S.Ha.R.K.
To initialize the system, we must register the kernel modules
first:
\begin{verbatim}
TIME __kernel_register_levels__(void *arg)} {
struct multiboot_info *mb=(struct multiboot_info*)arg;
EDF_register_level(EDF_ENABLE_ALL);
RR_register_level(RRTICK, RR_MAIN_YES, mb);
CBS_register_level(CBS_ENABLE_ALL, 0);
dummy_register_level();
SEM_register_module();
CABS_register_module();
NOPM_register_module();
return TICK;
}
\end{verbatim}
The filesystem always require the registration of a module that
can accept soft aperiodic tasks (such as the CBS, for example).
Moreover, a mutex protocol must be present (NOPM, in this case).
\begin{verbatim}
TASK __init__(void *arg) {
extern int __bdev_sub_init(void);
extern int __fs_sub_init(void);
struct multiboot_info *mb = (struct multiboot_info*)arg;
/* block devices initialization */
__bdev_sub_init();
/* filesystem initialization */
__fs_sub_init();
__call_main__(mb);
return (void *)0;
}
\end{verbatim}
This is a typical S.Ha.R.K. \_\_init\_\_ function. In this
function, block devices and file systems should be initialized.
\begin{verbatim}
int __bdev_subinit(void) {
extern __dev_t root_device;
BDEV_PARMS bdev=BASE_BDEV;
/* low level initialization */
bdev_def_showinfo(bdev,TRUE);
bdev_init(&bdev);
/* root device specification */
root_device = bdev_scan_devices(choose_root_callback);
if (root_device < 0) {
/* in case of error... */
exit(1);
}
return 0;
}
\end{verbatim}
This function initializes the block device layer. This function
sets the parameters of the bdev variable, calls the low level
initialization functions and searches for the \emph{root device};
basically, the function must specify in some way which is the root
device. In the function, the low level layer lists all the
available devices, calling the function choose\_root\_callback()
for each of them. That function will choose the root device.
\begin{verbatim}
int choose_root_callback(dev_t dev,u_int8_t fs) {
if (fs==FS_MSDOS)
return dev;
return -1;
}
\end{verbatim}
In this case, the choice of the root device is quite simple: the
first FAT16 DOS partition becomes the root device. Usually,
DOS calls that partition ``C:''.
\begin{verbatim}
int __fs_sub_init(void) {
extern __dev_t root_device;
FILESYSTEM_PARMS fs = BASE_FILESYSTEM;
struct mount_opts opts;
/* mount parameters */
memset(&opts, 0, sizeof(struct mount_opts));
opts.flags = MOUNT_FLAG_RW;
/* filesystem initialization */
filesystem_def_rootdevice(fs,root_device);
filesystem_def_fs(fs,FS_MSDOS);
filesystem_def_showinfo(fs,TRUE);
filesystem_init(&fs);
/* C library initialization */
libc_initialize();
return 0;
}
\end{verbatim}
This functions initializes the filesystem giving R/W permission on
the root partition previously found. Then, the C library is
initialized.
%----------------------------------------------------------------------------
\section{Internal structure}
\label{sec:fs:intstruct}
%----------------------------------------------------------------------------
Many data structures are composed by two parts: a generic part,
that is independent from the particular filesystem, and a
filesystem specific part, that are usually stored into an union.
Almost all the data structures have a set of functions that
modifies them. Pointers to these functions are included in a
structure that has the name of the structure that is modified by
the functions with a suffix ``operations''. For example, the
structure super
\_block (see Section
\ref{sec:superblock}) has a
structure super\_operations (see Section \ref{sec:superop}) that
contains a function create\_inode.
Since the high layer functions influences the global system
throughput, some informations are duplicated in the data
structures to speed up the execution.
The system is basically structured with a Unix-like approach, with
4 fundamental structures:
\begin{description}
\item [super]contains informations on each device that is mounted (a table is
used);
\item [inode]contains informations on the files in a filesystem (double lists
with hash keys are used);
\item [dentry]contains the file names (a tree is used);
\item [file]contains informations on each open file (a descriptor table is
used).
\end{description}
%----------------------------------------------------------------------------
\subsection{The data structures}
%----------------------------------------------------------------------------
%----------------------------------------------------------------------------
\subsubsection{struct super\_block}
\label{sec:superblock}
%----------------------------------------------------------------------------
From a user point of view, the filesystem exports a unix-like view
of the system. That is, partitions can be mounted into the
directory tree. When the system starts, the root directory is
mounted as ``/''.
\begin{table}
\begin{verbatim}
struct super_block {
struct super_block *sb_parent;
struct super_block *sb_childs;
struct super_block *sb_next;
__dev_t sb_dev;
struct inode *sb_root;
struct dentry *sb_droot;
struct file_system_type *sb_fs;
struct super_operations *sb_op;
struct dentry_operations *sb_dop;
struct mount_opts sb_mopts;
int sb_busycount;
__uint32_t sb_used:1;
__uint32_t sb_blocked:1;
union {
struct msdos_super_block msdos_sb;
} u;
};
\end{verbatim}
\caption{struct super\_block}
\label{src:structsuper}
\end{table}
The struct super\_block contains the fundamental information of
the filesystem (a unix-like filesystem usually duplicates this
informations a few times into the Hard disk). Every time a
partition is mounted a new structure is allocated and initialized.
These structures are maintained in a tree, that is, if a partition
is mounted into a directory the new structure becomes a leaf of
the structure that contains the mount directory.
\begin{figure}
\begin{center}
\includegraphics[width=7cm]{images/supertree.eps}
\end{center}
\caption{Super block tree}
\label{fig:supertree}
\end{figure}
The root super block is stored into the global pointer
root\_superblock; Figure \ref{fig:supertree} shows a possible
super block tree.
Here a short description of the struct super\_block fields (see
Table \ref{src:structsuper}):
\begin{description}
\item [sb\_parent,~sb\_childs,~sb\_next]are used to store the tree structure:
sb\_parent points to the father of the structure, whereas sb\_childs points to
the first structure of the childs; sb\_next is used to create the brothers list
(that is the children list of the father).
\item [sb\_dev]Contains the device, that is the partition that contains the
filesystem (the low level part is responsible of the devices, that are typically
associated to an hard disk partition. It is possible for example to extend the
low level to associate to a device other entities, such a file on another
filesystem (this is usually possible on Unix system through loopback devices).
\item [sb\_root]This is the root directory inode of the filesystem.
\item [sb\_droot]This is the directory entry (the filename) of the root
directory.
\item [sb\_fs]This is the pointer to the filesystem present in the filesystem.
Every supported file system is registered at init time (see Section
\ref{sec:fs:int:init}).
\item [sb\_op]Pointer to a structure that contains the specific filesystem
functions used to use the struct super\_block.
\item [sb\_dop]These are the specific function used for the directory entries.
\item [sb\_mopts]Mount options used during initialization.
\item [sb\_busycount]This is a counter that counts the number of entities that
are using the filesystem (e.g., a partition can be unmounted if someone is
reading it).
\item [sb\_used]A flag that signals if the structure is used.
\item [sb\_blocked]Some operations on this structure must be executed in mutual
exclusion; if this flag is set there is some operation on the structure. This
flag can not be used/tested/modified directly by the user functions.
\end{description}
%----------------------------------------------------------------------------
\subsubsection{struct dentry}
%----------------------------------------------------------------------------
\begin{table}
\begin{verbatim}
struct dentry {
struct dentry *d_next;
struct dentry *d_prev;
struct dentry *d_parent;
struct dentry *d_child;
__time_t d_acc;
struct qstr d_name;
short d_lock;
struct dentry_operations *d_op;
struct inode *d_inode;
struct super_block *d_sb;
};
\end{verbatim}
\caption{struct dentry}
\label{src:structdentry}
\end{table}
\label{sec:dentry} The struct dentry contains the file names independently from
the file type (regular file, directory, ...) \footnote{``dentry'' means
\emph{D}irectory \emph{ENTRY}.}. Each file in the filesystem is always stored in
two structures:
\begin{itemize}
\item struct dentry contains the filename;
\item struct inode contains the other informations on the file.
\end{itemize}
This separation is made because in most file systems (not in the DOS FAT16)
more than one name can be linked to the same file (inode).
\begin{figure}
\begin{center}
\includegraphics[width=10cm]{images/dentrytree.eps}
\end{center}
\caption{Directory entry tree.}
\label{fig:dentrytree}
\end{figure}
These data structures are stored internally in a tree structure. Please note
that the inodes are not structured as a tree; every dentry has a pointer to the
correspondent inode.
Figure \ref{fig:dentrytree} shows a directory entry tree.
Directories are showed with a continuous line, whereas files are
showed with a dashed line (note that the internal structures for
files and directories are the same).
The system do not load all the file names in the filesystem. Only
the recent filenames are stored in memory in a partial tree. These
informations are continuously updated removing and adding dentry
struct.
\begin{table}
\begin{verbatim}
struct qstr {
char name[MAXFILENAMELEN+1];
char *nameptr;
};
\end{verbatim}
\caption{struct qstr}
\label{src:structqstr}
\end{table}
Directory names are stored into a struct qstr (see Table
\ref{src:structqstr}). A string is not used to avoid too much
sting copies. The structure is composed by two fields:
\begin{description}
\item [nameptr]If != NULL, it is the name to use.
\item [name]If nameptr==NULL, it is the name to use.
\end{description}
The QSTRNAME can be used on a structure QSTR to get a (char *).
When filling the structure, one of the two fields can be used. If
nameptr is used, the pointed string must be statically allocated.
If name is used, nameptr must be set to NULL.
The fields of the struct dentry contains the following
informations:
\begin{description}
\item [d\_next,~d\_prev,~d\_parent~e~d\_child]can be used to store a dentry tree
structure: d\_parent is a pointer to the father's dentry, d\_child is a pointer
to a list of children dentry, d\_next and d\_prev are used to navigate into the
children dentry list.
\item [d\_acc]It is the time of the last access to the dentry structure.
\item [d\_name]Is the name associated to the dentry.
\item [d\_lock]is a blocking counter: it is incremented every time a routine
needs to use the dentry, and it is decremented when it is no more needed. In
that way, it is possible to know which are the structures currently in use. The
filesystem routines should not directly modify this field.
\item [d\_op]Pointer to the virtual operations used to handle this structure.
\item [d\_inode]is the inode associated to the structure.
\item [d\_sb]It is the super\_block that contains the directory. This a
redundant information used to speed-up the code.
\end{description}
%----------------------------------------------------------------------------
\subsubsection{struct inode}
\label{sec:inode}
%----------------------------------------------------------------------------
\begin{figure}
\begin{center}
\includegraphics[width=10cm]{images/inodebuckets.eps}
\end{center}
\caption{Inode data structures}
\label{fig:inodebuckets}
\end{figure}
\begin{table}
\begin{verbatim}
struct inode {
struct stat i_st;
struct inode *i_next;
struct inode *i_prev;
int i_hash;
__rwlock_t i_lock;
__uint16_t i_dlink;
__uint32_t i_dirty:1;
struct inode_operations *i_op;
struct super_block *i_sb;
union {
struct msdos_inode_info msdos_i;
} u;
};
\end{verbatim}
\caption{struct inode}
\label{src:structinode}
\end{table}
An inode is a file into a filesystem. It contains all the file
informations except the name that is maintained into a struct
dentry (see \ref{sec:dentry}; more details can be found into
\cite{book:unix}).
All the inodes temporarily present in memory are stored into an
hash bucket structure as showed in Figure \ref{fig:inodebuckets}:
for each device a number is computed using an hash function, the
inode number (file serial number) and the device number. All the
inodes with the same hash entry are stored in a list. The first
inode in the list is stored in a table at the index given by the
hash key.
\begin{table}
\begin{verbatim}
struct stat {
__dev_t st_dev;
__ino_t st_ino;
__mode_t st_mode;
__nlink_t st_nlink;
__uid_t st_uid;
__gid_t st_gid;
__off_t st_size;
unsigned long st_blksize;
__time_t st_atime;
__time_t st_mtime;
__time_t st_ctime;
};
\end{verbatim}
\caption{struct stat}
\label{src:structstat}
\end{table}
An inode contains the following fields:
\begin{description}
\item [i\_st]Contains all the standard informations for a file (e.g., the
length, the type); a task can inspect these informations using the stat()
primitive. These informations are included into a stat structure that is shared
between the user libraries and the internal implementation, in a way that a
memcopy should be enough to pass these informations.
\item [i\_next,~i\_prev~e~i\_hash]These fields are used to implement the double
linked list and the hash key: i\_next and i\_prev implement a double list,
whereas i\_hash is the hash key (redundant information).
\item [i\_lock]It is a lock for the structure (see Section \ref{sec:rwlock}).
\item [i\_dlink]Number of directory entries that points to the inode (see
Section \ref{sec:dentry}).
\item [i\_dirty]A flag that says if this inode has been modified.
\item [i\_op]Virtual operations used for working with this inode. Usually they
are filesystem specific.
\item [i\_sb]Super block where the inode is stored (see Section
\ref{sec:superblock}).
\end{description}
The struct stat showed in Table \ref{src:structstat}, contained
into the struct inode and described in the paragraph 5.6 of
\cite{book:posix1003.1}, contains the following fields:
\begin{description}
\item [st\_mode]It is the file mode (informations about who can
read/modify/execute a file).
\item [st\_ino]An unique number that identifies a file (``file serial number''
in the Posix terminology).
\item [st\_dev]It is the device serial number; st\_dev and st\_ino are necessary
and sufficient to uniquely identify a file in the system.
\item [st\_nlink]Number of hard links to a file (Always 1 for FAT16 file
systems).
\item [st\_uid~e~st\_gid]File group and user identifiers. 0 is used for the
system administrator. The FAT16 filesystem always set these fields to 0.
\item [st\_size]File length (bytes).
\item [st\_atime,~st\_mtime~e~st\_ctime]File times: st\_atime is the time of the
last access, st\_mtime is the time of the last modification and st\_ctime is the
time of the last state change (e.g., when the file has been created or modified
using chmod).
\end{description}
%----------------------------------------------------------------------------
\subsubsection{struct file}
\label{sec:file}
%----------------------------------------------------------------------------
\begin{table}
\begin{verbatim}
struct file {
struct file *f_next;
struct dentry *f_dentry;
struct file_operations *f_op;
__loff_t f_pos;
unsigned int f_count;
__uint32_t f_flag_isdir:1;
union {
struct msdos_file_info msdos_f;
} u;
};
\end{verbatim}
\caption{struct file}
\label{src:structfile}
\end{table}
Every open file has a struct file associated to it (see Table
\ref{src:structfile}).
\begin{description}
\item [f\_next]Pointer used internally to the filesystem modules.
\item [f\_dentry]contains the filename (see Section \ref{sec:dentry}). From here
we can get the corresponding inode number.
\item [f\_op]Contains a pointer to the virtual operation that works on struct
file, described in Section \ref{sec:fileop}.
\item [f\_pos]Contains the current position into the file. read/write operations
are done starting from this point; moreover, Posix specification allow a seek to
a position beyond file termination: in that case read operation should fail,
whereas write operation should fill the gap with zeroes.
\item [f\_count]Contains the number of file descriptors that are using this
structure for input/output operations on the file.
\item [f\_flag\_isdir]Flag that says if this is a directory (redundant
information).
\item [msdos\_f]FAT16 specific informations.
\end{description}
The applications use a file descriptor to index the file to use. A
file descriptor is a number meaningful only for the processor that
obtained it (e.g. through an open operation) that is assigned to a
struct file through a descriptor table. Every process has a
different table, contained into the struct file\_descriptors
described in section \ref{sec:filedes}. The whole S.Ha.R.K. Kernel
can be considered a monoprocess multithread application, and for
that reason it has only one descriptor table.
%----------------------------------------------------------------------------
\subsubsection{struct file\_descriptors}
\label{sec:filedes}
%----------------------------------------------------------------------------
\begin{table}
\begin{verbatim}
struct file_descriptors {
struct file *fd_file[MAXOPENFILES];
__uint16_t fd_flags[MAXOPENFILES];
__uint32_t fd_fflags[MAXOPENFILES];
__mode_t fd_umask;
char fd_cwd[MAXPATHNAMELEN+1];
struct dentry *fd_cwden;
__mutex_t fd_mutex;
};
\end{verbatim}
\caption{struct file\_descriptors}
\label{src:structfiledes}
\end{table}
This struct is used to store all the general parameters assigned
to a process (e.g., the current directory). Again, S.Ha.R.K. has
only one structure because it can be considered like a monoprocess
multithread system. A mutex is used to guarantee mutual exclusion
when the struct is used by more than one thread. The structure
contains the following fields:
\begin{description}
\item [fd\_file]This is the descriptor table; every descriptor
(with value between 0 and MAXOPENFILES) is associated an element
that can have the following values:
\begin{itemize}
\item NULL if it is a free descriptor;
\item RESERVED if it is reserved for special uses (e.g., descriptors assigned to
the keyboard or to the console)
\item pointer to the file
\end{itemize}
\item [fd\_flags]Flags associated to the file (e.g., FD\_CLOSEXEC), described in
detail in \cite{book:posix1003.1}, when the fcntl() primitive is described.
\item [fd\_fflags]Flag used when the file was opened (e.g., O\_APPEND,
O\_RDONLY, etc...).
\item [fd\_umask]Mask used during file creation;can be modified with the
function umask().
\item [fd\_cwd]Current working directory used by all the primitives that
receives a relative pathname (that does not start with ``/'' (redundant
information).
\item [fd\_cwden]The current working directory; it can be modified using the
function chdir().
\item [fd\_mutex]Mutual exclusion semaphore used for concurrent multithread
access.
\end{description}
%----------------------------------------------------------------------------
\subsection{Disk Cache}
\label{sec:diskcache}
%----------------------------------------------------------------------------
%----------------------------------------------------------------------------
\subsubsection{How to use the cache}
%----------------------------------------------------------------------------
The filesystem modules use the disk cache to get all the
informations they need.
\begin{table}
\begin{verbatim}
dcache_t *dcache_lock (__dev_t dev, __blkcnt_t lsector);
void dcache_unlock (dcache_t *);
dcache_t *dcache_acquire (__dev_t dev, __blkcnt_t lsector);
void dcache_release (dcache_t *);
void dcache_dirty (dcache_t *d);
void dcache_skipwt(dcache_t *d);
\end{verbatim}
\caption{Disk cache interface}
\label{tab:cacheint}
\end{table}
The disk cache interface is showed in Table \ref{tab:cacheint}:
\begin{description}
\item [dcache\_lock]This function can be used to require a read only access to
the cache. More than one thread can lock a block. the parameters are the device
number and the sector number; it returns a pointer to the struct dcache\_t in
case of success, or NULL in case of error. This function do not return until the
access is granted or an error occurred. To avoid critical blocking times in case
more than one block is needed, you must use a proper access protocol: the
blocking must be done with increasing block numbers; moreover, if possible, only
one block should be blocked at a time.
\item [dcache\_unlock]This function can be used to unblock a cache entry
previously blocked.
\item [dcache\_acquire]This function behaves as dcache\_lock(), but it requires
a read/write access. Only one task at a time can acquire the lock. This access
do not imply that the cache entry is considered dirty (see the dcache\_dirty()
function).
\item [dcache\_release]This function is similar to dcache\_unlock(), and works
with the locks acquired by dcache\_acquire().
\item [dcache\_dirty]Signal that the cache entry has been modified. This
function must be called only if we own a read/write lock and if the cache has
been modified.
\item [dcache\_skipwt]This function signals to the system that the cache entry
contains system informations that are often modified. That means that the
write\_through flag should not be considered. See section \ref{sec:modicache}
for a detailed description.
\end{description}
%----------------------------------------------------------------------------
\subsubsection{struct dcache}
\label{sec:dcache}
%----------------------------------------------------------------------------
\begin{figure}
\begin{center}
\includegraphics[width=10cm]{images/dcachebuckets.eps}
\end{center}
\caption{Data structure for disk cache}
\label{fig:dcachebuckets}
\end{figure}
\begin{table}
\begin{verbatim}
typedef struct TAGdcache {
int prev;
int next;
int hash;
__dev_t device;
__blkcnt_t lsector;
__uint8_t buffer[MAXSECTORSIZE];
int used;
__time_t time;
int numblocked;
__fs_sem_t sync;
__fs_sem_t esync;
__rwlock_t rwlock;
__uint32_t dirty:1;
__uint32_t ready:1;
__uint32_t esyncreq:1;
__uint32_t skipwt:1;
} dcache_t;
\end{verbatim}
\caption{struct dcache}
\label{str:structdcache}
\end{table}
This data structure, used by the module that implements the data
cache, represents a cache entry. All the high level functions uses
the cache module as an interface to the lower layer.
These structures are maintained in the system wit a structure
similar to the inodes, that is a set of double linked lists that I
can access using an hash key, as showed in Figure
\ref{fig:dcachebuckets}; the hash key is computed using the device
(that is the physical peripheral) and the lsector (the logical
sector).
Every struct dcache is an elementary block that can be transferred
to a device. It is not possible to have 2 data structures that
refers to the same block. That is, the pair (device,lsector) must
uniquely identify the data sector on the physical device (e.g.,
the block 64 in ``ide/hda'' could be the block 0 of
``ide/hda1'', as explained in section
\ref{sec:dd:interfaccia:nomi}).
The field of the structure (see Table \ref{str:structdcache}) are:
\begin{description}
\item [prev,~next~e~hash]Are used to store the structures into the lists that
are accessed using the hash key: next and prev are used to implement a double
linked list, whereas hash is the hash key.
\item [device,~lsector~e~buffer]These fields contain the physical device, the
logical sector and the data contained in the corresponding physical block. These
are the unique fields that a high level routine should use (and that shall be
provided by a data cache).
\item [used]The number of tasks that are using the block; -1 means that no one
is using the block, and that the block is free.
\item [time]The last block access time.
\item [numblocked]It is the number of tasks that are waiting the availability of
the block, (that is, they are waiting the flag ready to be set). That is, if a
task asks for a block already asked by another task, but not yet available (not
made available by the lower layer), then the task blocks on a synchronization
semaphore, and it increments that field.
\item [sync]This is the synchronization semaphore used by the blocked tasks that
waits the setting of the ready flag.
\item [esync]It is a synchronization semaphore on the errors needed during the
cache ``purging'' phase (see later).
\item [rwlock]This is a locker used to require the R/W access to the block (see
Section \ref{sec:rwlock}).
\item [dirty]This flag is set if the block has been modified.
\item [ready]This flag is set if the block is available.
\item [esyncreq]This flag is set if the synchronization on errors is required
(that is, the field esync is used).
\item [skipwt]This flag says if the write through handling should be done.
\end{description}
The fields esync and esyncreq are used during the ``purging
phase'': a task requires a read or write operation on a block, the
block is not into the data cache, and the cache is full. That is,
another block must be discarded from the cache. To discard a block
it can happen that a disk write operation must be done. While this
write operation is in progress other tasks can require the same
(discarded, in writing) block. These tasks can block waiting for
synchronization on the block. In that case the cache behavior
could be the following: the requests on the block are aborted (the
failure is only needed if we are able to free the block; the
blocked tasks are awakened with an error); or, the requests on the
block are not aborted (the block can not be freed, and the tasks
blocked on it are woken up). The problem is that after the task
has wake up all the blocked tasks, it must wait that all the
awakened tasks check for an error. To do that, the task set the
esyncreq flag, and it blocks on the esync semaphore; the last task
that checked for errors, checks the esyncreq flag and signal the
esync semaphore waking up the waiting task.
The disk cache can work in three different modes: \label{sec:modicache}
``copy-back'', ``write-through'' and ``modified write-thought''. Usually most
time-sharing OSes handle a copy-back cache. that is, write operations are done
only in cache; a low priority task is used to synchronize cache and disk data.
In such a system, if a cache request arrives and the cache is full, a block is
freed eventually writing it on the disk. On a real-time system such a behavior
can give some problems, because the time spent freeing the cache can be
accounted to the running task. For example, a task that only reads can be
delayed by another task that fills the cache memory writing blocks on the disk.
To solve this problem there are 2 solutions: the cache is not used (that is, no
caching on write), or the cache behavior is set to write-through, that is the
write operations are immediately synchronized on the disk, in a way that who
dirties the cache is also responsible for its synchronization.
If the cache is used in write-through mode, another problem
arises: sometimes a few bytes are modified frequently (e.g. the
filesystem information for the allocation of a file on the disk),
that leads to continuous disk writes. For that reason, the disk
cache provides the flag skipwt, that forces the copy-back mode for
a particular block. By default, this flag is not set. When the
write-through policy must be disabled, the file system should call
the function dcache
\_skipwt() on the cache-entry before releasing
it.
%----------------------------------------------------------------------------
\subsubsection{struct \_\_rwlock\_t}
\label{sec:rwlock}
%----------------------------------------------------------------------------
\begin{table}
\begin{verbatim}
typedef struct {
__mutex_t mutex;
int active_readers;
int active_writers;
int blocked_readers;
int blocked_writers;
__sem_t readers_semaph;
__sem_t writers_semaph;
} __rwlock_t;
\end{verbatim}
\caption{struct \_\_rwlock\_t}
\label{src:structrwlock}
\end{table}
\begin{table}
\begin{verbatim}
void __rwlock_init (__rwlock_t *ptr);
void __rwlock_rdlock (__rwlock_t *ptr);
void __rwlock_wrlock (__rwlock_t *ptr);
int __rwlock_tryrdlock (__rwlock_t *ptr);
int __rwlock_trywrlock (__rwlock_t *ptr);
void __rwlock_rdunlock (__rwlock_t *ptr);
void __rwlock_wrunlock (__rwlock_t *ptr);
\end{verbatim}
\caption{Lockers functions}
\label{src:funcrwlock}
\end{table}
The cache module and other modules need a locker protocol, that is
a protocol that solves the reader/writer problem: basically, there
is a shared resource where some processes must read or write.
These rules must be imposed by the protocol:
\begin{enumerate}
\item If someone is writing, no one is authorizes to read or write.
\item If someone is reading, other tasks are authorized to read, but not to
write.
\end{enumerate}
In literature, there exist different methods to solve the problem
with different behaviors. This filesystem implements the protocol
proposed into paragraph 4.2.3.1 of
\cite{book:programmazioneconcorrente}. To replace that policy, the
rwlock.c file should be modified with another algorithm,
maintaining the same interface.
Table \ref{src:structrwlock} contains the internal data structure
of the implemented protocol. To implement a ``locker'', all the
needed data must be inserted into a structure \_\_wrlock\_t. The
following functions have also to be provided (see Table
\ref{src:funcrwlock}):
\begin{description}
\item [\_\_rwlock\_init]Initializes the structure passed as parameter. That
structure is passed to all the functions and it is used to protect a shared
resource.
\item [\_\_rwlock\_rdlock]Read request (read lock); when this function returns,
the calling task is authorized to read the shared resource.
\item [\_\_rwlock\_wrlock]Write request (write lock).
\item [\_\_rwlock\_tryrdlock]Conditional read request (try read lock); The
function return 0 if the calling task can read, another number otherwise. At the
end of the read operation (if the lock has been granted), \_\_rwlock\_rdunlock
must be called.
\item [\_\_rwlock\_trywrlock]Conditional write request (try write lock); The
function return 0 if the calling task can write, another number otherwise. At
the end of the write operation (if the lock has been granted),
\_\_rwlock\_wrunlock must be called.
\item [\_\_rwlock\_rdunlock]Read unlock; the task has finished the use of the
shared resource.
\item [\_\_rwlock\_wrunlock]Write unlock; the task has finished the use of the
shared resource, that is again in a consistent state.
\end{description}
The functions
\_\_rwlock
\_rdlock and
\_\_rwlock
\_rdunlock must always be used
together. Nested calls are not allowed. The same rules apply to
\_\_rwlock
\_wrlock and
\_\_rwlock
\_wrunlock.
Please note that
\_\_rwlock
\_rdlock and
\_\_rwlock
\_wrlock can
cause undefined task blocking. The other functions do not block
the calling task.
%----------------------------------------------------------------------------
\section{Filesystem: DOS (FAT16)}
%----------------------------------------------------------------------------
%----------------------------------------------------------------------------
\subsection{Description}
%----------------------------------------------------------------------------
A technical description of the FAT16 filesystem can be found in
\cite{manual:DOS330} or \cite{manual:dosfat}.
%----------------------------------------------------------------------------
\subsubsection{Internal structure}
%----------------------------------------------------------------------------
\begin{figure}
\begin{center}
\includegraphics[width=7cm]{images/fatarch.eps}
\end{center}
\caption{Internal structure of the DOS FAT16.}
\label{src:fatarch}
\end{figure}
As showed in Figure \ref{src:fatarch}, this filesystem is divided
in areas:
\begin{description}
\item [Boot~sector]Contains a small program that loads the DOS operating
system, as specified into \cite{manual:DOS330}. It also contains other
informations like the number of FATS, the kind of support and the dimension of
the Root Directory; It can be thought as a super-block. The informations
contained in this sector are showed in Table \ref{tab:fatboot}.
\item [FAT~tables]This data area contains some copies of the File Allocation
Table (FAT) These informations are used to know where a file can be found on the
hard disk.
\item [Root~directory]This block is structured as all the blocks that contains a
directory, with the exceptions that it is allocated in a contiguous way and that
it has a fixed size.
\item [Data~area]This is the area where the files and directories are stored.
The minimal allocation unit is a cluster, that is a set of contiguous sectors on
the hard disk. The boot sector stores the number of sectors that compose a
cluster.
\end{description}
%----------------------------------------------------------------------------
\subsubsection{The boot sector}
%----------------------------------------------------------------------------
\begin{table}
\begin{verbatim}
struct bootrecord {
__uint8_t reserved[3];
char oemname[8];
__uint16_t bytespersector;
__uint8_t sectorspercluster;
__uint16_t hiddensectors;
__uint8_t fatsnumber;
__uint16_t rootentry;
__uint16_t sectors;
__uint8_t media;
__uint16_t sectorsperfat;
__uint16_t sectorpertrak;
__uint16_t headsperdisk;
__uint32_t hiddensectors32;
__uint32_t sectors32;
__uint16_t physicaldrive;
__uint8_t signature;
__uint8_t serialnumber[4];
char volumelabel[11];
char fattype[8];
};
\end{verbatim}
\caption{The DOS boot sector.}
\label{tab:fatboot}
\end{table}
Table \ref{tab:fatboot} shows the boot sector structure. The boot
sector is always 512 bytes large. The struct is only the first
part of the sector, the second part is the boot loader. Here is a
small description of some fields:
\begin{description}
\item [sectorpercluster]stores the number of sectors included into a cluster.
\item [sectorsperfat]stores the number of sectors in the FAT
\item [fatsnumber]stores the number of FATs on the disk.
\end{description}
A note: when DOS formats a disk, creating a FAT16 filesystem,
the informations that comes from a pre-existing filesystem have
precedence over the informations given by the hardware. That is,
if an hard disk declares to have 8 heads, and the pre-existing
filesystem has headsperdisk=13, the new file system will have 13
heads.
%----------------------------------------------------------------------------
\subsubsection{The File Allocation Table}
%----------------------------------------------------------------------------
\begin{figure}
\begin{center}
\includegraphics[width=7cm]{images/fatfat.eps}
\end{center}
\caption{An example of File Allocation Table (FAT)}
\label{fig:fatfat}
\end{figure}
Every FAT is organized as a table of 16 bit elements. Every
element is a cluster; the element value gives informations about
the cluster:
\begin{itemize}
\item 0 means the cluster is available.
\item values between 0xfff0-0xfff7 are reserved; for example, 0xfff7 means bad
(damaged) cluster.
\item 0xfff8-0xffff says that the cluster is the last cluster in a file.
\item Every other value index which is the next cluster of the file.
\end{itemize}
The FAT is used to create a set of chains that make a file. For
example, Figure \ref{fig:fatfat} shows the information for a file
stored in 4 clusters (23, 27, 25, 31). The first cluster of the
chain is contained into the directories. Please note that the
first cluster of the data area is the number two, because cluster
0 and 1 (the first 4 bytes) are used to store the type of FAT (12
or 16 bit), and the disk format.
Directories (except the root directory) are common files with a
particular structure. The system usually give access to
directories through other primitives (mkdir, creat, and so on).
%----------------------------------------------------------------------------
\subsubsection{The ``root directory''}
%----------------------------------------------------------------------------
This area has the same structure of a directory file, except that
it has fixed dimension, it is not allocated into the data area and
its allocation is contiguous. Common directories have the same
structure except that they are allocated into the data area, and
that they can be fragmented.
\begin{table}
\begin{verbatim}
struct directoryentry {
__uint8_t name[8];
__uint8_t ext[3];
__uint8_t attribute;
__uint8_t reserved[10];
__uint16_t time;
__uint16_t date;
__uint16_t cluster;
__uint32_t size;
};
\end{verbatim}
\caption{DOS Directory Entry.}
\label{tab:fatdentry}
\end{table}
The directory structure is showed in Figure \ref{tab:fatdentry}:
\begin{description}
\item [name~and~ext]Contains the filename in the classic format 8+3. Non used
chars are filled with spaces.
\item [attribute]File flags: they specify if it is regular, a directory, a
volume label, it has been archived or if it is read-only.
\item [time]Creation time (2 seconds resolution) (all the 3 times specified by
the POSIX standard are mapped on that date). The time format is DOS specific.
\item [date]Creation date (until 2099). The date format is DOS specific.
\item [cluster]The first cluster of the file. A value 0 means empty file (the
first available cluster is the cluster 2).
\item [size]File size (32 bit).
\end{description}
Note the difference with the typical Unix file systems, where file
names and file informations are stored in different areas.
Directory entries are not contiguous. If the first character name
is 0xe5, the file has been deleted, if it is 0x00, it has been
never used. Any other character means that the entry is valid.
0x00 is used for efficiency: if it is found, directory scan is
ended also if there are other sectors in the cluster that contains
a directory.
%----------------------------------------------------------------------------
\subsubsection{The data area}
%----------------------------------------------------------------------------
The data area is divided in clusters, that are a set of sectors.
It starts from cluster number 2, and it stores regular files and
directories.
linear addresses are used to identify a sector on a hard disk, and
its cluster. Hard disk sectors are numbered starting from sector
0. The filesystem specifies an algorithm that associates to every
linear address an address C/H/S (Cylinder/Head/Sector).
Some DOS implementation supposes a particular structure
depending on the media used: a 1.44 Mb floppy disk has sectors
composed by 1 sector, and a root directory of 14 sectors. The best
way to implement a filesystem is to rely only on the super block
informations.
%----------------------------------------------------------------------------
\subsection{Implementation notes}
%----------------------------------------------------------------------------
The filesystem maintains 2 structures to store file informations:
an inode structure and a dentry structure. In the DOS FAT16
filesystem unfortunately there is an unique correspondence between
file names and inodes, and that information is contained into the
filesystem directory entry.
\begin{figure}
\begin{center}
\includegraphics[width=7cm]{images/fsn.eps}
\end{center}
\caption{The FAT16 file serial number.}
\label{fig:fsn}
\end{figure}
When the system refers to a file, it use the inode number. That
is, every file has a number that is unique inside the file system.
The FAT16 serial number has been specified as shown in Figure
\ref{fig:fsn}:
\begin{itemize}
\item The most significant part is the cluster number (the cluster where the
directory is stored).
\item The less significant part is the number into the directory entry.
\end{itemize}
For example, the inode number 0x01230008, refers to the 8th
directory entry into the cluster 0x123.
The cluster 1 is used to index the directory entries contained
into the root directory, whereas the cluster 0 is used to index
the root directory.
\bibliographystyle{plain}
\bibliography{../common/retis}
\end{document}