This specification describes the design for a protected,
persistent
RAM-based special filesystem in Linux (PRAMFS).
PRAMFS is designed to be a light-weight,
and space-efficient RAM-based non-volatile filesystem. PRAMFS is also
designed in such a way as to minimize the risk of filesystem corruption
due to errant writes caused by kernel bugs.
The RAM used for the PRAMFS filesystem must meet the following
requirements:
- The RAM must be directly addressable.
- The RAM must have access times comparable to normal
system memory.
- In order for the RAM to be write protected, it must
be addressable by the CPU only via page tables.
To help meet the "protected" requirement, PRAMFS attempts to minimize
the time windows in which filesystem data is present in unprotected
kernel buffers. Therefore PRAMFS does not maintain file data in the
page caches for normal file I/O. Since a central assumption of the
PRAMFS is that the RAM
used for the filesystem is comparable in speed to the system memory, it
is OK and in fact desirable to do this.
PRAMFS accomplishes this by making use of the direct
I/O feature in Linux,
thus guaranteeing that
file data will be transferred directly from the user-level buffers to
the
filesystem and vice-versa, with no intermediate buffers. However in
PRAMFS direct I/O is enabled across all files in the filesystem, in
other words the O_DIRECT flag is forced on every open of a PRAMFS file.
Also, file I/O in the PRAMFS always occurs synchronously. There is no
need to block the current process while the transfer to/from the PRAMFS
is in progress, since one of the requirements of the PRAMFS is that the
filesystem exist in fast RAM. So file I/O in PRAMFS is always
direct, synchronous, and never blocks.
One approach for a non-volatile RAM filesystem is to
write a non-volatile RAM block device driver, and then mount a
disk-based filesystem over it. The advantages of PRAMFS over this
approach include:
- Disk-based filesystems such as ext2/ext3/ext4 use the page
cache for file I/O. PRAMFS never uses the page cache for file I/O and therefore
it does not have a negative impact on the cache but it can be freed up the
for use by other parts of the system that need it
(disk-based filesystem data, pages from swapped-out processes, IPC pages, etc.). This protects
the filesystem against possible page cache corruption caused by kernel
bugs.
- Disk-based filesystems such as ext2/ext3/ext4 were
designed for optimum performance on disk-based media, and so implement features such as
block groups, which attempts to group inode data into a contiguous set
of data blocks to minimize disk seeks when accessing files. Since there
is no such performance penalty for random access on RAM-based media,
such features as block groups are not used in PRAMFS, which reduces
filesystem complexity and in turn increases the efficient use of space
on the media (i.e. More space is dedicated to actual file data storage
and less to meta-data needed to maintain that file data).
PRAMFS
Special Filesystem
In this section we discuss the details of the PRAMFS
special filesystem design. First, the general layout of data objects is
described, followed by a description of the information contained in
the data objects themselves. Then a discussion of how blocks are
allocated for inodes. Then the directory tree structure is described,
followed by the details of write protecting the filesystem data, and
finally a walk-through of some important filesystem methods.
Refer to the following figure for the data layout.
- Super Block (SB):
the super block is 128 bytes long, and exists at the very beginning of
the filesystem. There is a redundant super block after the primary.
- Inode Table: the inode table consists of Ni inodes,
and each inode
is 128 bytes long. Therefore the inode table is 128*Ni bytes in size.
The number of inodes is calculated such that the end of the table
occurs on a block boundary. The inode table size is fixed, that is the
maximum number of inodes that can be allocated is Ni, and Ni cannot be
changed after the filesystem is created.
- Data Blocks: The remaining space in the filesystem
after the inode
table consists of data blocks for file data and the in-use bitmap
(discussed next). In the figure, b is the block size in bytes, and N is
the total number of data blocks. Like Ni, N is fixed, that is, once the
filesystem is created, the maximum number of data blocks that can be
allocated is fixed.
- Block In-Use Bitmap: For every block, there is a bit
in the bitmap
signifying whether that block is in-use by a file (set) or not in use
(cleared). Therefore the bitmap requires N bits, or N/8 bytes, or
N/(8b) blocks. N includes the bitmap blocks, so when the filesystem is
created, the first N/(8b) bits are set in the bitmap, which marks the
blocks that make up the bitmap as in use, and these blocks can never be
freed.
The super block
object contains information that pertains to the whole filesystem. Such
information includes the block size of the filesystem in bytes, the
total number of inodes and blocks (Ni and N), a count of the current
free inodes and data blocks, etc. The PRAMFS super block structure is
shown here:
#define PRAM_SB_SIZE 128 /* must be power of two */ #define PRAM_SB_BITS 7
/* * Structure of the super block in PRAMFS */ struct pram_super_block { __be16 s_sum; /* checksum of this sb, including padding */ __be64 s_size; /* total size of fs in bytes */ __be32 s_blocksize; /* blocksize in bytes */ __be32 s_inodes_count; /* total inodes count (used or free) */ __be32 s_free_inodes_count;/* free inodes count */ __be32 s_free_inode_hint; /* start hint for locating free inodes */ __be32 s_blocks_count; /* total data blocks count (used or free) */ __be32 s_free_blocks_count;/* free data blocks count */ __be32 s_free_blocknr_hint;/* free data blocks count */ __be64 s_bitmap_start; /* data block in-use bitmap location */ __be32 s_bitmap_blocks;/* size of bitmap in number of blocks */ __be32 s_mtime; /* Mount time */ __be32 s_wtime; /* Write time */ __be16 s_magic; /* Magic signature */ char s_volume_name[16]; /* volume name */ };
In PRAMFS, directory entry information, such as file
names and owning inode, are contained within the inodes themselves.
This presents a problem only for hard links, so PRAMFS does not support
hard links.
The PRAMFS inode structure is
reprinted here:
#define PRAM_INODE_SIZE 128 /* must be power of two */ #define PRAM_INODE_BITS 7
/* * Structure of a directory entry in PRAMFS. * Offsets are to the inode that holds the referenced dentry. */ struct pram_dentry { __be64 d_next; /* next dentry in this directory */ __be64 d_prev; /* previous dentry in this directory */ __be64 d_parent; /* parent directory */ char d_name[0]; };
/* * Structure of an inode in PRAMFS */ struct pram_inode { __be16 i_sum; /* checksum of this inode */ __be32 i_uid; /* Owner Uid */ __be32 i_gid; /* Group Id */ __be16 i_mode; /* File mode */ __be16 i_links_count; /* Links count */ __be32 i_blocks; /* Blocks count */ __be32 i_size; /* Size of data in bytes */ __be32 i_atime; /* Access time */ __be32 i_ctime; /* Creation time */ __be32 i_mtime; /* Modification time */ __be32 i_dtime; /* Deletion Time */ __be64 i_xattr; /* Extended attribute */ __be32 i_generation; /* File version (for NFS) */ __be32 i_flags; /* Inode flags */
union { struct { /* * ptr to row block of 2D block pointer array, * file block #'s 0 to (blocksize/8)^2 - 1. */ __be64 row_block; } reg; /* regular file or symlink inode */ struct { __be64 head; /* first entry in this directory */ __be64 tail; /* last entry in this directory */ } dir; struct { __be32 rdev; /* major/minor # */ } dev; /* device inode */ } i_type; struct pram_dentry i_d; };
Notice the i_type union member. The valid
elements of the
union depend on the file's type as contained in i_mode .
For instance, a directory file has valid information in i_type.dir ,
and the other elements of the union are invalid.
In PRAMFS, as in most other filesystems, the inode
number of an inode is simply the absolute offset of that inode from the beginning of the filesystem.
In PRAMFS, only regular files own file data (directories
don't own data
blocks with the exception of extended attributes blocks, this will be discussed later). The inode field i_type.reg.row_block
points to the start of a 2-dimensional table of data block pointers. A
single block is allocated for the row block, and therefore contains b/8
64-bit pointers that point to up to b/8 column blocks. Each column
block holds up to b/8 pointers to data blocks. In this way a regular
file can contain up to (b/8)^2 data blocks, or b^3/64 bytes of data.
For those familiar with the EXT2 filesystem, i_type.reg.row_block
is equivalent to the i_block[13] entry in the EXT2 inode structure. The EXT2 inode's i_block[0-11]
entries point directly to data blocks, the reason being that, for small
files, the first 12 data blocks can be located in a single disk seek.
For the PRAMFS
however, there is no speed penalty for random access, so direct
pointers to data blocks are not necessary, and hence simplifies the
methods for locating data blocks. Also, higher order tables (such as
EXT2's 3-dimensional i_block[14] ) are not deemed
necessary in PRAMFS because it is not envisioned that so much
persistent RAM would be available to hold such
large files.
A note about block numbers. An offset pointer to a block
is sometimes referred to as a logical block number.
Given a block index from 0 to N-1, it's a simple matter to convert the
index into a logical block number: it's just the start offset of data
blocks plus the index times the blocksize, or s_bitmap_start +
(index * s_blocksize) .
However when accessing data blocks for a
file, we usually use a file block number,
which is the relative position of the block inside the file. To find
the absolute logical block number corresponding to a file block index
from 0 to (b/8)^2 - 1, we use the inode's 2-dimensional block pointer
table. For instance, say we are looking for the block at file block
index 359, and the blocksize is b=1024. This means that a single block
can hold 128 logical block numbers, and the logical block number for
file block index 359 is therefore located at i_type.reg.row_block[2][103] ,
that is, entry 103 within the third column block. This algorithm is
acccomplished by the function pram_find_data_block()
in fs/pramfs/inode.c, which takes as arguments the inode and the file
block index and returns the corresponding logical block number.
The organization of the inode logical block pointer
table is illustrated in the figure below. Arrows in the figure
represent pointers, and entries in the column blocks are marked with their file
block index, and are pointing to data blocks assigned to them.
To allocate a new block, a search is made for the first
cleared bit
in the in-use bitmap. The located bit number is also the logical block
index of the located free block. The bit is then set in the in-use
bitmap to mark the corresponding block as in use. This algorithm is
implemented in the function pram_new_block() in
fs/pramfs/balloc.c, which returns the logical block index of the block
that was just allocated.
The function pram_new_block() is used by
the higher-level function pram_alloc_blocks() .
The job of this function is to allocate data blocks for an inode. It
will allocate a set of data blocks starting at a given file block
index. Note that this function must take care of allocating the row and
column blocks that make up the 2D block pointer table. Any unallocated
file blocks before the starting file block index are allocated. All
allocated blocks except the last are zeroed out. pram_alloc_blocks()
in turn is used by the struct
file_operations write() method (discussed below).
All inodes (of any type) within a directory are linked
together in a
doubly-linked list, where the i_next and i_prev
fields of the inodes point to the next and previous inodes within the
directory. The i_prev pointer of the first inode and the i_next
pointer of the last inode are null terminated.
The parent directory inode holds pointers to the head
and tail of the doubly-linked list contained in that directory (i_type.dir.head
and i_type.dir.tail , respectively). If the directory is
empty, i_type.dir.head and i_type.dir.tail
are both zero.
Other filesystem implementations, such as EXT2,
use directory entry objects ("dentries") to associate file
names to inodes, and these dentries are located in data blocks owned by
the parent directory. In EXT2
for instance, a dentry holds the file name, the inode number to
associate the file with, and the file type, and these dentries are
stored in data blocks owned by the parent directory. In PRAMFS,
directory inode's do not need to own any data blocks, because all
dentry information is contained within the inode's themselves.
Extended attributes are stored in blocks allocated outside of any inode.
The i_xattr field is then made to point to this allocated block. If all
extended attributes of an inode are identical, these inodes may share the same
extended attribute block. Such situations are automatically detected by keeping
a cache of recent attribute block numbers and hashes over the block's contents
in memory. Each extended attribute block is described with a descriptor. In each block descriptors there are
flags, the absolute block number and a lock for each block. The design is based
on the ext2/3/4 but instead of using buffer head structs, page cache and block stuff,
PRAMFS use a red-black tree to track blocks and their states.
The block header is followed by multiple entry descriptors. These entry
descriptors are variable in size, and alligned to PRAM_XATTR_PAD
byte boundaries. The entry descriptors are sorted by attribute name,
so that two extended attribute blocks can be compared efficiently.
Attribute values are aligned to the end of the block, stored in
no specific order. They are also padded to PRAM_XATTR_PAD byte
boundaries. No additional gaps are left between them.
Because PRAMFS attempts to
avoid filesystem corruption caused by kernel bugs, dirty pages in the
page cache are not allowed to be written back to the backing-store RAM.
This means that only private file mappings are supported. This way, an
errant write into the page cache will not get
written back to the filesystem.
This is accomplished by implementing the readpage() method
in the PRAMFS address_space object, but not the writepage()
method.
In addition to the software protection features already
discussed
(i.e. avoiding the page cache for file I/O, and allowing only private
mappings), the hardware protection
feature utilizes the
system's paging unit by mapping the I/O memory pages initially as
read-only. Any writes to objects in the PRAMFS
first mark the corresponding page table entries as writeable, perform
the write, and then mark the pages as read-only again. Also, when the write operation completes, any
stale entries in the system TLB that are still marking the pages as
writeable are flushed.
The function that sets the writeable flag for the
filesystem's pages is pram_writeable() in
fs/pramfs/wprotect.c. This function is used by a set of macro functions
defined in include/linux/pram_fs.h:
-
pram_lock_super(ps) and pram_unlock_super(ps) .
The pram_lock_super() macro acquires a spinlock, and then marks the pages that
contain the given PRAMFS super-block (ps) as writeable. In turn, pram_unlock_super()
recalculates the check-sum for the super-block (pram_sync_super ),
marks the pages read-only, and then releases the spin-lock and restores
the system's interrupt flags. All writes to the PRAMFS super-block are
bracketed by these two macro functions. Thus the write operations must
always be done quickly.
-
pram_lock_inode(pi) and pram_unlock_inode(pi) .
These macros perform the same operations as the super-block macros
above, the only difference is that the PRAMFS inode size is passed to pram_writeable()
instead of the super-block size. All writes to PRAMFS inodes are
bracketed by these macros.
-
pram_lock_block(sb,bp) and pram_unlock_block(sb,bp) .
Again, much the same as the above macros, except that we pass the
blocksize to pram_writeable() . Also, blocks are not
check-summed. All writes to PRAMFS blocks are bracketed by these
macros.
In this section we do a code walk-through on a sample
filesystem
method. We will choose the write() method, which is a
method in struct file_operations .
This method is chosen because it involves more filesystem operations
than any other method. In this walk-through, we assume a new regular
file is being created, and then written to. A simple example of a shell
command that would cause this to happen is echo hello >
hello.txt . This case will walk us through not only the write()
method, but also inode creation and linking into the parent directory.
The first entry into PRAMFS from the command echo
hello > hello.txt is to the method pram_create()
in struct inode_operations
for directory inodes. The task of this method is to create a new inode
for a regular file in a given directory. The first thing the method
does is to call pram_new_inode() to allocate a new inode.
pram_new_inode() calls the kernel service new_inode()
to allocate a new struct inode for the virtual filesystem
layer. Next, the free inodes count is checked in the PRAMFS super-block
(s_free_inodes_count )
to verify there are free inodes available in the inode table. If there
are, the index of the first free inode in the table is located. A free
inode is characterized by a zero hard link count (i_links_count
= 0) , and either a file type of zero (i_mode = 0), or a
marked deletion time (i_dtime != 0). Once the index of a
free inode is located, the struct inode object is filled
in with initial values. This inode is then converted to a PRAMFS inode
and copied into the located index in the PRAMFS inode table.
If pram_new_inode() is successful, pram_create()
then sets the inode's inode and file method pointers to those for a
PRAMFS regular file, and then links the new PRAMFS inode into the given
parent directory with a call to pram_add_nondir() . This
routine calls pram_add_link()
which does the actual linking of the new inode into the parent
directories doubly-linked inode list. Then a new directory entry is
instantiated into the VFS layer's dentry cache, and pram_add_nondir()
returns. This completes the creation of the new inode for the regular
file named hello.txt .
The next entry into PRAMFS is to pram_open_file(). This
method simply forces the flag O_DIRECT
on, and then calls the generic open file method. Therefore, all
subsequent I/O on the file will use direct I/O.
Then, generic_file_write() is called. All
the standard checks are done, such as
verifying that the user buffer is accessible and that the file position
that we are writing to is valid. Then, since the O_DIRECT flag is set in the
file descriptor, the PRAMFS direct_IO()
method is called.
pram_direct_IO() is the workhorse regular
file access
method for the PRAMFS. First,
the beginning file block number and the byte offset within that first
block is calculated, based on the given file offset. Then the number of
blocks that will be accessed is
calculated based on the access length. If a write is being performed, pram_alloc_blocks()
(described above) is called to allocate the blocks we'll need for the
write.
With the data blocks now avalaible, pram_direct_IO()
executes a while loop to transfer all requested bytes to/from the user
buffer from/to the inode's data blocks. At every while loop iteration,
either the remainder of the data is transferred, or an entire block
size chunk is transferred. At the start of each while loop, the call to
pram_find_data_block() is made to translate
the file block number to a logical block number.
Note how these methods are written such that all
accesses to objects in the PRAMFS completely bypass the page and buffer
caches. Data moves directly from the user buffer to PRAMFS data blocks,
with the file data never existing in any intermediate kernel buffers or
caches.
The PRAMFS currently requires one mount option, and
there are several
optional mount options:
- The mount option "physaddr=" is a required option.
This tells PRAMFS the physical address of the start of the RAM that makes up the filesystem.
- The mount option "init=" is optional, and is used to
initialize an
empty filesystem. Any data in an existing filesystem will be lost if
this option is given. The parameter to "init=" is the RAM size in bytes.
- The mount option "bs=" is optional, and is used to
specify a block
size. It is ignored if the "init=" option is not specified, since
otherwise the block size is read from the PRAMFS super-block. The
default blocksize is 2048 bytes, and the allowed block sizes are 512,
1024, 2048, and 4096.
- The mount option "bpi=" is optional, and is used to
specify the
bytes per inode ratio, i.e. For every N bytes in the filesystem, an
inode will be created. This behaves the same as the "-i" option to
mke2fs. It is ignored if the "init=" option is not specified.
- The mount option "N=" is optional, and is used to
specify the
number of inodes to allocate in the inode table. If the option is not
specified, the bytes-per-inode ratio is used the calculate the number
of inodes. If neither the "N=" or "bpi=" options are specified, the
default behavior is to reserve 5% of the total space in the filesystem
for the inode table. This option behaves the same as the "-N" option to
mke2fs. It is ignored if the "init=" option is not specified.
- The mount option "errors=" is optional, and is used to
specify the fs behaviour in case of error. It can be "cont", "remount-ro" and "panic". With the
first value no action is done in case of error. With the second
one the fs is mounted read-only. with the third one a kernel
panic happens. Default action is to continue on error.
- The mount options "acl/noacl" are optionals. They are used to enable/disable the support for access control lists (disabled by default).
- The mount options "user_xattr/nouser_xattr" are optionals. They are used to enable/disable the support for the user extended attributes (disabled by default).
- The mount options "noprotect" is optional. It is used to disable the memory protection (enabled by default).
- The mount options "xip" is optional. It is used to enable the execute-in-place (disabled by default).
Example:
mount -t pramfs -o
physaddr=0x20000000,init=1M,bs=1k none /mnt/pram
This example locates the filesystem at physical address
0x20000000, and also requests an empty filesystem be initialized, of
total size 1048576 bytes and blocksize 1024. The mount point is
/mnt/pram.
mount -t pramfs -o physaddr=0x20000000 none
/mnt/pram
This example locates the filesystem at physical address
0x20000000 as in the first example, but uses the intact filesystem that
already exists.
The following operations should be verified on a mounted
PRAMFS:
- A mounted PRAMFS should pass the Bonnie++ benchmark
tests. In this example bonnie++ command, it is assumed that the total
PRAMFS filesystem is atleast 1MB in size, and that there are atleast
2048 inodes available:
bonnie++ -u root -s 1 -r 0 -n 2 -d /mnt/pram
- Errant writes by the kernel into any area within the
PRAMFS
memory should cause a kernel page fault exception, and should not
corrupt the filesystem. This can be tested with the a test module. The test module
can be compiled in the kernel enabling the option "PRAMFS Test" in the kconfig
menu. The module will attempt a write within the pramfs memory and should
cause a kernel page protection fault ("Unable to handle kernel paging
request at virtual address ..."). Then reboot the system, remount the
pramfs filesystem, and verify that no filesystem data has been
corrupted.
Architecture-dependent
Requirements for PRAMFS
Like the other filesystems, the source code under
fs/pramfs is
architecture independent, and simply needs to be recompiled for a
specific architecture. However, in kernel version 2.4, PRAMFS does
require a few kernel
services that are architecture dependent in order to support the write
protection feature:
This is a new function that is identical to the existing
__ioremap()
method in every respect, except that the page table entries that map
the IO memory must be marked read-only. The
method is used by PRAMFS to initially map the PRAMFS memory as
read-only at mount time.
-
set_memory_ro(), set_memory_wr()
For PRAMFS, these already existing routines are needed to
turn on/off the memory protection. Some architectures already allow
this, such as x86.
Here we show some experimental results with Pramfs. The following graphics show the differences
of using the XIP feature. This test has been performed with bonnie++ benchmark on a real
embedded environment.
You can download the complete data here and
here.
The following graphics show the differences between a ramdisk with ext2 and Pramfs. This time we worked with an
emulated environment. It is not very important the asbsolute value but the relative difference between this two fs.
In this case we can see the overhead of unuseful disk policy over RAM and it points out how and why Pramfs it is important in an environment as the RAM, with completly different rules and constraints compared with disks.
You can download the complete data here and
here.
1.For small PRAMFS mounted filesystems, bonnie++ fills
up the
filesystem before the tests complete. This is not a bug in PRAMFS, but
rather a limitation of the bonnie++ program.
|