This specification describes the design for a protected,
persistent
RAM-based special filesystem in Linux (PRAMFS).
PRAMFS is designed to be a full-featured, light-weight,
and space-efficient RAM-based non-volatile filesystem. PRAMFS is also
designed in such a way as to minimize the risk of filesystem corruption
due to errant writes caused by kernel bugs.
The RAM used for the PRAMFS filesystem must meet the following
requirements:
- The RAM must be directly addressable.
- The RAM must have access times comparable to normal
system memory.
- In order for the RAM to be write protected, it must
be addressable by the CPU only via page tables.
To help meet the "protected" requirement, PRAMFS attempts to minimize
the time windows in which filesystem data is present in unprotected
kernel buffers. Therefore PRAMFS does not maintain file data in the
page caches for normal file I/O. Since a central assumption of the
PRAMFS is that the RAM
used for the filesystem is comparable in speed to the system memory, it
is OK and in fact desirable to do this. There is no point in caching
file
data when the backing store is as fast or faster than the cache
memory itself!
PRAMFS accomplishes this by making use of the direct
I/O feature in Linux,
thus guaranteeing that
file data will be transferred directly from the user-level buffers to
the
filesystem and vice-versa, with no intermediate buffers. However in
PRAMFS direct I/O is enabled across all files in the filesystem, in
other words the O_DIRECT flag is forced on every open of a PRAMFS file.
Also, file I/O in the PRAMFS always occurs synchronously. There is no
need to block the current process while the transfer to/from the PRAMFS
is in progress, since one of the requirements of the PRAMFS is that the
filesystem exist in fast RAM. So file I/O in PRAMFS is always
direct, synchronous, and never blocks.
One approach for a non-volatile RAM filesystem is to
write a non-volatile RAM block device driver, and then mount a
disk-based filesystem over it. The advantages of PRAMFS over this
approach include:
- Disk-based filesystems such as ext2/ext3 use the page
cache for file I/O. PRAMFS never uses the page cache for file I/O. This
frees up the
page cache for use by other parts of the system that really need it
(disk-based filesystem data, pages from swapped-out processes, IPC pages, etc.). Also, this protects
the filesystem against possible page cache corruption caused by kernel
bugs.
- Disk-based filesystems such as ext2/ext3 were
designed
for optimum
performance on disk-based media, and so implement features such as
block groups, which attempts to group inode data into a contiguous set
of data blocks to minimize disk seeks when accessing files. Since there
is no such performance penalty for random access on RAM-based media,
such features as block groups are not used in PRAMFS, which reduces
filesystem complexity and in turn increases the efficient use of space
on the media (i.e. More space is dedicated to actual file data storage
and less to meta-data needed to maintain that file data).
PRAMFS
Special Filesystem
In this section we discuss the details of the PRAMFS
special filesystem design. First, the general layout of data objects is
described, followed by a description of the information contained in
the data objects themselves. Then a discussion of how blocks are
allocated for inodes. Then the directory tree structure is described,
followed by the details of write protecting the filesystem data, and
finally a walk-through of some important filesystem methods.
Refer to the following figure for the data layout.
- Super Block (SB):
the super block is 128 bytes long, and exists at the very beginning of
the filesystem. There are no repeats of the super block.
- Inode Table: the inode table consists of Ni inodes,
and each inode
is 128 bytes long. Therefore the inode table is 128*Ni bytes in size.
The number of inodes is calculated such that the end of the table
occurs on a block boundary. The inode table size is fixed, that is the
maximum number of inodes that can be allocated is Ni, and Ni cannot be
changed after the filesystem is created.
- Data Blocks: The remaining space in the filesystem
after the inode
table consists of data blocks for file data and the in-use bitmap
(discussed next). In the figure, b is the block size in bytes, and N is
the total number of data blocks. Like Ni, N is fixed, that is, once the
filesystem is created, the maximum number of data blocks that can be
allocated is fixed.
- Block In-Use Bitmap: For every block, there is a bit
in the bitmap
signifying whether that block is in-use by a file (set) or not in use
(cleared). Therefore the bitmap requires N bits, or N/8 bytes, or
N/(8b) blocks. N includes the bitmap blocks, so when the filesystem is
created, the first N/(8b) bits are set in the bitmap, which marks the
blocks that make up the bitmap as in use, and these blocks can never be
freed.
The super block
object contains information that pertains to the whole filesystem. Such
information includes the block size of the filesystem in bytes, the
total number of inodes and blocks (Ni and N), a count of the current
free inodes and data blocks, etc. The PRAMFS super block structure is
shown here:
#define PRAM_SB_SIZE 128 // must be power of two #define PRAM_SB_BITS 7
typedef unsigned long pram_off_t;
/* * Structure of the super block in PRAMFS */ struct pram_super_block { __u32 s_size; /* total size of fs in bytes */ __u32 s_blocksize; /* blocksize in bytes */ __u32 s_features; /* feature flags */ __u32 s_inodes_count; /* total inodes count (used or free) */ __u32 s_free_inodes_count;/* free inodes count */ __u32 s_free_inode_hint; /* start hint for locating free inodes */ __u32 s_blocks_count; /* total data blocks count (used or free) */ __u32 s_free_blocks_count;/* free data blocks count */ __u32 s_free_blocknr_hint;/* free data blocks count */ pram_off_t s_bitmap_start; /* data block in-use bitmap location */ __u32 s_bitmap_blocks;/* size of bitmap in number of blocks */ __u32 s_mtime; /* Mount time */ __u32 s_wtime; /* Write time */ __u32 s_rev_level; /* Revision level */ __u16 s_magic; /* Magic signature */ __u16 s_state; /* File system state */ __u16 s_errors; /* Behaviour when detecting errors */ char s_volume_name[16]; /* volume name */ __u32 s_sum; /* checksum of this sb, including padding */ };
The data type pram_off_t is an offset
pointer type for
PRAMFS. These are simply 32-bit offsets from the beginning of the
filesystem, and are used to locate data objects in the filesystem
(inodes, data blocks, in-use bitmap, etc.).
In PRAMFS, directory entry information, such as file
names and owning inode, are contained within the inodes themselves.
This presents a problem only for hard links, so PRAMFS does not support
hard links. If at some time hard link support is desired, PRAMFS will
instead use the more traditional model of maintaining directory entry
info seperate from inodes.
The PRAMFS inode structure is
reprinted here:
#define PRAM_INODE_SIZE 128 // must be power of two #define PRAM_INODE_BITS 7
/* * Structure of a directory entry in PRAMFS. * Offsets are to the inode that holds the referenced dentry. */ struct pram_dentry { pram_off_t d_next; /* next dentry in this directory */ pram_off_t d_prev; /* previous dentry in this directory */ pram_off_t d_parent; /* parent directory */ char d_name[0]; };
/* * Structure of an inode in PRAMFS */ struct pram_inode { __u32 i_sum; /* checksum of this inode */ __u32 i_uid; /* Owner Uid */ __u32 i_gid; /* Group Id */ __u16 i_mode; /* File mode */ __u16 i_links_count; /* Links count */ __u32 i_blocks; /* Blocks count */ __u32 i_size; /* Size of data in bytes */ __u32 i_atime; /* Access time */ __u32 i_ctime; /* Creation time */ __u32 i_mtime; /* Modification time */ __u32 i_dtime; /* Deletion Time */
union { struct { /* * ptr to row block of 2D block pointer array, * file block #'s 0 to (blocksize/4)^2 - 1. */ pram_off_t row_block; } reg; // regular file or symlink inode struct { pram_off_t head; /* first entry in this directory */ pram_off_t tail; /* last entry in this directory */ } dir; struct { __u32 rdev; /* major/minor # */ } dev; // device inode } i_type; struct pram_dentry i_d; };
Notice the i_type union member. The valid
elements of the
union depend on the file's type as contained in i_mode.
For instance, a directory file has valid information in i_type.dir,
and the other elements of the union are invalid.
In PRAMFS, as in most other filesystems, the inode
number of an inode is simply the absolute offset (pram_off_t)
of that inode from the beginning of the filesystem.
In PRAMFS, only regular files own file data (directories
never own data
blocks, this will be discussed later). The inode field i_type.reg.row_block
points to the start of a 2-dimensional table of data block pointers. A
single block is allocated for the row block, and therefore contains b/4
32-bit pointers that point to up to b/4 column blocks. Each column
block holds up to b/4 pointers to data blocks. In this way a regular
file can contain up to (b/4)^2 data blocks, or b^3/16 bytes of data.
For those familiar with the EXT2 filesystem, i_type.reg.row_block
is equivalent to the i_block[13] entry in the EXT2 inode structure. The EXT2 inode's i_block[0-11]
entries point directly to data blocks, the reason being that, for small
files, the first 12 data blocks can be located in a single disk seek.
For the PRAMFS
however, there is no speed penalty for random access, so direct
pointers to data blocks are not necessary, and hence simplifies the
methods for locating data blocks. Also, higher order tables (such as
EXT2's 3-dimensional i_block[14]) are not deemed
necessary in PRAMFS because it is not envisioned that so much
persistent RAM would be available to hold such
large files.
A note about block numbers. An offset pointer to a block
is sometimes referred to as a logical block number.
Given a block index from 0 to N-1, it's a simple matter to convert the
index into a logical block number: it's just the start offset of data
blocks plus the index times the blocksize, or s_bitmap_start +
(index * s_blocksize).
However when accessing data blocks for a
file, we usually use a file block number,
which is the relative position of the block inside the file. To find
the absolute logical block number corresponding to a file block index
from 0 to (b/4)^2 - 1, we use the inode's 2-dimensional block pointer
table. For instance, say we are looking for the block at file block
index 359, and the blocksize is b=1024. This means that a single block
can hold 256 logical block numbers, and the logical block number for
file block index 359 is therefore located at i_type.reg.row_block[1][103],
that is, entry 103 within the second column block. This algorithm is
acccomplished by the function pram_find_data_block()
in fs/pramfs/inode.c, which takes as arguments the inode and the file
block index and returns the corresponding logical block number.
The organization of the inode logical block pointer
table is illustrated in the figure below. Arrows in the figure
represent pram_off_t
pointers, and entries in the column blocks are marked with their file
block index, and are pointing to data blocks assigned to them.
To allocate a new block, a search is made for the first
cleared bit
in the in-use bitmap. The located bit number is also the logical block
index of the located free block. The bit is then set in the in-use
bitmap to mark the corresponding block as in use. This algorithm is
implemented in the function pram_new_block() in
fs/pramfs/balloc.c, which returns the logical block index of the block
that was just allocated.
The function pram_new_block() is used by
the higher-level function pram_alloc_blocks().
The job of this function is to allocate data blocks for an inode. It
will allocate a set of data blocks starting at a given file block
index. Note that this function must take care of allocating the row and
column blocks that make up the 2D block pointer table. Any unallocated
file blocks before the starting file block index are allocated. All
allocated blocks except the last are zeroed out. pram_alloc_blocks()
in turn is used by the struct
file_operations write() method (discussed below).
All inodes (of any type) within a directory are linked
together in a
doubly-linked list, where the i_next and i_prev
fields of the inodes point to the next and previous inodes within the
directory. The i_prev pointer of the first inode and the i_next
pointer of the last inode are null terminated.
The parent directory inode holds pointers to the head
and tail of the doubly-linked list contained in that directory (i_type.dir.head
and i_type.dir.tail, respectively). If the directory is
empty, i_type.dir.head and i_type.dir.tail
are both zero.
Other filesystem implementations, such as EXT2,
use directory entry objects ("dentries") to associate file
names to inodes, and these dentries are located in data blocks owned by
the parent directory. In EXT2
for instance, a dentry holds the file name, the inode number to
associate the file with, and the file type, and these dentries are
stored in data blocks owned by the parent directory. In PRAMFS,
directory inode's do not need to own any data blocks, because all
dentry information is contained within the inode's themselves.
Because PRAMFS attempts to
avoid filesystem corruption caused by kernel bugs, dirty pages in the
page cache are not allowed to be written back to the backing-store RAM.
This means that only private file mappings are supported. This way, an
errant write into the page cache will not get
written back to the filesystem.
This is accomplished by implementing the readpage() method
in the PRAMFS address_space object, but not the writepage()
method.
In addition to the software protection features already
discussed
(i.e. avoiding the page cache for file I/O, and allowing only private
mappings), the hardware protection
feature utilizes the
system's paging unit by mapping the I/O memory pages initially as
read-only. Any writes to objects in the PRAMFS
first mark the corresponding page table entries as writeable, perform
the write, and then mark the pages as read-only again. This operation
is done atomically and non-reentrantly by holding the page-table
spin-lock with
interrupts disabled. Also, when the write operation completes, any
stale entries in the system TLB that are still marking the pages as
writeable are flushed.
PRAMFS can disable the hardware write protection feature
with the kernel config option CONFIG_PRAMFS_NOWP. This is useful for
memory that is mapped without page tables, for instance memory that
lives in the first 512M of physical address space in MIPS.
The function that sets the writeable flag for the
filesystem's pages is pram_writeable() in
fs/pramfs/wprotect.c. This function is used by a set of macro functions
defined in include/linux/pram_fs.h:
-
pram_lock_super(ps) and pram_unlock_super(ps).
The pram_lock_super() macro acquires the init_mm.page_table_lock
spin-lock and disables interrupts, and then marks the pages that
contain the given PRAMFS super-block (ps) as writeable. In turn, pram_unlock_super()
recalculates the check-sum for the super-block (pram_sync_super),
marks the pages read-only, and then releases the spin-lock and restores
the system's interrupt flags. All writes to the PRAMFS super-block are
bracketed by these two macro functions. Thus the write operations must
always be done quickly.
-
pram_lock_inode(pi) and pram_unlock_inode(pi).
These macros perform the same operations as the super-block macros
above, the only difference is that the PRAMFS inode size is passed to pram_writeable()
instead of the super-block size. All writes to PRAMFS inodes are
bracketed by these macros.
-
pram_lock_block(sb,bp) and pram_unlock_block(sb,bp).
Again, much the same as the above macros, except that we pass the
blocksize to pram_writeable(). Also, blocks are not
check-summed. All writes to PRAMFS blocks are bracketed by these
macros.
In this section we do a code walk-through on a sample
filesystem
method. We will choose the write() method, which is a
method in struct file_operations.
This method is chosen because it involves more filesystem operations
than any other method. In this walk-through, we assume a new regular
file is being created, and then written to. A simple example of a shell
command that would cause this to happen is echo hello >
hello.txt. This case will walk us through not only the write()
method, but also inode creation and linking into the parent directory.
The first entry into PRAMFS from the command echo
hello > hello.txt is to the method pram_create()
in struct inode_operations
for directory inodes. The task of this method is to create a new inode
for a regular file in a given directory. The first thing the method
does is to call pram_new_inode() to allocate a new inode.
pram_new_inode() calls the kernel service new_inode()
to allocate a new struct inode for the virtual filesystem
layer. Next, the free inodes count is checked in the PRAMFS super-block
(s_free_inodes_count)
to verify there are free inodes available in the inode table. If there
are, the index of the first free inode in the table is located. A free
inode is characterized by a zero hard link count (i_links_count
= 0) , and either a file type of zero (i_mode = 0), or a
marked deletion time (i_dtime != 0). Once the index of a
free inode is located, the struct inode object is filled
in with initial values. This inode is then converted to a PRAMFS inode
and copied into the located index in the PRAMFS inode table.
If pram_new_inode() is successful, pram_create()
then sets the inode's inode and file method pointers to those for a
PRAMFS regular file, and then links the new PRAMFS inode into the given
parent directory with a call to pram_add_nondir(). This
routine calls pram_add_link()
which does the actual linking of the new inode into the parent
directories doubly-linked inode list. Then a new directory entry is
instantiated into the VFS layer's dentry cache, and pram_add_nondir()
returns. This completes the creation of the new inode for the regular
file named hello.txt.
The next entry into PRAMFS is to pram_open_file(). This
method simply forces the flag O_DIRECT
on, and then calls the generic open file method. Therefore, all
subsequent I/O on the file will use direct I/O.
Then, generic_file_write() is called. All
the standard checks are done, such as
verifying that the user buffer is accessible and that the file position
that we are writing to is valid. Then, since the O_DIRECT flag is set in the
file descriptor, the PRAMFS direct_IO()
method is called.
pram_direct_IO() is the workhorse regular
file access
method for the PRAMFS. First,
the beginning file block number and the byte offset within that first
block is calculated, based on the given file offset. Then the number of
blocks that will be accessed is
calculated based on the access length. If a write is being performed, pram_alloc_blocks()
(described above) is called to allocate the blocks we'll need for the
write.
With the data blocks now avalaible, pram_direct_IO()
executes a while loop to transfer all requested bytes to/from the user
buffer from/to the inode's data blocks. At every while loop iteration,
either the remainder of the data is transferred, or an entire block
size chunk is transferred. At the start of each while loop, the call to
pram_find_data_block() is made to translate
the file block number to a logical block number.
Note how these methods are written such that all
accesses to objects in the PRAMFS completely bypass the page and buffer
caches. Data moves directly from the user buffer to PRAMFS data blocks,
with the file data never existing in any intermediate kernel buffers or
caches.
The PRAMFS currently requires one mount option, and
there are several
optional mount options:
- The mount option "physaddr=" is a required option.
This tells PRAMFS the physical address of the start of the RAM that makes up the filesystem.
- The mount option "init=" is optional, and is used to
initialize an
empty filesystem. Any data in an existing filesystem will be lost if
this option is given. The parameter to "init=" is the RAM size in bytes.
- The mount option "bs=" is optional, and is used to
specify a block
size. It is ignored if the "init=" option is not specified, since
otherwise the block size is read from the PRAMFS super-block. The
default blocksize is 2048 bytes, and the allowed block sizes are 512,
1024, 2048, and 4096.
- The mount option "bpi=" is optional, and is used to
specify the
bytes per inode ratio, i.e. For every N bytes in the filesystem, an
inode will be created. This behaves the same as the "-i" option to
mke2fs. It is ignored if the "init=" option is not specified.
- The mount option "N=" is optional, and is used to
specify the
number of inodes to allocate in the inode table. If the option is not
specified, the bytes-per-inode ratio is used the calculate the number
of inodes. If neither the "N=" or "bpi=" options are specified, the
default behavior is to reserve 5% of the total space in the filesystem
for the inode table. This option behaves the same as the "-N" option to
mke2fs. It is ignored if the "init=" option is not specified.
Example:
mount -t pramfs -o
physaddr=0x20000000,init=0x2F000,bs=1024 none /mnt/pram
This example locates the filesystem at physical address
0x20000000, and also requests an empty filesystem be initialized, of
total size 0x2f000 bytes and blocksize 1024. The mount point is
/mnt/pram.
mount -t pramfs -o physaddr=0x20000000 none
/mnt/pram
This example locates the filesystem at physical address
0x20000000 as in the first example, but uses the intact filesystem that
already exists.
The following operations should be verified on a mounted
PRAMFS:
- A mounted PRAMFS should pass the Bonnie++ benchmark
tests. In this example bonnie++ command, it is assumed that the total
PRAMFS filesystem is atleast 1MB in size, and that there are atleast
2048 inodes available:
bonnie++ -u root -s 1 -r 0 -n 2 -d /mnt/pram
- Errant writes by the kernel into any area within the
PRAMFS
memory should cause a kernel page fault exception, and should not
corrupt the filesystem. This can be tested with the following simple
kernel module. Copy the text to a file named "testwrite.c", compile it
natively on the system being tested with the command
gcc -c
-D__KERNEL__ -DMODULE -O -Wall testwrite.c, and then install the
module with the command insmod -f ./testwrite.o.
The module will attempt a write within the pramfs memory and should
cause a kernel page protection fault ("Unable to handle kernel paging
request at virtual address ..."). Then reboot the system, remount the
pramfs filesystem, and verify that no filesystem data has been
corrupted.
/* compile with: -c -D__KERNEL__ -DMODULE -O -Wall */
#include <linux/module.h> #include <linux/version.h> #include <linux/init.h> #include <linux/fs.h> #include <linux/pram_fs.h>
static unsigned long off = 0; MODULE_PARM(off, "i"); MODULE_PARM_DESC(off, "offset within pramfs to attempt a write");
int __init test_pramfs_write(void) { struct super_block * sb; struct pram_super_block *psb; char * ptr;
sb = find_pramfs_super(); if (!sb) { printk(KERN_ERR "%s: PRAMFS super block not found (not mounted?)\n", __func__); return 1; } psb = pram_get_super(sb); off = (off < psb->s_size) ? off : psb->s_size-1; ptr = (char*)psb + off;
/* * attempt an unprotected write into the pramfs area, this * should cause a kernel page protection fault */ printk("%s: writing to kernel VA %p\n", __func__, ptr); *ptr = 0;
return 0; }
void test_pramfs_write_cleanup(void) {}
/* Module information */ MODULE_LICENSE("GPL"); module_init(test_pramfs_write); module_exit(test_pramfs_write_cleanup);
Architecture-dependent
Requirements for PRAMFS
Like the other filesystems, the source code under
fs/pramfs is
architecture independent, and simply needs to be recompiled for a
specific architecture. However, in kernel version 2.4, PRAMFS does
require a few kernel
services that are architecture dependent in order to support the write
protection feature:
This is a new function that is identical to the existing
__ioremap()
method in every respect, except that the page table entries that map
the IO memory must be marked read-only. The
method is used by PRAMFS to initially map the PRAMFS memory as
read-only at mount time.
-
flush_tlb_page(), flush_tlb_mm(),
flush_tlb_range()
For PRAMFS, these already existing routines need to
allow flushing the
system's hardware TLB
for memory regions owned by init_mm. Some architectures already allow
this, such as PPC. However most architectures still require that the
caller of these methods have a process context besides the init
process.
The above requirements are not needed if write
protection is disabled with the CONFIG_PRAMFS_NOWP config option.
Note that in kernel version 2.6, new methods exist to flush the TLB for
page table entries specifically owned by kernel mappings. Also, the
need for __ioremap_readonly() has been removed.
Therefore PRAMFS in 2.6 has no special arch-dependent requirements.
1.For small PRAMFS mounted filesystems, bonnie++ fills
up the
filesystem before the tests complete. This is not a bug in PRAMFS, but
rather a limitation of the bonnie++ program.
2.There is some optimization to be done, by
consolidating writes
to the super-block and inodes. This will remove calls to the write
protection routine, and speed up some PRAMFS write operations.
|