Technical Specification for A Protected Non-Volatile RAM Filesystem:

Contents

Introduction

This specification describes the design for a protected, persistent RAM-based special filesystem in Linux (PRAMFS).

PRAMFS is designed to be a full-featured, light-weight, and space-efficient RAM-based non-volatile filesystem. PRAMFS is also designed in such a way as to minimize the risk of filesystem corruption due to errant writes caused by kernel bugs.

System Requirements for PRAMFS

The RAM used for the PRAMFS filesystem must meet the following requirements:

  • The RAM must be directly addressable.
  • The RAM must have access times comparable to normal system memory.
  • In order for the RAM to be write protected, it must be addressable by the CPU only via page tables.

High Level Design

To help meet the "protected" requirement, PRAMFS attempts to minimize the time windows in which filesystem data is present in unprotected kernel buffers. Therefore PRAMFS does not maintain file data in the page caches for normal file I/O. Since a central assumption of the PRAMFS is that the RAM used for the filesystem is comparable in speed to the system memory, it is OK and in fact desirable to do this. There is no point in caching file
data when the backing store is as fast or faster than the cache memory itself!

PRAMFS accomplishes this by making use of the direct I/O feature in Linux, thus guaranteeing that file data will be transferred directly from the user-level buffers to the filesystem and vice-versa, with no intermediate buffers. However in PRAMFS direct I/O is enabled across all files in the filesystem, in other words the O_DIRECT flag is forced on every open of a PRAMFS file. Also, file I/O in the PRAMFS always occurs synchronously. There is no need to block the current process while the transfer to/from the PRAMFS is in progress, since one of the requirements of the PRAMFS is that the filesystem exist in fast RAM. So file I/O in PRAMFS is always direct, synchronous, and never blocks.

One approach for a non-volatile RAM filesystem is to write a non-volatile RAM block device driver, and then mount a disk-based filesystem over it. The advantages of PRAMFS over this approach include:

  • Disk-based filesystems such as ext2/ext3 use the page cache for file I/O. PRAMFS never uses the page cache for file I/O. This frees up the page cache for use by other parts of the system that really need it (disk-based filesystem data, pages from swapped-out processes, IPC pages, etc.). Also, this protects the filesystem against possible page cache corruption caused by kernel bugs.

  • Disk-based filesystems such as ext2/ext3 were designed for optimum performance on disk-based media, and so implement features such as block groups, which attempts to group inode data into a contiguous set of data blocks to minimize disk seeks when accessing files. Since there is no such performance penalty for random access on RAM-based media, such features as block groups are not used in PRAMFS, which reduces filesystem complexity and in turn increases the efficient use of space on the media (i.e. More space is dedicated to actual file data storage and less to meta-data needed to maintain that file data).

PRAMFS Special Filesystem

In this section we discuss the details of the PRAMFS special filesystem design. First, the general layout of data objects is described, followed by a description of the information contained in the data objects themselves. Then a discussion of how blocks are allocated for inodes. Then the directory tree structure is described, followed by the details of write protecting the filesystem data, and finally a walk-through of some important filesystem methods.

Data Objects Layout

Refer to the following figure for the data layout.

pramfs_layout.jpg

  • Super Block (SB): the super block is 128 bytes long, and exists at the very beginning of the filesystem. There are no repeats of the super block.

  • Inode Table: the inode table consists of Ni inodes, and each inode is 128 bytes long. Therefore the inode table is 128*Ni bytes in size. The number of inodes is calculated such that the end of the table occurs on a block boundary. The inode table size is fixed, that is the maximum number of inodes that can be allocated is Ni, and Ni cannot be changed after the filesystem is created.

  • Data Blocks: The remaining space in the filesystem after the inode table consists of data blocks for file data and the in-use bitmap (discussed next). In the figure, b is the block size in bytes, and N is the total number of data blocks. Like Ni, N is fixed, that is, once the filesystem is created, the maximum number of data blocks that can be allocated is fixed.

  • Block In-Use Bitmap: For every block, there is a bit in the bitmap signifying whether that block is in-use by a file (set) or not in use (cleared). Therefore the bitmap requires N bits, or N/8 bytes, or N/(8b) blocks. N includes the bitmap blocks, so when the filesystem is created, the first N/(8b) bits are set in the bitmap, which marks the blocks that make up the bitmap as in use, and these blocks can never be freed.

Super Block

The super block object contains information that pertains to the whole filesystem. Such information includes the block size of the filesystem in bytes, the total number of inodes and blocks (Ni and N), a count of the current free inodes and data blocks, etc. The PRAMFS super block structure is shown here:
#define PRAM_SB_SIZE 128 // must be power of two
#define PRAM_SB_BITS 7

typedef unsigned long pram_off_t;

/*
* Structure of the super block in PRAMFS
*/
struct pram_super_block {
__u32 s_size; /* total size of fs in bytes */
__u32 s_blocksize; /* blocksize in bytes */
__u32 s_features; /* feature flags */
__u32 s_inodes_count; /* total inodes count (used or free) */
__u32 s_free_inodes_count;/* free inodes count */
__u32 s_free_inode_hint; /* start hint for locating free inodes */
__u32 s_blocks_count; /* total data blocks count (used or free) */
__u32 s_free_blocks_count;/* free data blocks count */
__u32 s_free_blocknr_hint;/* free data blocks count */
pram_off_t s_bitmap_start; /* data block in-use bitmap location */
__u32 s_bitmap_blocks;/* size of bitmap in number of blocks */
__u32 s_mtime; /* Mount time */
__u32 s_wtime; /* Write time */
__u32 s_rev_level; /* Revision level */
__u16 s_magic; /* Magic signature */
__u16 s_state; /* File system state */
__u16 s_errors; /* Behaviour when detecting errors */
char s_volume_name[16]; /* volume name */
__u32 s_sum; /* checksum of this sb, including padding */
};

The data type pram_off_t is an offset pointer type for PRAMFS. These are simply 32-bit offsets from the beginning of the filesystem, and are used to locate data objects in the filesystem (inodes, data blocks, in-use bitmap, etc.).

Inodes

In PRAMFS, directory entry information, such as file names and owning inode, are contained within the inodes themselves. This presents a problem only for hard links, so PRAMFS does not support hard links. If at some time hard link support is desired, PRAMFS will instead use the more traditional model of maintaining directory entry info seperate from inodes.

The PRAMFS inode structure is reprinted here:

#define PRAM_INODE_SIZE 128 // must be power of two
#define PRAM_INODE_BITS 7

/*
* Structure of a directory entry in PRAMFS.
* Offsets are to the inode that holds the referenced dentry.
*/
struct pram_dentry {
pram_off_t d_next; /* next dentry in this directory */
pram_off_t d_prev; /* previous dentry in this directory */
pram_off_t d_parent; /* parent directory */
char d_name[0];
};

/*
* Structure of an inode in PRAMFS
*/
struct pram_inode {
__u32 i_sum; /* checksum of this inode */
__u32 i_uid; /* Owner Uid */
__u32 i_gid; /* Group Id */
__u16 i_mode; /* File mode */
__u16 i_links_count; /* Links count */
__u32 i_blocks; /* Blocks count */
__u32 i_size; /* Size of data in bytes */
__u32 i_atime; /* Access time */
__u32 i_ctime; /* Creation time */
__u32 i_mtime; /* Modification time */
__u32 i_dtime; /* Deletion Time */

union {
struct {
/*
* ptr to row block of 2D block pointer array,
* file block #'s 0 to (blocksize/4)^2 - 1.
*/
pram_off_t row_block;
} reg; // regular file or symlink inode
struct {
pram_off_t head; /* first entry in this directory */
pram_off_t tail; /* last entry in this directory */
} dir;
struct {
__u32 rdev; /* major/minor # */
} dev; // device inode
} i_type;

struct pram_dentry i_d;
};

Notice the i_type union member. The valid elements of the union depend on the file's type as contained in i_mode. For instance, a directory file has valid information in i_type.dir, and the other elements of the union are invalid.

In PRAMFS, as in most other filesystems, the inode number of an inode is simply the absolute offset (pram_off_t) of that inode from the beginning of the filesystem.

Data Blocks

In PRAMFS, only regular files own file data (directories never own data blocks, this will be discussed later). The inode field i_type.reg.row_block points to the start of a 2-dimensional table of data block pointers. A single block is allocated for the row block, and therefore contains b/4 32-bit pointers that point to up to b/4 column blocks. Each column block holds up to b/4 pointers to data blocks. In this way a regular file can contain up to (b/4)^2 data blocks, or b^3/16 bytes of data. For those familiar with the EXT2 filesystem, i_type.reg.row_block is equivalent to the i_block[13] entry in the EXT2 inode structure. The EXT2 inode's i_block[0-11] entries point directly to data blocks, the reason being that, for small files, the first 12 data blocks can be located in a single disk seek. For the PRAMFS however, there is no speed penalty for random access, so direct pointers to data blocks are not necessary, and hence simplifies the methods for locating data blocks. Also, higher order tables (such as EXT2's 3-dimensional i_block[14]) are not deemed necessary in PRAMFS because it is not envisioned that so much persistent RAM would be available to hold such large files.

A note about block numbers. An offset pointer to a block is sometimes referred to as a logical block number. Given a block index from 0 to N-1, it's a simple matter to convert the index into a logical block number: it's just the start offset of data blocks plus the index times the blocksize, or s_bitmap_start + (index * s_blocksize).

However when accessing data blocks for a file, we usually use a file block number, which is the relative position of the block inside the file. To find the absolute logical block number corresponding to a file block index from 0 to (b/4)^2 - 1, we use the inode's 2-dimensional block pointer table. For instance, say we are looking for the block at file block index 359, and the blocksize is b=1024. This means that a single block can hold 256 logical block numbers, and the logical block number for file block index 359 is therefore located at i_type.reg.row_block[1][103], that is, entry 103 within the second column block. This algorithm is acccomplished by the function pram_find_data_block() in fs/pramfs/inode.c, which takes as arguments the inode and the file block index and returns the corresponding logical block number.

The organization of the inode logical block pointer table is illustrated in the figure below. Arrows in the figure represent pram_off_t pointers, and entries in the column blocks are marked with their file block index, and are pointing to data blocks assigned to them.

pramfs_blockptr.jpg

Data Block Allocation

To allocate a new block, a search is made for the first cleared bit in the in-use bitmap. The located bit number is also the logical block index of the located free block. The bit is then set in the in-use bitmap to mark the corresponding block as in use. This algorithm is implemented in the function pram_new_block() in fs/pramfs/balloc.c, which returns the logical block index of the block that was just allocated.

The function pram_new_block() is used by the higher-level function pram_alloc_blocks(). The job of this function is to allocate data blocks for an inode. It will allocate a set of data blocks starting at a given file block index. Note that this function must take care of allocating the row and column blocks that make up the 2D block pointer table. Any unallocated file blocks before the starting file block index are allocated. All allocated blocks except the last are zeroed out. pram_alloc_blocks() in turn is used by the struct file_operations write() method (discussed below).

Directory Structure

All inodes (of any type) within a directory are linked together in a doubly-linked list, where the i_next and i_prev fields of the inodes point to the next and previous inodes within the directory. The i_prev pointer of the first inode and the i_next pointer of the last inode are null terminated.

The parent directory inode holds pointers to the head and tail of the doubly-linked list contained in that directory (i_type.dir.head and i_type.dir.tail, respectively). If the directory is empty, i_type.dir.head and i_type.dir.tail are both zero.

Other filesystem implementations, such as EXT2, use directory entry objects ("dentries") to associate file names to inodes, and these dentries are located in data blocks owned by the parent directory. In EXT2 for instance, a dentry holds the file name, the inode number to associate the file with, and the file type, and these dentries are stored in data blocks owned by the parent directory. In PRAMFS, directory inode's do not need to own any data blocks, because all dentry information is contained within the inode's themselves.

Memory Mapping

Because PRAMFS attempts to avoid filesystem corruption caused by kernel bugs, dirty pages in the page cache are not allowed to be written back to the backing-store RAM. This means that only private file mappings are supported. This way, an errant write into the page cache will not get written back to the filesystem.

This is accomplished by implementing the readpage() method in the PRAMFS address_space object, but not the writepage() method.

Hardware Write Protection

In addition to the software protection features already discussed (i.e. avoiding the page cache for file I/O, and allowing only private mappings), the hardware protection feature utilizes the system's paging unit by mapping the I/O memory pages initially as read-only. Any writes to objects in the PRAMFS first mark the corresponding page table entries as writeable, perform the write, and then mark the pages as read-only again. This operation is done atomically and non-reentrantly by holding the page-table spin-lock with interrupts disabled. Also, when the write operation completes, any stale entries in the system TLB that are still marking the pages as writeable are flushed.

PRAMFS can disable the hardware write protection feature with the kernel config option CONFIG_PRAMFS_NOWP. This is useful for memory that is mapped without page tables, for instance memory that lives in the first 512M of physical address space in MIPS.

The function that sets the writeable flag for the filesystem's pages is pram_writeable() in fs/pramfs/wprotect.c. This function is used by a set of macro functions defined in include/linux/pram_fs.h:

  • pram_lock_super(ps) and pram_unlock_super(ps).
    The pram_lock_super() macro acquires the init_mm.page_table_lock spin-lock and disables interrupts, and then marks the pages that contain the given PRAMFS super-block (ps) as writeable. In turn, pram_unlock_super() recalculates the check-sum for the super-block (pram_sync_super), marks the pages read-only, and then releases the spin-lock and restores the system's interrupt flags. All writes to the PRAMFS super-block are bracketed by these two macro functions. Thus the write operations must always be done quickly.

  • pram_lock_inode(pi) and pram_unlock_inode(pi).
    These macros perform the same operations as the super-block macros above, the only difference is that the PRAMFS inode size is passed to pram_writeable() instead of the super-block size. All writes to PRAMFS inodes are bracketed by these macros.

  • pram_lock_block(sb,bp) and pram_unlock_block(sb,bp).
    Again, much the same as the above macros, except that we pass the blocksize to pram_writeable(). Also, blocks are not check-summed. All writes to PRAMFS blocks are bracketed by these macros.

Filesystem Methods Walk-Through

In this section we do a code walk-through on a sample filesystem method. We will choose the write() method, which is a method in struct file_operations. This method is chosen because it involves more filesystem operations than any other method. In this walk-through, we assume a new regular file is being created, and then written to. A simple example of a shell command that would cause this to happen is echo hello > hello.txt. This case will walk us through not only the write() method, but also inode creation and linking into the parent directory.

The first entry into PRAMFS from the command echo hello > hello.txt is to the method pram_create() in struct inode_operations for directory inodes. The task of this method is to create a new inode for a regular file in a given directory. The first thing the method does is to call pram_new_inode() to allocate a new inode.

pram_new_inode() calls the kernel service new_inode() to allocate a new struct inode for the virtual filesystem layer. Next, the free inodes count is checked in the PRAMFS super-block (s_free_inodes_count) to verify there are free inodes available in the inode table. If there are, the index of the first free inode in the table is located. A free inode is characterized by a zero hard link count (i_links_count = 0) , and either a file type of zero (i_mode = 0), or a marked deletion time (i_dtime != 0). Once the index of a free inode is located, the struct inode object is filled in with initial values. This inode is then converted to a PRAMFS inode and copied into the located index in the PRAMFS inode table.

If pram_new_inode() is successful, pram_create() then sets the inode's inode and file method pointers to those for a PRAMFS regular file, and then links the new PRAMFS inode into the given parent directory with a call to pram_add_nondir(). This routine calls pram_add_link() which does the actual linking of the new inode into the parent directories doubly-linked inode list. Then a new directory entry is instantiated into the VFS layer's dentry cache, and pram_add_nondir() returns. This completes the creation of the new inode for the regular file named hello.txt.

The next entry into PRAMFS is to pram_open_file(). This method simply forces the flag O_DIRECT on, and then calls the generic open file method. Therefore, all subsequent I/O on the file will use direct I/O.

Then, generic_file_write() is called. All the standard checks are done, such as verifying that the user buffer is accessible and that the file position that we are writing to is valid. Then, since the O_DIRECT flag is set in the file descriptor,  the PRAMFS direct_IO() method is called.

pram_direct_IO() is the workhorse regular file access method for the PRAMFS. First, the beginning file block number and the byte offset within that first block is calculated, based on the given file offset. Then the number of blocks that will be accessed is calculated based on the access length. If a write is being performed, pram_alloc_blocks() (described above) is called to allocate the blocks we'll need for the write.

With the data blocks now avalaible, pram_direct_IO() executes a while loop to transfer all requested bytes to/from the user buffer from/to the inode's data blocks. At every while loop iteration, either the remainder of the data is transferred, or an entire block size chunk is transferred. At the start of each while loop, the call to pram_find_data_block() is made to translate the file block number to a logical block number.

Note how these methods are written such that all accesses to objects in the PRAMFS completely bypass the page and buffer caches. Data moves directly from the user buffer to PRAMFS data blocks, with the file data never existing in any intermediate kernel buffers or caches.

User Interface

The PRAMFS currently requires one mount option, and there are several optional mount options:

  • The mount option "physaddr=" is a required option. This tells PRAMFS the physical address of the start of the RAM that makes up the filesystem.

  • The mount option "init=" is optional, and is used to initialize an empty filesystem. Any data in an existing filesystem will be lost if this option is given. The parameter to "init=" is the RAM size in bytes.

  • The mount option "bs=" is optional, and is used to specify a block size. It is ignored if the "init=" option is not specified, since otherwise the block size is read from the PRAMFS super-block. The default blocksize is 2048 bytes, and the allowed block sizes are 512, 1024, 2048, and 4096.

  • The mount option "bpi=" is optional, and is used to specify the bytes per inode ratio, i.e. For every N bytes in the filesystem, an inode will be created. This behaves the same as the "-i" option to mke2fs. It is ignored if the "init=" option is not specified.

  • The mount option "N=" is optional, and is used to specify the number of inodes to allocate in the inode table. If the option is not specified, the bytes-per-inode ratio is used the calculate the number of inodes. If neither the "N=" or "bpi=" options are specified, the default behavior is to reserve 5% of the total space in the filesystem for the inode table. This option behaves the same as the "-N" option to mke2fs. It is ignored if the "init=" option is not specified.

Example:

mount -t pramfs -o physaddr=0x20000000,init=0x2F000,bs=1024 none /mnt/pram

This example locates the filesystem at physical address 0x20000000, and also requests an empty filesystem be initialized, of total size 0x2f000 bytes and blocksize 1024. The mount point is /mnt/pram.

mount -t pramfs -o physaddr=0x20000000 none /mnt/pram

This example locates the filesystem at physical address 0x20000000 as in the first example, but uses the intact filesystem that already exists.

Acceptance Criteria

The following operations should be verified on a mounted PRAMFS:

  • A mounted PRAMFS should pass the Bonnie++ benchmark tests. In this example bonnie++ command, it is assumed that the total PRAMFS filesystem is atleast 1MB in size, and that there are atleast 2048 inodes available:
   		bonnie++ -u root -s 1 -r 0 -n 2 -d /mnt/pram
  • Errant writes by the kernel into any area within the PRAMFS memory should cause a kernel page fault exception, and should not corrupt the filesystem. This can be tested with the following simple kernel module. Copy the text to a file named "testwrite.c", compile it natively on the system being tested with the command gcc -c -D__KERNEL__ -DMODULE -O -Wall testwrite.c, and then install the module with the command insmod -f ./testwrite.o. The module will attempt a write within the pramfs memory and should cause a kernel page protection fault ("Unable to handle kernel paging request at virtual address ..."). Then reboot the system, remount the pramfs filesystem, and verify that no filesystem data has been corrupted.
/* compile with: -c -D__KERNEL__ -DMODULE -O -Wall */

#include <linux/module.h>
#include <linux/version.h>
#include <linux/init.h>
#include <linux/fs.h>
#include <linux/pram_fs.h>

static unsigned long off = 0;
MODULE_PARM(off, "i");
MODULE_PARM_DESC(off, "offset within pramfs to attempt a write");

int __init test_pramfs_write(void)
{
struct super_block * sb;
struct pram_super_block *psb;
char * ptr;

sb = find_pramfs_super();
if (!sb) {
printk(KERN_ERR
"%s: PRAMFS super block not found (not mounted?)\n",
__func__);
return 1;
}

psb = pram_get_super(sb);
off = (off < psb->s_size) ? off : psb->s_size-1;
ptr = (char*)psb + off;

/*
* attempt an unprotected write into the pramfs area, this
* should cause a kernel page protection fault
*/
printk("%s: writing to kernel VA %p\n", __func__, ptr);
*ptr = 0;

return 0;
}

void test_pramfs_write_cleanup(void) {}

/* Module information */
MODULE_LICENSE("GPL");
module_init(test_pramfs_write);
module_exit(test_pramfs_write_cleanup);

Additional Information

Architecture-dependent Requirements for PRAMFS

Like the other filesystems, the source code under fs/pramfs is architecture independent, and simply needs to be recompiled for a specific architecture. However, in kernel version 2.4, PRAMFS does require a few kernel services that are architecture dependent in order to support the write protection feature:

  • __ioremap_readonly()

This is a new function that is identical to the existing __ioremap() method in every respect, except that the page table entries that map the IO memory must be marked read-only. The method is used by PRAMFS to initially map the PRAMFS memory as read-only at mount time.

  • flush_tlb_page(), flush_tlb_mm(), flush_tlb_range()

For PRAMFS, these already existing routines need to allow flushing the system's hardware TLB for memory regions owned by init_mm. Some architectures already allow this, such as PPC. However most architectures still require that the caller of these methods have a process context besides the init process.

The above requirements are not needed if write protection is disabled with the CONFIG_PRAMFS_NOWP config option.

Note that in kernel version 2.6, new methods exist to flush the TLB for page table entries specifically owned by kernel mappings. Also, the need for __ioremap_readonly() has been removed. Therefore PRAMFS in 2.6 has no special arch-dependent requirements.

Known Problems

1.For small PRAMFS mounted filesystems, bonnie++ fills up the filesystem before the tests complete. This is not a bug in PRAMFS, but rather a limitation of the bonnie++ program.

2.There is some optimization to be done, by consolidating writes to the super-block and inodes. This will remove calls to the write protection routine, and speed up some PRAMFS write operations.