Technical Specification for A Protected Non-Volatile RAM Filesystem:

Contents

Introduction

This specification describes the design for a protected, persistent RAM-based special filesystem in Linux (PRAMFS).

PRAMFS is designed to be a light-weight, and space-efficient RAM-based non-volatile filesystem. PRAMFS is also designed in such a way as to minimize the risk of filesystem corruption due to errant writes caused by kernel bugs.

System Requirements for PRAMFS

The RAM used for the PRAMFS filesystem must meet the following requirements:

  • The RAM must be directly addressable.
  • The RAM must have access times comparable to normal system memory.
  • In order for the RAM to be write protected, it must be addressable by the CPU only via page tables.

High Level Design

To help meet the "protected" requirement, PRAMFS attempts to minimize the time windows in which filesystem data is present in unprotected kernel buffers. Therefore PRAMFS does not maintain file data in the page caches for normal file I/O. Since a central assumption of the PRAMFS is that the RAM used for the filesystem is comparable in speed to the system memory, it is OK and in fact desirable to do this.

PRAMFS accomplishes this by making use of the direct I/O feature in Linux, thus guaranteeing that file data will be transferred directly from the user-level buffers to the filesystem and vice-versa, with no intermediate buffers. However in PRAMFS direct I/O is enabled across all files in the filesystem, in other words the O_DIRECT flag is forced on every open of a PRAMFS file. Also, file I/O in the PRAMFS always occurs synchronously. There is no need to block the current process while the transfer to/from the PRAMFS is in progress, since one of the requirements of the PRAMFS is that the filesystem exist in fast RAM. So file I/O in PRAMFS is always direct, synchronous, and never blocks.

One approach for a non-volatile RAM filesystem is to write a non-volatile RAM block device driver, and then mount a disk-based filesystem over it. The advantages of PRAMFS over this approach include:

  • Disk-based filesystems such as ext2/ext3/ext4 use the page cache for file I/O. PRAMFS never uses the page cache for file I/O and therefore it does not have a negative impact on the cache but it can be freed up the for use by other parts of the system that need it (disk-based filesystem data, pages from swapped-out processes, IPC pages, etc.). This protects the filesystem against possible page cache corruption caused by kernel bugs.

  • Disk-based filesystems such as ext2/ext3/ext4 were designed for optimum performance on disk-based media, and so implement features such as block groups, which attempts to group inode data into a contiguous set of data blocks to minimize disk seeks when accessing files. Since there is no such performance penalty for random access on RAM-based media, such features as block groups are not used in PRAMFS, which reduces filesystem complexity and in turn increases the efficient use of space on the media (i.e. More space is dedicated to actual file data storage and less to meta-data needed to maintain that file data).

PRAMFS Special Filesystem

In this section we discuss the details of the PRAMFS special filesystem design. First, the general layout of data objects is described, followed by a description of the information contained in the data objects themselves. Then a discussion of how blocks are allocated for inodes. Then the directory tree structure is described, followed by the details of write protecting the filesystem data, and finally a walk-through of some important filesystem methods.

Data Objects Layout

Refer to the following figure for the data layout.

pramfs_layout.jpg

  • Super Block (SB): the super block is 128 bytes long, and exists at the very beginning of the filesystem. There is a redundant super block after the primary.

  • Inode Table: the inode table consists of Ni inodes, and each inode is 128 bytes long. Therefore the inode table is 128*Ni bytes in size. The number of inodes is calculated such that the end of the table occurs on a block boundary. The inode table size is fixed, that is the maximum number of inodes that can be allocated is Ni, and Ni cannot be changed after the filesystem is created.

  • Data Blocks: The remaining space in the filesystem after the inode table consists of data blocks for file data and the in-use bitmap (discussed next). In the figure, b is the block size in bytes, and N is the total number of data blocks. Like Ni, N is fixed, that is, once the filesystem is created, the maximum number of data blocks that can be allocated is fixed.

  • Block In-Use Bitmap: For every block, there is a bit in the bitmap signifying whether that block is in-use by a file (set) or not in use (cleared). Therefore the bitmap requires N bits, or N/8 bytes, or N/(8b) blocks. N includes the bitmap blocks, so when the filesystem is created, the first N/(8b) bits are set in the bitmap, which marks the blocks that make up the bitmap as in use, and these blocks can never be freed.

Super Block

The super block object contains information that pertains to the whole filesystem. Such information includes the block size of the filesystem in bytes, the total number of inodes and blocks (Ni and N), a count of the current free inodes and data blocks, etc. The PRAMFS super block structure is shown here:
#define PRAM_SB_SIZE 128 /* must be power of two */
#define PRAM_SB_BITS 7

/*
* Structure of the super block in PRAMFS
*/
struct pram_super_block {
__be16 s_sum; /* checksum of this sb, including padding */
__be64 s_size; /* total size of fs in bytes */
__be32 s_blocksize; /* blocksize in bytes */
__be32 s_inodes_count; /* total inodes count (used or free) */
__be32 s_free_inodes_count;/* free inodes count */
__be32 s_free_inode_hint; /* start hint for locating free inodes */
__be32 s_blocks_count; /* total data blocks count (used or free) */
__be32 s_free_blocks_count;/* free data blocks count */
__be32 s_free_blocknr_hint;/* free data blocks count */
__be64 s_bitmap_start; /* data block in-use bitmap location */
__be32 s_bitmap_blocks;/* size of bitmap in number of blocks */
__be32 s_mtime; /* Mount time */
__be32 s_wtime; /* Write time */
__be16 s_magic; /* Magic signature */
char s_volume_name[16]; /* volume name */
};

Inodes

In PRAMFS, directory entry information, such as file names and owning inode, are contained within the inodes themselves. This presents a problem only for hard links, so PRAMFS does not support hard links.

The PRAMFS inode structure is reprinted here:

#define PRAM_INODE_SIZE 128 /* must be power of two */
#define PRAM_INODE_BITS 7

/*
* Structure of a directory entry in PRAMFS.
* Offsets are to the inode that holds the referenced dentry.
*/
struct pram_dentry {
__be64 d_next; /* next dentry in this directory */
__be64 d_prev; /* previous dentry in this directory */
__be64 d_parent; /* parent directory */
char d_name[0];
};

/*
* Structure of an inode in PRAMFS
*/
struct pram_inode {
__be16 i_sum; /* checksum of this inode */
__be32 i_uid; /* Owner Uid */
__be32 i_gid; /* Group Id */
__be16 i_mode; /* File mode */
__be16 i_links_count; /* Links count */
__be32 i_blocks; /* Blocks count */
__be32 i_size; /* Size of data in bytes */
__be32 i_atime; /* Access time */
__be32 i_ctime; /* Creation time */
__be32 i_mtime; /* Modification time */
__be32 i_dtime; /* Deletion Time */
__be64 i_xattr; /* Extended attribute */
__be32 i_generation; /* File version (for NFS) */
__be32 i_flags; /* Inode flags */

union {
struct {
/*
* ptr to row block of 2D block pointer array,
* file block #'s 0 to (blocksize/8)^2 - 1.
*/
__be64 row_block;
} reg; /* regular file or symlink inode */
struct {
__be64 head; /* first entry in this directory */
__be64 tail; /* last entry in this directory */
} dir;
struct {
__be32 rdev; /* major/minor # */
} dev; /* device inode */
} i_type;

struct pram_dentry i_d;
};

Notice the i_type union member. The valid elements of the union depend on the file's type as contained in i_mode. For instance, a directory file has valid information in i_type.dir, and the other elements of the union are invalid.

In PRAMFS, as in most other filesystems, the inode number of an inode is simply the absolute offset of that inode from the beginning of the filesystem.

Data Blocks

In PRAMFS, only regular files own file data (directories don't own data blocks with the exception of extended attributes blocks, this will be discussed later). The inode field i_type.reg.row_block points to the start of a 2-dimensional table of data block pointers. A single block is allocated for the row block, and therefore contains b/8 64-bit pointers that point to up to b/8 column blocks. Each column block holds up to b/8 pointers to data blocks. In this way a regular file can contain up to (b/8)^2 data blocks, or b^3/64 bytes of data. For those familiar with the EXT2 filesystem, i_type.reg.row_block is equivalent to the i_block[13] entry in the EXT2 inode structure. The EXT2 inode's i_block[0-11] entries point directly to data blocks, the reason being that, for small files, the first 12 data blocks can be located in a single disk seek. For the PRAMFS however, there is no speed penalty for random access, so direct pointers to data blocks are not necessary, and hence simplifies the methods for locating data blocks. Also, higher order tables (such as EXT2's 3-dimensional i_block[14]) are not deemed necessary in PRAMFS because it is not envisioned that so much persistent RAM would be available to hold such large files.

A note about block numbers. An offset pointer to a block is sometimes referred to as a logical block number. Given a block index from 0 to N-1, it's a simple matter to convert the index into a logical block number: it's just the start offset of data blocks plus the index times the blocksize, or s_bitmap_start + (index * s_blocksize).

However when accessing data blocks for a file, we usually use a file block number, which is the relative position of the block inside the file. To find the absolute logical block number corresponding to a file block index from 0 to (b/8)^2 - 1, we use the inode's 2-dimensional block pointer table. For instance, say we are looking for the block at file block index 359, and the blocksize is b=1024. This means that a single block can hold 128 logical block numbers, and the logical block number for file block index 359 is therefore located at i_type.reg.row_block[2][103], that is, entry 103 within the third column block. This algorithm is acccomplished by the function pram_find_data_block() in fs/pramfs/inode.c, which takes as arguments the inode and the file block index and returns the corresponding logical block number.

The organization of the inode logical block pointer table is illustrated in the figure below. Arrows in the figure represent pointers, and entries in the column blocks are marked with their file block index, and are pointing to data blocks assigned to them.

pramfs_blockptr.jpg

Data Block Allocation

To allocate a new block, a search is made for the first cleared bit in the in-use bitmap. The located bit number is also the logical block index of the located free block. The bit is then set in the in-use bitmap to mark the corresponding block as in use. This algorithm is implemented in the function pram_new_block() in fs/pramfs/balloc.c, which returns the logical block index of the block that was just allocated.

The function pram_new_block() is used by the higher-level function pram_alloc_blocks(). The job of this function is to allocate data blocks for an inode. It will allocate a set of data blocks starting at a given file block index. Note that this function must take care of allocating the row and column blocks that make up the 2D block pointer table. Any unallocated file blocks before the starting file block index are allocated. All allocated blocks except the last are zeroed out. pram_alloc_blocks() in turn is used by the struct file_operations write() method (discussed below).

Directory Structure

All inodes (of any type) within a directory are linked together in a doubly-linked list, where the i_next and i_prev fields of the inodes point to the next and previous inodes within the directory. The i_prev pointer of the first inode and the i_next pointer of the last inode are null terminated.

The parent directory inode holds pointers to the head and tail of the doubly-linked list contained in that directory (i_type.dir.head and i_type.dir.tail, respectively). If the directory is empty, i_type.dir.head and i_type.dir.tail are both zero.

Other filesystem implementations, such as EXT2, use directory entry objects ("dentries") to associate file names to inodes, and these dentries are located in data blocks owned by the parent directory. In EXT2 for instance, a dentry holds the file name, the inode number to associate the file with, and the file type, and these dentries are stored in data blocks owned by the parent directory. In PRAMFS, directory inode's do not need to own any data blocks, because all dentry information is contained within the inode's themselves.

Extended Attributes

Extended attributes are stored in blocks allocated outside of any inode. The i_xattr field is then made to point to this allocated block. If all extended attributes of an inode are identical, these inodes may share the same extended attribute block. Such situations are automatically detected by keeping a cache of recent attribute block numbers and hashes over the block's contents in memory. Each extended attribute block is described with a descriptor. In each block descriptors there are flags, the absolute block number and a lock for each block. The design is based on the ext2/3/4 but instead of using buffer head structs, page cache and block stuff, PRAMFS use a red-black tree to track blocks and their states. xattr.png
The block header is followed by multiple entry descriptors. These entry descriptors are variable in size, and alligned to PRAM_XATTR_PAD byte boundaries. The entry descriptors are sorted by attribute name, so that two extended attribute blocks can be compared efficiently. Attribute values are aligned to the end of the block, stored in no specific order. They are also padded to PRAM_XATTR_PAD byte boundaries. No additional gaps are left between them.

Memory Mapping

Because PRAMFS attempts to avoid filesystem corruption caused by kernel bugs, dirty pages in the page cache are not allowed to be written back to the backing-store RAM. This means that only private file mappings are supported. This way, an errant write into the page cache will not get written back to the filesystem.

This is accomplished by implementing the readpage() method in the PRAMFS address_space object, but not the writepage() method.

Hardware Write Protection

In addition to the software protection features already discussed (i.e. avoiding the page cache for file I/O, and allowing only private mappings), the hardware protection feature utilizes the system's paging unit by mapping the I/O memory pages initially as read-only. Any writes to objects in the PRAMFS first mark the corresponding page table entries as writeable, perform the write, and then mark the pages as read-only again. Also, when the write operation completes, any stale entries in the system TLB that are still marking the pages as writeable are flushed.

The function that sets the writeable flag for the filesystem's pages is pram_writeable() in fs/pramfs/wprotect.c. This function is used by a set of macro functions defined in include/linux/pram_fs.h:

  • pram_lock_super(ps) and pram_unlock_super(ps).
    The pram_lock_super() macro acquires a spinlock, and then marks the pages that contain the given PRAMFS super-block (ps) as writeable. In turn, pram_unlock_super() recalculates the check-sum for the super-block (pram_sync_super), marks the pages read-only, and then releases the spin-lock and restores the system's interrupt flags. All writes to the PRAMFS super-block are bracketed by these two macro functions. Thus the write operations must always be done quickly.

  • pram_lock_inode(pi) and pram_unlock_inode(pi).
    These macros perform the same operations as the super-block macros above, the only difference is that the PRAMFS inode size is passed to pram_writeable() instead of the super-block size. All writes to PRAMFS inodes are bracketed by these macros.

  • pram_lock_block(sb,bp) and pram_unlock_block(sb,bp).
    Again, much the same as the above macros, except that we pass the blocksize to pram_writeable(). Also, blocks are not check-summed. All writes to PRAMFS blocks are bracketed by these macros.

Filesystem Methods Walk-Through

In this section we do a code walk-through on a sample filesystem method. We will choose the write() method, which is a method in struct file_operations. This method is chosen because it involves more filesystem operations than any other method. In this walk-through, we assume a new regular file is being created, and then written to. A simple example of a shell command that would cause this to happen is echo hello > hello.txt. This case will walk us through not only the write() method, but also inode creation and linking into the parent directory.

The first entry into PRAMFS from the command echo hello > hello.txt is to the method pram_create() in struct inode_operations for directory inodes. The task of this method is to create a new inode for a regular file in a given directory. The first thing the method does is to call pram_new_inode() to allocate a new inode.

pram_new_inode() calls the kernel service new_inode() to allocate a new struct inode for the virtual filesystem layer. Next, the free inodes count is checked in the PRAMFS super-block (s_free_inodes_count) to verify there are free inodes available in the inode table. If there are, the index of the first free inode in the table is located. A free inode is characterized by a zero hard link count (i_links_count = 0) , and either a file type of zero (i_mode = 0), or a marked deletion time (i_dtime != 0). Once the index of a free inode is located, the struct inode object is filled in with initial values. This inode is then converted to a PRAMFS inode and copied into the located index in the PRAMFS inode table.

If pram_new_inode() is successful, pram_create() then sets the inode's inode and file method pointers to those for a PRAMFS regular file, and then links the new PRAMFS inode into the given parent directory with a call to pram_add_nondir(). This routine calls pram_add_link() which does the actual linking of the new inode into the parent directories doubly-linked inode list. Then a new directory entry is instantiated into the VFS layer's dentry cache, and pram_add_nondir() returns. This completes the creation of the new inode for the regular file named hello.txt.

The next entry into PRAMFS is to pram_open_file(). This method simply forces the flag O_DIRECT on, and then calls the generic open file method. Therefore, all subsequent I/O on the file will use direct I/O.

Then, generic_file_write() is called. All the standard checks are done, such as verifying that the user buffer is accessible and that the file position that we are writing to is valid. Then, since the O_DIRECT flag is set in the file descriptor,  the PRAMFS direct_IO() method is called.

pram_direct_IO() is the workhorse regular file access method for the PRAMFS. First, the beginning file block number and the byte offset within that first block is calculated, based on the given file offset. Then the number of blocks that will be accessed is calculated based on the access length. If a write is being performed, pram_alloc_blocks() (described above) is called to allocate the blocks we'll need for the write.

With the data blocks now avalaible, pram_direct_IO() executes a while loop to transfer all requested bytes to/from the user buffer from/to the inode's data blocks. At every while loop iteration, either the remainder of the data is transferred, or an entire block size chunk is transferred. At the start of each while loop, the call to pram_find_data_block() is made to translate the file block number to a logical block number.

Note how these methods are written such that all accesses to objects in the PRAMFS completely bypass the page and buffer caches. Data moves directly from the user buffer to PRAMFS data blocks, with the file data never existing in any intermediate kernel buffers or caches.

User Interface

The PRAMFS currently requires one mount option, and there are several optional mount options:

  • The mount option "physaddr=" is a required option. This tells PRAMFS the physical address of the start of the RAM that makes up the filesystem.

  • The mount option "init=" is optional, and is used to initialize an empty filesystem. Any data in an existing filesystem will be lost if this option is given. The parameter to "init=" is the RAM size in bytes.

  • The mount option "bs=" is optional, and is used to specify a block size. It is ignored if the "init=" option is not specified, since otherwise the block size is read from the PRAMFS super-block. The default blocksize is 2048 bytes, and the allowed block sizes are 512, 1024, 2048, and 4096.

  • The mount option "bpi=" is optional, and is used to specify the bytes per inode ratio, i.e. For every N bytes in the filesystem, an inode will be created. This behaves the same as the "-i" option to mke2fs. It is ignored if the "init=" option is not specified.

  • The mount option "N=" is optional, and is used to specify the number of inodes to allocate in the inode table. If the option is not specified, the bytes-per-inode ratio is used the calculate the number of inodes. If neither the "N=" or "bpi=" options are specified, the default behavior is to reserve 5% of the total space in the filesystem for the inode table. This option behaves the same as the "-N" option to mke2fs. It is ignored if the "init=" option is not specified.

  • The mount option "errors=" is optional, and is used to specify the fs behaviour in case of error. It can be "cont", "remount-ro" and "panic". With the first value no action is done in case of error. With the second one the fs is mounted read-only. with the third one a kernel panic happens. Default action is to continue on error.

  • The mount options "acl/noacl" are optionals. They are used to enable/disable the support for access control lists (disabled by default).

  • The mount options "user_xattr/nouser_xattr" are optionals. They are used to enable/disable the support for the user extended attributes (disabled by default).

  • The mount options "noprotect" is optional. It is used to disable the memory protection (enabled by default).

  • The mount options "xip" is optional. It is used to enable the execute-in-place (disabled by default).

Example:

mount -t pramfs -o physaddr=0x20000000,init=1M,bs=1k none /mnt/pram

This example locates the filesystem at physical address 0x20000000, and also requests an empty filesystem be initialized, of total size 1048576 bytes and blocksize 1024. The mount point is /mnt/pram.

mount -t pramfs -o physaddr=0x20000000 none /mnt/pram

This example locates the filesystem at physical address 0x20000000 as in the first example, but uses the intact filesystem that already exists.

Acceptance Criteria

The following operations should be verified on a mounted PRAMFS:

  • A mounted PRAMFS should pass the Bonnie++ benchmark tests. In this example bonnie++ command, it is assumed that the total PRAMFS filesystem is atleast 1MB in size, and that there are atleast 2048 inodes available:
   		bonnie++ -u root -s 1 -r 0 -n 2 -d /mnt/pram
  • Errant writes by the kernel into any area within the PRAMFS memory should cause a kernel page fault exception, and should not corrupt the filesystem. This can be tested with the a test module. The test module can be compiled in the kernel enabling the option "PRAMFS Test" in the kconfig menu. The module will attempt a write within the pramfs memory and should cause a kernel page protection fault ("Unable to handle kernel paging request at virtual address ..."). Then reboot the system, remount the pramfs filesystem, and verify that no filesystem data has been corrupted.

Additional Information

Architecture-dependent Requirements for PRAMFS

Like the other filesystems, the source code under fs/pramfs is architecture independent, and simply needs to be recompiled for a specific architecture. However, in kernel version 2.4, PRAMFS does require a few kernel services that are architecture dependent in order to support the write protection feature:

  • __ioremap_readonly()

This is a new function that is identical to the existing __ioremap() method in every respect, except that the page table entries that map the IO memory must be marked read-only. The method is used by PRAMFS to initially map the PRAMFS memory as read-only at mount time.

  • set_memory_ro(), set_memory_wr()

For PRAMFS, these already existing routines are needed to turn on/off the memory protection. Some architectures already allow this, such as x86.

Benchmark Results

Here we show some experimental results with Pramfs.
The following graphics show the differences of using the XIP feature. This test has been performed with bonnie++ benchmark on a real embedded environment.

You can download the complete data here and here.

The following graphics show the differences between a ramdisk with ext2 and Pramfs. This time we worked with an emulated environment. It is not very important the asbsolute value but the relative difference between this two fs.

In this case we can see the overhead of unuseful disk policy over RAM and it points out how and why Pramfs it is important in an environment as the RAM, with completly different rules and constraints compared with disks.

You can download the complete data here and here.

Known Problems

1.For small PRAMFS mounted filesystems, bonnie++ fills up the filesystem before the tests complete. This is not a bug in PRAMFS, but rather a limitation of the bonnie++ program.