1. Introduction

Reliable and efficient data storage is a major concern in the modern computer industry. This type of storage is mainly provided by magnetic and optical devices, the most common being the traditional magnetic disks. However, since processor performance improves at a higher rate than disk performance, the cost of a disk access (measured in processor cycles) continues to increase with time. As a result, several applications whose effective execution depends on low latency access to their files, are going to suffer an increasingly high performance penalty if they store their data on traditional magnetic disks. Such applications include transaction-based systems, visualization systems, web servers, web proxies, compilers, etc.

Several file systems attempt to speedup file accesses by using a portion of the file server's main memory as a disk cache. Unfortunately, such caches can not be larger than a single workstation's main memory. To overcome this memory size limitation, research file systems use all client and server caches in a NOW as a single cooperative cache. Unfortunately, such file systems are not widespread yet, and their principles have not been incorporated into commercial operating systems.

Here a new device, the Network RamDisk is introduced, which attacks the high latency problem in a very convenient manner. A Network RamDisk is a block device that consists of the all idle main memories in a Network of Workstations. It behaves like any other disk, allowing the creation of files and file systems on top of it. However, since it is implemented in main memory, it provides lower latency and higher bandwidth than most traditional magnetic disks.

In this paper, The organize the client and server memories of a NOW into a Network RamDisk is proposed. Idle workstations donate their memory which will be used to store a portion of the Network RamDisk. When an application reads a file, the file system requests a block from the device driver of the Network RamDisk. The driver knows in which workstation's main memory the block resides, asks the workstation for this block, and returns the requested block. Write operations proceed in the same way. Since block accesses involve only memory and interconnection network transfers, they proceed with low latency and high bandwidth. For example, typical disk latency is around 10 ms, while typical network latency is around 1 ms. Modern interconnection networks provide latency as low as a few microseconds. Thus, Network RamDisks may result in significant performance improvements over magnetic disks, especially when application performance depends on latency.

The Network RamDisk, much like network memory file systems exploits network memory to avoid magnetic disk I/O, but unlike these file systems, the Network RamDisk, being a device driver, can be easily incorporated into any environment, without modifying existing operating system sources, or changing file systems. Thus, by using a Network RamDisk, users will be able to exploit all network memory to store their files, without changing their operating system, or their file system.

2.Concept of Network RAM

The aggregate main memory of the workstation is called Network RAM.A significant amount of resources unused in the cluster at any given time. 80-90 % of workstation wil be idle in the evening & late afternoon 1/3 of workstation are completely unused, even at the busiest times of the day 70-85 % of the Network RAM in a cluster is ideal.

The main goal of Network RAM is to improve the performance of memory intensive workloads by paging to idle memory over the Network rather than to disk.

Major assumptions for evaluating Network RAM :-

There are four major assumptions that people make when evaluating Network RAM:

1. Disks are slow.

2. Networks are fast.

3. Most machines have free memory.

4. Software overhead is low.

Possible Network Ram Implementations :-

1. Explicit Program Management :-

This implementation would require user code to explicity move data back and forth from either the disk or the Network.This idea requires substantial code modification.

2. User-Level :-

This solution requires that the user modify their code to use a new malloc which allocates from Network RAM backed memory. This is an easier modification than explicit program management, but can still require source code.It has the overhead of handling interrupt on page faults and managing the page tables through system calls.

3. User Level Pager :-

This solution is possible in such operating system, which allows a user level program to handle moving pages to and from second-level storage.

4. Device Driver :-

Network RamDisk is used as a Device Driver. It is a block device that consists of the all idle main memories in a Network of Workstations. It behaves like any other disk, allowing the creation of files and file systems on top of it. It provides lower latency and higher bandwidth than most traditional magnetic disks. This solution has the advantage that it requires no code modification or kernel source modification. It has disadvantage that OS pages can send over the network, making reliability more important.

5. Modified Kernel :-

This solution should give the higest performance without modifying user program because it requires no extra interrupts or context switches. However, changes to kernels are not portable between architectures.

6. Network Interface :-

This solution requires replacing the memory controller with some of custom memory / network chip, making this solution the most un-portable. This solution can have a low overhead than a modified kernel because it can operate on cache lines rather than entire pages, making the data transferred smaller.

In this paper, Device Driver implementation of Network Ram is discussed.

3.The Design of a Network RamDisk

The Network Ram Disk (NRD) is a scalable distributed low-latency storage system. It consists of one machine that functions as a NRD Client, and a number of workstations functioning as NRD Servers.

Server and Client Workstations :-

The workstations that participate in the Network RamDisk by letting a certain amount of their memory be used, are known as NRD servers, while the workstation that can access the Network RamDisk and the filesystems on it, is called the NRD client. Any workstation in a NOW may act as a server, making a selectable amount of its memory available for the Network RamDisk, depending on its workload and memory capacity. The client machine must have the NRD device driver linked into its operating system, and must be connected over an interconnection network with the server workstations. The Network RamDisk functions then as a normal disk mounted on the client host, and any computer on the network can access the NRD by mapping it from the client. It is possible that a machine can function both as a client and a server.

Device Operation :-

All server workstations run an NRD server process, which stores a number of disk blocks and handles all block read and write requests. The NRD client accepts read or write as a normal disk, and when it wants to read or write a number of blocks, it searches its block table to find out on which server the corresponding blocks reside, and starts sending requests to that server.

In case of native memory-demanding processes starting on a server workstation, part of the server's virtual memory is swapped to disk. Future requests will be serviced from the disk, thus degrading performance. The problem can be handled by the server which can send a message to the client notifying of its memory-loaded state. The latter will try to find another server having enough free memory and migrate the disk blocks that were stored on the loaded server to a new one. If this is not possible, the delay of requests to the loaded server are inevitable.

Reliability Issues :-

In a distributed system, a workstation may crash at any time (e.g. due to software or human error). If the crashed workstation acts as an NRD server, it will lose the disk blocks of the client. Clearly, it is not acceptable for applications running on the client workstation to lose their data files due to remote server crash. Instead, we would like to be able to recover their disk blocks. Otherwise a remote server crash is certain to cause, not only an application crash and important data loss, but to result in a client crash as well, especially in the case where the Network RamDisk is used to store swap space of applications.

If the crashed workstation acts only as a client, after recovering from the crash, it can reconnect to the server hosts and find the Network RamDisk in the same state as its magnetic disks were after the crash. Running fsck will certainly correct inconsistency problems.

There are many types of crashes. First of all there may be machine crashes due to loss of power. This situation is not addressed in this paper, since most computer buildings are equipped with UPSs. The most frequent cause of crash is a software crash. To avoid loss of data due to a server crash, some systems write all network memory data to the disk as well . Instead, we design a reliable Network RamDisk that is able to reconstruct the lost disk blocks when one server fails.

To provide this level of reliability, some form of redundancy must be used. The main issues that must be taken into account regarding the form of redundancy used are

1. The runtime overhead introduced must be minimal since it is a cost paid even when no server crashes.

2. The memory overhead introduced must be as low as possible because the memory reserved for reliability could be used locally at each workstation in order to store its memory pages.

3. The runtime overhead of recovery from failure should be low.

Different Reliability Policies :-

1. Mirroring :-

The simplest form of redundancy is mirroring. In mirroring, there exist two copies of each disk block (block replication). When the client writes a disk block, the block is sent to two different servers, and the block table is updated. Thus, even when one of the servers crashes, the filesystem is still stable and consistent, because all disk blocks of the crashed server exist on the mirror server. Obviously the crash recovery overhead, in case of mirroring, is minimal. However, the runtime overhead is rather high, since each block write requires two block transfers. To make matters worse, mirroring wastes half of the remote memory used.

2. Basic Parity Method :-

To reduce the main memory waste caused by mirroring, let use parity-based redundancy schemes much like the ones used in RAIDS . Suppose, for example, that we have S servers, each having P pages. Page(i,j) is the page that resides on server i.Assume, that we have P parity pages, where parity page j is formed by taking the XOR of all the pages in all servers. All these pages belong to the same parity group. If a server crashes, all its pages can be restored by XORing all pages within each parity group.

When the client swaps out a page it has to update the parity to reflect the change. This update is done in two steps:

1. The client sends the swapped out page to the server, which computes the XOR of the old and the new page.

2. The server sends the just computed XOR to the parity server, which XORs it with the old parity, forming the new parity.

Unfortunately, this method involves two page transfers: one from client to server, and one from server to parity. Moreover, the client should not discard the page just swapped out, because the server may crash before the new parity is computed, thus, making it impossible to restore the swapped out page. This parity method increases the amount of remote main memory only by a factor of ( 1 + 1 / S ) minimizing the memory overhead, but it still imposes a significant runtime overhead.

3. Parity Logging or Parity Group :-

To avoid the additional page transfers induced by the basic parity method, a parity logging scheme is developed. The key idea is that a given page need not be bound to a particular server or parity group. Instead, every time a page is paged out, a new server and a new parity group may be used to host the page.

Suppose the client uses S servers. Each paged out page is XORed with a page size buffer maintained by the client (which is initially filled with zeros) and then is transfered to a server following a round robin policy. Whenever S pages have been transfered, the buffer is also transfered to a parity server. Using this technique, the runtime overhead is minimal, since for each paged out page ( 1 + 1 / S ) page transfers are required. When a server crashes, all of its pages can be restored by XORing the pages in their group with the corresponding parity page.

Every time a page is repaged out, it is marked in the old parity group containing it as inactive. When all the pages of a parity group are marked as inactive, all the memory server pages and the corresponding parity page can be reused. It is obvious that each memory server must have some extra overflow memory to support parity logging since many versions of a given page may be present simultaneously at the servers' memory. Also, due to this situation, it is possible that some server runs out of memory. In this case, one has to perform garbage collection freeing parity sets by combining their active pages to new ones. A parity group is depicted on figure 3.1.

4. Parity Caching :-

A new reliability policy is developed to reduce both the memory overhead and the latency bottleneck for each I/O request. This policy, which is called parity caching, uses a small amount of the NRD client's memory as a parity cache. The parity cache collects a fixed amount of disk blocks, enabling the parity caching policy to use more than one blocks as a parity stripping unit during normal operation, while it still supports recovery for each block unit in case of server failure. Obviously parity caching has much lower latency than the parity logging policy.

Suppose there exist one (reliable) client and S (unreliable) server machines, and each server can store B disk blocks. We need one server to keep the parity blocks (the parity server), while the other S-1 servers (called storage servers) store normal data blocks. Accordingly, our cache has S-1 buffers for the storage servers (storage buffers) and one for the parity server (the parity buffer). Thus the cache is a total of ( X * S ) blocks (1 block = 512 bytes). Each buffer in the cache corresponds to exactly one server,

meaning that the blocks placed in a particular buffer will be sent and stored on its corresponding server. Such a block cache and the matching of of buffers and servers are illustrated in figure 1 for S = 4 (3 storage servers and 1 parity server). When a write request of X blocks arrives from the filesystem the client places the X blocks into one storage buffer, updating a map that shows where each block is stored. When all S-1 storage buffers in the cache are full, the client computes the X parity blocks into the parity buffer, sends away all the buffers, each to its corresponding server, and then empties the cache. A parity caching is depicted on figure 3.2.

5. Adaptive Parity Caching :-

Another problem is the issue of redundancy cleaning. Due to the fact that a block does not have a fixed position, but can be written anywhere on any server, each time a block is overwritten the old block must be invalidated. The invalid block cannot be cleared until all the data blocks in its parity group are invalid, because it is needed in case one of the blocks in its parity group is lost. Similar behavior is encountered also in the Log-structured FileSystem (LFS) , and raises questions about the use of redundancy cleaning.

Based upon our knowledge of disk devices and filesystems we believe that blocks whose block numbers are contiguous, belong either to the same file or to file(s) that are part of a data stream, and are likely to be written, read or rewritten together. If the rewritten blocks were placed in the same parity group, that would make the redundancy cleaning of our device automatic. Using this observation let extend parity caching policy to an adaptive parity caching policy - figure 3.3.

In the adaptive caching policy use P parity caches as the one mentioned above and call each cache, a stream cache, because it tries to capture a single data stream. Figure 3 shows an adaptive parity cache with 3 stream caches (A, B, C) and 1 victim stream cache (V).

So when the client has to choose a stream cache for placing the X blocks, it performs the following placement:

1. If all stream caches are empty, place the blocks in the first buffer of the first stream cache.

2. If there is one non-empty stream cache :

1. Search the non-empty stream caches if the X blocks are contiguous to a buffer already in a stream cache, in which case place the blocks into the first free buffer of the same cache.

2. If there is an empty stream cache, place the blocks into its first buffer.

3. If there isn't any match of the above place the blocks into the first free buffer of the victim (V) cache.

In case we are waiting on non-contiguous blocks, there is a stream cache timeout. If the victim cache is full twice and some caches have not filled yet, the client fills the remaining buffers in a stream cache with 0's, and processes the cache like it was full.

Phase 1 : Stream cache A accumulates blocks 10-17, stream cache B blocks 30-37, stream cache C blocks 50-57, and stream cache V (the victim stream cache) blocks 70-73,90-93 and 74-77.

Phase 2 : Stream cache A accumulates blocks 10-21, stream cache B blocks 30-41, stream cache C blocks 50-61, and stream cache V (the victim stream cache) blocks 94-97. Each of the first three caches have captured a single block stream.

An example of the above policy is illustrated in figures 3.4 and 3.5. In this example assume that there are S = 4 servers, and each buffer is X = 4 blocks in size. The filesystem makes the following write requests (in block numbers) : (10,11,12,13), (30,31,32,33), (50,51,52,53), (70,71,72,73), (90,91,92,93), (34,35,36,37), (14,15,16,17), (54,25,26,27), and (74,75,76,77). Then our cache would have the blocks placed as shown in figure 3.3 . The next write requests are : (18,19,20,21), (94,95,96,97), (38,39,40,41), (58,59,60,61), and cache would be in the state of figure 3.5.

It is clear that this placement policy creates many parity groups of contiguous blocks. Even when the filesystem writes in many different block streams, each stream is placed in the same stream cache. Moreover, this policy degrades gracefully when more than 4 different streams are written together. In the above example we used 5 streams and the cache managed to capture 3 of them, and mixed the last two in the best possible manner.

4.Implementation

The Network RamDisk consists of a client issuing byte/block read and write requests and remote server processes satisfying these requests. Fully operational prototypes of the proposed system have been built, with clients on Linux 2.0.33 and the Digital Unix 4.0 Operating Systems, and servers running on Digital Unix 4.0 operating systems. The reliability schemes mentioned previously, were fully implemented, tested and measured on the Linux NRD client.

The Network RamDisk Client :-

The NRD client is a disk device driver that handles all read and write requests. In order to service these requests, it may forward them to user level NRD servers running on remote machines. The above design minimizes the modifications needed to port the system to another operating system, and avoids modifications to the operating system kernel.

When the NRD client starts, is passes through a configuration phase where it finds out which servers it will use, how many blocks each server will store, which reliability or caching policy will be used, and then initializes its data structures. The NRD client keeps two maps in its memory, a block table with the location of each block, and a bitmap with the full or empty blocks on each server. The size of these maps is quite small. For example for every 1024 byte block and a maximum of 256 servers, we need 5 bytes for the location of each block (1 byte for the server and 4 bytes for the block number on that server), so for a NRD of 256 MB the block table is only 1.25 MB.

After the configuration phase, the NRD client connects to the NRD servers, and functions as a normal disk accepting block I/O requests from any filesystem that is created on it. Depending on the reliability policy used the client issues block read/write requests to the servers using TCP/IP sockets over the network interface(s) the NRD client machine has installed. Security is ensured by allowing access to our device only to the superuser and by using privileged ports for the communication among the NRD client and the servers.

The operating system is not aware that we use remote main memory instead of magnetic disk as a disk I/O device. It just performs ordinary block I/O activities through the Virtual File System (VFS) using what it considers a disk, mainly due to the device driver's response to all normal ioctl() commands of an ordinary magnetic disk. The disk's geometry (capacity, cylinders, tracks, sectors, etc.) are customizable from the driver's ioctl interface. A disk type entry ia also added to the /etc/disktab (or /etc/fstab) file of NRD client machines, so that the disklabel (or the fdisk) command could easily identify the NRD as a disk. The amount of blocks transferred with each request to the NRD server, is also a matter of disk's geometry and can be adapted according to the current interconnection network, so that the performance is optimum.

The filesystem used on Network RamDisk is the classic UFS (Unix FileSystem) on Digital Unix (exists on most Unix-based operating systems), and the Ext2 filesystem on Linux. These do not take full advantage of disk being a Network RamDisk, however the main aim is to measure performance resulting purely from the device and not from the filesystem on it. Moreover using a common filesystem, allowed us to use the ordinary system administrative tools for disks (disklabel, newfs, fdisk, mkfs, mount/umount, fsck, df, etc.) to install, create, and test the filesystem on our device.

The current implementation of the Digital Unix Network RamDisk client contains only an unreliable version, and runs on top of a low-bandwidth 10 Mbps Ethernet.

The current Linux Network RamDisk client implementation contains three different reliability/caching policies:

1. A simple unreliable version similar to the Digital Unix implementation.

2. An unreliable version, which uses caching of blocks and large batched network transfers to reduce network latency. This version has the optimum performance.

3. An adaptive parity caching version, performs also quite well.

All these versions of the Linux Network RamDisk client have been tested on top of a 155 Mbps ATM network. Lower latency and higher bandwidth networks like Gigabit ATM should offer even more promising performance, especially when faster communication protocols are used.

The Network RamDisk Servers :-

The Network RamDisk server is a user level program listening to a socket and accepting connections from the NRD clients. Each client is served by a new instance of the server which accesses the same portion of the local workstation's main memory to store the client's disk blocks. When the client makes a read request, the server transfers the requested block(s) over the socket. When the client makes a write request, the server reads the incoming block(s) from the socket, and stores them in its main memory. The server is also responsible for providing periodically (or on demand) information to the client concerning its memory load. An NRD parity server process is by no means different than a storage NRD server one. It just performs block read and write requests responding to client requests without knowing whether it stores memory blocks or parity blocks.

5.Experiments and Performance

To test the operability, evaluate the performance of Network RamDisk prototype described above, and compare it to traditional disk I/O, a series of performance measurements using a number of representative benchmarks and applications, that depend in various ways and capacities on the disk I/O performance. The applications include searching the files on the disk for a particular string using find and grep, and using tar to unpack the archive file of Gnu Compiler version 2.7.2.

The experiments are tested on the Digital Unix client which was built first and implements only an unreliable NRD.

The Digital Unix NRD Client :-

Experimental Environment :-

All applications and benchmarks for the Digital Unix NRD client were executed from user level on the DEC-Alpha 3000 model 300 with 32MB of main memory running Digital Unix 4.0, and were compiled with the standard system C compiler (cc). The workstations that contributed their main memory for storing the Network RamDisk blocks were DEC-Alpha 3000 model 300 also with 32 MB of main memory, connected via a standard 10Mbits/sec Ethernet. In all experiments the amount of idle memory was larger than the amount of memory needed for storing the Network RamDisk blocks, and was equally distributed among all workstations. During the test we tried to keep the Ethernet as idle as possible, in order to gain the maximum bandwidth from the 10 Mbits/sec available. The local disk that was used for testing and comparison is a DEC RZ55, providing 10Mbits/sec bandwidth, and average seek time of 16 msec. Although both the disk and the interconnection network are slow, they have comparable bandwidth.

Performance of “find” and “grep” :-

The file read performance of the Network RamDisk is measured using the find Unix command. First copy a directory tree, which du reported to be 28 MBytes in size and having 1075 files, into the two devices, the Network RamDisk and the magnetic disk. Then cleared the filesystem cache from the recently written blocks, by reading the whole directory tree of /usr which is 280 MBytes large. Using a simple shell script searched all the files in the directory tree for a non-existing string, using the applications find and grep. The results shown in Table 1, point out that it takes 129.8 seconds for find and grep for the search while on the Network RamDisk it takes only 103.6 seconds. We see that our application is 25% faster when run on top of the NRD.

Using “tar” to unpack files :-

Trying to measure the write performance of real applications the Unix tar command is used to unpack many files from a single archive. One must note that tar does not compress the files, it simply packs a whole directory tree into a single file. The directory tree contained in the tarfile was 28 MBytes in size, and had 1075 files.Timed the command to unpack the archive file into the Network RamDisk and the magnetic disk. Table 2 presents the results. Thus the application is 79% faster on the NRD, than on the magnetic disk. This performance advantage can be exploited by many applications which do many sequential file writes, such as visualizing tools, proxies and databases.

6. Applications

1. Applications of Network RAM in Memory Paging:-

Network RAM- equivalent of remote memory paging is

1. Faster than the disk

2. All memory-intensive applications could benefit from Network RAM

3. Even useful for mobile or portable computers, where storage space is limited

2. Applications of Network RAM in File System:-

1. Use the Network RAM as a file system cache, or directly as a faster-than disk storage device for file I/O.

2. Using Network Memory as a File cache:-

· All file system use a portion of the workstation’s memory as a file system cache.

· Two problems:

1. multiple cached copies waste local memory

2. no knowledge of the file’s existence

· Improve the file system’s caching performance:

create a global network memory file system cache:“cooperate caching”

3. Applications of Network RAM in Database:-

Transaction-Based Systems:-

· Substitute the disk accesses with Network RAM accesses reducing the latency of a transaction

· A transaction makes a number of disk accesses to read, makes some designated calculations on that data, writes its results to the disk and, at the end, commits.

· Used to increase a transactions-based system’s performance:

Approaches:-

1. At the initial phase of a transaction “read requests” from the

disk are performed through the use of global file system cache.

2. Speed up synchronous write operations to reliable storage at

Transaction commits time.

7. Advantages

1. Disk’s I/O performance problem can be reduced.

2. Use reliable network memory for directly temporary files.

3. Can be easily implemented without modifying OS (Device Driver).

4. Can be accessed by any file system.

5. The Network RamDisk requires no code modification or kernel source modification.

Disadvantages:-

1. Instead of memory pages, send disk blocks to remote memory

2. Disk blocks are much smaller than memory pages (512 or 1024 bytes) and on many OSs variables in size (from 512 bytes to 8 Kbytes)

3. OS pages can send over the network, making reliability more important.

8.Conclusion

In this paper a new device : the Network RamDisk is introduced.This is a block storage device that consists of (idle) main memory of workstations within a (heterogeneous) workstation cluster.

The contributions of this report are:

· How to build a remote memory disk storage system. The major advantage of this design is the portability it offers. This device driver can be installed on any Digital Unix or Linux system, and can exploit the main memory of any workstation supporting TCP/IP.

· A new adaptive parity reliability strategy for distributing disk blocks to many servers, which shows that it effectively reduces the time for redundancy cleaning, thus providing reliability at a very low performance cost.

· Storing disk blocks to remote memory results in substantial performance improvements over the local magnetic disk, mostly due to latency reduction.

· Making remote memory available to every application through the common filesystem structures.

Based on our implementation and our performance results we conclude:

· data to the Network RamDisk results in significant performance improvement over storing them on the traditional magnetic disk. Applications that use our Network RamDisk even when running on top of traditional Ethernet technology show performance improvements of up to 412 %.

· Organizing remote memory as a Network RamDisk under any ordinary file system, is an inexpensive way to let applications use the memory of any workstations in a heterogeneous cluster. Everyday applications can now use remote memory for their data files without modifying their code, under any Unix filesystem.

· Remote memory data storage provides good performance and reliability with almost no extra hardware support.

· The benefits of storing data blocks to remote memory will only increase with time. Current architecture trends suggest that the gap between processor and disk speed continues to widen. Disks are not expected to provide the latency needed by several applications unless a breakthrough in disk technology occurs. On the other hand, interconnection network bandwidth keeps increasing (and network latency decreasing) at a much higher rate than (single) disk bandwidth, thereby increasing the performance benefits of the remote memory disk storage.

So, using remote memory as a Network RamDisk is a cost-effective and performance-effective way to execute I/O intensive applications on a network of heterogeneous workstations.

Bibliography

1. T.E. Anderson, D.E. Culler, and D.A. Patterson. A case for NOW (Networks of Workstations). IEEE Micro, February 1995.

2.G. Buzzard,D. Jacobson,M. Mackey,S. Marovich & J. Wilkes. An Implementation of the Hamlyn Sender-Managed Interface Architecture. In Second Symposium on Operating System Design and Implementation, October 1996.

3.Crawley : Operating Systems : An Object oriented Approach, McGraw Hill, 1998.

4. W. Richard Stevens : Advance Programming in the UNIX Environment, Addison- Wesley, (NAROSA), 1994 reprint.

5. Url: www.archvlsi.ics.forth.gr

TO DOWN LOAD REPORT AND PPT

DOWNLOAD

SEMINARS AND PROJECTS

THE NETWORK RAMDISK