SEMINARS AND PROJECTS: High Performance Computing

1. Introduction to High Performance Computing

Since its inception, the computer industry has been driven by an endless quest for more and more computing power. The attitude of the computer users can be characterized by an insatiable thirst for more. We want more of all computing resources. We want more memory, more processor speed, more faster peripherals. However much computing power there is, it is never enough. This seminar is an attempt to discuss and enlist the various technologies employed to achieve High Performance Computing.

The traditional approach to achieve higher speeds was to run the system clock faster.

Unfortunately, we are beginning to hit some fundamental limits on clock speed. As we make our CPUs smaller, we hit another fundamental problem : heat dissipation. The faster runs the computer, the more heat it generates, and the smaller the computer, the harder it is to get rid of this heat. Already on high-end Pentium systems the CPU cooler is bigger than the CPU itself.

One approach to greater speed is through massively parallel computers. These machines consist of many CPUs, each of which runs at “normal” speeds, but which collectively have far more power than a single CPU. Systems with around a 1000 CPUs are available commercially. Systems with 1 million CPUs are likely to be built in the next decade.

High-Performance computers are increasingly in demand in the areas of structural analysis, weather forecasting, petroleum exploration, fusion energy research, medical diagnosis, industrial automation, remote sensing, military defense, genetic engineering, and socioeconomics, among many other scientific and engineering applications. Without superpower computers, many of these challenges to advance human civilization cannot be made within a reasonable time period. Achieving high performance depends not only on using faster and more reliable hardware devices but also on major improvements in computer architecture and processing techniques.

1.1 Why more than one CPU?

Answering this question is important. Using 8 CPUs to run your word processor sounds a little like "over-kill" -- and it is. What about a web server, a database, a rendering program, or a project scheduler? Maybe extra CPUs would help. What about a complex simulation, a fluid dynamics code, or a data mining application. Extra CPUs definitely help in these situations. Indeed, multiple CPUs are being used to solve more and more problems.

The question usually asked is that why do we need two or four CPUs, could we not just wait till a faster chip arrives? There are several reasons:

1. Due to the use of multi-tasking Operating Systems, it is possible to do several things at once. This is a natural "parallelism" that is easily exploited by more than one low cost CPU.

2. Processor speeds have been doubling every 18 months, but what about RAM speeds or hard disk speeds? Unfortunately, these speeds are not increasing as fast as the CPU speeds. Keep in mind most applications require "out of cache memory access" and hard disk access. Doing things in parallel is one way to get around some of these limitations.

3. Predictions indicate that processor speeds will not continue to double every 18 months after the year 2005. There are some very serious obstacles to overcome in order to maintain this trend.

4. Depending on the application, parallel computing can speed things up by any where from 2 to 500 times faster (in some cases even faster). Such performance is not available using a single processor. Even supercomputers that at one time used very fast custom processors are now built from multiple "commodity- off-the-shelf" CPUs.

If you need speed - either due to a compute bound problem and/or an I/O bound problem, parallel is worth considering. Because parallel computing is implemented in a variety of ways, solving your problem in parallel will require some very important decisions to be made. These decisions may dramatically effect portability, performance, and cost of your application.

Before we get technical, let's look take a look at a real "parallel computing problem" using an example with which we are familiar - waiting in long lines at a store.

1.2 Taxonomy of Parallel Computing Systems

One of the earliest attempts to classify systems on the basis of their architecture was by a scientist called Flynn. Flynn classified all types of computer systems into the following types :

1. SISD

2. SIMD

3. MISD

4. MIMD

1. SISD : SISD stands for Single Instruction Stream Single Data Stream. These are our traditional uniprocessor systems.

2. SIMD : Single Instruction Stream Multiple Data Stream. These systems are used in specific scientific applications. e.g. are array processors and vector processors. All the processors are interconnected with a master unit. The master unit gives the instruction and all the processors execute this same instruction but on their individual data stream.

3. MISD : Multiple Instruction Stream Single Data Stream. Such systems have not been implemented in the industry and hence a discussion is not possible.

4. MIMD : Multiple Instruction Stream Multiple Data Stream. Such systems are the norm of the day in the world of parallel processing. A further classification of these systems into Multiprocessor systems, Multicomputer systems, and Distributed systems is possible. Throughout the paper we are concerned about such systems only.

MIMD Architecture Systems:

Such systems are the true work horses of the parallel computing world. They are classified as :

1. Multiprocessors.

2. Multicomputer Systems.

3. Distributed Systems.

2. Mutiprocessor Systems:

A shared-memory multiprocessor is a computer system in which two or more CPUs share full access to a common RAM. A program running on any of the CPUs sees a normal virtual address space. The unusual property of this system is that the CPU can write some value into a memory word and then read the word back and get a different value ( because another CPU has changed it ). This property forms the fundamental of interprocessor communication: one CPU writes some data into memory and another one reads the data out.

Multiprocessor Hardware Implementations:

1. Uniform Memory Access (UMA) multiprocessors.

2. Non Uniform Memory Access (NUMA) multiprocessors.

1. 2.1 Uniform Memory Access Multiprocessors:

2.1.1 UMA Bus-Based Architecture:

The simplest multiprocesors are based on a single bus, as shown in figure. Two or more CPUs and one or memory modules use the same bus for communication. When a CPU wants to read a word from memory it checks to see if the bus is free, if it is then it puts the address of the word to be read from memory onto the bus, asserts a few control signals and waits until memory puts the desired work on the bus.

However, there is one problem with this model. If the model is used for one or two processors then the system should work fine but if the model is used for 32 or 64 processor systems then the entire performance of the system is limited by the bandwidth of the system. The solution to this problem is to add a cache to each CPU. The cache can be inside the CPU chip, next to the CPU chip, on the processor board, or some combination of all three. Since many reads can now be satisfied out of the local cache, there will be much less bus traffic and the system can support more CPUs. In general caching is done on the basis of 32- or 64- byte blocks. When a word is referenced the entire block is brought into the cache of CPU. But additional problems of cache coherence are introduced due to this design. Yet another possibility is that each CPU has not only a cache but also a local private memory which it accesses over a private bus.

2.1.2 UMA Multiprocessor Using Crossbar Switches:

Even with the best caching bus based UMA systems have their size limited to 16 or 32 CPU maximum. To go beyond this point we need a different kind of interconnection network. The simplest circuit for connecting n CPUs to k memories is the crossbar switch. One of the advantages of this system is that no CPU is ever denied a connection because some cross point or line is already occupied ( assuming memory module itself is available ). The disadvantage of this design is that the number of crosspoints grows as n^2. With 1000 CPUs and 1000 memory modules we need 10^6 ( or a million ) crosspoints. Clearly this is not feasible. Nevertheless for medium sized systems a crossbar design is available.

2.1.3 UMA Multiprocessor Using Multistage Switching Networks:

An example of Multistage Switching Networks is the economical omega network. For n CPUs and n memories we would need (log n to the base 2 ). Unlike the crossbar switch network the omega network is a blocking network i.e. conflicts can occur over the use of a wire or a switch, as well as between requests to memory and replies from memory.

2. 2. 2 Non Uniform Memory Access Multiprocessors:

Single bus UMA multiprocessors are generally limited to no more than a few dozen CPUs and crossbar or switched multiprocessors need a lot of expensive hardware and are not that much bigger. To get more than 100 CPUs something needs to give and this something is usually that all memory modules have same access time. Like their UMA cousins they provide a single address space across all the CPUs but unlike the UMA machines, access to local memory modules is faster than access to remote ones. Thus all UMA programs will run without change on NUMA machines, but the performance will be worse than on a UMA machine at the same clock speed.

NUMA machines have three key characteristics that all of them possess and which altogether distinguish them from other multiprocessors:

1. There is a single address memory space available to all CPUs.

2. Access to remote memory is via LOAD and STORE instructions.

3. Access to remote memory is slower than access to local memory.

When the access time to remote memory is not hidden (because there is no caching ) the system is called NC-NUMA. When coherent caches are present, the system is called CC-NUMA (Cache-Coherent NUMA).

The most popular approach for building large CC-NUMA multiprocessors currently currently is the directory based multiprocessor. The idea is to maintain a database telling where each cache line is and what its status is. When the cache line is reference the database is queried to find out where it is and whether it is clean or modified. Since the database must be queried on every instruction it has to be kept in extremely fast special-purpose hardware that can respond in fraction of a bus cycle.

2.3 Mutiprocessor Operating System Types

2.3.1 Each CPU has its own Operating System:

The simplest possible way to organize a multiprocessor operating system is to statically divide memory into as many partitions as there are CPUs and give each CPU its own private memory and its own private copy of the operating system. In effect, the n CPUs now operate as n independent computers. One obvious optimization is to allow all the CPUs to share the operating system code and make private copies of only the data.

The major disadvantage of this system is that each CPU has its own set of processes that it schedules by itself. There is no sharing of processes. If a user logs into CPU 1, all of his processes run on CPU 1. As a consequence, it can happen that CPU 1 is idle while CPU 2 is loaded with work.

Secondly no sharing of pages can take place. It can happen that CPU 1 has pages to share but CPU 2 is paging continiously. Thirdly, cache can have modified pages leading to inconsistent results.

2.3.2 Master-Slave Multiprocessors:

The above model is rarely used anymore. A second approach is shown in the figure above. Here, one copy of the operating systems and its tables is present on the CPU 1 and not on any others. All system calls are redirected to CPU 1 for processing there. CPU 1 may also run user processes if there is CPU time left over. This model is called master-slave since CPU 1 is the master and all the other are slaves.

The master-slave problem solves most of the problems of the earlier model. There is a single data structure which keeps a list of ready processes. When a CPU goes idle, it asks the operating system for a process to run and it is assigned one.

The disadvantage with this model is that the master becomes a bottleneck. After all, it must handle all system calls from all the CPUs.

2.3.3 Symmetric Multiprocessors:

Our third model, the SMP eliminates this asymmetry. There is one copy of the operating system in the memory, but any CPU can run on it. When a system call is made, the CPU on which the system call was made traps to the kernel and the system call is executed.

The model balances processes and memory dynamically since there is only one set of operating system tables. It also eliminates the master CPU bottleneck, since there is no master, but it introduces its own problems. In particular, if two or more CPUs are running operating system code at the same time disaster will occur. Imagine two CPUs trying to access the same free memory page. The simplest solution is to use mutexes with the operating system, making the whole operating system one big critical region area. This has the disadvantage that if one CPU is executing the operating system code then others must wait. This is as bad as bottleneck of previous example

A solution to this problem is possible that the operating system code is divided into various critical regions with corresponding mutexes associated with them. But the down side of this solution is that complexity is increased very substantially.

3 Multiple Computers:

3.1 Definition:

Multiple Computers / Clusters are one of the most hottest topics in the computing industry today. The down side to multiprocessors is that they are expensive to build and hence clusters come in to play their role.

A cluster may be defined as:

A cluster is a group of independent computers that work together to run a common set of applications and provide the image of a single system to the client and application. The computers are physically connected by cables and programmatically connected by cluster software. These connections allow computers to utilize problem solving features such as and load balancing. This is not possible with a stand-alone computer.ute operations

When you first heard people speak of Piles of PCs, the first thing that came to mind may have been a cluttered computer room with processors, monitors, and snarls of cables all around. Collections of computers have undoubtedly become more sophisticated than in the early days of shared drives and modem connections. No matter what you call them—Clusters of Workstations (COW), Networks of Workstations (NOW), Workstation Clusters (WCs), Clusters of PCs (CoPs)—clusters of computers are now filling the processing niche once occupied by more powerful stand-alone machines.

The figure above depicts about 6 nodes but a typical large scale cluster consists of anywhere around a hundred to a thousand nodes. All the nodes are connected together through an appropriate interconnection technology.

In its simplest form, the computers in your office that are connected to your local area network constitute a workstation cluster. In addition to the hardware, a workstation cluster also includes the middleware that allows the computers to act as a distributed or parallel system and the applications designed to run on it.

While a system based on low-end workstations and network technologies may not at first seem particularly useful, such systems have been the testbeds for a new computing paradigm: high-performance and high-availability cluster computing. This class of system is be-coming increasingly commonplace; in fact, most academic institutions and industries that use high-performance computing either already use or are thinking of using workstation clusters to run their most demanding applications. Even companies that can afford traditional supercomputers are becoming interested in commodity clusters.

Why the switch? For some, cluster-based systems provide a way to stretch their computing dollars, allowing the reuse of seemingly obsolete office or classroom systems. Others have found that a cluster of high-performance workstations can easily compete with the best supercomputers IBM or SGI have to offer. A company can download a few tools from a public Web site and order a collection of machines and network equipment to put together an 8-Gflops system for around $50,000. Assembling a powerful supercomputer would cost around $200,000.

Technologies, Components, and Applications

A cluster consists of all the components found on any LAN with PCs or workstations: individual computers with their processors, memory, and disks; network cards; cabling; libraries; operating systems; middleware; tools; and various other utilities. The architecture of clusters, however, can vary rather dramatically. At one end of the spectrum are clusters based on commercial off-the-shelf components or put together from older systems, maybe originally used in offices or classrooms. (For information on one of the first COTS clusters, see section 3.3 ) At the other end are proprietary clusters built around high-end SMP processors and custom network technologies. The physical configurations of clusters also vary widely, including anything from a bunch of PCs located in a classroom to motherboards stored in custom racks in a computing services room.

Of course, all these components are there to support applications of one sort or another. Applications appropriate for clusters are not restricted to those traditionally run on supercomputers or other high-performance systems. The number and types of applications now using clusters are increasing all the time. Cluster-based systems support both high-performance parallel applications such as computational chemistry, astrophysics, and computational fluid dynamics, and commercial applications such as a load-balanced high-performance Web server like HotBot, which uses a parallel Oracle database.

In the history of supercomputing, there always has been the notion that super speed requires exotic technologies and monumental cost. Our newest supercomputer proves that if you use the right exotic technologies, you don't need much money -- not even as much as a "normal" Beowulf would cost

A popular measure of supercomputer speed is how many billions of floating-point arithmetic operations it can perform per second (GFLOPS) while executing a useful program. Less than a decade ago, supercomputers cost more than $1,000,000 per GFLOPS. Traditional supercomputers still cost around $10,000 for each GFLOPS delivered. Linux PC clusters, "Beowulfs" like LANL's Avalon, bring that cost down to around $3,000 per GFLOPS. The Kentucky Linux Athlon Testbed2 (KLAT2) system is Linux PC cluster that costs $650 for 1 GFLOPs.

The way to get super speed is parallel processing: executing multiple portions of a program simultaneously allows the program to complete in less time. The more parallelism you can use, the greater the potential speedup. A modern supercomputer doesn't really execute operations much faster than a high-end PC, it just executes more operations simultaneously. Thus, the key to making a cluster of PCs perform like a supercomputer is to use as much parallelism as possible.

A Clustering configuration provides high availability, scalability, and manageability.

High availability. The cluster is designed to avoid a single point of failure. Applications can be distributed over more than one computer, achieving a degree of parallelism and failure recovery, and therefore providing more availability.

High Scalability. You can increase the cluster's computing power by adding more processors or computers.

High Manageability. The cluster appears as a single-system to end-users, applications, and the network, while providing a single point of control to administrators. This single point of control can be remote.

3.2 An Architectural View of Clusters:

In a nutshell, network clustering connects otherwise independent computers to work together in some coordinated fashion. Because clustering is a term used broadly, the hardware configuration of clusters varies substantially depending on the networking technologies chosen and the purpose (the so-called "computational mission") of the system. Clustering hardware comes in three basic flavors: so-called "shared disk," "mirrored disk," and "shared nothing" configurations.

Shared Disk

One approach to clustering utilizes central I/O devices accessible to all computers ("nodes") within the cluster. We call these systems shared-disk clusters as the I/O involved is typically disk storage for normal files and/or databases. Shared-disk cluster technologies include Oracle Parallel Server (OPS) and IBM's HACMP.

Shared-disk clusters rely on a common I/O bus for disk access but do not require shared memory. Because all nodes may concurrently write to or cache data from the central disks, a synchronization mechanism must be used to preserve coherence of the system. An independent piece of cluster software called the "distributed lock manager" assumes this role.

Shared-disk clusters support higher levels of system availability: if one node fails, other nodes need not be affected. However, higher availability comes at a cost of somewhat reduced performance in these systems because of overhead in using a lock manager and the potential bottlenecks of shared hardware generally. Shared-disk clusters make up for this shortcoming with relatively good scaling properties: OPS and HACMP support eight-node systems, for example.

Shared Nothing

A second approach to clustering is dubbed shared-nothing because it does not involve concurrent disk accesses from multiple nodes. (In other words, these clusters do not require a distributed lock manager.) Shared-nothing cluster solutions include Microsoft Cluster Server (MSCS).

MSCS is an atypical example of a shared nothing cluster in several ways. MSCS clusters use a shared SCSI connection between the nodes, that naturally leads some people to believe this is a shared-disk solution. But only one server (the one that owns the quorum resource) needs the disks at any given time, so no concurrent data access occurs. MSCS clusters also typically include only two nodes, whereas shared nothing clusters in general can scale to hundreds of nodes.

Mirrored Disk

Mirrored-disk cluster solutions include Legato's Vinca. Mirroring involves replicating all application data from primary storage to a secondary backup (perhaps at a remote location) for availability purposes. Replication occurs while the primary system is active, although the mirrored backup system -- as in the case of Vinca -- typically does not perform any work outside of its role as a passive standby. If a failure occurs in the primary system, a failover process transfers control to the secondary system. Failover can take some time, and applications can lose state information when they are reset, but mirroring enables a fairly fast recovery scheme requiring little operator intervention. Mirrored-disk clusters typically include just two nodes although the mirrored backup system -- as in the case of Vinca -- typically does not perform any work outside of its role as a passive standby. If a failure occurs in the primary system, a failover process transfers control to the secondary system. Failover can take some time, and applications can lose state information when they are reset, but mirroring enables a fairly fast recovery scheme requiring little operator intervention. Mirrored-disk clusters typically include just two nodes.

3.3 Beowulf Supercomputing Architecture – An Example :

There are probably as many Beowulf definitions as there are people who build or use Beowulf Supercomputer facilities. Some claim that one can call their system Beowulf only if it is built in the same way as the NASA's original machine. Others go to the other extreme and call Beowulf any system of workstations running parallel code. The accepted definition of Beowulf fits somewhere between the two views described above.

Beowulf is a multi computer architecture which can be used for parallel computations. It is a system which usually consists of one server node, and one or more client nodes connected together via Ethernet or some other network. It is a system built using commodity hardware components, like any PC capable of running Linux, standard Ethernet adapters, and switches. It does not contain any custom hardware components and is trivially reproducible. Beowulf also uses commodity software like the Linux operating system, Parallel Virtual Machine (PVM) and Message Passing Interface (MPI). The server node controls the whole cluster and serves files to the client nodes. It is also the cluster's console and gateway to the outside world. Large Beowulf machines might have more than one server node, and possibly other nodes dedicated to particular tasks, for example consoles or monitoring stations. In most cases client nodes in a Beowulf system are dumb, the dumber the better. Nodes are configured and controlled by the server node, and do only what they are told to do. In a disk-less client configuration, client nodes don't even know their IP address or name until the server tells them what it is. One of the main differences between Beowulf and a Cluster of Workstations (COW) is the fact that Beowulf behaves more like a single machine rather than many workstations. In most cases client nodes do not have keyboards or monitors, and are accessed only via remote login or possibly serial terminal. Beowulf nodes can be thought of as a CPU + memory package which can be plugged in to the cluster, just like a CPU or memory module can be plugged into a motherboard.

Beowulf is not a special software package, new network topology or the latest kernel hack. Beowulf is a technology of clustering Linux computers to form a parallel, virtual supercomputer. Although there are many software packages such as kernel modifications, PVM and MPI libraries, and configuration tools which make the Beowulf architecture faster, easier to configure, and much more usable, one can build a Beowulf class machine using standard Linux distribution without any additional software. If you have two networked Linux computers which share at least the /home file system via NFS, and trust each other to execute remote shells (rsh), then it could be argued that you have a simple, two node Beowulf machine.

3.3.1 Classification

Beowulf systems have been constructed from a variety of parts. For the sake of performance some non-commodity components (i.e. produced by a single manufacturer) have been employed. In order to account for the different types of systems and to make discussions about machines a bit easier, we propose the following simple classification scheme:

Class I Beowulf:

This class of machines built entirely from commodity "off-the-shelf" parts. We shall use the "Computer Shopper" certification test to define commodity "off-the-shelf" parts. (Computer Shopper is a 1 inch thick monthly magazine/catalog of PC systems and components.) The test is as follows:

A Class I Beowulf is a machine that can be assembled from parts found in at least 3 nationally/globally circulated advertising catalogs.

The advantages of a Class I system are:

Ø hardware is available form multiple sources (low prices, easy maintenance)

Ø no reliance on a single hardware vendor

Ø driver support from Linux commodity

Ø usually based on standards (SCSI, Ethernet, etc.)

The disadvantages of a Class I system are:

Ø best performance may require Class II hardware

Class II Beowulf

A Class II Beowulf is simply any machine that does not pass the Computer Shopper certification test. This is not a bad thing. Indeed, it is merely a Classification of the machine.

The advantages of a Class II system are:

Ø Performance can be quite good!

The disadvantages of a Class II system are:

Ø driver support may vary

Ø reliance on single hardware vendor

Ø may be more expensive than Class I systems.

One Class is not necessarily better than the other. It all depends on your needs and budget. This Classification system is only intended to make discussions about Beowulf systems a bit more succinct.

3.3.2 How does Beowulf system differ from a Cluster Of Workstations ?

The computer laboratory described above is a perfect example of a Cluster of Workstations (COW). So what is so special about Beowulf, and how is it different from a COW? The truth is that there is not much difference, but Beowulf does have few unique characteristics. First of all, in most cases client nodes in a Beowulf cluster do not have keyboards, mice, video cards nor monitors. All access to the client nodes is done via remote connections from the server node, dedicated console node, or a serial console. Because there is no need for client nodes to access machines outside the cluster, nor for machines outside the cluster to access client nodes directly, it is a common practice for the client nodes to use private IP addresses like the 10.0.0.0/8 or 192.168.0.0/16 address ranges. Usually the only machine that is also connected to the outside world using a second network card is the server node. The most common ways of using the system is to access the server's console directly, or either telnet or remote login to the server node from personal workstation. Once on the server node, users can edit and compile their code, and also spawn jobs on all nodes in the cluster. In most cases COWs are used for parallel computations at night, and over weekends when people do not actually use the workstations for every day work, thus utilising idle CPU cycles. Beowulf on the other hand is a machine usually dedicated to parallel computing, and optimised for this purpose. Beowulf also gives better price/performance ratio as it is built from off-the-shelf components and runs mainly free software. Beowulf has also more single system image features which help the users to see the Beowulf cluster as a single computing workstation.

4. Distributed and Grid Computing

4.1 Introduction to Distributed and Grid Computing

When the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.

-- Gilder Technology Report, June 2000.

Internet computing and Grid technologies promise to change the way we tackle complex problems. They will enable large-scale aggregation and sharing of computational, data and other resources across institutional boundaries. And harnessing these new technologies effectively will transform scientific disciplines ranging from high-energy physics to the life sciences.

The data challenge

Fifty years of innovation have increased the raw speed of individual computers by a factor of around one million, yet they are still far too slow for many challenging scientific problems. For example, detectors at the Large Hadron Collider at CERN, the European Laboratory for Particle Physics, by 2005 will be producing several petabytes of data per year -- a million times the storage capacity of the average desktop computer. Performing the most rudimentary analysis of these data will probably require the sustained application of some 20 teraflops (floating-point operations) per second of computing power. Compare this with the 3 teraflops per second produced by the fastest contemporary supercomputer, and it is clear that more sophisticated analyses will need orders of magnitude more power.

Cluster and conquer

One solution to the problem of inadequate computer power is to 'cluster' multiple individual computers. This technique, first explored in the early 1980s, is now standard practice in supercomputer centres, research labs and industry. The fastest supercomputers in the world are collections of microprocessors, such as the 8,000-processor ASCI White system at Lawrence Livermore National Laboratory in California. Many research laboratories operate low-cost PC clusters or 'farms' for computing or data analysis. This year's winner in the Gordon Bell Award for price-performance in supercomputing achieved US$0.92 per megaflop per second on an Intel Pentium III cluster. And good progress has been made on the algorithms needed to exploit thousands of processors effectively.

Although clustering can provide significant improvements in total computing power, a cluster remains a dedicated facility, built at a single location. Financial, political and technical constraints place limits on how large such systems can become. For example, ASCI White cost $110 million and needed an expensive new building. Few individual institutions or research groups can afford this level of investment.

Internet Computing

Rapid improvements in communications technologies are leading many to consider more decentralized approaches to the problem of computing power. There are over 400 million PCs around the world, many as powerful as an early 1990s supercomputer. And most are idle much of the time. Every large institution has hundreds or thousands of such systems. Internet computing seeks to exploit otherwise idle workstations and PCs to create powerful distributed computing systems with global reach and supercomputer capabilities.

The opportunity represented by idle computers has been recognized for some time. In 1985, Miron Livny showed that most workstations are often idle, and proposed a system to harness those idle cycles for useful work. Exploiting the multitasking possible on the popular Unix operating system and the connectivity provided by the Internet, Livny's Condor system is now used extensively within academia to harness idle processors in workgroups or departments. It is used regularly for routine data analysis as well as for solving open problems in mathematics. At the University of Wisconsin, for example, Condor regularly delivers 400 CPU days per day of essentially "free" computing to academics at the university and elsewhere: more than many supercomputer centres.

But although Condor is effective on a small scale, true mass production of Internet computing cycles had to wait a little longer for the arrival of more powerful PCs, the spread of the Internet, and problems (and marketing campaigns) compelling enough to enlist the help of the masses. In 1997, Scott Kurowski established the Entropia network to apply idle computers worldwide to problems of scientific interest. In just two years, this network grew to encompass 30,000 computers with an aggregate speed of over one teraflop per second. Among its several scientific achievements is the identification of the largest known prime number.

The next big breakthrough in Internet computing was David Anderson's http://setiathome.ssl.berkeley.edu/. This enlisted personal computers to work on the problem of analysing data from the Arecibo radio telescope for signatures that might indicate extraterrestrial intelligence. Demonstrating the curious mix of popular appeal and good technology required for effective Internet computing, SETI@home is now running on half-a-million PCs and delivering 1,000 CPU years per day -- the fastest (admittedly special-purpose) computer in the world.

Entropia, Anderson's United Devices, and other entrants have now gone commercial, hoping to turn a profit by selling access to the world's idle computers -- or the software required to exploit idle computers within an enterprise. Significant venture funding has been raised, and new "philanthropic" computing activities have started that are designed to appeal to PC owners. Examples include Parabon's Compute-against-Cancer, which analyses patient responses to chemotherapy, and Entropia's FightAidsAtHome project, which uses the AutoDock package to evaluate prospective targets for drug discovery. Although the business models that underlie this new industry are still being debugged, the potential benefits are enormous: for instance, Intel estimates that its own internal "Internet computing" project has saved it $500 million over the past ten years.

What does this all mean for science and the scientist? A simplistic view is that scientists with problems amenable to Internet computing now have access to a tremendous new computing resource. All they have to do is cast their problem in a form suitable for execution on home computers -- and then persuade the public (or an Internet computing company) that their problem is important enough to justify the expenditure of "free" cycles.

One can define distributed computing in many different ways. Various vendors have created and marketed distributed computing systems for years, and have developed numerous initiatives and architectures to permit distributed processing of data and objects across a network of connected systems.

Distributed Computing provides an environment where you can harness idle CPU cycles and storage space of tens, hundreds, or thousands of networked systems to work together on a particularly processing-intensive problem. The growth of such processing models has been limited, however, due to a lack of compelling applications and by bandwidth bottlenecks, combined with significant security, management, and standardization challenges. But the last year has seen a new interest in the idea as the technology has ridden the coattails of the peer-to-peer craze started by Napster. A number of new vendors have appeared to take advantage of the nascent market; including market giants like Intel, Microsoft, Sun, and Compaq that have validated the importance of the concept.

Increasing desktop CPU power and communications bandwidth have also helped to make distributed computing a more practical idea. The numbers of real applications are still somewhat limited, and the challenges--particularly standardization--are still significant. But there's a new energy in the market, as well as some actual paying customers, so it's about time to take a look at where distributed processing fits and how it works.

4.2 Distributed vs Grid Computing

There are actually two similar trends moving in tandem--distributed computing and grid computing. Depending on how you look at the market, the two either overlap, or distributed computing is a subset of grid computing. Grid Computing got its name because it strives for an ideal scenario in which the CPU cycles and storage of millions of systems across a worldwide network function as a flexible, readily accessible pool that could be harnessed by anyone who needs it, similar to the way power companies and their users share the electrical grid. The figure below is a caricature of an old woman accessing computing resources like electricity.

Sun defines a computational grid as "a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to computational capabilities." Grid computing can encompass desktop PCs, but more often than not its focus is on more powerful workstations, servers, and even mainframes and supercomputers working on problems involving huge datasets that can run for days. And grid computing leans more to dedicated systems, than systems primarily used for other tasks.

Large-scale distributed computing usually refers to a similar concept, but is more geared to pooling the resources of hundreds or thousands of networked end-user PCs, which individually are more limited in their memory and processing power, and whose primary purpose is not distributed computing, but rather serving their user. As we mentioned above, there are various levels and types of distributed computing architectures, and both Grid and distributed computing don't have to be implemented on a massive scale. They can be limited to CPUs among a group of users, a department, several departments inside a corporate firewall, or a few trusted partners across the firewall.

4.3 How Distributed Computing Works

In most cases today, a distributed computing architecture consists of very lightweight software agents installed on a number of client systems, and one or more dedicated distributed computing management servers. There may also be requesting clients with software that allows them to submit jobs along with lists of their required resources.

An agent running on a processing client detects when the system is idle, notifies the management server that the system is available for processing, and usually requests an application package. The client then receives an application package from the server and runs the software when it has spare CPU cycles, and sends the results back to the server. The application may run as a screen saver, or simply in the background, without impacting normal use of the computer. If the user of the client system needs to run his own applications at any time, control is immediately returned, and processing of the distributed application package ends. This must be essentially instantaneous, as any delay in returning control will probably be unacceptable to the user.

4.4 Distributed Computing Management Server

The servers have several roles. They take distributed computing requests and divide their large processing tasks into smaller tasks that can run on individual desktop systems (though sometimes this is done by a requesting system). They send application packages and some client management software to the idle client machines that request them. They monitor the status of the jobs being run by the clients. After the client machines run those packages, they assemble the results sent back by the client and structure them for presentation, usually with the help of a database.

If the server doesn't hear from a processing client for a certain period of time, possibly because the user has disconnected his system and gone on a business trip, or simply because he's using his system heavily for long periods, it may send the same application package to another idle system. Alternatively, it may have already sent out the package to several systems at once, assuming that one or more sets of results will be returned quickly. The server is also likely to manage any security, policy, or other management functions as necessary, including handling dialup users whose connections and IP addresses are inconsistent.

Obviously the complexity of a distributed computing architecture increases with the size and type of environment. A larger environment that includes multiple departments, partners, or participants across the Web requires complex resource identification, policy management, authentication, encryption, and secure sandboxing functionality. Resource identification is necessary to define the level of processing power, memory, and storage each system can contribute.

Policy management is used to varying degrees in different types of distributed computing environments. Administrators or others with rights can define which jobs and users get access to which systems, and who gets priority in various situations based on rank, deadlines, and the perceived importance of each project. Obviously, robust authentication, encryption, and sandboxing are necessary to prevent unauthorized access to systems and data within distributed systems that are meant to be inaccessible.

If you take the ideal of a distributed worldwide grid to the extreme, it requires standards and protocols for dynamic discovery and interaction of resources in diverse network environments and among different distributed computing architectures. Most distributed computing solutions also include toolkits, libraries, and API's for porting third party applications to work with their platform, or for creating distributed computing applications from scratch.

4.5 Peer-to-Peer Features

Though distributed computing has recently been subsumed by the peer-to-peer craze, the structure described above is not really one of peer-to-peer communication, as the clients don't necessarily talk to each other. Current vendors of distributed computing solutions include Entropia, Data Synapse, Sun, Parabon, Avaki, and United Devices. Sun's open source GridEngine platform is more geared to larger systems, while the others are focusing on PCs, with Data Synapse somewhere in the middle. In the case of the SETI@home project, Entropia and most other vendors, the structure is a typical hub and spoke with the server at the hub. Data is delivered back to the server by each client as a batch job. In the case of DataSynapse's LiveCluster, however, client PCs can work in parallel with other client PCs and share results with each other in 20ms long bursts. The advantage of LiveCluster's architecture is that applications can be divided into tasks that have mutual dependencies and require interprocess communications, while those running on Entropia cannot. But while Entropia and other platforms can work very well across an Internet of modem connected PCs, DataSynapse's LiveCluster makes more sense on a corporate network or among broadband users across the Net.

4.6 The Poor Man's Supercomputer

The advantages of this type of architecture for the right kinds of applications are impressive. The most obvious is the ability to provide access to supercomputer level processing power or better for a fraction of the cost of a typical supercomputer. SETI@Home's Web site FAQ points out that the most powerful computer, IBM's ASCI White, is rated at 12 TeraFLOPS and costs $110 million, while SETI@home currently gets about 15 TeraFLOPs and has cost about $500K so far. Further savings comes from the fact that distributed computing doesn't require all the pricey electrical power, environmental controls, and extra infrastructure that a supercomputer requires. And while supercomputing applications are written in specialized languages like mpC, distributed applications can be written in C, C++, etc.

The performance improvement over typical enterprise servers for appropriate applications can be phenomenal. In a case study that Intel did of a commercial and retail banking organization running Data Synapse's LiveCluster platform, computation time for a series of complex interest rate swap modeling tasks was reduced from 15 hours on a dedicated cluster of four workstations to 30 minutes on a grid of around 100 desktop computers. Processing 200 trades on a dedicated system took 44 minutes, but only 33 seconds on a grid of 100 PCs. According to the company using the technology, the performance improvement running various simulations allowed them to react much more swiftly to market fluctuations (but we have to wonder if that's a good thing…).

Scalability is also a great advantage of distributed computing. Though they provide massive processing power, super computers are typically not very scalable once they're installed. A distributed computing installation is infinitely scalable--simply add more systems to the environment. In a corporate distributed computing setting, systems might be added within or beyond the corporate firewall.

A byproduct of distributed computing is more efficient use of existing system resources. Estimates by various analysts have indicated that up to 90 percent of the CPU cycles on a company's client systems are not used. Even servers and other systems spread across multiple departments are typically used inefficiently, with some applications starved for server power while elsewhere in the organization server power is grossly underutilized. And server and workstation obsolescence can be staved off considerably longer by allocating certain applications to a grid of client machines or servers. This leads to the inevitable Total Cost of Ownership, Total Benefit of Ownership, and ROI discussions. Another byproduct, instead of throwing away obsolete desktop PCs and servers, an organization can dedicate them to distributed computing tasks.

4.7 Distributed Computing Application Characteristics

Obviously not all applications are suitable for distributed computing. The closer an application gets to running in real time, the less appropriate it is. Even processing tasks that normally take an hour are two may not derive much benefit if the communications among distributed systems and the constantly changing availability of processing clients becomes a bottleneck. Instead you should think in terms of tasks that take hours, days, weeks, and months. Generally the most appropriate applications, according to Entropia, consist of "loosely coupled, non-sequential tasks in batch processes with a high compute-to-data ratio." The high compute to data ratio goes hand-in-hand with a high compute-to-communications ratio, as you don't want to bog down the network by sending large amounts of data to each client, though in some cases you can do so during off hours. Programs with large databases that can be easily parsed for distribution are very appropriate.

Clearly, any application with individual tasks that need access to huge data sets will be more appropriate for larger systems than individual PCs. If terabytes of data are involved, a supercomputer makes sense as communications can take place across the system's very high speed backplane without bogging down the network. Server and other dedicated system clusters will be more appropriate for other slightly less data intensive applications. For a distributed application using numerous PCs, the required data should fit very comfortably in the PC's memory, with lots of room to spare.

Taking this further, United Devices recommends that the application should have the capability to fully exploit "coarse-grained parallelism," meaning it should be possible to partition the application into independent tasks or processes that can be computed concurrently. For most solutions there should not be any need for communication between the tasks except at task boundaries, though Data Synapse allows some interprocess communications. The tasks and small blocks of data should be such that they can be processed effectively on a modern PC and report results that, when combined with other PC's results, produce coherent output. And the individual tasks should be small enough to produce a result on these systems within a few hours to a few days.

4.8 Types of Distributed Computing Applications

Beyond the very popular poster child SETI@Home application, the following scenarios are examples of other types of application tasks that can be set up to take advantage of distributed computing.

A query search against a huge database that can be split across lots of desktops, with the submitted query running concurrently against each fragment on each desktop.
Complex modeling and simulation techniques that increase the accuracy of results by increasing the number of random trials would also be appropriate, as trials could be run concurrently on many desktops, and combined to achieve greater statistical significance (this is a common method used in various types of financial risk analysis).
Exhaustive search techniques that require searching through a huge number of results to find solutions to a problem also make sense. Drug screening is a prime example.
Many of today's vendors, particularly Entropia and United Devices, are aiming squarely at the life sciences market, which has a sudden need for massive computing power. As a result of sequencing the human genome, the number of identifiable biological targets for today's drugs is expected to increase from about 500 to about 10,000. Pharmaceutical firms have repositories of millions of different molecules and compounds, some of which may have characteristics that make them appropriate for inhibiting newly found proteins. The process of matching all these "ligands" to their appropriate targets is an ideal task for distributed computing, and the quicker it's done, the quicker and greater the benefits will be. Another related application is the recent trend of generating new types of drugs solely on computers.
Complex financial modeling, weather forecasting, and geophysical exploration are on the radar screens of these vendors, as well as car crash and other complex simulations.

To enhance their public relations efforts and demonstrate the effectiveness of their platforms, most of the distributed computing vendors have set up philanthropic computing projects that recruit CPU cycles across the Internet. Parabon's Compute-Against-Cancer harnesses an army of systems to track patient responses to chemotherapy, while Entropia's FightAidsAtHome project evaluates prospective targets for drug discovery. And of course, the SETI@home project has attracted millions of PCs to work on analyzing data from the Arecibo radio telescope for signatures that indicate extraterrestrial intelligence. There are also higher end grid projects, including those run by the US National Science Foundation, NASA, and as well as the European Data Grid, Particle Physics Data Grid, the Network for Earthquake Simulation Grid, and Grid Physics Network that plan to aid their research communities. And IBM has announced that it will help to create a life sciences grid in North Carolina to be used for genomic research.

4.9 Security and Standards Challenges

The major challenges come with increasing scale. As soon as you move outside of a corporate firewall, security and standardization challenges become quite significant. Most of today's vendors currently specialize in applications that stop at the corporate firewall, though Avaki, in particular, is staking out the global grid territory. Beyond spanning firewalls with a single platform, lies the challenge of spanning multiple firewalls and platforms, which means standards.

Most of the current platforms offer high level encryption such as Triple DES. The application packages that are sent to PCs are digitally signed, to make sure a rogue application does not infiltrate a system. Avaki comes with its own PKI (public key infrastructure), for example. Identical application packages are typically sent to multiple PCs and the results of each are compared. Any set of results that differs from the rest becomes security suspect. Even with encryption, data can still be snooped when the process is running in the client's memory, so most platforms create application data chunks that are so small, that it is unlikely snooping them will provide useful information. Avaki claims that it integrates easily with different existing security infrastructures and can facilitate the communications among them, but this is obviously a challenge for global distributed computing.

Working out standards for communications among platforms is part of the typical chaos that occurs early in any relatively new technology. In the generalized peer-to-peer realm lies the Peer-to-Peer Working Group, started by Intel, which is looking to devise standards for communications among many different types of peer-to-peer platforms, including those that are used for edge services and collaboration.

The Global Grid Forum is a collection of about 200 companies looking to devise grid computing standards. Then you have vendor-specific efforts such as Sun's Open Source JXTA platform, which provides a collection of protocols and services that allows peers to advertise themselves to and communicate with each other securely. JXTA has a lot in common with JINI, but is not Java specific (thought the first version is Java based).

Intel recently released its own peer-to-peer middleware, the Intel Peer-to-Peer Accelerator Kit for Microsoft .Net, also designed for discovery, and based on the Microsoft.Net platform.

For Grid projects there's the Globus ToolKit, www.globus.org developed by a group at the Argonne National Laboratory and another team at the University of Southern California's Information Science Institute. It bills itself as an open source infrastructure that provides most of the security, resource discovery, data access, and management services needed to construct Grid applications. A large number of today's grid applications are based on the Globus Toolkit. IBM is offering a version of the Globus ToolKit for its servers running on Linux and AIX. And Entropia has announced that it intends to integrate its platform with Globus, as an early start towards communications among platforms and applications.

For today, however, the specific promise of distributed computing lies mostly in harnessing the system resources that lie within the firewall. It will take years before the systems on the Net will be sharing compute resources as effortlessly as they can share information.

Finally, we realize that distributed computing technologies and research cover a vast amount of territory, and we've only touched the surface so far.

Conclusion:

High Performance Computing is one of the hottest fields today and likely to remain for many years to come till we can create Quantum and Nano computers.

Both clusters and SMPs provide a configuration with multiple processors to support high-demand applications but SMPs have been around for far longer. The main strength of SMPs approach is that an SMP is easier to manage and configure than a cluster. The SMP is closer to the single processor model for which nearly all applications are written.

On the other hand Clusters are far superior to SMPs in terms of incremental and absolute scalability. A cluster can be configured from off the shelf components and hence has found much favour among research projects in the academia. Due to this reason they also prove to be cost effective and are already beginning to replace SMPs for supercomputing applications except in niche research areas. Clusters will enable even medium sized organizations to have their own home made supercomputers.

Applications appropriate for clusters are not restricted to those traditionally run on supercomputers or other high-performance systems. The number and types of applications now using clusters are increasing all the time. Cluster-based systems support both high-performance parallel applications such as computational chemistry, astrophysics, and computational fluid dynamics, and commercial applications such as a load-balanced high-performance Web server like HotBot, which uses a parallel Oracle database.

Although Distributed and Grid computing are both new technologies, they have already proven themselves useful and their future looks promising. As technologies, networks and business models mature, people expect that it will become commonplace for small and large communities of scientists to create "Science Grids" linking their various resources to support human communication, data access and computation. One also expects to see a variety of contracting arrangements between scientists and Internet computing companies providing low-cost, high-capacity cycles. The result will be integrated Grids in which problems of different types can be routed to the most appropriate resource: dedicated supercomputers for specialized problems that require tightly coupled processors and idle workstations for more latency tolerant, data analysis problems.

TO DOWN LOAD REPORT AND PPT

DOWNLOAD

SEMINARS AND PROJECTS

High Performance Computing

1.1 Why more than one CPU?

2.3 Mutiprocessor Operating System Types

3.2 An Architectural View of Clusters:

Shared Disk

Shared Nothing

Mirrored Disk

3.3.1 Classification

Pages

Labels

Followers

About Me