Radiation Tolerant Computer Design

Designing for the radiation environment in space doesn't always mean using rad-hard parts.

By Juergen Fedrich, Director of System Engineering SBS Augsburg, Germany Space is no longer the sole domain of well-funded governments. More and more commercial applications, such as telecom satellites, are finding homes above the earth. But with the increasing commercialism of space, there is a migration away from the use of traditional rad-hard components. Design teams are asking, "How rad-hard do components need to be for the system to work?" In general, this has to be defined on a project-to-project basis depending on its mission, environment, budget and criticality. In mission critical systems, such as satellite guidance computers, true rad-hard components need to be used. Other systems may be able to use radiation enhanced, radiation tolerant, or even commercial parts if the system and board-level design take radiation into account. In all cases, designing with radiation in mind is essential. Designing for radiation need not be limited to space-bound systems, either. Even on Earth, where most of the cosmic radiation is absorbed by the atmosphere, radiation tolerance is very important. For example, in certain military applications, in atomic power plants, or on aircraft, equipment has to function even under the highest radiation levels. Further, the semiconductor industry's move to decrease device structure sizes, reduce power requirements, and increase speed, has lead to increased radiation sensitivity for all applications. Thus, a radiation-induced device failure could become a major problem, even at sea level. Radiation comes in many forms. The material forms include high-energy neutrons, protons, and heavy ions along with alpha (helium nucleus) and beta (electron or positron) particles, all of which can cause ionization currents in or directly damage materials they strike. (See sidebar, "How Radiation Affects Silicon.") There is also electromagnetic radiation (EMR) along a wide spectrum (see Figure 1), which causes damage by releasing energy during absorption. Low-frequency EMR effects are well known to most designers as heat (infrared) and EMI (RF). The high-frequency radiation, x-rays and gamma rays, produce ionization effects similar to some material forms of radiation. The need to account for radiation effects, combined with budget restrictions, can tax the ingenuity of design teams. System-level tolerance can be built in using redundancy and fault-tolerant system design along with shielding. At the board level, however, designers have only the following options:

  1. Mitigation at chip level by using components which do meet the radiation requirement
  2. Mitigation at board level by using design techniques to achieve a radiation-tolerant board
  3. A mixture of options a) and b)
Option c) is the most common choice. Mitigation at the Chip Level
Designers can add radiation tolerance to their boards by choosing to use rad-hard components. Rad-hard component qualification does not guarantee that the device is insensible to radiation, however. It means that the device has endpoint electrical parameters, which are tested and certified by the manufacturer, that take radiation into account. Thus, the device specifications that the board designer works with already account for the effects of radiation, building radiation tolerance into the board design automatically. Rad-hard components are built to withstand up to a specified level of radiation. For the die itself, radiation hardness is achieved by using special materials and processes. Silicon on insulator (SOI), for instance, is a process with increased radiation tolerance. At the component level, it is possible to increase hardness by using special shielding techniques for the case. For example, using several layers of different material significantly improves shielding of the component case. The disadvantage of using rad-hard devices is their high price. Furthermore, not all devices are available in rad-hard versions, especially newer commercial devices, which makes a fully rad-hard design difficult to achieve. When rad-hard chips are not available designers may be able to use radiation-enhanced components, which have radiation tolerance but are not as fully qualified for radiation as rad-hard components. These devices are significantly cheaper than rad-hard ones. But as with rad-hard devices, there is not a radiation-enhanced version of each and every commercial component available. Another alternative is to choose components that have inherent radiation tolerance, such as the G3 PowerPC. These components are the most cost-saving solution. The disadvantage of the approach is that, up to now, only a few chip manufacturers have investigated and identified the radiation tolerance of their parts. Instead, the responsibility to identify and test these parts lies solely on the system designer. Thus, finding components that have inherent radiation tolerance can be a major designing activity for the system manufacturer. Mitigation at the Board Level
When no better devices are available or when the subsystem is considered as non-critical, designers can use commercial components in their design. The process technologies best suited to radiation environments are CMOS epitaxial or SOI (silicon-on-insulator). Beyond that, designs need to take into account the specific effects radiation has on components. The major affects to be considered when using commercial components are Total Ionizing Dose (TID), Single Event Upset (SEU) and Single Event Latchup (SEL) (see sidebar: How Radiation Affects Silicon for more details). Total Ionizing Dose causes degradation of a device's electrical parameters, such as propagation delays and threshold voltages. Commercial parts tolerate TID ranges between 3-30 krad(Si). Utilizing commercial devices under high TID for longer time is impractical, as the devices will degrade by absorbing more and more radiation. In this case, only shielding (on component level or on system/satellite level) seems feasible. Single Event Upset is a radiation-induced charge pulse that can occur anywhere within the component's circuitry. SEU on digital logic may result in a signal glitch, but can have a greater impact on memory cells (e.g. DRAMs, CCDs) or registers. In such devices the SEU can cause the content of one cell to be flipped. To overcome these random changes, designs can apply error detection and correction (EDAC) to memory blocks. Because they correct such single-bit errors, ECC or EDAC make memory insensible to SEU. Register-based devices such as FPGAs can be programmed in such a way that all flip-flops are triple redundant and equipped with a majority voting system. With this approach, if a single register flips the voting will still result in the correct output. The same approach applies to discrete register logic elements. Single Event Latchup (SEL) is a more serious effect. In this case, the ionization caused by the radiation triggers parasitic transistors, causing an internal short-circuit that is self-perpetuating. On commercial devices a latchup causes high current draws resulting in high internal temperatures. In the worst-case scenario, the device may melt down. For devices that are not critical, with respect to overall system functionality, a latchup can be cured by power-cycling the device's supply. Switching off the device's power resets the latchup. When the device is switched back on it will continue with nominal performance. Board designers can take advantage of this effect by monitoring the ICC of a device to detect a latchup, then shutting power down for a set period of time. Generating an interrupt during this event will inform the (supervising) system, that a latch-up condition was seen by the affected hardware and corresponding steps (e.g. system boot) can take place. Design Example
An example of all these techniques can be found on the International Space Station (ISS), where an open architecture based on the VMEbus is being used. The ISS began construction in 1998 when a Russian Proton rocket lifted the Zarya module into orbit. Since then, two other modules, the service module Zvezda and the Destiny Laboratory Module, have been assembled and initial crews have started to work on board the ISS. It should in its final operational configuration by 2006. With such an extended duration in space while the station is being completed, reliability in board designs is critical. All hardware has to meet a 10-year in-orbit life expectancy and require minimal repair and maintenance. In addition, selected hardware has to withstand natural cosmic radiation. These factors combined with budget constraints made it necessary to find commercially based solutions for designing and producing space hardware. SBS' Government Group, which specializes in designing and producing hardware for harsh environments, has supplied a number of boards and systems for usage in the ISS. One of the radiation-tolerant boards it produced is an Ethernet/High Speed Serial I/F (HLCU), shown in Figure 2. The major functional components are dual Ethernet controllers with 10BaseT interfaces, a 1 Mbyte, 4-port SRAM, 10 Mbit/sec serial links, and control FPGAs. The board was designed with dual control interfaces for redundancy. The board needed to meet the ISS radiation requirements:
Total Ionizing Dose Below 1,4 krad (Si)
SEU Threshold Linear Energy Transfer (LET) for Heavy Ions Above 36 MeVcm^2/mg
SEU Threshold for Heavy Ions Above 10 MeVcm^2/mg
SEL Threshold Above 110 MeVcm^2/mg
The first step to meeting the radiation requirements was to analyze all its parts regarding availability of radiation data. For parts without known data, SBS identified functional replacement parts that had known data. The second step was to analyze all parts regarding radiation sensitivity. Simpler parts, such as buffer/drivers, met the radiation requirements. For all parts showing sensitivity at required levels, SBS sought replacement parts with better data. Due to market availability it was not possible to eliminate/replace all sensitive devices. The remaining critical parts were handled on a case-by-case basis. Radiation data showed that the Ethernet controller was sensitive to Latch-Up effects. As no other/better parts were available on the market, SBS designed a delatcher into the board. This delatcher controls the current consumption of the SONIC device. Whenever the SONIC device current exceeds a programmed threshold, the delatcher turns off the SONIC device and generates an interrupt to inform the software application about the occurred latch-up condition. After a defined time the delatcher reapplies the supply power to the SONIC and the board can continue with nominal operation. The radiation data also showed SEU sensitivity in the memory and FPGA devices. To protect the memory, SBS chose and designed in an EDAC (error detection and correction) device that detects two-bit errors and corrects single-bit errors. For the FPGA, which was programmed to be a state machine, it was possible to overcome the SEU by using a "majority-vote" design. Instead of one register cell this design uses three registers cells, which are coupled by a voting circuitry in parallel. Whenever one cell content is flipped (e.g. due to SEU) the two remaining cells will out-vote the changed one and thus the system becomes SEU resistant. While this example is for a space application, the same approach can be used for any design that might be subject to radiation effects. That may be more designs than people think. With the industry's move to reduce the size of device structures, radiation is becoming an important element to consider on sea level applications. By using the right technology, qualified parts and proper design, it is possible to create the right solution for radiation-sensitive applications. The radiation environment consists of five elements: neutron radiation, total ionizing dose (TID), single event phenomena (SEP), transient radiation effects (TRE), and electromagnetic pulse (EMP). Table 1 provides some metrics associated with these and other radiation phenomena. Neutron radiation induced effects significantly change electrical characteristics of bipolar devices. Because CMOS technology is based on majority carrier (charges), these devices are immune to neutron radiation up to very high levels (1014 neutrons/cm2). Therefore, while bipolar technology is more sensitive in general to neutron irradiation, CMOS devices are more resistant (see sidebar: Neutron Damage). The total ionizing dose environment is a composite of gamma rays, x-rays and other ionizing radiation, and the total amount of energy depends on the absorption material and is expressed for silicon and/or silicon dioxide as rad (Si) or rad (SiO2), respectively. Historically, bipolar circuits have been very tolerant to total ionizing dose (TID) and it was believed that bipolar devices show a better tolerance to TID than CMOS. Newer bipolar technology, using modern techniques like "recessed oxides", smaller feature size, or increased packaging density, has higher radiation sensitivity than previously assumed. Recently, it has been detected that linear bipolar devices show total dose sensitivity at low dose rates. This phenomenon is called enhanced low dose rate (ELDR) and is observed at low dose rates as low as 1 rad (Si). The affected bipolar devices show degradation between 2 times to 10 times worse compared to high dose rates. CMOS products are affected by threshold voltage shifts and radiation induced leakage current. Total dose can degrade circuit parameter to the point where the circuit's operation is seriously damaged. For example, the following parameters can be affected: propagation time, ICC, IOZ, VIL, VIH, and VOL. Transient radiation effects (or high dose rate) effects are associated with nuclear explosions. This is a major concern, especially in military applications where tactical equipment might be located. The transient radiation is characterized by a narrow pulse width (about 3ns – 10µs) and contains a total dose of about 100 rad (Si) or more. Such a dose rate will quickly induce excess charge in the chip, which will lead to latch-up, upset (soft error), junction burnout, short transient pulses on the outputs, and saturated outputs. The worst condition for transient latch-up is a short radiation pulse. The worst case for transient upset is a wide radiation pulse at low temperature and low voltage. Thin EPI (thin epitaxial) CMOS devices are inherently latch-up immune to transient radiation. Finally, single event phenomena are caused by cosmic radiation and trapped protons. Furthermore, this effect depends on technology geometric feature sizes (smaller size increases sensibility) and device speed, as well as the Earth's geomagnetic fields (as geomagnetic fields act as an energy filter). For example, it is known that minimal SEP occurs in low altitude orbits with inclination angles lower than 45°. SEP is generated by the following particles: heavy ions, low-energy alpha particles and high-energy protons (>10MEV). SEP include effects like: transients, soft error (single event upset (SEU) and multiple bit SEU), single event latch-up (SEL), power MOSFET burnout (BO) and single event gate rupture (SEGR). Table 1: How various radiation types affect silicon semiconductor devices.
Effect / Induced Failure Caused by Comment
Change of electrical characteristics Neutron irradiation Affects mainly bipolar devices CMOS devices are resistive up to 1014 neutrons/cm
TID (Total Ionizing Dose)
· shift of threshold voltage
· increased leakage current
· increased propagation time
Composite of gamma rays, x-rays and other ionizing radiation Devices using NAND gates are more tolerant than those with NOR gates
SEU (Single Event Upset)
Flip of memory cell content
Heavy ions, high energy protons, low energy alpha particles Impacts SRAMs, DRAMs, CCDs, Registers
SEL (Single Event Latch-up)
Increase of supply current
Heavy ions, high energy protons, low energy alpha particles May melt down the device
SEBO (Single Event Burn-out)
Junction burn-out of power MOSFETs
Nuclear explosions,
heavy ions, high energy protons, low energy alpha particles
Catastrophic failure on power MOSFET
SEGR (Single Event Gate Rupture)
Gate rupture of power MOSFETs
Nuclear explosions,
heavy ions, high energy protons, low energy alpha particles
Catastrophic failure on power MOSFET
TRE (Transient Radiation Effects)
  • latch-up
  • single event upset
  • junction burn-out
  • short transients on outputs
  • saturated outputs
  • Nuclear explosion Of major concern for tactical equipment
    Electro-Magnetic Pulse
    Nuclear explosions Affects all electronic systems
    To be handled on system level only
    Neutrons are relatively heavy (1840 times heavier than electrons), uncharged particles. Instead of merely ionizing atoms or molecules, they collide with the lattice atoms of the semiconductor, dislodging or displacing them from their lattice sites to cause them to take up interstitial positions within the crystal. This results in the disruption or distortion of the local lattice structure. A single fast neutron striking a silicon atom can cause a large number of displacements to occur in a relatively localized region. The site of the displaced atom in the lattice is called a vacancy. The displaced atom is called an interstitial; creation of the interstitial-vacancy pair is called a Frenkel defect. Highly energetic incident neutrons can impart enough energy to the displaced atom for it to in turn displace other atoms in the lattice, and this phenomenon can proceed in a cascade manner to form additional defects within the lattice structure. While some of the displaced atoms slip back into isolated vacancies to reconstitute the local lattice structure, others combine with dopant or impurity atoms to produce stable defects. Stable defects are electrically inactive and do not act as recombination or trapping centers. Mobile vacancies combining with impurity atoms, donor atoms, or other vacancies produce stable defects. These defects are effective recombination or trapping centers, and produce resistivity changes. Via displacement interaction, neutron bombardment produces the changes in the electrical properties of bulk silicon. All of these affect the operation of the semiconductor device: - Minority-carrier recombination lifetime decreases - Majority-carrier concentration decreases - Carrier mobility decrease Neutrons are also capable of causing ionization in the lattice, even though their major interaction is through damage by displacement. Since neutrons are uncharged, they cannot interact electrically with charged particles to ionize them. Neutrons produce ionization through secondary processes such as neutron collisions that produce recoil atoms or ions. There are two types: there are neutron collisions that excite atomic nuclei, which de-excite by emitting ionizing gamma rays (Bremsstrahlung Radiation); or there are neutron collisions where the neutron is absorbed by the atomic nucleus, which in turn emits an ionized charged particle. Ionization and displacement damage alters the electrical properties of minority carrier bipolar transistors at neutron fluences between 1010 to 1012 neutrons per centimeter squared (cm2). Altered performance characteristics include transistor gain degradation, leakage current increase, and the generation of photocurrents. The first two effects are due to displacement damage; the last phenomenon is produced by ionization. Majority carrier CMOS transistors are not altered until higher fluences (that is, greater than 1014 neutrons per cm2), primarily due to changes in resistivity, carrier removal, mobility, and diffusion parameters.