[CMSC 411 Home] | [Syllabus] | [Project] | [VHDL resource] | [Homework 1-6] | [lecture notes]
[Homework 7-12] | [news] | [files]
AMD Unveils 12-Core Opterons with HP, Dell, Acer Chip maker AMD officially launches its new 12-core Opteron processor with help from Hewlett-Packard, Dell and even Acer, which is looking to now expand into the server market. AMD's new Opteron chip comes along as Intel is expected to release its "Nehalem EX" Xeon chip later this week. Both processors target the data center. Typically multiple chips with special signals that allow the multiple chips to act as a single multicore chip. Possibly better memory bandwidth, cache consistency? 3/30/2010 web page
single chip: hardwaresecrets.com
facebook.txt
Super computer sets record, By John Markoff, Published: June 9, 2008 (In 2009 Jaguar ran at 1.7 petaflops, 1.7*10^15 instructions per second)SAN FRANCISCO: An American military super computer, assembled from components originally designed for video game machines, has reached a long-sought-after computing milestone by processing more than 1.026 quadrillion calculations per second. The new machine is more than twice as fast as the previous fastest super computer, the IBM BlueGene/L, which is based at Lawrence Livermore National Laboratory in California. The new $133 million super computer, called Roadrunner in a reference to the state bird of New Mexico, was devised and built by engineers and scientists at IBM and Los Alamos National Laboratory, based in Los Alamos, New Mexico. It will be used principally to solve classified military problems to ensure that the nation's stockpile of nuclear weapons will continue to work correctly as they age. The Roadrunner will simulate the behavior of the weapons in the first fraction of a second during an explosion. Before it is placed in a classified environment, it will also be used to explore scientific problems like climate change. The greater speed of the Roadrunner will make it possible for scientists to test global climate models with higher accuracy. To put the performance of the machine in perspective, Thomas D'Agostino, the administrator of the National Nuclear Security Administration, said that if all six billion people on earth used hand calculators and performed calculations 24 hours a day and seven days a week, it would take them 46 years to do what the Roadrunner can in one day. The machine is an unusual blend of chips used in consumer products and advanced parallel computing technologies. The lessons that computer scientists learn by making it calculate even faster are seen as essential to the future of both personal and mobile consumer computing. The high-performance computing goal, known as a petaflop, one thousand trillion calculations per second, has long been viewed as a crucial milestone by military, technical and scientific organizations in the United States, as well as a growing group including Japan, China and the European Union. All view super computing technology as a symbol of national economic competitiveness. The Roadrunner is based on a radical design that includes 12,960 chips that are an improved version of an IBM Cell microprocessor, a parallel processing chip originally created for Sony's Play Station 3 video-game machine. The Sony chips are used as accelerators, or turbochargers, for portions of calculations. The Roadrunner also includes a smaller number of more conventional Opteron processors, made by Advanced Micro Devices, which are already widely used in corporate servers. In addition, the Roadrunner will operate exclusively on the Fedora Linux operating from Red Hat.
From technews@hq.acm.org Mon Mar 31 16:06:01 2008 NASA Builds World's Largest Display Government Computer News (03/27/08) Jackson, Joab NASA's Ames Research Center is expanding the first Hyperwall, the world's largest high-resolution display, to a display made of 128 LCD monitors arranged in an 8-by-16 matrix, which will be capable of generating 245 million pixels. Hyperwall-II will be the largest display for unclassified material. Ames will use Hyperwall-II to visualize enormous amounts of data generated from satellites and simulations from Columbia, its 10,240-processor supercomputer. "It can look at it while you are doing your calculations," says Rupak Biswas, chief of advanced supercomputing at Ames, speaking at the High Performance Computer and Communications Conference. One gigantic image can be displayed on Hyperwall-II, or more than one on multiple screens. The display will be powered by a 128-node computational cluster that is capable of 74 trillion floating-point operations per second. Hyperwall-II will also make use of 1,024 Opteron processors from Advanced Micro Devices, and have 128 graphical display units and 450 terabytes of storage.
Yet to be seen: Will Intel shed its other dinosaur, the North Bridge and South Bridge concept, in order to achieve integrated IO?
![]()
How long will it be before your computer is really and truly outdated?? Who will be the first to have their own supercomputer? "IBM researchers reached a significant milestone in the quest to send information between the "brains" on a chip using pulses of light through silicon instead of electrical signals on copper wires. The breakthrough -- a significant advancement in the field of "Silicon Nanophotonics" -- uses pulses of light rather than electrical wires to transmit information between different processors on a single chip, significantly reducing cost, energy and heat while increasing communications bandwidth between the cores more than a hundred times over wired chips. The new technology aims to enable a power-efficient method to connect hundreds or thousands of cores together on a tiny chip by eliminating the wires required to connect them. Using light instead of wires to send information between the cores can be as much as 100 times faster and use 10 times less power than wires, potentially allowing hundreds of cores to be connected together on a single chip, transforming today's large super computers into tomorrow's tiny chips while consuming significantly less power. IBM's optical modulator performs the function of converting a digital electrical signal carried on a wire, into a series of light pulses, carried on a silicon nanophotonic waveguide. First, an input laser beam (marked by red color) is delivered to the optical modulator. The optical modulator (black box with IBM logo) is basically a very fast "shutter" which controls whether the input laser is blocked or transmitted to the output waveguide. When a digital electrical pulse (a "1" bit marked by yellow) arrives from the left at the modulator, a short pulse of light is allowed to pass through at the optical output on the right. When there is no electrical pulse at the modulator (a "0" bit), the modulator blocks light from passing through at the optical output. In this way, the device "modulates" the intensity of the input laser beam, and the modulator converts a stream of digital bits ("1"s and "0"s) from electrical signals into light pulses. December 05, 2007" http://www.flixxy.com/optical-computing.htm
Seagate crams 329 gigabits of data per square inch http://ct.zdnet.com/clicks?t=73361625-e808f46de0195a86f73d2cce955257f9-bf&brand=ZDNET&s=5 Seagate has announced that it is shipping the densest 3.5 inch desktop hard drive available - cramming an incredible 329 gigabits per square inch. The new drives, the Barracuda 7200.12, offers 1TB of storage on two platters and the high density is achieved by using Perpendicular Magnetic Recording technology. Seagate hopes to add more platters later this year in order to boost capacity even further. The Barracuda 7200.12 is a 7,200RPM drive that has a 3Gbps serial ATA (SATA) interface that offers a sustained transfer rate of up to 160MB/s and a burst speed of 3Gbps. Prior to the Barracuda 7200.12 the Seagate drive with the greatest density was the Barracuda 7200.11 that offered 1.5GB of storage across four platters.
More Chip Cores Can Mean Slower Supercomputing, Sandia Simulation Shows. Sandia National Laboratories (01/13/09) Singer, Neal Simulations at Sandia National Laboratory have shown that increasing the number of processor cores on individual chips may actually worsen the performance of many complex applications. The Sandia researchers simulated key algorithms for deriving knowledge from large data sets, which revealed a significant increase in speed when switching from two to four multicores, an insignificant increase from four to eight multicores, and a decrease in speed when using more than eight multicores. The researchers found that 16 multicores were barely able to perform as well as two multicores, and using more than 16 multicores caused a sharp decline as additional cores were added. The drop in performance is caused by a lack of memory bandwidth and a contention between processors over the memory bus available to each processor. The lack of immediate access to individualized memory caches slows the process down once the number of cores exceeds eight, according to the simulation of high-performance computing by Sandia researchers Richard Murphy, Arun Rodrigues, and Megan Vance. "The bottleneck now is getting the data off the chip to or from memory or the network," Rodrigues says. The challenge of boosting chip performance while limiting power consumption and excessive heat continues to vex researchers. Sandia and Oak Ridge National Laboratory researchers are attempting to solve the problem using message-passage programs. Their joint effort, the Institute for Advanced Architectures, is working toward exaflop computing and may help solve the multichip problem.
Microscope Has 100 Million Times Finer Resolution Than Current MRI An artistic view of the magnetic tip (blue) interacting with the virus particles at the end of the cantilever. Scientists at IBM Research, in collaboration with the Center for Probing the Nanoscale at Stanford University, have demonstrated magnetic resonance imaging (MRI) with volume resolution 100 million times finer than conventional MRI. This signals a significant step forward in tools for molecular biology and nanotechnology by offering the ability to study complex 3D structures at the nanoscale. By extending MRI to such fine resolution, the scientists have created a microscope that may ultimately be powerful enough to unravel the structure and interactions of proteins, paving the way for new advances in personalized healthcare and targeted medicine. This advancement was enabled by a technique called magnetic resonance force microscopy (MRFM), which relies on detecting ultrasmall magnetic forces. In addition to its high resolution, the imaging technique is chemically specific, can "see" below surfaces and, unlike electron microscopy, is non-destructive to sensitive biological materials. The researchers use MRFM to detect tiny magnetic forces as the sample sits on a microscopic cantilever - essentially a tiny sliver of silicon shaped like a diving board. Laser interferometry tracks the motion of the cantilever, which vibrates slightly as magnetic spins in the hydrogen atoms of the sample interact with a nearby nanoscopic magnetic tip. The tip is scanned in three dimensions and the cantilever vibrations are analyzed to create a 3D image.
With us again today is James Reinders, a senior engineer at Intel. DDJ: James, how is programming for a few cores different from programming for a few hundred cores? JR: As we go from a few cores to hundred, two things happen: 1. Scaling is everything and single core performance is truly uninteresting in comparison, and; 2. Shared memory becomes tougher and tougher to count on, or disappears altogether. For programmers, the shift to "Think Parallel" is not complete until we truly focus on scaling in our designs instead of performance on a single core. A program which scales poorly, perhaps because it divides work up crudely, can hobble along for a few cores. However, running a program on hundreds of cores will reveal the difference between hobbling and running. Henry Ford learned a lot about automobile design while doing race cars before he settled on making cars for the masses. Automobiles which ran under optimal conditions at slower speeds did not truly shake out a design the way less optimal and high speed racing condition did with a car. Likewise, a programmer will find designing programs for hundreds of cores to be a challenge. I think we already know more than we think. It is obvious to think of supercomputer programming, usually scientific in nature, as having figured out how their programs can run in parallel. But, let me suggest that Web 2.0, is highly parallel -- and is a model which helps with the second issue in moving to hundreds of cores. Going from a few cores to many core means several changes is in the hardware which impact software a great deal. The biggest change is in memory because with a few cores you can assume every core has equal access to memory. It turns out having equal access to memory simplifies many things for programmers, many ugly things do not need to be worried about. The first step away from complete bliss is when instead of equal access (UMA) you move to unequal but access is still available (NUMA). In really large computers, memory is usually broken up (distributed) and is simply not globally available to all processors. This is why in distributed memory machines programming is usually done with messages instead of using shared memory. Programs can easily move from UMA to NUMA, the only real issue is performance -- and there will be countless tricks in very complex hardware to help mask the need for tuning. There will, nevertheless, be plenty of opportunity for programmers to tune for NUMA the same way we tune for caches today. The gigantic leap, it would seem, is to distributed memory. I have many thoughts on how that will happen, but that is a long ways off -- sort of. We see it already in web computing -- Web 2.0, if you will, is a distributed programming model without shared memory -- all using messages (HTML, XML, etc.) So maybe message passing of the supercomputer world has already met its replacement for the masses: Web 2.0 protocols. DDJ: Are compilers ready to take advantage of these multi-core CPUs? JR: Compilers are great at exploiting parallelism and terrible at discovering it. When people ask the question you did of me, I find they are usually wondering about automatic compilers which take my program of today, and magically find parallelism and produce great multi-core binaries. That is simply not going to happen. Every decent compiler will have some capability to discover parallel automatically, but it will simple not be enough. The best explanation I can give is this: it is a issue of algorithm redesign. We don't expect a compiler to read in a bubble sort function and compile it into a quick sort function. That would be roughly the same as reading most serial programs and compiling into a parallel program. The key is to find the right balance of how to have the programmer express the right amount of the algorithm and the parallelism, so the compiler can take it the rest of the way. Compiler have done a great job exploiting SIMD parallelism for programs written using vectors or other syntaxes designed to make the parallel accessible enough for the compiler to not have too much difficulty to discover it. In such cases, compilers do a great job exploiting MMX, SSE, SSE2, etc. The race is on to find the right balance of programming practices and compiler technology. While the current languages are not quite enough, we've seen small additions like OpenMP yield big results for a class of applications. I think most programming will evolve to use small changes which open up the compiler to seeing the parallelism. Some people advocate whole new programming languages, which allow much more parallelism to be expressed explicitly. This is swinging the pendulum too far for most programmers, and I have my doubts we any one solution is general purpose enough for widespread usage. DDJ: Earlier in our conversation, I gave you an I.O.U. for the beer you asked for. Will you share with readers what Prof. Norman R. Scott told you and why you blame it for having you so confident in the future of computing? JR: Okay, but since I've been living in the Pacific Northwest some time you need to know that I'm not likely to drink just any beer. In 1987, my favorite college professor was retiring from teaching at the University of Michigan. He told us that when he started in electronics that he would build circuits with vacuum tubes. He would carefully tune the circuit for that vacuum tube. He thought it was wonderful. But, if the vacuum tube blew he could get a new one but would have to retune the circuit to the new vacuum tube because they were never quite the same. Now this amazed us, because most of us helped our dad's buy "standard" replacement vacuum tubes at the corner drug store for our televisions when we were kids. So the idea of vacuum tubes not being standard and interchangeable seemed super old to us, because even standard vacuum tubes were becoming rare specialty items at Radio Shack (but also perfected to have lifetime guarantees). Next, Prof. Scott noted that Intel had announced a million transistor processor recently. He liked to call that VLSII (Very Large Scale Integration Indeed!). Now for the punchline: Je said his career spanned inconsistent vacuum tubes to a million transistors integrated on a die the size of a fingertip. He asked if we thought technology (or our industry) was moving FASTER or SLOWER than during his career? We all said "faster!" So he asked: "Where will the industry be when your careers end since we will start with a million transistors on a chip?" I've always thought that was the scariest thing I ever heard. It reminds me still to work to keep up -- lest I be left behind (or run over). So, when people tell me that the challenge before us is huge and never before seen -- and therefore insurmountable -- I'm not likely to be convinced. You can blame Prof. Scott for my confidence that we'll figure it out. I don't think a million vacuum tubes equivalents on a finger tip seemed like anything other than fantasy to him when he started his career -- and now we have a thousand times that. So I'm not impressed when people say we cannot figure out how to use a few hundred cores. I don't think this way because I work at Intel, I think this way in no small part because of Prof. Scott. Now, I might work at Intel because I think this way. And I'm okay with that. But let's give credit to Prof. Scott, not Intel for why I think what I do. DDJ: James, thanks for taking time over the past few weeks for this most interesting conversation. JR: You're welcome.
Articles are edited to fit the purpose of this page. All copyrights belong to the original source.
Last updated 2/14/10