vme and critical systems,multicore          Other topics:   OpenVPX, RTOS, multicore, VPX, AdvancedTCA, microcontroller, FPGAs, analog-to-digital
VME and Critical Systems
home
articles & topics
product search
White Papers
newswire
E-letter
E-cast Schedule
articles >
Technology Feature
vme and critical systems,multicore
RSS Link
Industry News:
vme and critical s...
  • Live Webcast Announced: Multicore Migration: The Challenges of migrating to a multicore design
    2 years ago
  • Technical Overview of TI Multicore Solutions for High-Density Applications Live Event: November 5, 11:00 AM EDT
    1 year ago
  • Webcast: Multicore Migration: The Challenges of Migrating to a Multicore Design. May 20th
    2 years ago
  • More Industry News headlines...
Technology Partnerships:
vme and critical s...
  • Texas Instruments and PolyCore Software collaborate to ease application deployment for multicore systems
    3 months ago
  • MontaVista Continues Strong Support for Freescale Multicore Processors and New VortiQa Software Products
    1 year ago
  • US Technologies Offers Testing and Repair of VME, MVME, VMEbus Products
    1 year ago
  • More Technology Partnerships headlines...
Contracts:
vme and critical s...
  • Enea Signs Multicore and Hypervisor Deal Worth 2 MEUR
    2 weeks ago
  • Enea Signs 110 MYEN Multicore deal in Japan
    4 months ago
  • Tundra Semiconductor's Serial RapidIO Switch Selected by VMETRO
    2 years ago
  • More Contracts headlines...
New Products:
vme and critical s...
  • Release of The NAG Library for SMP and Multicore
    3 months ago
  • Elma Bustronic has Over 30 Standard Slot Sizes for 6U and 7U VME64x Backplanes
    10 months ago
  • Wind River Extends Virtualization Support With New Release of Wind River Hypervisor
    5 months ago
  • More New Products headlines...
People:
vme and critical s...
  • BittWare Expands Technical Management Team
    2 years ago
  • USMC 234th Birthday Tribute Video
    8 months ago
  • VMETRO Bolsters Leadership Team
    5 years ago
  • More People headlines...
Mergers and Acquisitions:
vme and critical s...
  • CriticalBlue raises $4M, adds Investors Toshiba Corporation and Scottish Venture Fund
    1 year ago
  • Freescale signs definitive agreement to purchase Intoto, Inc.
    1 year ago
  • From the Blog: Former Motorola Manager Sounds off on Emerson's Acquisition
    2 years ago
  • More Mergers and Acquisitions headlines...
Conferences and Awards:
vme and critical s...
  • Multicore Expo Unveils All-Star Presentation Line-Up
    3 years ago
  • ESC Silicon Valley Joins Forces With the Fifth Multicore Expo in San Jose
    8 months ago
  • Texas Instruments to Share the latest Multicore Design Challenges and Technology Trends at Multicore Expo
    3 months ago
  • More Conferences and Awards headlines...
Media and Education:
vme and critical s...
  • OpenSystems Publishing Renames VMEbus Systems Magazine to 'VME and Critical Systems' Magazine
    3 years ago
  • OpenSystems Publishing Launches New VME E-site
    3 years ago
  • Why hardware designers should switch to Eclipse
    4 months ago
  • More Media and Education headlines...
Standard Certifications and References:
vme and critical s...
  • Certification to IRIS
    1 year ago
  • PICMG adds Ethernet Fabric to Advanced Mezzanine Card
    3 years ago
  • BittWare Commits to Long Term VITA 41 VXS Roadmap
    4 years ago
  • More Standard Certifications and References headlines...
Browse topics
Search Articles
Browse Articles
See Also:
Military Articles
Embedded Computing Articles
CompactPCI Articles
Magazine >

About the Magazine
Editorial Topics
Free Subscription
Reader Service Card
Search Articles
Search Products
Contact Information
Columns

Editor's Foreword
VITA News
VITA Standards
Technology in Europe
Military Technology Insider
Guest Editorial
Defining Standards
Departments

Editor's Choice Products
by Chris A. Ciufo
VMEnow Blog
What is VME?
VME: Then & NOW
Webcasts

Upcoming E-casts
Archived E-casts
Submissions

Submit a Press Release
Submit a New Product
Submit an Abstract for Review
Vendors/Sponsors

Do an E-cast
Preferred Vendors
Upcoming Issue
Advertise
Editorial Calendar
Media Kits










The secret to improving multicore performance: Data flow and cache analysis

By
Doug Sandy
Emerson Network Power Embedded Computing

For programmers, few experiences are more frustrating than multithreading an algorithm for a multicore application only to find the performance is barely superior to single core. To avoid this frustration, achieve significantly faster runtimes, and predict performance scalability when they add one, two, or more cores, programmers benefit from greater visibility into multicore performance bottlenecks. Doug explores how analyses of data flow and cache usage can help programmers zero in on key factors that slow runtime performance, and a cache analysis software library is key.

Itís a familiar question for programmers: How can faster runtimes be achieved on multicore architectures? The simple answer is multithreading. In practice, the time-consuming task of multithreading algorithms for multicore applications often fails to produce an appreciable performance boost. Frustrated programmers look for solutions in instruction cycles, but many times coding issues are not the problem. Instead, the answer often lies in data flow and cache usage: two areas that can be easily analyzed to improve algorithm – and application – performance.

Data flow and cache analyses are effective ways to identify where and how multicore processors lose performance. Helpful tools and techniques include:

1. Processor performance counters – Counters are highly accurate and provide a snapshot of everything that takes place when code is running, including cache operations, executed instructions, and frequency of core stalls. Limitations include the difficulty of capturing the performance of code by itself and the need to have a specific hardware target in place.

2. Simulators and emulators – Processor suppliers sometimes provide simulator or emulator technology, which enable code to run and designers to possibly view activity within caches. These tools might become available as soon as a processorís architecture is defined, but generally they are only applicable for specific processor architectures.

3. Instrumentation of existing code – Developers can gain greater visibility into how data flows through cores by using a cache analysis software library that lets them run code through simulated cache structures of their own definition. Users receive detailed information about cache problems, which can lead to immediate code changes and follow-up analyses.

The third method of data flow and cache analyses – instrumentation of existing code – holds particular promise. This technique gives the freedom to conduct ìwhat-ifî analyses, where users create and test multiple cache architectures – all without need for a specific hardware target or specialized emulation.

Cache analyses and data flows: A closer look

Dataflow analysis is most useful when the application of interest consists of a small set of ìwell-behavedî algorithms, each having deterministic data flow and few control branches. Applications that fall into this class tend toward complex mathematical computations and machine control algorithms. Performance of database applications, on the other hand, is much more difficult to characterize because the dataflow varies wildly based upon the particular records being accessed and the database's size and structure. Fortunately, many applications of interest, such as image reconstruction and machine control, fall into the ìwell-behavedî class.

Example: Data flow and cache analyses

Figure 1 uses the Fast Fourier Transform (FFT) algorithm to demonstrate the kind of visibility into multicore performance bottlenecks provided by data flow and cache analyses. Performance is critical in the FFT, which is commonly used in the military and medical fields, and it is well-behaved with respect to its dataflow.

Figure1
Figure 1: The Fast Fourier Transform (FFT) algorithm is used to demonstrate the kind of visibility into multicore performance bottlenecks provided by data flow and cache analyses.

 

The left side of Figure 1 illustrates a simple decomposition of a 16-point, 4-core FFT in which the array is shown vertically and indicated by circles. Core 1 begins by working on the first four points in the array, computing values, and then, at each stage, waiting for the other cores to complete their processing. Core 2 operates in much the same way, always waiting for other cores to complete their processing before recalculating its results, because each stage depends on data computed in earlier stages. The threads must be synchronized at each stage where data comes from a different core. In this instance, synchronization stops at stage three.

The left side of Figure 1 shows data flowing from one coreís cache to another – information that is not normally visible when programming. When data needlessly flows from the cache of one core to the cache of another core, application runtime slows down.

The right side of Figure 1 shows runtime improvements that can result from clues provided by data flow and cache analyses. A decrease in the number of arrows crossing cache domains represents 1) an improvement in the performance of the FFT algorithm; 2) the elimination of data sharing between cores; and 3) elimination of the double buffering that occurs when two processors are no longer writing back to the same data that is being read.

Note that in the right side of Figure 1, Core 2 and Core 4 sit idle until the third stage. This would be a concern if computation was the limiting factor in performance, but in this case, it is not – data flow is.

Figure 2 shows the time the processor spends in computation and data movement for three different FFT cases. The left side shows a single-core implementation. The middle and right stacks show time breakdowns for the first and second algorithms from Figure 1.

Figure2
Figure 2: The time the processor spends in computation and data movement for three varying FFT cases differs. The left side shows a single-core implementation, while the middle and right side stacks show time breakdowns for the first and second algorithms from Figure 1.

 

The middle stack of Figure 2 shows the increased time that a processor spends accessing memory – a key factor in runtimes and one that data flow and cache analyses can help resolve. In the single-core example, memory operations do not take as long because data fits well in the caches. In the multicore example, memory access is slower because data does not fit well in the cache after it has been double buffered.

The right side of Figure 2 shows the better performance that can result when the aforementioned bottlenecks are addressed. The time spent accessing memory and the L2 cache are greatly reduced. Now, the number of instructions and memory accesses to the L1 cache are the largest bottlenecks. Both of these can be addressed using vector processing instructions. 

These charts, generated from instrumentation of the code, enable the programmer to easily visualize problem areas and focus on changes that maximize overall performance.  Although not shown in this example, code instrumentation also allows the programmer to change the size or structure of the caches in order to understand implications on code performance – something that is impossible to do with processor simulators or performance counters.   

Greater visibility for better performance

Multithreading an algorithm is time consuming, and too often programmers find their efforts have resulted in meager performance gains.

Data flow and cache analyses can prevent that frustration by providing visibility into multicore performance limitations. Emersonís complimentary Cachelib cache analysis software library supports instrumentation of existing code and makes analysis of data flow and cache easier and more detailed than ever before. By emulating the caching structure, the software enables what-if code analyses of multiple algorithms and cache architectures.

When designers obtain better visibility into performance limitations and use it to build and experiment with new cache architectures, frustration lessens and better multicore performance emerges.

Doug Sandy is the senior staff technologist of the Technology and Strategy department for the Embedded Computing business of Emerson Network Power. He is responsible for evaluating the performance of computer systems, constructing models for the systems, and predicting system behavior. He also focuses on the strategic development of computing, networking, memory, and storage trends. He holds bachelorís and masterís degrees in Electrical Engineering from California Polytechnic State University, San Luis Obispo, California. He can be contacted at Doug.Sandy@Emerson.com.

Emerson Network Power

Embedded Computing

602-438-3392

www.emersonnetworkpower/embeddedcomputing

 

To obtain a free copy of the Cachelib library, please e-mail the author at Doug.Sandy@Emerson.com.




©MMIX VME and Critical Systems. An OpenSystems Media, LLC publication.
About this Magazine and Website | Contact Us | VME and Critical Systems Media Kit