6. Programming Topics

6.1 Rationale

The material is organized into three subtopics: Paradigms and notations, correctness, and performance/energy. We discuss these in separate sections below. A prerequisite for coverage of much of this material is some background in conventional programming. Even though we advocate earlier introduction of parallelism in a student’s programming experience, basic algorithmic problem-solving skills must still be developed, and we recognize that it may be easier to begin with sequential models. Coverage of parallel algorithms prior to this material would allow the focus to be exclusively on the practical aspects of parallel and distributed programming, but they can also be covered at the same time as necessary and appropriate.  Parallel software development can be taught using many different languages and tools, including Java, C, C++, Python, OpenMP, CUDA, MPI, and many others.

Paradigms and Notations:  There are different approaches to parallel programming. These can be classified in many different ways. Here we have used two different ways of classifying the models. First, we classify the paradigms by the target machine model: SIMD (single instruction multiple data) is the paradigm in which the parallelism is confined to operations on (corresponding) elements of arrays. This linguistic paradigm is at the basis of the Intel Advanced Vector Extensions (AVX) or IBM Vector Scalar Extension (VSX) macros, some database operations, some operations in data structure libraries, and the languages constructs used for vector machines. Shared-memory is the paradigm of OpenMP and Intel’s Thread Building Blocks, among other examples.  Distributed memory is the paradigm underlying message passing and the MPI standard. A hybrid model is when any of the previous three paradigms co-exist in a single program. The logical target machine does not have to be identical to the physical machine. For example, a program written according to the distributed memory paradigm can be executed on a shared-memory machine and programs written in the shared-memory paradigm can be executed on a distributed memory machine with appropriate software support (e.g., Intel’s Cluster OpenMP).  More loosely coupled models are also addressed, including client/server, peer-to-peer, and big data models.  

A second way to classify programming approaches is according to the mechanisms that control parallelism. These are (mostly) orthogonal to the first classification. For example, programs in the SPMD (single program multiple data) paradigm can follow a distributed-memory, shared-memory or even the SIMD model from the first classification. The same is true of programs following the data parallel model. The task spawning model can work within a distributed or shared-memory paradigm. The parallel loop form seems to be mainly used with the shared-memory paradigm, but some models/languages have merged the loop model with the distributed memory paradigm (e.g., MapReduce).  The recent trend toward accelerators (e.g., GPUs) is also included here. Students are expected to be familiar with several notations (not languages since in many cases support comes from libraries such as MPI and BSPlib). Not all notations need to be covered, but at least one per main paradigm should be. An example collection that provides this coverage would be Java threads, AVX macros, OpenMP, and MPI.

Correctness and Semantics: This set of topics outlines the material needed to understand the behavior of parallel programs beyond the fact that there are activities that take place (or could take place) simultaneously. Material in this section covers three main topics as well as the methods and tools necessary to detect and troubleshoot defects. The main topics are Tasking, Synchronization and Memory models. In this context, tasking refers to the means to create threads on multiple cores and assign work to them, either through implicit or explicit assignment (e.g., OpenMP vs. POSIX threads). Synchronization introduces the concept of critical sections of code that depend on order to execute properly. Material in this section includes critical regions in code as well as producer-consumer models. The third section covers different Memory models explaining the relationship and tradeoffs between strict and relaxed memory models. As many programming languages have their own model, only the basic ideas are expected to be covered. Finally, the section deals with Concurrency problems and the tools to detect them. Common defects such as deadlock, starvation and race conditions are explained while tools like Eraser or various tools from Intel’s Parallel Toolkit can be covered as a means of detecting the errors. 

Performance and Energy:  The final group of topics is about performance and energy issues – how to organize the computation and the data for the different classes of machines, and how to deal with minimizing energy usage. Topics in this section are divided in four categories: Computation, Data, Tools and Metrics, and Power/Energy Efficiency. The first two categories (i.e., Computation and Data) refer to parallel programming aspects that influence performance, either in the form of task decomposition or by accessing remote data. The third topic introduces the notion of performance metrics and the ways in which they can be collected.  In more detail, Computation includes task decomposition strategies and how tasks  can be assigned to different  threads/processes. This section includes performance effects of load balancing and its contributing factors. such as scheduling, failures, and distribution delays. Data includes a variety of topics, from data representation and its effect on performance and energy, to data locality, and performance tradeoffs of laying out data and accessing data remotely. This section introduces different storage and organization paradigms, as well as issues arising from distribution and replication, such as consistency and atomicity of operations. Tools and Metrics covers ways in which performance is measured and evaluated, the laws measuring performance in a parallel setting, and finally performance in relation to Energy/Power consumption. 

6.2 Updates from version 1.0

We highlight the changes to the curriculum guidelines relative to version 1.0 of this document.  The main changes are related to adding topics from new aspects, including big data, distributed computing, and energy.  Many of those topics are suitable for upper level classes in parallel programming, distributed systems, databases and others, but some topics have influenced the learning outcomes for core classes.  The energy topics can be addressed in the core Systems class.  One additional change is the addition of programming for accelerators, such as GPUs, which can be introduced in a CS2 or Systems class and addressed more deeply in an upper level Parallel Programming class.

Table 3: Programming Topics

 TopicsCore Bloom LevelLearning Outcomes and Teaching Suggestions (core)Advanced Bloom LevelLearning Outcome (advanced)Where Covered
Parallel Programming Paradigms and Notations
By the target machine model     
  Concurrency and ParallelismCUnderstand concurrency is an algorithmic property; it exposes potential for parallelization. If concurrency is present in an algorithm, it can be parallelized, without concurrency there is no scope for parallelization. Concurrency can be present in a sequential program, parallelization takes advantage of concurrency to increase performance.  CS1; CS2; DS/A
  SIMDKUnderstand common vector operations including element-by-element operations and reductions.  CS2: Systems
  Processor vector extensionsKKnow examples – e.g., Intel AVX or Power VSX macrosCUnderstand examples from Intel/Power vector instructionsSystems; Arch2
  Array language extensionsN AKnow how to write parallel array code in some language (e.g., Fortran95, Intel’s C/C++ Array Extension[CEAN])ParProg
  Shared memoryABe able to write correct thread-based programs (protecting shared data) and understand how to obtain speed up.  CS2; DS/A
  Language parts or extensionsKKnow about language extensions for parallel programming. Illustrate with examples from Cilk (spawn/join), Java (Java threads), or other languages.   
  Compiler  directives/ pragmasCUnderstand what simple directives, such as those of OpenMP, mean (parallel for, concurrent section), show examples   
  LibrariesCKnow one in detail, and know of the existence of some other example libraries such as Pthreads, Pfunc, Intel’s TBB (Thread building blocks), Microsoft’s TPL (Task Parallel Library), C++ threads, etc.   
  Distributed memoryKKnow basic notions of messaging among processes, different ways of message passing, collective operations  Systems; DS/A
  Message passingN CKnow about the overall organization of a message passing program as well as point-to-point and collective communication primitives (e.g., MPI)ParProg
  PGAS languagesN CKnow about partitioned address spaces, other parallel constructs (e.g., UPC, CoArray Fortran, Chapel)ParProg
  Client Server and Peer-to-Peer modelsCKnow notions of invoking and providing services (e.g., RPC, RMI, web services) – understand these as concurrent processes; know about network model of distributed computing (e.g., sockets); know that in distributed systems such handshaking interaction is crucial for the efficient communication between asynchronous processesABe able to program a basic client/server and/or P2P interfaceSystems; Networking
  Big Data Technology StackN AUnderstand the Big data technology stack and its layered architecture. Be able to write code (e.g., in Python, R) using some of the tools that facilitate data storage, organization, management, and/or analysis within a Big Data stackParProg; DistSystems
  HybridKKnow the notion of programming over multiple classes of machines simultaneously (CPU, GPU, TPU, etc.)ABe able to write correct programs using two programming paradigms: e.g., shared memory and GPU (OpenMP+CUDA), Distributed and shared memory (MPI+OpenMP), Distributed memory and GPU (MPI+CUDA)Systems; ParProg
By the control statement     
  Task/thread spawningABe able to write correct programs with threads, synchronize (fork-join, producer/consumer, master/worker, etc.), use dynamic threads (in number and possibly recursively) thread creation – (e.g., Pthreads,  Java threads, etc.)  – builds on shared memory topic above  CS2; DS/A
  Event-Driven ExecutionKKnow about the need for event-driven execution; possible approaches to implementing it. Know about the notion of causality among events (e.g., remote file access, GUI).  These effects may be easier to discuss in the context of distributed systems.A CS2; DistSystems
  SPMDCUnderstand how SPMD program is written and how it executes Be able to write an SPMD program and understand how it executes 
  SPMD notationsCKnow the existence of highly threaded data parallel notations (e.g., CUDA, OpenCL), message passing (e.g., MPI), and some others (e.g., Global Arrays, BSP library)A CS2; DS/A; ParProg
  Data parallelABe able to write a correct data-parallel program for shared-memory machines and get speedup, should do an exercise. Understand relation between different notations for data parallel: Array notations, SPMD, and parallel loops. Builds on shared memory topic above. 
  Parallel loops for shared memoryAKnow, through an example, one way to implement parallel loops, understand collision/dependencies across iterations (e.g., OpenMP, Intel’s TBB)A CS2; DS/A; Lang
  Data parallel for distributed memoryN KKnow data parallel notations for distributed memory (e.g., UPC, Chapel, Co-Array Fortran)ParProg
  MapReduceKUnderstand how problems can be solved by mapreduce, and how algorithms can be written using map and reduceASolve problems using mapreduce, and write algorithms using map and reduceCS2; Lang; ParProg
  Offloading to acceleratorsKKnow about running parts of applications on accelerators (e.g., GPU, TPU, FPGA)  CS2; Systems
  Accelerator notationsN ABe able to write an accelerator program that takes advantage of the hardware (e.g., CUDA, OpenACC, OpenMP 4.5 or above, OpenCL, TensorFlow)ParProg
  Functional/logic languagesN KUnderstanding advantages and disadvantages of very different programming styles (e.g., Parallel Haskell, Parlog, Erlang)ParProg
Semantics and correctness issues     
Tasks and threadsABe able to write parallel programs that create and assign work to threads/processes,, in at least one parallel environment (e.g., OpenMP, Intel TBB, pthreads, etc.)A CS2; DS/A; Systems; Lang
SynchronizationABe able to write shared memory programs with critical regions, producer- consumer communication, and get speedup; know the notions of mechanisms for concurrency (monitors, semaphores, etc.)  CS2; DS/A; Systems
  Critical regionsABe able to write shared memory programs that use critical regions for synchronization   
  Producer-consumerABe able to write shared memory programs that use the producer-consumer pattern to share data and synchronize threads   
  HandshakingN AAnalyze handshaking protocols and derive performance bounds (e.g., response time in sliding window, TCP connection management)Networking; DistSystems
Concurrency issuesCUnderstand the notions of deadlock (detection, prevention), race conditions (definition), determinacy/non-determinacy in parallel programs (e.g., if there is a  race condition, the correctness of the output may depend on the order of execution)  DS/A; Systems
  Deadlock/LivelockCUnderstand what deadlock and livelock are, and methods for detecting and preventing them; also cast in terms of distributed systems   
  StarvationCUnderstand how starvation (of a thread or process) can occur, in context of an example (e.g., dining philosophers)   
  Race conditionCKnow what a race condition is, and how to use synchronization to prevent it   
  Distributed Data Structures and ApplicationsN CUnderstand synchronization in the context of data structures; correctness in a concurrent contextParProg; DistSystems
  Tools to detect concurrency defectsN KKnow the existence of tools to detect race conditions (e.g., Eraser) and debugging (e.g., Intel Parallel Toolkit)ParProg
Memory modelsN CKnow what a memory model is, and the implications of the difference between strict and relaxed models (performance vs. ease of use)ParProg
  Sequential consistencyN  Understand semantics of sequential consistency for shared memory programs 
  Relaxed consistencyN  Understand semantics of one relaxed consistency model (e.g., release consistency) for shared memory programs 
  Consistency in distributed transactionsN CRecognize consistency problems. Know that consistency is an issue in transactions issued concurrently by multiple agents. Implement transaction commit protocols in databases.DB; DistSystems
Performance and Energy issues     
ComputationCUnderstand the basic notions of static and dynamic scheduling, mapping and impact of load balancing on performance   
  Computation decomposition strategiesCUnderstand different ways to assign computations to threads or processes  CS2; DS/A
  Owner computes ruleCUnderstand how to assign loop iterations to threads based on which thread/process owns the data element(s) written in an iteration   
  Decomposition into atomic tasksCUnderstand how to decompose computations into tasks with communication only at the beginning and end of each task, and assign them to threads/processes   
  Decomposition into Map and Reduce tasksKKnow that divide and conquer can be expressed and programmed as a Map and Reduce decompositionAUnderstand how to decompose computation using the Map Reduce paradigm. Be able to reason about computation speed ups from this decomposition and the communication tradeoffs resulting from reduction(s). Be able to produce code expressing this decompositionCS2; ParProg
  Work stealingN CUnderstand one way to do dynamic assignment of computationsParProg
  Offloading onto an acceleratorN CUnderstand when it is worthwhile to decompose onto an accelerator (e.g., GPU, TPU, FPGA)ParProg
  Program transformationsN CBe able to perform simple loop transformations by hand, and understand how that impacts performance of the resulting code (e.g., loop fusion, fission, skewing, blocking)Compilers; ParProg
  Load balancingCUnderstand the effects of load imbalances on performance, and ways to balance load across threads or processes  DS/A; Systems
  Scheduling and mappingCUnderstand the importance of a programmer, compiler and/or runtime system mapping and scheduling computations to threads/processes, both statically and dynamicallyACan apply a static and a dynamic strategy for mapping processes of large scale parallel programs onto processors that optimizes performance through reduction of communication delaysDS/A; Systems; ParProg
  Effect of timing failures/delay in distributed systemsKUnderstand that a failure in one node can cause a global failure in a distributed system. For example, one could use waiting on a non-terminating program to illustrate a failure scenario in distributed systems (e.g., in the context of consensus).  CS2
  protocol timeout protectionsN AUnderstanding the use of timeouts in situations with a high probability of errorNetworking
 Data Understand impact of data distribution, layout and locality on performance; notion that transfer of data has fixed cost plus bit rate (irrespective of transfer from memory or inter-processor); know false sharing and its impact on performance (e.g., in a cyclic mapping in a parallel loop) in ParProg;C DS/A; Lang; ParProg
  Data distributionN CUnderstand what block, cyclic, and block-cyclic data distributions are, and what it means to distribute data across multiple threads/processesParProg
  Data layoutCKnow how to lay out data in memory to get improved performance and energy (memory hierarchy in shared memory parallel system)  DS/A; Systems
  False sharingKKnow that for cache coherent shared memory systems, data is kept coherent in blocks, not individual words, and how to avoid false sharing across threads of data for a blockCBe aware of false sharing, able to give examples where it occurs, and understand why it happensSystems; ParProg
  Energy impactKKnow the energy cost of loading data into memory from secondary storage (and writing out modified data to secondary storage)  Systems
  Data Representation     
  Floating point and integer precision (64-bit, 32-bit, and 16-bit or less)KPower savings using smaller data representation 64-bit vs. 32-bit vs. 16-bit floating-point precision, 32-bit vs. 16-bit integer). For example, machine learning on GPUs is driving lower (16-bit) floating-point precision.  Systems
  Data localityKKnow what spatial and temporal locality are, and how to organize data to take advantage of them  DS/A; Systems
  Performance impact of data movementKKnow the performance cost of moving data to secondary storage for big data analysis, distributed vs centralized analysis. Be aware of specific mechanisms that take advantage of locality (e.g., in-situ processing)CUnderstand performance costs of data locality with respect to various metrics, and be able to contrast mechanisms like in-situ vs. in-transit vs. offline processingSystems; ParProg; DistSystems
  Structured vs unstructured dataKKnow the differences and tradeoffs between these data representationsABe able to build solutions for the different types of dataDS/A; Databases
  Graph representations and databasesKKnow of graph representations of data to facilitate graph analysis.CUnderstand performance gains due to graph-specific mechanisms versus other more general data representationsDS/A; Databases
  Data handling and manipulationN CUnderstand the differences and performance impact of data formatting and storing mechanisms. Be able to implement adequate solutions for the different types of storingDistSystems; Databases
  Distributed databasesN CComprehend the principles behind distributed databases and the motivation of tradeoffs to support scalabilityDistSystems; Databases
  NoSQL databasesN AComprehend how NoSQL databases enable scalable data manipulation, include exemplars to become familiar with some of them (e.g., MongoDB, Hive)Databases
  Eventual consistency vs ACIDN CComprehend how replication enables scalability in databases but transactions lose their ACID propertiesDatabases
  Distributed file systemsKBe aware of existence of distributed file systems and common examples of where they are used and whyKComprehend the basic principles of how distributed file systems work, their scalability benefits, and performance and reliability problemsSystems; Operating Systems; DistSystems; ParProg
  Replicated file systemsN CKnow of distributed file systems such as HDFS and its replication and fault tolerance mechanisms. ParProg be able to use a DFS.DistSystems; ParProg
  Key-value storage systemsN KUnderstand the concept of key-value storage. ParProg be able to use an appropriate API.DistSystems; ParProg
Tools and metrics     
  Performance monitoring toolsKKnow of tools for runtime monitoring (e.g., gprof, perf, Intel Performance Toolkit, TAU)  DS/A; Systems
  Performance metricsCKnow the basic definitions of performance metrics (speedup, efficiency, work, cost), Amdahl’s law; know the notion of scalability  CS2; DS/A
  SpeedupCUnderstand how to compute speedup, and what it means   
  EfficiencyCUnderstand how to compute efficiency, and why it matters   
  Parallel ScalabilityCUnderstand that speedup and efficiency is a single point of measure for a particular problem size and  number of processes/threads. These metrics change as problem size and/or number of processes/threads vary. Understand that scalability is a metric that measures how speedup varies as problem size and/or number of processes/threads vary.   
  Amdahl’s lawCUnderstand that speedup is limited by the sequential portion of a parallel program, if problem size is kept fixed   
  Gustafson’s LawKKnow the idea of weak scaling, where problem size increases as the number of processes/threads increases   
Power/Energy efficiency     
  Power-latency tradeoffKFamiliar with the notion that problem decompositions (including their granularity), and active/idle states (e.g., including modulation of CPU frequency) may be exploited to adjust balance among throughput, latency, and energy consumption.  Systems
  Energy efficiency vs. load balancingKAware that unbalanced work decomposition and communication congestion can prolong computation and reduce energy efficiency.  Systems
  Active power management methodsN AAware that systems  expose various execution parameters (e.g., P states on Intel) and have examined the effect of modulating at least one to optimize performance or energy consumption.ParProg; Arch2
  Idle power management methodsN AAware that architectures and OSs provide various interfaces that enable computational units to be passivated (e.g., C states) and have examined tradeoffs (e.g., reduced subsystem power consumption v. increased latency) involved in exploiting at least one of them. ParProg; Arch2
  Power consumption of parallel programsN KAware that optimal energy efficiency may not be achieved through aggressive reduction of CPU clock frequencies and exploitation of sleep modes due to increased execution time and static components of system power consumption.ParProg
Security     
 Security protocolsN Aunderstand IPsec basicsNetworking