Curriculum-Programming Topics

6. Programming Topics

6.1 Rationale

The material is organized into three subtopics: Paradigms and notations, correctness, and performance/energy. We discuss these in separate sections below. A prerequisite for coverage of much of this material is some background in conventional programming. Even though we advocate earlier introduction of parallelism in a student’s programming experience, basic algorithmic problem-solving skills must still be developed, and we recognize that it may be easier to begin with sequential models. Coverage of parallel algorithms prior to this material would allow the focus to be exclusively on the practical aspects of parallel and distributed programming, but they can also be covered at the same time as necessary and appropriate. Parallel software development can be taught using many different languages and tools, including Java, C, C++, Python, OpenMP, CUDA, MPI, and many others.

Paradigms and Notations: There are different approaches to parallel programming. These can be classified in many different ways. Here we have used two different ways of classifying the models. First, we classify the paradigms by the target machine model: SIMD (single instruction multiple data) is the paradigm in which the parallelism is confined to operations on (corresponding) elements of arrays. This linguistic paradigm is at the basis of the Intel Advanced Vector Extensions (AVX) or IBM Vector Scalar Extension (VSX) macros, some database operations, some operations in data structure libraries, and the languages constructs used for vector machines. Shared-memory is the paradigm of OpenMP and Intel’s Thread Building Blocks, among other examples. Distributed memory is the paradigm underlying message passing and the MPI standard. A hybrid model is when any of the previous three paradigms co-exist in a single program. The logical target machine does not have to be identical to the physical machine. For example, a program written according to the distributed memory paradigm can be executed on a shared-memory machine and programs written in the shared-memory paradigm can be executed on a distributed memory machine with appropriate software support (e.g., Intel’s Cluster OpenMP). More loosely coupled models are also addressed, including client/server, peer-to-peer, and big data models.

A second way to classify programming approaches is according to the mechanisms that control parallelism. These are (mostly) orthogonal to the first classification. For example, programs in the SPMD (single program multiple data) paradigm can follow a distributed-memory, shared-memory or even the SIMD model from the first classification. The same is true of programs following the data parallel model. The task spawning model can work within a distributed or shared-memory paradigm. The parallel loop form seems to be mainly used with the shared-memory paradigm, but some models/languages have merged the loop model with the distributed memory paradigm (e.g., MapReduce). The recent trend toward accelerators (e.g., GPUs) is also included here. Students are expected to be familiar with several notations (not languages since in many cases support comes from libraries such as MPI and BSPlib). Not all notations need to be covered, but at least one per main paradigm should be. An example collection that provides this coverage would be Java threads, AVX macros, OpenMP, and MPI.

Correctness and Semantics: This set of topics outlines the material needed to understand the behavior of parallel programs beyond the fact that there are activities that take place (or could take place) simultaneously. Material in this section covers three main topics as well as the methods and tools necessary to detect and troubleshoot defects. The main topics are Tasking, Synchronization and Memory models. In this context, tasking refers to the means to create threads on multiple cores and assign work to them, either through implicit or explicit assignment (e.g., OpenMP vs. POSIX threads). Synchronization introduces the concept of critical sections of code that depend on order to execute properly. Material in this section includes critical regions in code as well as producer-consumer models. The third section covers different Memory models explaining the relationship and tradeoffs between strict and relaxed memory models. As many programming languages have their own model, only the basic ideas are expected to be covered. Finally, the section deals with Concurrency problems and the tools to detect them. Common defects such as deadlock, starvation and race conditions are explained while tools like Eraser or various tools from Intel’s Parallel Toolkit can be covered as a means of detecting the errors.

Performance and Energy: The final group of topics is about performance and energy issues – how to organize the computation and the data for the different classes of machines, and how to deal with minimizing energy usage. Topics in this section are divided in four categories: Computation, Data, Tools and Metrics, and Power/Energy Efficiency. The first two categories (i.e., Computation and Data) refer to parallel programming aspects that influence performance, either in the form of task decomposition or by accessing remote data. The third topic introduces the notion of performance metrics and the ways in which they can be collected. In more detail, Computation includes task decomposition strategies and how tasks can be assigned to different threads/processes. This section includes performance effects of load balancing and its contributing factors. such as scheduling, failures, and distribution delays. Data includes a variety of topics, from data representation and its effect on performance and energy, to data locality, and performance tradeoffs of laying out data and accessing data remotely. This section introduces different storage and organization paradigms, as well as issues arising from distribution and replication, such as consistency and atomicity of operations. Tools and Metrics covers ways in which performance is measured and evaluated, the laws measuring performance in a parallel setting, and finally performance in relation to Energy/Power consumption.

6.2 Updates from version 1.0

We highlight the changes to the curriculum guidelines relative to version 1.0 of this document. The main changes are related to adding topics from new aspects, including big data, distributed computing, and energy. Many of those topics are suitable for upper level classes in parallel programming, distributed systems, databases and others, but some topics have influenced the learning outcomes for core classes. The energy topics can be addressed in the core Systems class. One additional change is the addition of programming for accelerators, such as GPUs, which can be introduced in a CS2 or Systems class and addressed more deeply in an upper level Parallel Programming class.

Table 3: Programming Topics

Topics			Core Bloom Level	Learning Outcomes and Teaching Suggestions (core)	Advanced Bloom Level	Learning Outcome (advanced)	Where Covered
Parallel Programming Paradigms and Notations
	By the target machine model
		Concurrency and Parallelism	C	Understand concurrency is an algorithmic property; it exposes potential for parallelization. If concurrency is present in an algorithm, it can be parallelized, without concurrency there is no scope for parallelization. Concurrency can be present in a sequential program, parallelization takes advantage of concurrency to increase performance.			CS1; CS2; DS/A
		SIMD	K	Understand common vector operations including element-by-element operations and reductions.			CS2: Systems
		Processor vector extensions	K	Know examples – e.g., Intel AVX or Power VSX macros	C	Understand examples from Intel/Power vector instructions	Systems; Arch2
		Array language extensions	N		A	Know how to write parallel array code in some language (e.g., Fortran95, Intel’s C/C++ Array Extension[CEAN])	ParProg
		Shared memory	A	Be able to write correct thread-based programs (protecting shared data) and understand how to obtain speed up.			CS2; DS/A
		Language parts or extensions	K	Know about language extensions for parallel programming. Illustrate with examples from Cilk (spawn/join), Java (Java threads), or other languages.
		Compiler directives/ pragmas	C	Understand what simple directives, such as those of OpenMP, mean (parallel for, concurrent section), show examples
		Libraries	C	Know one in detail, and know of the existence of some other example libraries such as Pthreads, Pfunc, Intel’s TBB (Thread building blocks), Microsoft’s TPL (Task Parallel Library), C++ threads, etc.
		Distributed memory	K	Know basic notions of messaging among processes, different ways of message passing, collective operations			Systems; DS/A
		Message passing	N		C	Know about the overall organization of a message passing program as well as point-to-point and collective communication primitives (e.g., MPI)	ParProg
		PGAS languages	N		C	Know about partitioned address spaces, other parallel constructs (e.g., UPC, CoArray Fortran, Chapel)	ParProg
		Client Server and Peer-to-Peer models	C	Know notions of invoking and providing services (e.g., RPC, RMI, web services) – understand these as concurrent processes; know about network model of distributed computing (e.g., sockets); know that in distributed systems such handshaking interaction is crucial for the efficient communication between asynchronous processes	A	Be able to program a basic client/server and/or P2P interface	Systems; Networking
		Big Data Technology Stack	N		A	Understand the Big data technology stack and its layered architecture. Be able to write code (e.g., in Python, R) using some of the tools that facilitate data storage, organization, management, and/or analysis within a Big Data stack	ParProg; DistSystems
		Hybrid	K	Know the notion of programming over multiple classes of machines simultaneously (CPU, GPU, TPU, etc.)	A	Be able to write correct programs using two programming paradigms: e.g., shared memory and GPU (OpenMP+CUDA), Distributed and shared memory (MPI+OpenMP), Distributed memory and GPU (MPI+CUDA)	Systems; ParProg
	By the control statement
		Task/thread spawning	A	Be able to write correct programs with threads, synchronize (fork-join, producer/consumer, master/worker, etc.), use dynamic threads (in number and possibly recursively) thread creation – (e.g., Pthreads, Java threads, etc.) – builds on shared memory topic above			CS2; DS/A
		Event-Driven Execution	K	Know about the need for event-driven execution; possible approaches to implementing it. Know about the notion of causality among events (e.g., remote file access, GUI). These effects may be easier to discuss in the context of distributed systems.	A		CS2; DistSystems
		SPMD	C	Understand how SPMD program is written and how it executes		Be able to write an SPMD program and understand how it executes
		SPMD notations	C	Know the existence of highly threaded data parallel notations (e.g., CUDA, OpenCL), message passing (e.g., MPI), and some others (e.g., Global Arrays, BSP library)	A		CS2; DS/A; ParProg
		Data parallel	A	Be able to write a correct data-parallel program for shared-memory machines and get speedup, should do an exercise.		Understand relation between different notations for data parallel: Array notations, SPMD, and parallel loops. Builds on shared memory topic above.
		Parallel loops for shared memory	A	Know, through an example, one way to implement parallel loops, understand collision/dependencies across iterations (e.g., OpenMP, Intel’s TBB)	A		CS2; DS/A; Lang
		Data parallel for distributed memory	N		K	Know data parallel notations for distributed memory (e.g., UPC, Chapel, Co-Array Fortran)	ParProg
		MapReduce	K	Understand how problems can be solved by mapreduce, and how algorithms can be written using map and reduce	A	Solve problems using mapreduce, and write algorithms using map and reduce	CS2; Lang; ParProg
		Offloading to accelerators	K	Know about running parts of applications on accelerators (e.g., GPU, TPU, FPGA)			CS2; Systems
		Accelerator notations	N		A	Be able to write an accelerator program that takes advantage of the hardware (e.g., CUDA, OpenACC, OpenMP 4.5 or above, OpenCL, TensorFlow)	ParProg
		Functional/logic languages	N		K	Understanding advantages and disadvantages of very different programming styles (e.g., Parallel Haskell, Parlog, Erlang)	ParProg
Semantics and correctness issues
	Tasks and threads		A	Be able to write parallel programs that create and assign work to threads/processes,, in at least one parallel environment (e.g., OpenMP, Intel TBB, pthreads, etc.)	A		CS2; DS/A; Systems; Lang
	Synchronization		A	Be able to write shared memory programs with critical regions, producer- consumer communication, and get speedup; know the notions of mechanisms for concurrency (monitors, semaphores, etc.)			CS2; DS/A; Systems
		Critical regions	A	Be able to write shared memory programs that use critical regions for synchronization
		Producer-consumer	A	Be able to write shared memory programs that use the producer-consumer pattern to share data and synchronize threads
		Handshaking	N		A	Analyze handshaking protocols and derive performance bounds (e.g., response time in sliding window, TCP connection management)	Networking; DistSystems
	Concurrency issues		C	Understand the notions of deadlock (detection, prevention), race conditions (definition), determinacy/non-determinacy in parallel programs (e.g., if there is a race condition, the correctness of the output may depend on the order of execution)			DS/A; Systems
		Deadlock/Livelock	C	Understand what deadlock and livelock are, and methods for detecting and preventing them; also cast in terms of distributed systems
		Starvation	C	Understand how starvation (of a thread or process) can occur, in context of an example (e.g., dining philosophers)
		Race condition	C	Know what a race condition is, and how to use synchronization to prevent it
		Distributed Data Structures and Applications	N		C	Understand synchronization in the context of data structures; correctness in a concurrent context	ParProg; DistSystems
		Tools to detect concurrency defects	N		K	Know the existence of tools to detect race conditions (e.g., Eraser) and debugging (e.g., Intel Parallel Toolkit)	ParProg
	Memory models		N		C	Know what a memory model is, and the implications of the difference between strict and relaxed models (performance vs. ease of use)	ParProg
		Sequential consistency	N			Understand semantics of sequential consistency for shared memory programs
		Relaxed consistency	N			Understand semantics of one relaxed consistency model (e.g., release consistency) for shared memory programs
		Consistency in distributed transactions	N		C	Recognize consistency problems. Know that consistency is an issue in transactions issued concurrently by multiple agents. Implement transaction commit protocols in databases.	DB; DistSystems
Performance and Energy issues
	Computation		C	Understand the basic notions of static and dynamic scheduling, mapping and impact of load balancing on performance
		Computation decomposition strategies	C	Understand different ways to assign computations to threads or processes			CS2; DS/A
		Owner computes rule	C	Understand how to assign loop iterations to threads based on which thread/process owns the data element(s) written in an iteration
		Decomposition into atomic tasks	C	Understand how to decompose computations into tasks with communication only at the beginning and end of each task, and assign them to threads/processes
		Decomposition into Map and Reduce tasks	K	Know that divide and conquer can be expressed and programmed as a Map and Reduce decomposition	A	Understand how to decompose computation using the Map Reduce paradigm. Be able to reason about computation speed ups from this decomposition and the communication tradeoffs resulting from reduction(s). Be able to produce code expressing this decomposition	CS2; ParProg
		Work stealing	N		C	Understand one way to do dynamic assignment of computations	ParProg
		Offloading onto an accelerator	N		C	Understand when it is worthwhile to decompose onto an accelerator (e.g., GPU, TPU, FPGA)	ParProg
		Program transformations	N		C	Be able to perform simple loop transformations by hand, and understand how that impacts performance of the resulting code (e.g., loop fusion, fission, skewing, blocking)	Compilers; ParProg
		Load balancing	C	Understand the effects of load imbalances on performance, and ways to balance load across threads or processes			DS/A; Systems
		Scheduling and mapping	C	Understand the importance of a programmer, compiler and/or runtime system mapping and scheduling computations to threads/processes, both statically and dynamically	A	Can apply a static and a dynamic strategy for mapping processes of large scale parallel programs onto processors that optimizes performance through reduction of communication delays	DS/A; Systems; ParProg
		Effect of timing failures/delay in distributed systems	K	Understand that a failure in one node can cause a global failure in a distributed system. For example, one could use waiting on a non-terminating program to illustrate a failure scenario in distributed systems (e.g., in the context of consensus).			CS2
		protocol timeout protections	N		A	Understanding the use of timeouts in situations with a high probability of error	Networking
	Data			Understand impact of data distribution, layout and locality on performance; notion that transfer of data has fixed cost plus bit rate (irrespective of transfer from memory or inter-processor); know false sharing and its impact on performance (e.g., in a cyclic mapping in a parallel loop) in ParProg;	C		DS/A; Lang; ParProg
		Data distribution	N		C	Understand what block, cyclic, and block-cyclic data distributions are, and what it means to distribute data across multiple threads/processes	ParProg
		Data layout	C	Know how to lay out data in memory to get improved performance and energy (memory hierarchy in shared memory parallel system)			DS/A; Systems
		False sharing	K	Know that for cache coherent shared memory systems, data is kept coherent in blocks, not individual words, and how to avoid false sharing across threads of data for a block	C	Be aware of false sharing, able to give examples where it occurs, and understand why it happens	Systems; ParProg
		Energy impact	K	Know the energy cost of loading data into memory from secondary storage (and writing out modified data to secondary storage)			Systems
		Data Representation
		Floating point and integer precision (64-bit, 32-bit, and 16-bit or less)	K	Power savings using smaller data representation 64-bit vs. 32-bit vs. 16-bit floating-point precision, 32-bit vs. 16-bit integer). For example, machine learning on GPUs is driving lower (16-bit) floating-point precision.			Systems
		Data locality	K	Know what spatial and temporal locality are, and how to organize data to take advantage of them			DS/A; Systems
		Performance impact of data movement	K	Know the performance cost of moving data to secondary storage for big data analysis, distributed vs centralized analysis. Be aware of specific mechanisms that take advantage of locality (e.g., in-situ processing)	C	Understand performance costs of data locality with respect to various metrics, and be able to contrast mechanisms like in-situ vs. in-transit vs. offline processing	Systems; ParProg; DistSystems
		Structured vs unstructured data	K	Know the differences and tradeoffs between these data representations	A	Be able to build solutions for the different types of data	DS/A; Databases
		Graph representations and databases	K	Know of graph representations of data to facilitate graph analysis.	C	Understand performance gains due to graph-specific mechanisms versus other more general data representations	DS/A; Databases
		Data handling and manipulation	N		C	Understand the differences and performance impact of data formatting and storing mechanisms. Be able to implement adequate solutions for the different types of storing	DistSystems; Databases
		Distributed databases	N		C	Comprehend the principles behind distributed databases and the motivation of tradeoffs to support scalability	DistSystems; Databases
		NoSQL databases	N		A	Comprehend how NoSQL databases enable scalable data manipulation, include exemplars to become familiar with some of them (e.g., MongoDB, Hive)	Databases
		Eventual consistency vs ACID	N		C	Comprehend how replication enables scalability in databases but transactions lose their ACID properties	Databases
		Distributed file systems	K	Be aware of existence of distributed file systems and common examples of where they are used and why	K	Comprehend the basic principles of how distributed file systems work, their scalability benefits, and performance and reliability problems	Systems; Operating Systems; DistSystems; ParProg
		Replicated file systems	N		C	Know of distributed file systems such as HDFS and its replication and fault tolerance mechanisms. ParProg be able to use a DFS.	DistSystems; ParProg
		Key-value storage systems	N		K	Understand the concept of key-value storage. ParProg be able to use an appropriate API.	DistSystems; ParProg
	Tools and metrics
		Performance monitoring tools	K	Know of tools for runtime monitoring (e.g., gprof, perf, Intel Performance Toolkit, TAU)			DS/A; Systems
		Performance metrics	C	Know the basic definitions of performance metrics (speedup, efficiency, work, cost), Amdahl’s law; know the notion of scalability			CS2; DS/A
		Speedup	C	Understand how to compute speedup, and what it means
		Efficiency	C	Understand how to compute efficiency, and why it matters
		Parallel Scalability	C	Understand that speedup and efficiency is a single point of measure for a particular problem size and number of processes/threads. These metrics change as problem size and/or number of processes/threads vary. Understand that scalability is a metric that measures how speedup varies as problem size and/or number of processes/threads vary.
		Amdahl’s law	C	Understand that speedup is limited by the sequential portion of a parallel program, if problem size is kept fixed
		Gustafson’s Law	K	Know the idea of weak scaling, where problem size increases as the number of processes/threads increases
	Power/Energy efficiency
		Power-latency tradeoff	K	Familiar with the notion that problem decompositions (including their granularity), and active/idle states (e.g., including modulation of CPU frequency) may be exploited to adjust balance among throughput, latency, and energy consumption.			Systems
		Energy efficiency vs. load balancing	K	Aware that unbalanced work decomposition and communication congestion can prolong computation and reduce energy efficiency.			Systems
		Active power management methods	N		A	Aware that systems expose various execution parameters (e.g., P states on Intel) and have examined the effect of modulating at least one to optimize performance or energy consumption.	ParProg; Arch2
		Idle power management methods	N		A	Aware that architectures and OSs provide various interfaces that enable computational units to be passivated (e.g., C states) and have examined tradeoffs (e.g., reduced subsystem power consumption v. increased latency) involved in exploiting at least one of them.	ParProg; Arch2
		Power consumption of parallel programs	N		K	Aware that optimal energy efficiency may not be achieved through aggressive reduction of CPU clock frequencies and exploitation of sleep modes due to increased execution time and static components of system power consumption.	ParProg
Security
		Security protocols	N		A	understand IPsec basics	Networking

Table of Contents

About CDER

Advancing PDC

Conference workshops

Community Resources