16 Ch. 3.3: Memory Hierarchy: Capacity and Speed
Data is stored in various locations in a hierarchy. Each level of storage tends to be larger and slower as they get further away from the CPU. While some of the concerns laid out in this section are not fundamental to writing parallel programs, being aware of the performance impacts described here can significantly increase code efficiency.
Disk storage has the largest capacity, followed by Random Access Memory (RAM), followed then by CPU cache. Modern processors have multiple levels of CPU cache, typically L1, L2 and L3 caches. L1 cache is the smallest/fastest cache closest to the CPU, followed by L2 cache and lastly L3 cache which is larger but slower than the others. In the workshop example above, the workers (threads) at one workbench (process) can communicate quickly with each other. Similarly in computing, the threads of a process communicate through RAM and shared caches. This is why it is referred to as ‘shared-memory’ programming—because the threads communicate via memory that they all have access to.
To move data from disk to RAM, code explicitly opens files on disk and reads that data into a variable which is stored in RAM. On a personal computer, when RAM is used up, a swap disk file may automatically be used to provide extra storage. When the computer must frequently access data from swap disk it significantly slows down and this is called thrashing. However, in a cluster job (like those run on Digital Alliance servers), a certain amount of RAM is requested and if this RAM is used up the job is cancelled.
In hardware, the CPU moves data from RAM into the CPU caches itself meaning this is not controlled explicitly by code or software. A cache miss occurs when a piece of data is needed and is not already in the CPU cache. This causes the CPU to load the data from RAM into the cache. Although the CPU caching is transparent to the programmer, how cache is used does present a significant performance impact, so being mindful of memory access patterns can affect code performance. For example, multi-dimensional arrays can either be stored in column-major or row-major order. If code iterates through each element of the array, then following the programming language’s array ordering can significantly reduce cache misses and thus improve performance.
Another more advanced issue during parallel programming is when the same data is loaded into cache across multiple CPUs. If one CPU updates the data in that cache line, the hardware must notify all other CPUs that the cache line is no longer valid, so on next use it will need to be fetched from RAM again. Performance can be improved by using memory access patterns for data-parallel operations where each CPU accesses data spaced out enough that each uses a different cache line.
Central Processing Unit
Random Access Memory