April 26, 2015
Multicore Pt. 2
Talk about unbalanced… Part 1 was about 300 lines, this part 2 is 700 lines… I am exhausted OTL
Some have been translated from Korean as I struggle through my lecture notes so remember to cross-reference.
Multicore Computing Pt 2
Chapter 8 - Cache and Virtual Memory
Principle of Locality
- Temporal Locality
- Reuse of data/instructions in the near future
- Data: Same variable access in each iteration in for loop
- Instructions: Cycling through for loop
- Spatial Locality
- Data/instructions in locations near to each other likely to be used together
- Data: Array elements accessed in succession
- Instructions: Referenced in sequence
Memory Hierarchies
- Hierarchical arrangement of storage to exploit locality of reference
- Trade-off between speed, capacity and cost
Caching
- Exploit temporal and spatial locality
- Cache block = cache line
- Basic unit of cache storage
- Multiple bytes/words (Intel Haswell: 64 bytes/line)
- Cache hit: block requested and found in cache
- Cache miss: block requested but not in cache, requires fetching from memory
Cache Organization
- Set: collection of cache locations in which a fetched block can be placed
-
Addressing a word: Tag (part of address of data in main memory) Set index (set) Offset (word)
Direct-Mapped Caches
- One cache line per set
- Block from main memory can only be placed in one location in cache
- Simple replacement
- Collisions between blocks for the same cache line can occur
Set-Associative Caches
- Data block can be placed in a few locations in the cache
- Requires complex replacement policy
- Requires complex tag comparison hardware on lines in a set
- Less collisions between blocks for the same cache line than direct-mapped
Fully Associative Caches
- Only one set
- Data block can be in any place in the cache
- Requires complex tag comparison hardware on lines in a set
- Less collisions between blocks for the same cache line than set-associative
Types of Cache Misses
- Cold (compulsory) miss: when cache is empty
- Conflict miss: when cache is large enough but multiple data items map to the same cache line
- Capacity miss: when working set is larger than cache size
Replacement Policies
- Least Recently Used (LRU)
- Replace block that has not been used for the longest time
- Need to maintain LRU statistics for each cache line in a set
- Costly, time consuming read/modify/write cycle to maintain set state on a cache access
- Pseudo LRU
- Binary decision tree to keep track of LRU statistics
- One write cycle to update stats on a hit
- One read cycle during line replacement
- First in, First out (FIFO)
- Replace block that has been in set for longest time
- Random
Write Policies
- Combine following policies for each case
On cache hit
- Write through
- Both block in cache and lower level memory are modified
- Simple to implement
- Lower level memory always consistent with cache
- Write back
- Block in cache is modified, only written back to lower level memory when replaced, uses dirty bit
- Harder to implement
- Lower level memory not always consistent with cache
On cache miss
- Write allocate
- Block is loaded into the cache
- No write allocate
- Block is directly modified in lower level memory and not loaded into cache
Non-blocking/Lockup-free caches
- Continues to supply cache hits during a miss instead of being blocked waiting for lower level memory
- Reduces effective miss penalty
- Optional: support multiple outstanding misses by maintaining state for each outstanding miss
Cache Performance Metrics
- Miss rate
- Fraction of memory references not found in cache
- Hit time
- Time required to deliver a line in cache to the processor
- Includes time to determine if line is in cache
- Miss penalty
- Additional time required due to miss
Virtual Memory
- Abstraction of main memory by operating system
- Provides each process with large and uniform address space
- Size of address space larger than main memory
- Protect address space of each process from corruption by other processes
- Main memory treated as cache of permanent secondary storage eg. hard disk
Pages
- Each byte in main memory has unique physical address
- CPU generates virtual address to access main memory
- Memory management unit (MMU) translates virtual address to physical address using look-up table (page table) stored in main memory
Page Tables
- Array of page table entries
- Includes valid bit and n-bit address field (physical page frame number)
- OS (page fault handler) maintains contents of page table and transferring pages between main memory and secondary storage
- Swapping (paging): activity of transferring pages between secondary storage and main memory
- Demand paging: wait until the last moment to swap in a page when a miss (page fault) occurs
Address Translation
- 3 types of virtual pages
- Unallocated: pages not yet allocated by VM due to no space on secondary storage
- Cached: allocated pages currently cached in main memory
- Uncached: allocated pages not cached in main memory (residing in secondary storage)
Page Replacement Policies
- LRU
- FIFO
- Second chance
- Clock
- Maintain circular list of pages in memory keeping track of when pages are first loaded in memory and when page is referenced
- Exploits spatial locality
Translation Look-aside Buffer (TLB)
- High overhead when MMU refers to page table for address translation since page table is stored in main memory
- Small, virtually addressed cache with each line holding a block of one page table entry
- Micro-TLB
- Small TLB placed over main TLB to boost speed of address translation for cache accesses
- Main TLB handles micro-TLB misses
Caches and Virtual Memory
- Virtually addressed cache: faster (no address translation) but security issues (OS needs to flush cache on context switch)
- Physically addressed cache: slower but no security issues (no OS intervention)
- Physically indexed, physically tagged
- No OS intervention
- TLB in critical path
- Physically indexed, virtually tagged
- Never used
- No OS intervention
- TLB in critical path
- Virtually indexed, physically tagged
- Common in real world
- Address translation at same time as cache indexing
- TLB not in critical path
- Much faster than physically indexed caches
- Virtually indexed, virtually tagged
- Address translation occurs on cache miss
- TLB (address translation) not in critical path
=======
Chapter 9 - Parallelism
Types of Parallelism
Task Parallelism
- Performing distinct tasks at the same time
- Dividing an application into these separate tasks
Data Parallelism
- Loop-level parallelism
- Performing same operation to different items of data at the same time as in a
for
loop operating on data in an array
SIMD
- Single Instruction Multiple Data
- Performing same operation on multiple data simultaneously with a single instruction
- Single program counter
=====
Chapter 10 - Multithreaded Architecture, Cache Coherence and GPU Architecture
Multithreaded Processors
- Issues instructions from multiple threads of control for the pipeline
- To guarantee no dependences between instructions in a pipeline
- Exploit ILP from multiple threads executing at the same time
Thread-level Parallelism (TLP)
- TLP represented by use of multiple threads of execution
- To improve throughput
- More cost-effective to exploit than ILP
Superscalar Processors
- Aim to issue an instruction on every clock cycle
- Issue-width: max number of instructions that can be issued by a processor simultaneously
- When hardware can issue
n
instructions on every cycle- Processor has
n
issue slots n
-issue processor
- Processor has
- Inefficiency
- Vertical waste: An issue slot not having an issue for some time
- Schedule threads to hide long latency
- switch different thread contexts on each cycle
- Horizontal waste: In a particular cycle, not all issue slots are being used
- Context switch among threads every cycle
- Context switch among threads every few cycles on data hazards, cache misses etc.
- Simultaneous Multi-threading
- Selects instructions for execution from all threads on each cycle to remove horizontal and vertical waste
- Vertical waste: An issue slot not having an issue for some time
Cache Coherence
- Private caches in a multiprocessor system results in copies of a variable present in multiple places
- Write by one processor may not become visible to others
- Coherence problems between I/O devices and processor caches
- Solutions
- Software-based
- Compiler/runtime
- Perfect memory access information needed
- Hardware-based
- More common
- Requests for data always return most recent value
- Coherence misses due to invalidations
- Shared caches do not have coherence problem
- Snooping: each processor snoops every address on bus during a memory read request, if possess dirty copy of requested block, provide it to requester and abort memory access instead
- Write-through Caches: all processor writes update local cache and global bus write
- False Sharing: invalidations can lead to different cores writing to different locations in the same cache block repeatedly
- Software-based
GPU Architecture
- Many simple compute units
- Many threads, no synchronization between threads for graphics applications
Shader
- Program that runs on GPU to execute one of the programmable stages in rendering pipeline
- Shader cores are simple, programmable
- Optimized for a single instruction stream
- Graphics pipeline stages implemented by vising shader core several times
Characteristics of GPU Applications
- Data independence between triangles/pixels
- Millions of triangles/pixels
- Massively parallel processing possible
Exploiting Massive Parallelism
- SIMD processing across many ALUs
- Single program counter across these multiple ALUs
- SIMT: Single Instruction Multiple Thread
- Warp: group of instruction execution instances, all threads in warp execute same instruction at a time, unit of context switch
- Uses hardware context switching to avoid stalls caused by high latency operations
- Branches automatically handled by hardware, conditional execution in a warp
Multiple Streaming Multiprocessors
- Different streaming multiprocessors execute different shaders
- Simultaneous instruction streams
==========
Chapter 11 - Memory Consistency
Memory Consistency
- Memory should respect order between accesses to different locations in a thread
- Reordering instructions may mess this up
- Coherence does not help since it is for a single location
Memory Consistency Models
- To specify constraints on the order in which memory operations from any processor can appear to execute w.r.t. to one another
- Balance between programming complexity and performance
Sequential Consistency
- Observable outcome of multi-threaded program should be the same as outcome of a single thread executing all operations in the same program
- Total order between operations is defined assuming atomic operations
- Inefficient with hardware implementation
- Program Order
- Order in which operations appear in source code
- At most one memory operation per instruction
- Not the same as order presented to hardware by compiler
Relaxed Memory Consistency Model
- Relaxation in program order, only need to preserve dependences
- Difficult for the programmer
- Memory fence/barrier instructions prevent reordering of instructions
Processor Consistency
- Before a read is allowed to perform w.r.t. any other processor, all previous reads must be performed
- Before a write is allowed to perform w.r.t. any other processor, all previous accesses must be performed
- Relaxing write->read order
- Allow read to complete before an earlier write in program order
- Write buffer with read bypassing
Weak Ordering
- Relax all program orders
- No program orders are guaranteed by underlying hardware or compiler optimizations
- Only synchronization operations
- Need to distinguish between ordinary read/writes and synchronizations
- Before r/w allowed to perform w.r.t. any processor, all previous sync operations must be performed w.r.t. every processor and vice-versa
- Sync operations obey sequential Consistency
- Multiple r/w requests outstanding at the same time to hide r/w latency
- Ordering w.r.t. sync must be guaranteed
Release Consistency
- Relax all program orders but not w.r.t. sync operations
- Two separate sync operations
- Acquire: a read operation such as lock
- Release: a write operation such as unlock
- Need to distinguish between ordinary r/w and sync and between acquire and release operations
- Before r/w allowed to perform w.r.t. any processor, all previous acquires operations must be performed w.r.t. every processor and vice-versa with release
- Acquire and release operations are sequentially consistent
- Specific memory access ordering w.r.t. acquire/release must be guaranteed
- Memory access inside and outside of the critical section do not delay each other
===========
Chapter 12 - Process and Thread
Process
- Instance of a computer program being executed
- Stream of instructions being executed
- Abstraction used by OS
- Consists of
- Registers
- Memory (code, data, stack, heap etc.)
- I/O status (open file tables etc.)
- Signal management info
Supervisor Mode vs. User Mode
- Modern processors have two different modes of execution
- Supervisor (kernel) mode
- Any instruction can be executed
- For the OS kernel
- eg. I/O instructions
- User mode
- Only a subset of instructions allowed to execute
- For any other software other than kernel
- Supervisor (kernel) mode
System Calls
- Program running in user mode requesting a service from the OS
- Implemented with a trap/exception/fault
- Triggers switch to kernel mode to perform requested action
Uniprogramming vs. Multiprogramming
- Uniprogramming
- Only one process at a time
- Poor resource utilization
- eg. DOS
- Multiprogramming
- Multiple processes at a time
- Increased resource utilization
- e.g modern OSes like Windows, Linux etc.
Virtual Memory
- OS abstraction of physical memory
- Provides each process with illusion of exclusive use of memory and a larger memory space than is actually available
- Logical (virtual) address
- Address generated by CPU
- Logical address space - set of all logical addresses generated by a program
- Physical address
- Address used by physical memory
- Physical address space - set of all physical addresses corresponding to the logical addresses
- Virtual memory in OS in charge of run-time mapping from virtual to physical addresses using MMU for fast translation
Address Space of a Process
- Each process has its own private (virtual) address space
- Process cannot affect state of another process directly
- Memory protection
- Kernel address space vs. user address space
Communication between Processes
- Coordination between processes through r/w to location in shared address space
- Also through Inter-Process Communication (IPC) mechanism
- Exchange data without sharing virtual address space
- Expensive
Concurrency
- Computer system is typically multiprogramming
- Concurrently, kernel processes execute system code while user processes execute user code
- Instructions of one process interleaved with those of another process
- Using context switches
Context Switch
- CPU scheduler in OS switches between processes
- Context is info allowing the continuation of a process later
Process State Transitions
- Running: using CPU
- Ready: no CPU available
- Blocked: waiting for event eg. I/O to occur
Preemptive vs. Cooperative
- Preemptive multitasking
- Permits preemption of tasks
- All processes will get some CPU time
- More reliably guarantees each process a slice of operating time
- Supported by nearly all modern OSes
- Cooperative multitasking
- Tasks explicitly programmed to yield when they do not need system resources (eg. CPU)
- Rarely used
Threads
- Independent Fetch/Decode/Execute loop
- Smallest unit of processing that can be scheduled by OS
- Consists of
- Code
- Registers
- Stack
- Thread-local data
- User-level thread vs. kernel-level thread
- Threads contained in a process
- Multiple threads can exist in the same process
- Shares resources with other threads
- Code
- Data
- OS resources: open files, signals etc.
Communication between Threads
- Multiple threads within a process share portions of virtual address space of process by default eg. code and data sections
- Coordination between threads in same process through r/w variables allocated in shared space
- Writes to shared address visible to every thread
Thread Library
- Provide API for creating and managing threads
- User-level library entirely in user space with no kernel support
- Threads occur in user space
- Threads managed by runtime library
- Kernel-level library supported directly by OS
- Each thread has its own execution context
- Threads managed by OS
- POSIX Pthread
Linux Schedulers
- Completely Fair Scheduler (CFS)
- No distinction between processes and threads when scheduling
- To maintain fairness
- Run-queue for each processor containing processes whose state is ‘ready’
- Nice values used to weight processes
Time Slice
- Time interval a process can run for without being preempted
- Proportional to processes’ weight
Virtual Runtime
- Measure for amount of time provided to a process
- The smaller a process’ virtual runtime, the higher its need for the processor
- Cumulative execution time inversely scaled by its weight
Red-black Tree
- Self-balanced tree
- No path in the tree will ever be more than twice as long as any other
- Operations on the tree occur in O(log n) where n is number of nodes in the tree
- CFS maintains red-black tree for run-queue for each processor
Scheduling Domains
- Set of processors whose workloads should be kept balanced by kernel
- Partitioned in one or more groups
- Hierarchically organized: top scheduling domain is the set of all processors in the system
Run-queue Balancing
- Perform load balancing on each re-balancing tick using push migration
- Check hierarchically if a scheduling domain is significantly unbalanced
- Find busiest run-queue in the domain and migrate processes to another
Push vs. Pull
- Push migration
- Specific process periodically checks load on each processor
- Re-distributes by moving/pushing processes to less-busy processors
- Pull migration
- Idle processors pull waiting processes from a busy processor
- Linux scheduler implements both techniques
Negative Aspect of Process Migration
- New processor’s cache is cold for the migrated process
- Needs to pull its data into the cache
===========
Chapter 13 - High Performance Computing System Architecture and Parallelization Patterns
Symmetric Multiprocessing
- Two or more identical processors connect to a single, shared main memory, have full access to all I/O devices, and are controlled by a single operating system instance that treats all processors equally
- Architecture of most modern multiprocessor systems
Cluster
- Nodes comprising processors and main memory
- Many nodes connected together using an Interconnection Network
- Architecture of many High Performance Computing systems
Interconnection Network
- Connects nodes in a cluster together
- Commonly using Ethernet or InfiniBand
- HPC usually using 10-gigabit Ethernet
- InfiniBand used by HPC and data centers
Advantages of Clusters
- High performance utilizing existing systems
- Fault tolerance since system continues working even with a malfunctioning node
- Scalability
- Centralized management
Heterogeneous Clusters
- Each node comprised of CPU and GPU/co-processor
Massively Parallel Processing
- Large number of processors being used to executed one program
- MPP aims to effectively exploit parallelism of thousands of nodes in clusters
Parallelism Patterns
- Control dependence:
if
statement - Data dependence: flow, anti, output dependences
- Loop-carried dependence: dependence exists across iterations ie. if loop is removed dependence no longer exists
- Loop-independent dependence: dependence exists within an iteration ie. if loop is removed, dependence still exists
Conditions for Parallelism
- Reordering of instructions produces the same result as the original program
Matrix and Vector Multiplication
- Good problem for parallelization
- Few dependences
- Nested
for
loop
Map
- Simple operation applied to all elements in a sequence
- Split a task into many independent sub-tasks which require no synchronization/communication between each sub-task
Reduction
- A summary operation, the opposite of map
- Combines many sub-results into one final result
- eg. add, multiply, min, max, count etc.
MapReduce
- A Map operation followed by a Reduction
- Exploits parallelism in sub-tasks
==========
Chapter 14 - Synchronization and Parallelization Considerations
Need for Synchronization
- Maintain sequence between processes and threads
- Achieve mutual exclusion
Barrier
- Barrier halts execution of a thread until all specified threads have reached the barrier
- Implemented in code
Race Condition
- When two or more threads are attempting to access the same resource eg. r/w to the same variable
- Result is non-deterministic (different runs, different results) and depends on relative timing between interfering threads
- Solutions
- Busy-wait: One thread keeps checking if the other thread has left the critical section before proceeding
- Lock-free algorithms
Thread Safety
- Multiple threads can execute the same piece of code without conflicts
- Thread-safe library
Atomicity
- Operation is atomic if either the entire operation succeeds or it entirely fails
Mutual Exclusion
- No two threads are in their critical section at the same time
- Concurrency control to prevent race conditions
Lock
- Atomic operation
- Lock variable with two states, locked and unlocked
- Thread wanting to enter critical section has to set lock variable to lock, when leaving unlock
- Thread not able to set lock variable to lock has to wait
Spinlocks
- Thread requesting lock spins constantly checking the lock until it is released
- Thread proceeds as soon as lock is released
- Saves time for locks held for short time since no context switching
- Wastes CPU cycles spinning
- Cannot be used on uniprocessor systems
- Uses test_and_set()
- Tests a memory location L for false
- If L == false, set to true, return false
- If L == true, return true
- Uses two while loops with inner loop checking the pointer to avoid performance degradation due to frequent memory accesses
Sleeplocks
- Thread requesting lock is blocked and put back on ready queue once lock is released
- Can be used on uniprocessor
- Saves CPU time on locks held for long time
Pros and Cons of Locks
- Easy to understand and use
- Is expensive
- Can result in
- Deadlock
- Priority Inversion
Priority Inversion
- Only one processor
- Lower priority thread preempted by higher priority thread while in its critical section, higher priority thread will wait forever for suspended lower priority thread to release the lock
- Solutions
- Disable preemption while lock is held
- Priority inheritance
Semaphores
- Semaphore S is a variable that takes only non-negative integer values
- Manipulated by two atomic operations P (down) and V (up)
- P(S): If S>0, P decrements/downs S and returns; else suspends until S becomes nonzero
- V(S): V increments/ups S and returns
- To schedule shared resources
- Thread suspended on a semaphore does not need to execute instructions checking variables in a busy-wait loop
Binary Semaphore
- Value initially 1, always either 0 or 1
Counting (general) Semaphores
- Value any integer greater than or equal to 0
- Represents resource with many units available
Producers and Consumers
- Producer threads create data element
- Consumer threads receive data element and perform computation on the element
Single Producer, Single Consumer
- Use circular queue to store data elements
- For asynchronous communication between the producer and consumer
- Producer appends to tail of queue if not full
- Consumer removes from head of queue if not empty
- No. of elements currently in queue and total number of available slots in queue are implemented as counting semaphores
Non-blocking Synchronization
- Algorithm is non-blocking if suspension of one or more threads will not stop progress of other threads
- Blocking is undesirable
- Nothing is accomplished while blocked
- Coarse-grained locking reduces opportunities for parallelism
- Fine-grained locking increases overhead
- Non-blocking synchronization algorithms exist but are hard to understand
- Advantages
- No deadlocks
- No priority inversion
- No performance degradation from
- Context switching
- Page faults
- Cache misses
Considerations for Parallelization
- Potential for parallelization in a program
- Amdahl’s Law
- Speedup determined by sections which have to be sequentially executed
- Locality
- Best to use data while it is in the cache
- Exploit spatial and temporal locality where possible
- Load Balance
- Load needs to be distributed evenly across processors
- Total time taken for execution depends on the slowest/busiest processor
- Parallelization Overheads
- Costs for starting multiple threads
- Costs for communication
- Costs for synchronization
- Extra computation for parallelization
Scalability
- Suitably efficient and practical when applied to cases with large number of processors
- Overhead increases as number of processors increases
- Strong scaling: how the solution time varies with the number of processors for a fixed total problem size
- Weak scaling: how the solution time varies with the number of processors for a fixed problem size per processor
=========
Chapter 15 - Hardware Support for Synchronization
Deadlock and Livelock
- Deadlock occurs when each thread in a set is waiting for an event only another thread in the set can cause
- No possible execution sequence will ever succeed
- Livelock occurs when states of threads constantly change with regard to one another but none progress
- Possible execution sequences exist in which no thread ever succeeds
Deadlock
- Four necessary conditions for deadlock
- Mutual Exclusion: at most one thread holds a resource at any one time
- Hold and Wait: threads already holding resources attempt to hold new resources
- No Preemption: thread already holding resource has to voluntarily release it
- Circular Wait: circular chain of threads requesting resources held by next thread in the chain
Mutual Exclusion Problem
- Each thread executing instructions in an infinite loop
- Satisfies following requirements
- Mutual Exclusion: No two threads in critical section at the same time
- Deadlock-freedom: If some threads are trying to enter critical section, one thread eventually enters
- Starvation-freedom: If a thread is trying to enter its critical section, it must eventually enter
- In absence of contention for critical section, thread trying to enter critical section will succeed
Dekker’s Algorithm
- Between two threads
- Right to enter critical section explicitly passed between threads
Bakery Algorithm for Two Threads
- Does not have a variable that is both read and written by more than one thread
- Thread wanting to enter critical section required to take numbered ticket whose value is greater than values of all outstanding tickets
- Thread waits until its ticket has the lowest value
- Not practical for more than two threads
Hardware Support for Mutual Exclusion
- All architectures support atomic read and write operations
- Difficult to solve mutual exclusion problem with interleaved individual read and write instructions
- Read and write can be combined into a single atomic instruction
Atomic Swap
- Takes a register and memory location
- Exchanges the value in register for value in memory location
Test and Set
- Takes a memory location
- If value in memory location is false, set memory location to true, return false
- Else return true
Fetch and Add
- Takes a memory location and a value
- Increments value in memory location by given value, return old value of memory location
Read-Modify-Write
- Takes memory location and function
f
- Read value in memory location, writes new value
f(value)
into memory location, returnvoid
Compare and Swap
- Takes a memory location and two values
expected
andnew
- Compares value in memory location with
expected
- If equal, modifies memory location to
new
, return true - Else return false
===========
Chapter 16 - Optimization in the Memory Hierarchy
Parallel Merge Sort
- Each child processor provides sorted result to parent processor
Matrix Multiplication
- Need to consider
- Total cache size to exploit temporal locality and keep working set small
- Cache block size to exploit spatial locality
- Between N * N matrices: O(N^3)
Layout of C Arrays in Memory
- C arrays allocated in row-major order
- Each row in an array in contiguous memory locations
- Stepping through columns in one row accesses successive elements
- If block size B > 8 bytes, exploits spatial locality
- Compulsory miss rate = 8/B
- Stepping through rows in one column accesses distant elements
- No spatial locality
- Compulsory miss rate = 1 (100%)
Reuse and Locality
- Reuse occurs when accessing a location that has been accessed in the past
- Locality occurs when accessing a location currently in the cache
- Locality only occurs when there is reuse
Strip Mining
- Breaks loops into pieces eg. instead of one loop from 0-11, 4 loops each taking 3 elements
Tiling (Blocking) for Matrix Multiplication
- Algorithm accesses all elements of second matrix b column by column
- Elements of a column stored among N different cache lines
- If cache is big enough to hold N cache lines and no lines are replaced, no cache miss when reading b
- Since all of a, b, c cannot fit in cache, choose block size B such that it becomes possible to fit one block from each of the matrices in the cache
- To improve spatial and temporal locality
- Sub-blocks can be treated like scalars
- ksami