April 24, 2015

Multicore

Notes for multicore computing. Midterm week so there will be more of these posts coming up!

Some have been translated from Korean as I struggle through my lecture notes so remember to cross-reference.

Multicore Computing

Chapter 1 - Overview

Capabilities

Supercomputers moving on to operate in peta-flops (10^15)
Moore’s Law: no. of transistors double every year

Hardware techniques

Instruction pipeline
Out-of-order execution
Superscalar execution
On-chip cache

ILP Wall

Limit to Instruction Level Parallelism (ILP)

Power Wall

Instructions per second directly proportional to CPU clock frequency
Requires more power to run at a higher clock frequency

Multicore

Two or more processors on one chip
Manycore: 8/16 or more cores
Addresses power and ILP walls

Homogeneous multicore

All cores are the same
Intel Xeon, AMD Opteron, ARM Cortex A15 MPCore

Heterogeneous multicore

Different processors on the same chip
Intel i7, AMD APU, ARM big.LITTLE, Nvidia Tegra

GPGPU

General Purpose computing on Graphics Processing Units

Accelerator

Attached to systems to increase speed
GPU, FPGA, Intel Xeon Phi coprocessor

Amdahl’s Law

Used to find the maximum expected improvement to an overall system when only part of the system is improved

Programming Wall

Programmers have yet to efficiently exploit the many number of cores available
Hardware has advanced much more rapidly than software

Programming Model

Programmer -> Programming Model -> Parallel Processing System
A balance between delivering high performance and ease of programming

Shared memory parallel programming model

OpenMP
Pthreads

Message passing parallel programming model

Accelerator programming model

OpenCL
SnuCL
CUDA
OpenMP, OpenACC

===========

Chapter 2 - Computers and Notation

Computer

Input -> Computer -> Output

Data

Bit: binary digit (1 or 0) + encoded in electronic circuits as a change in voltage
Byte: 8 bits

Hardware and Software

Hardware: physical component
Software: set of machine-readable instructions
Programs: sequence of instructions written to perform a specific task

Radix Notation

Positional number system
Binary (base 2), Octal (base 8), Decimal (base 10), Hexadecimal (base 16)
Conversions between the different number systems
Signed and unsigned numbers + 2’s complement representation: flip all bits and add 1

Modulo

Divide and take remainder
Written as mod
eg. 7+3 = 2 (mod 8)

===========

Chapter 3 - Boolean Algebra and Logic

Boolean Algebra

OR, AND, NOT
XOR, XNOR, NOR, NAND
Minimise a boolean expression using Karnaugh map

Logic Gates

Made of transistors
Combinations of them implement all Boolean operations
Gate or propagation delay exists + Time taken for signal/changes to travel/propagate down to later stages of the circuit
Tristate buffer + Acts as a switch, output is input only if Enable is 1
Multiplexer (MUX) + Selects/forwards one of many inputs to the output
Decoder + Splits input into many outputs

Combinational Logic circuit

Logic of the circuit depends solely on logic gates

==========

Chapter 4 - Sequential Logic

Sequential Logic circuit

Circuit depends on a clock
Memory exists in the circuit
Output depends also on previous results and not just inputs

Clock

Square wave signal
Rising edge, falling edge

Latch and Flip-flop

Circuit component which has memory
Component in registers

Finite State Machines

Mealy FSM: Output values are determined by both its current state and current inputs
Moore FSM: Output values are determined solely by its current state

RAM

Random Access Memory
kilo (2^10), mega (2^20), giga (2^30) etc.
Composed of cells which hold one bit each
Multiplexers control read/write operations
Memory access time exists, product of propagation delay

==========

Chapter 5 - Binary Integer Arithmetic

Operations

Uses NOT, AND, OR, NOR, NAND, XOR, XNOR
In C language, ~, &, |, ^ correspond to NOT, AND, OR, XOR

Addition/Subtraction

Uses XOR, AND, OR
Sign extension for signed arithmetic

Multiplication/Division

Shift right/left by the appropriate number of bits
Sign extension done on 2’s complement representation for signed number arithmetic

===========

Chapter 6 - Floating Point Arithmetic

Scientific Notation for Numbers

m x b^e
m mantissa, coefficient
b base, radix
e exponent
eg. -0.12345 = -1.2345 * 10^-1

IEEE Floating Point Representation

Exponent E has to be biased + E has to be signed + If E is in 2’s complement it would be difficult to compare + When encoding add minimum value of E so encoded E is always positive
Exponent E = 111…, fraction F = 000… + Represents infinity + Overflow
Exponent E = 111…, fraction F not 000… + Represents NaN (Not a Number) + No numeric value can be determined
As if computed with inifinite precision then rounded

Precision

Single (32-bit) precision: sign(1), exponent(8), fraction(23), bias=127
Double (64-bit) precision: sign(1), exponent(11), fraction(52), bias=1023
Quadruple (128-bit) precision: sign(1), exponent(15), fraction(112), bias=16383

Rounding

Rounding errors since binary can only approximate the true value of a decimal number

Normalized numbers

Ones place of at least 1 and smallest exponent, eg. <= 2^-127 and >= 2^-127

Denormalized numbers

Ones place of 0 and smallest exponent, eg. > -2^-127 and < 2^-127
For representing numbers smaller than normal (underflow)

Floating Point Arithmetic

Add is NOT associative ie. (a+b)+c != a+(b+c) + Due to overflow and inexactness of rounding

Fused Multiply-Add

Multiply and add operation together in one hardware unit
Matrix dot product uses many multiply and add operations
Better precision than doing two seperate operations + One rounding error compared to two

=============

Chapter 7 - Von Neumann Architecture

Von Neumann Architecture

Four basic hardware components + Input devices + Output devices + Main memory + Central Processing Unit (CPU)

CPU

Carries out the instructions of a computer program
Arithmetic Logic Unit (ALU)
Control Unit (CU) fetches instructions from main memory, decodes and executes them using ALU where necessary

Stored Program Concept

Programs treated as data, both stored in main memory
No need to reprogram computers for each specific task all the time
Solve different problems by just switching between different programs

Dependences

Any ordering of execution that obeys all dependeces will produce the same result as the original program
Flow dependence
- True dependence: cannot be removed
- Read after Write
Anti dependence
- False dependence: can be removed
- Write after Read
Output dependence
- False dependence: can be removed
- Write after Write
Input not-really-a-dependence
- For caches
- Read after Read

Pipelined Processors

Increases instruction throughput ie. no. of instructions executed per CPU clock cycle
Typical five-stage pipline:
- IF: instruction fetch
- ID: instruction decode and register fetch
- EX: execute
- MEM: memory access
- WB: register write back

Pipline hazards

Occur when the next instruction does not execute in the next clock cycle
Data hazards
- Result needed before it is available
- To resolve:
  - Stall the pipeline/insert a bubble of no operation
  - Data forwarding
Structural hazards
- Hardware component required by more than one instruction at the same time
- Resource conflict eg. single memory unit accessed in both IF and MEM
- To resolve:
  - Stall the pipeline
Control hazards
- Branches in instruction flow
- Branch target unknown until instruction reaches MEM stage
- Instructions fetched after branch should not execute since control has been diverted
- To resolve:
  - Delay branch scheduling instructions into branch delay slots
  - Not having branches at all
  - Stalling
  - Branch predictors

In-order Execution

Fetch and decode next instruction
If operands available, instruction dispatched to appropriate functional unit
Else stall until available
Result written back to register file

Out of Order Execution

Dynamic instriction scheduling by hardware
Fetch and decode next instruction
Issue to reservation station
Instruction waits in reservation station until operands available
Instruction dispatched to functional unit and executes
Execution completed, result queued in reorder buffer in commit unit
Commit unit writes results to registers and memory in program fetch order

Issuing and Dispatching an Instruction

Instruction that has passed ID stage is issued
When operands are available, instruction is dispatched to functional/execution unit

Precise Exception (Interrupt)

Instructions before faulting instruction are committed
Instructions after faulting instruction are restarted from scratch

Retirement (Graduation)

Instruction commited or removed

Superscalar Processors

Dynamically issue multiple instructions in each clock cycle
- Fetches and decodes several instructions at a time
- In-order or out-of-order issue
Exploit Instruction Level Parallelism (ILP)
- Instructions that can be executed simultaneously
- Limited amount of ILP in an application

VLIW Processors

Static instruction scheduling by compiler
Very Long Instruction Word
Programs able to explicitly specify instructions to be executed simultaneously

- ksami

Tags: multicore, notes