并行计算代写 |代码代写

CS 533 Parallel Computer Architecture


2. (16 points) Prefetching 2

(a) What problem does prefetching attempt to tackle? Can prefetching hurt performance?

(b) Is hardware prefetching more effective for single-issue statically scheduled processors or multiple issue dynamically scheduled processors?Why?

(c) How early can a binding prefetch for variables in a critical section be issued in a cache coherent system?

(d) Prefetch tries to bring a cache line before it is requested. Can this lead to any violation of memory consistency? Explain.

(e) Briefly explain the intuition behind Runahead execution1 .

(f) Contemporary processors have HW prefetchers that automatically prefetch the cache lines. Identify one memory access pattern that can be automatically prefetched. How can you detect the pattern via hardware?

(g) In the paper detailing prefetching in the SUIF compiler2 , the authors discuss the tradeoffs associated with trying to perform a prefetch when the prefetch buffer is full. What are the possible options and advantages of each solution?

(h) Some processors replace coherent caches with scratchpad memories 3 where the memory is cached only by software instructions. Explain one advantage and one disadvantage of this approach.


4. (10 points) Synchronization 2: Spin locks

Suppose all 8 processors in a bus-based machine try to acquire a test & test & set lock simultaneously. Assume all processors are spinning on the lock in their caches and are invalidated by a release at the beginning.

(a) How many bus transactions will it take until all processors have acquired the lock if all the critical sections are empty (each processor only executes a LOCK operation, immediately followed by an UNLOCK operation with no other operations in between)?

(b) Assuming that the bus is fair (services pending requests before new ones) and that every bus transaction takes 50 cycles, how long would it take before the first processor acquires and releases the lock? How long before the last processor to acquire the lock is able to acquire and release it?

(c) If the variables used for implementing locks are not cached, will a test & test & set lock still generate less traffic than a test & set lock?Explain.

(d) Why would one use Array-Based Queue Locks (ABQLs) instead of test & test & set? Are there any downsides to ABQLs? Does the MCS lock solve these problems?

(e) Can you implement the MCS lock using LL/SC instead of compare & swap and fetch & store? Please explain how you can implement it, or why it is not implementable.


5. (11 points) SMT

(a) How does SMT differ from superscalars and traditional multithreading?

(b) Between an in-order and out-of-order superscalar processor, which would benefit more from adding SMT capabilities? Why?

(c) Indicate for each of the following structures whether it should be shared or duplicated in a SMT machine, or if it can be both? Explain.

i. Branch predictor

ii. Return address stack

iii. Register file

iv. TLB

v. Register Aliasing Table (RAT)

vi. Load-Store Queue (LSQ)

(d) What are the factors against building wide-issue superscalars? You can look at this paper4 for valuable insights.

(e) In multiscalar processors, list conditions under which a task may get squashed. Do all the successors of a squashed task also need to be squashed for correctness?


6. (12 points) SMT and CMP

(a) For which types of applications is it better to have a SMT machine instead of a CMP machine? What about applications that have higher performance on a CMP machine over an SMT machine? Assume both machines use the same die size.

(b) Compare SMTs and CMPs in terms of hardware complexity. Carefully explain all the aspects that contribute to it.

(c) Several papers explain how to transform a machine’s structure based on the codes being run5 , 6 (CMP SMT, OoO processor SMT in-order processor).

i. Explain the high-level transformation process of a CMP machine SMT machine. Give details about the hardware structures needed to be modified.

ii. Explain the limitations of the Core Fusion.

iii. Describe the approach MorphCore takes to handle workload diversity. Give details about the hardware structures that have to be modified.

(d) Many modern processors incorporate heterogeneous big.little 7 architecture. Explain one benefit of big.little over SMT. Also, identify one technical challenge ofheterogeneous architecture.

 

咨询 Alpha 小助手,获取更多课业帮助