This project was my final project for EECS: 2021. It was a RISC-V pipelining project where the objective was to learn how to optimize code. This project helped me excel in the course, gaining a very strong understanding to prepare for the exam and earning a grade of 100%

Assignment Details

Background:

Two-issue, statically scheduled processor benefits from the compiler to group two instructions into an "issue packet" to be executed in one clock cycle. This is in contrast to a single-issue instruction that only can execute one instruction in a single cycle. The issue packet sometimes is called a Very Long Instruction Word (VLIW). The compiler aims to remove all hazards by reordering instructions into issue packets. Sometimes compiler has to pad VLIW with no-operation instruction (e.g., NOP) if necessary. The compiler must bundle one ALU/branch instruction with one load/store instruction; if it cannot do that, it should use NOP to pad the unused instruction.

Specifications:

Given the above background and the following code:

for(int i=0; i!=j; i=i+2) b[i]=a[i]–a[i+1];

We want to compare the performance of single-issue and multiple-issue processors, taking into account program transformations that can be made to optimize for 2-issue execution.

A compiler doing little or no optimization might produce the following RISC-V assembly code:

        addi x12, x0, 0 
    	jal ENT
TOP: 	slli x5, x12, 3
    	add x6, x10, x5
    	lw x7, 0(x6) 
    	lw x29, 8(x6) 
    	sub x30, x7, x29 
    	add x31, x11, x5 
    	sw x30, 0(x31) 
    	addi x12, x12, 2
ENT:    bne x12, x13, TOP

As seen above, we can assume that i is x12, j is x13, a is x10, b is x11, and x5, x6, x7, x29, x30, and x31 are temp registers.

Our two-issue, statically scheduled processor has the following properties:

One instruction must be a memory operation; the other must be an arithmetic/logic instruction or a branch.
The processor has all possible forwarding paths between stages (including paths to the ID stage for branch resolution).
The processor has perfect branch prediction.
Two instructions may not issue together in a packet if one depends on the other.
If a stall is necessary, both instructions in the issue packet must stall.

Questions

Part A:

Draw a pipeline diagram showing how the RISC-V code given above executes on the two-issue processor. Assume that the loop exits after two iterations.

Part B:

What is the speedup of going from a one-issue to a two-issue processor? (Assume the loop runs 100,000 iterations).

Part C:

Rearrange/rewrite the RISC-V code given above to achieve better performance on the one-issue processor. Hint: Use the instruction “beqz x13,DONE” to skip the loop entirely if j = 0.

Part D:

Rearrange/rewrite the RISC-V code given above to achieve better performance on the two-issue processor. (Do not unroll the loop, however.)

Part E:

Repeat Part A, but this time use your optimized code from Part D.

Part F:

What is the speedup of going from a one-issue processor to a two-issue processor when running the optimized code from Part C and D?

Part G:

Unroll the RISC-V code from Part-D so that each iteration of the unrolled loop handles two iterations of the original loop. Then, rearrange/rewrite your unrolled code to achieve better performance on the two-issue processor. You may assume that j is a multiple of 4. Hint: you may want to re-organize the loop so that some calculations appear outside and at the end of the loop. You may also assume that the values in temporary registers are not needed after the loop.

Part H:

What is the speedup of going from a one-issue processor to a two-issue processor when running the unrolled, optimized code you created in Part G, assuming the two-issue processor can run two arithmetic/logic instructions together (e.g., the first instruction in a packet can be any type of instruction, but the second must be an arithmetic or logic instruction. And note that two memory operations cannot be scheduled at the same time.)

Part I:

Would the implementation of a superscalar processor facilitate the attainment of faster processing speeds (in comparison to the aforementioned scenarios) while maintaining the same CPU clock configurations? If the answer is affirmative, kindly expound on your design, incorporating the components used, analogous to our approach with RISC-V pipelines. Conversely, if your response is negative, elucidate the rationale behind your assertion.

For a video walkthrough of my solution, please see the following video:

RISC-V Pipelining Project