Suppose an instruction takes 4 cycles to execute in an unpipelined CPU: one cycle to fetch the instruction, one cycle to decode the instruction and fetch any operands, one cycle to perform the ALU operation, and one cycle to store the result. In a CPU with a 4 stage pipeline, that instruction still takes 4 cycles to execute, so how can we say the pipeline speeds up the execution of the program?