Why are elementwise additions much faster in separate loops than in a combined loop?

Suppose a1, b1, c1, and d1 point to heap memory and my numerical code has the following core loop.

This loop is executed 10,000 times via another outer for loop. To speed it up, I changed the code to: […]

Do I cast the result of malloc?

In this question, someone suggested in a comment that I should not cast the result of malloc, i.e.

rather than:

Why would this be the case? No; you don’t cast the result, since: It is unnecessary, as void […]