[LLVMdev] speed up memcpy intrinsic using ARM Neon registers
evan.cheng at apple.com
Tue Nov 10 13:27:31 CST 2009
On Nov 9, 2009, at 11:25 PM, Chris Lattner wrote:
> On Nov 9, 2009, at 11:13 PM, Evan Cheng wrote:
>>> On the A8, an ARM store after NEON stores to the same 16-byte block
>>> incurs a ~20 cycle penalty since the NEON unit executes behind ARM.
>>> It's worse if the NEON store was split across a 16-byte boundary, then
>>> there could be a 50 cycle stall.
>>> See http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/ for
>>> some more details and benchmarks.
>> If that's the case, then for A8 we should only do this when there
>> won't be trailing scalar load / stores.
> It should be safe if the start pointer is known 16-byte aligned. The trailing stores won't be in the same 16-byte chunk.
There are secondary effects if the load / store are within 64-byte block.
More information about the LLVMdev