@araker: I guess that the BLAS performance should be more or less the same as math-neon... at least regarding matrix multiplication.
So, probably cocos2d will do compile-time decisions, and not runtime decisions:
if compiled using ARMv6, then use vfp
if compiled using ARMv7, then use Neon.
If you are shipping fat binaries (ARMv6 + ARMv7), then it will use vfp in iPhone 1st and 2nd gen, and Neon on iPad, iPhone 3rd and 4th gen.
The thing is that math-neon is licensed under LGPL, and cocos2d is no longer licensed under "cocos2d license" so I'm not sure I'll be able to use that library. I'll ask the authors.
Backup plan: In case we can't use math-neon, Oolong-engine includes a 4x4 matrix multiplication function that is optimized for Neon, perhaps I'll adapt that one for 3x3 matrices
Update: Backup plan 2: http://blogs.arm.com/software-enablement/coding-for-neon-part-3-matrix-multiplication/