@manucorphorat
I've been doing some research, and I came across the "Accelerate Framework" that was introduced in iOS 4. It provides optimized matrix and vector operations, as well as convolution built in. If you are still looking for ways to improve the efficiency of your filter, I'd check it out. Just an idea.
EDIT:
Out of curiosity for how fast it would work, I implemented the filter using the vDSP part of the Accelerate framework:
...bunch of stuff...
if(format == kTexture2DPixelFormat_RGBA8888){
const unsigned char *originalData = (unsigned char*)input;
unsigned char *data = (unsigned char*)output;
float *originalFloatData = malloc(wh*4*sizeof(float));
float *floatData = malloc(wh*4*sizeof(float));
vDSP_vfltu8 (originalData, 1, originalFloatData, 1, wh*4);
/* Carry out a convolution. */
int filterStride = -1;
int signalStride =1;
int resultStride = 1;
vDSP_conv(originalFloatData, signalStride, kernel + kernelSize-1,
filterStride, floatData, resultStride, wh*4, kernelSize);
vDSP_vfixu8(floatData, 1, data, 1, wh*4);
}else if(format == kTexture2DPixelFormat_A8){
...More stuff...
This is just what I've been able to figure out so far, it almost works, I just got to figure out how to just convolve the individual colors with themselves... Here is what it did: