BM3D-r4
Added SSE2 optimizations
Basic, Final: profile="fast" use group_size=8 instead of 16 for better speed
OPP2RGB, RGB2OPP: now override frame property "_Matrix"
Compiled with VS2015
~35% faster with the same settings
~175% faster with default settings for bm3d.Basic+bm3d.Final