Commit Graph

55 Commits

Author SHA1 Message Date
Mark Reid
716b396740 avfilter/vf_lut3d: add x86-optimized tetrahedral interpolation
I spotted an interesting pattern that I didn't see before that leads to the implementation being faster.
The bit shifting table I was using before is no longer needed, and was able to remove quite a few lines. 
I also add use of FMA on the AVX2 version.

f32 1920x1080 1 thread with prelut
c impl
1434012700 UNITS in lut3d->interp,       1 runs,      0 skips
1434035335 UNITS in lut3d->interp,       2 runs,      0 skips
1423615347 UNITS in lut3d->interp,       4 runs,      0 skips
1426268863 UNITS in lut3d->interp,       8 runs,      0 skips

sse2
905484420 UNITS in lut3d->interp,       1 runs,      0 skips
905659010 UNITS in lut3d->interp,       2 runs,      0 skips
915167140 UNITS in lut3d->interp,       4 runs,      0 skips
915834222 UNITS in lut3d->interp,       8 runs,      0 skips

avx
574794860 UNITS in lut3d->interp,       1 runs,      0 skips
581035090 UNITS in lut3d->interp,       2 runs,      0 skips
584116720 UNITS in lut3d->interp,       4 runs,      0 skips
581460290 UNITS in lut3d->interp,       8 runs,      0 skips

avx2
301698880 UNITS in lut3d->interp,       1 runs,      0 skips
301982880 UNITS in lut3d->interp,       2 runs,      0 skips
306962430 UNITS in lut3d->interp,       4 runs,      0 skips
305472025 UNITS in lut3d->interp,       8 runs,      0 skips

gbrap16 1920x1080 1 thread with prelut
c impl
1480894840 UNITS in lut3d->interp,       1 runs,      0 skips
1502922990 UNITS in lut3d->interp,       2 runs,      0 skips
1496114307 UNITS in lut3d->interp,       4 runs,      0 skips
1492554551 UNITS in lut3d->interp,       8 runs,      0 skips

sse2
980777180 UNITS in lut3d->interp,       1 runs,      0 skips
986121520 UNITS in lut3d->interp,       2 runs,      0 skips
986489840 UNITS in lut3d->interp,       4 runs,      0 skips
998832248 UNITS in lut3d->interp,       8 runs,      0 skips

avx
622212360 UNITS in lut3d->interp,       1 runs,      0 skips
622981160 UNITS in lut3d->interp,       2 runs,      0 skips
645396315 UNITS in lut3d->interp,       4 runs,      0 skips
641057075 UNITS in lut3d->interp,       8 runs,      0 skips

avx2
321336400 UNITS in lut3d->interp,       1 runs,      0 skips
321268920 UNITS in lut3d->interp,       2 runs,      0 skips
323459895 UNITS in lut3d->interp,       4 runs,      0 skips
324949967 UNITS in lut3d->interp,       8 runs,      0 skips
2021-10-10 22:23:48 +02:00
Paul B Mahol
ac0f5f4c17 avfilter/vf_maskedclamp: add x86 SIMD 2019-10-23 16:20:21 +02:00
Paul B Mahol
ccd9bca15a avfilter/vf_transpose: add x86 SIMD 2019-10-21 20:37:51 +02:00
Paul B Mahol
295d99b439 avfilter/vf_adadenoise: add x86 SIMD 2019-10-17 19:44:11 +02:00
James Almer
1dbd3c6116 avfilter/vf_eq: fix compilation with x86 asm disabled
Signed-off-by: James Almer <jamrial@gmail.com>
2019-09-26 12:19:43 -03:00
Ting Fu
6aff2042d6 avfilter/x86/vf_eq: Change inline assembly into nasm code
Signed-off-by: Ting Fu <ting.fu@intel.com>
2019-09-26 08:11:13 +08:00
Paul B Mahol
058bbf48c6 avfilter/vf_v360: x86 SIMD for interpolations 2019-09-06 14:10:37 +02:00
Ruiling Song
98e419cbf5 avfilter/vf_convolution: add x86 SIMD for filter_3x3()
Tested using a simple command (apply edge enhance):
./ffmpeg_g -i ~/Downloads/bbb_sunflower_1080p_30fps_normal.mp4 \
 -vf convolution="0 0 0 -1 1 0 0 0 0:0 0 0 -1 1 0 0 0 0:0 0 0 -1 1 0 0 0 0:0 0 0 -1 1 0 0 0 0:5:1:1:1:0:128:128:128" \
 -an -vframes 1000 -f null /dev/null

The fps increase from 151 to 270 on my local machine.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
2019-08-07 14:31:28 +08:00
Ruiling Song
83f9da7768 avfilter/vf_gblur: add x86 SIMD optimizations
The horizontal pass get ~2x performance with the patch
under single thread.

Tested overall performance using the command(avx2 enabled):
./ffmpeg -i 1080p.mp4 -vf gblur -f null /dev/null
./ffmpeg -i 1080p.mp4 -vf gblur=threads=1 -f null /dev/null
For single thread, the fps improves from 43 to 60, about 40%.
For multi-thread, the fps improves from 110 to 130, about 20%.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
2019-06-12 08:53:11 +08:00
Paul B Mahol
dcae5ba322 avfilter: add anlmdn filter x86 SIMD optimizations 2019-01-10 21:49:47 +01:00
Marton Balint
6c2a7a8e9a avfilter/vf_framerate: factorize SAD functions which compute SAD for a whole frame
Also add SIMD which works on lines because it is faster then calculating it on
8x8 blocks using pixelutils.

Signed-off-by: Marton Balint <cus@passwd.hu>
2018-11-11 20:30:50 +01:00
Paul B Mahol
6d7c63588c avfilter/vf_overlay: add x86 SIMD
Specifically for yuv444, yuv422, yuv420 format when main stream has no alpha, and alpha
is straight.

Signed-off-by: Paul B Mahol <onemda@gmail.com>
2018-05-02 23:58:21 +02:00
Vasile Toncu
9c01cdb94e avfilter/vf_interlace: remove duplicate code with same funcionality 2018-04-23 23:48:30 +02:00
Marton Balint
4d95c6d5d7 avfilter/vf_framerate: add SIMD functions for frame blending
Blend function speedups on x86_64 Core i5 4460:

ffmpeg -f lavfi -i allyuv -vf framerate=60:threads=1 -f null none

C:     447548411 decicycles in Blend,    2048 runs,      0 skips
SSSE3: 130020087 decicycles in Blend,    2048 runs,      0 skips
AVX2:  128508221 decicycles in Blend,    2048 runs,      0 skips

ffmpeg -f lavfi -i allyuv -vf format=yuv420p12,framerate=60:threads=1 -f null none

C:     228932745 decicycles in Blend,    2048 runs,      0 skips
SSE4:  123357781 decicycles in Blend,    2048 runs,      0 skips
AVX2:  121215353 decicycles in Blend,    2048 runs,      0 skips

Signed-off-by: Marton Balint <cus@passwd.hu>
2018-01-28 18:50:52 +01:00
Paul B Mahol
86fda8be3f avfilter: add hflip x86 SIMD
Signed-off-by: Paul B Mahol <onemda@gmail.com>
2017-12-04 09:58:25 +01:00
Paul B Mahol
bbfcb1b7c8 avfilter/vf_threshold: add x86 SIMD
Signed-off-by: Paul B Mahol <onemda@gmail.com>
2017-12-02 14:58:56 +01:00
Paul B Mahol
01e545d046 avfilter: add limiter filter
Signed-off-by: Paul B Mahol <onemda@gmail.com>
2017-07-08 11:49:54 +02:00
Diego Biurrun
fd502f4f5f build: Generalize yasm/nasm-related variable names
None of them are specific to the YASM assembler.

(Cherry-picked from libav commit 39e208f4d4)

Signed-off-by: James Almer <jamrial@gmail.com>
2017-06-21 17:00:29 -03:00
Paul B Mahol
49bbfb9d13 avfilter: add arbitrary audio FIR filter
Signed-off-by: Paul B Mahol <onemda@gmail.com>
2017-05-09 20:47:52 +02:00
Muhammad Faiz
1e69ac9246 avfilter/avf_showcqt: cqt_calc optimization on x86
on x86_64:
        time    PSNR
plain   3.303   inf
SSE     1.649   107.087535
SSE3    1.632   107.087535
AVX     1.409   106.986771
FMA3    1.265   107.108437

on x86_32 (PSNR compared to x86_64 plain):
        time    PSNR
plain   7.225   103.951979
SSE     1.827   105.859282
SSE3    1.819   105.859282
AVX     1.533   105.997661
FMA3    1.384   105.885377

FMA4 test is not available

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Muhammad Faiz <mfcc64@gmail.com>
2016-06-08 16:09:43 +07:00
Ronald S. Bultje
5ce703a6bf vf_colorspace: x86-64 SIMD (SSE2) optimizations. 2016-04-12 16:42:48 -04:00
Thomas Mundt
5024a82e95 avfilter/vf_bwdif: add x86 SIMD
Signed-off-by: Thomas Mundt <loudmax@yahoo.de>
2016-03-13 10:06:21 +01:00
Paul B Mahol
5740dc27e1 avfilter/vf_w3fdif: add x86 SIMD
Signed-off-by: Paul B Mahol <onemda@gmail.com>
2015-10-10 17:33:43 +02:00
Paul B Mahol
ac74e857a2 avfilter/vf_stereo3d: add x86 SIMD for anaglyph outputs
Signed-off-by: Paul B Mahol <onemda@gmail.com>
2015-10-06 21:01:24 +02:00
Paul B Mahol
9762554dd0 avfilter/vf_blend: add x86 SIMD for some modes
Signed-off-by: Paul B Mahol <onemda@gmail.com>
2015-10-03 21:26:17 +02:00
Paul B Mahol
160556c9ad avfilter/vf_maskedmerge: add SIMD for maskedmerge with 8 bit depth input
Signed-off-by: Paul B Mahol <onemda@gmail.com>
2015-10-02 17:40:57 +02:00
James Darnley
bff7242608 avfilter/vf_removegrain: add x86 and x86_64 SSE2 functions
Speed of all modes increased by a factor between 7.4 and 19.8 largely depending
on whether bytes are unpacked into words.  Modes 2, 3, and 4 have been sped-up
by a factor of 43 (thanks quick sort!)

All modes are available on x86_64 but only modes 1, 10, 11, 12, 13, 14, 19, 20,
21, and 22 are available on x86 due to the number of SIMD registers used.

With a contribution from James Almer <jamrial@gmail.com>
2015-07-14 23:50:50 +00:00
Ronald S. Bultje
ae4c9ddebc vf_psnr: sse2 optimizations for sum-squared-error.
The internal line accumulator for 16bit can overflow, so I changed that
from int to uint64_t in the C code. The matching assembly looks a little
weird but output looks correct.

(avx2 should be trivial to add later.)

Reviewed-by: Paul B Mahol <onemda@gmail.com>
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2015-07-14 17:57:14 +02:00
Ronald S. Bultje
dfc58584b4 vf_ssim: x86 simd for ssim_4x4xN and ssim_endN.
Both are 2-2.5x faster than their C counterpart.

Reviewed-by: Paul B Mahol <onemda@gmail.com>
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2015-07-14 05:07:07 +02:00
Arwa Arif
4c38e960d0 avfilter: Port mp=eq/eq2 to lavfi
Code adapted from James Darnley's port
Some fixes from Paul B Mahol <onemda@gmail.com>

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2015-01-26 00:14:04 +01:00
James Almer
da02ee127a x86/vf_pp7: port dctB_mmx to yasm
Reviewed-by: Michael Niedermayer <michaelni@gmx.at>
Signed-off-by: James Almer <jamrial@gmail.com>
2015-01-09 20:02:27 -03:00
Arwa Arif
a299cd5ab3 lavfi: port mp=pp7 to libavfilter
The only difference with mp=pp7 is that default mode is "medium", as stated
in the MPlayer docs, rather than "hard".

Signed-off-by: Stefano Sabatini <stefasab@gmail.com>
2015-01-09 17:26:31 +01:00
James Almer
466e32bf25 x86/vf_fspp: port inline asm to yasm
Reviewed-by: Michael Niedermayer <michaelni@gmx.at>
Signed-off-by: James Almer <jamrial@gmail.com>
2014-12-26 15:39:51 -03:00
Arwa Arif
bdc4db0ee3 lavfi: port mp=fspp to a native libavfilter filter
Signed-off-by: Stefano Sabatini <stefasab@gmail.com>
2014-12-24 16:29:18 +01:00
Michael Niedermayer
fb3eb57369 avfilter/tinterlace: add Support for ff_lowpass_line_avx() & ff_lowpass_line_sse2()
Based-on: 2e1704059a by Kieran Kunhya

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-11-15 04:02:33 +01:00
Michael Niedermayer
6f373d75e8 Merge commit '2e1704059ae8625beda2ffde847ad22c5ba416dc'
* commit '2e1704059ae8625beda2ffde847ad22c5ba416dc':
  vf_interlace: Add SIMD for lowpass filter

Conflicts:
	libavfilter/vf_interlace.c
	libavfilter/x86/Makefile

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-11-15 02:39:49 +01:00
Kieran Kunhya
2e1704059a vf_interlace: Add SIMD for lowpass filter
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
2014-11-15 00:35:31 +01:00
James Almer
864f9326fb x86/vf_noise: move asm code to a separate file
Reviewed-by: Michael Niedermayer <michaelni@gmx.at>
Signed-off-by: James Almer <jamrial@gmail.com>
2014-10-17 00:44:35 -03:00
skal
406a9ccffe avfilter/vf_idet: MMX/MMXEXT/SSE2 implementation of idet's filter_line()
integration by Neil Birkbeck, with help from Vitor Sessak.
core SSE2 loop by Skal (pascal.massimino@gmail.com)

Reviewed-by: Clément Bœsch <u@pkh.me>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-09-04 22:19:00 +02:00
Robert Krüger
4a38eeec38 Revert "Revert "vf_yadif: move x86 init code to x86/yadif.c""
This reverts commit 975110a85e.

Signed-off-by: Robert Krüger <krueger@lesspain.de>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-01-14 14:19:14 +01:00
Michael Niedermayer
975110a85e Revert "vf_yadif: move x86 init code to x86/yadif.c"
This reverts commit a87b17f328.
This reduces the amount of non LGPL code, making a relicensing to LGPL
easier

Conflicts:

	libavfilter/vf_yadif.c
	libavfilter/x86/yadif.c
	libavfilter/x86/yadif_template.c
	libavfilter/yadif.h

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2013-12-01 20:26:26 +01:00
Michael Niedermayer
1ea28ffc4d Merge commit '0e730494160d973400aed8d2addd1f58a0ec883e'
* commit '0e730494160d973400aed8d2addd1f58a0ec883e':
  avfilter: x86: Port gradfun filter optimizations to yasm

Conflicts:
	libavfilter/x86/vf_gradfun_init.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2013-10-24 10:35:39 +02:00
Daniel Kang
0e73049416 avfilter: x86: Port gradfun filter optimizations to yasm
Signed-off-by: Diego Biurrun <diego@biurrun.de>
2013-10-23 14:50:27 +02:00
Paul B Mahol
9c774459a9 avfilter: port pullup filter from libmpcodecs
Signed-off-by: Paul B Mahol <onemda@gmail.com>
2013-09-17 17:03:36 +00:00
Clément Bœsch
a2c547ffec lavfi: add spp filter. 2013-06-14 01:27:22 +02:00
James Darnley
0a5814c9ba yadif: x86 assembly for 9 to 14-bit samples
These smaller samples do not need to be unpacked to double words
allowing the code to process more pixels every iteration (still 2 in MMX
but 6 in SSE2).  It also avoids emulating the missing double word
instructions on older instruction sets.

Like with the previous code for 16-bit samples this has been tested on
an Athlon64 and a Core2Quad.

Athlon64:
1809275 decicycles in C,    32718 runs, 50 skips
 911675 decicycles in mmx,  32727 runs, 41 skips, 2.0x faster
 495284 decicycles in sse2, 32747 runs, 21 skips, 3.7x faster

Core2Quad:
 921363 decicycles in C,     32756 runs, 12 skips
 486537 decicycles in mmx,   32764 runs,  4 skips, 1.9x faster
 293296 decicycles in sse2,  32759 runs,  9 skips, 3.1x faster
 284910 decicycles in ssse3, 32759 runs,  9 skips, 3.2x faster

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2013-03-16 22:32:54 +01:00
James Darnley
17e7b49501 yadif: x86 assembly for 16-bit samples
This is a fairly dumb copy of the assembly for 8-bit samples but it
works and produces identical output to the C version.  The options have
been tested on an Athlon64 and a Core2Quad.

Athlon64:
1810385 decicycles in C,    32726 runs, 42 skips
1080744 decicycles in mmx,  32744 runs, 24 skips, 1.7x faster
 818315 decicycles in sse2, 32735 runs, 33 skips, 2.2x faster

Core2Quad:
 924025 decicycles in C,     32750 runs, 18 skips
 623995 decicycles in mmx,   32767 runs,  1 skips, 1.5x faster
 406223 decicycles in sse2,  32764 runs,  4 skips, 2.3x faster
 387842 decicycles in ssse3, 32767 runs,  1 skips, 2.4x faster
 307726 decicycles in sse4,  32763 runs,  5 skips, 3.0x faster

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2013-03-16 22:32:34 +01:00
Diego Biurrun
e66240f22e avfilter: x86: consistent filenames for filter optimizations 2013-02-04 15:00:47 +01:00
Diego Biurrun
76d90125cd vf_hqdn3d: x86: Add proper arch optimization initialization 2013-02-01 13:11:45 +01:00
Daniel Kang
899157b308 yadif: Port inline assembly to yasm
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
2013-01-09 18:41:02 +01:00