代码之家 › 专栏 › 技术社区 › sd70

使用OpenCV进行视频Alpha混合

alphablending lag opencv python

sd70 · 技术社区 · 6 年前

我想用alpha视频在另一个视频上混合一个视频。这是我的密码。它工作得很好,但问题是这段代码根本没有效率,这是因为 /255 零件。速度慢,存在滞后问题。

有没有一种标准而有效的方法可以做到这一点?我希望结果是实时的。谢谢

import cv2
import numpy as np

def main():
    foreground = cv2.VideoCapture('circle.mp4')
    background = cv2.VideoCapture('video.MP4')
    alpha = cv2.VideoCapture('circle_alpha.mp4')

    while foreground.isOpened():
        fr_foreground = foreground.read()[1]/255
        fr_background = background.read()[1]/255     
        fr_alpha = alpha.read()[1]/255

        cv2.imshow('My Image',cmb(fr_foreground,fr_background,fr_alpha))

        if cv2.waitKey(1) == ord('q'): break

    cv2.destroyAllWindows

def cmb(fg,bg,a):
    return fg * a + bg * (1-a)

if __name__ == '__main__':
    main()

3 回复 | 直到 6 年前

Dan MaÅ¡ek 6 年前

让我们先解决几个明显的问题- foreground.isOpened() 即使在视频结束后,也将返回true,因此您的程序将在该点崩溃。解决方案有两个方面。首先,测试所有3个 VideoCapture 创建实例后,立即使用以下方法:

if not foreground.isOpened() or not background.isOpened() or not alpha.isOpened():
    print "Unable to open input videos."
    return

这将确保所有的门都被正确打开。下一部分是如何正确处理视频的结尾。这意味着要么检查 read() ,这是表示成功的布尔标志,或测试帧是否 None 。

while True:
    r_fg, fr_foreground = foreground.read()
    r_bg, fr_background = background.read()
    r_a, fr_alpha = alpha.read()
    if not r_fg or not r_bg or not r_a:
        break # End of video

此外,你似乎没有打电话 cv2.destroyAllWindows() --the () 缺少。这并不重要。

为了帮助调查和优化这一点,我添加了一些详细的计时,使用 timeit 模块和一对方便功能

from timeit import default_timer as timer

def update_times(times, total_times):
    for i in range(len(times) - 1):
        total_times[i] += (times[i+1]-times[i]) * 1000

def print_times(total_times, n):
    print "Iterations: %d" % n
    for i in range(len(total_times)):
        print "Step %d: %0.4f ms" % (i, total_times[i] / n)
    print "Total: %0.4f ms" % (np.sum(total_times) / n)

并修改了 main() 函数来测量每个逻辑步骤所花费的时间——读取、缩放、混合、显示、等待键。为此,我将除法拆分为单独的语句。我还做了一个小小的修改,使它在Python 2中可以工作。x也是( /255 被解释为整数除法并产生错误的结果)。

times = [0.0] * 6
total_times = [0.0] * (len(times) - 1)
n = 0
while True:
    times[0] = timer()
    r_fg, fr_foreground = foreground.read()
    r_bg, fr_background = background.read()
    r_a, fr_alpha = alpha.read()
    if not r_fg or not r_bg or not r_a:
        break # End of video
    times[1] = timer()
    fr_foreground = fr_foreground / 255.0
    fr_background = fr_background / 255.0
    fr_alpha = fr_alpha / 255.0
    times[2] = timer()
    result = cmb(fr_foreground,fr_background,fr_alpha)
    times[3] = timer()
    cv2.imshow('My Image', result)
    times[4] = timer()
    if cv2.waitKey(1) == ord('q'): break
    times[5] = timer()
    update_times(times, total_times)
    n += 1

print_times(total_times, n)

当我以1280x800 mp4视频作为输入运行此程序时,我注意到它确实非常缓慢,而且在我的6核机器上只使用了15%的CPU。各节的时间安排如下:

Iterations: 1190
Step 0: 11.4385 ms
Step 1: 37.1320 ms
Step 2: 39.4083 ms
Step 3: 2.5488 ms
Step 4: 10.7083 ms
Total: 101.2358 ms

这表明最大的瓶颈是缩放步骤和混合步骤。CPU使用率低也是次优的,但让我们首先关注低挂果实。

让我们看看我们使用的numpy数组的数据类型。 读取() 为我们提供具有 dtype 属于 np.uint8 --8位无符号整数。但是,浮点除法(如所写)将生成一个 数据类型 属于 np.float64 --64位浮点值。我们的算法并不需要这样的精度,所以我们最好只使用32位浮点运算——这意味着如果任何运算都是矢量化的,我们可能在相同的时间内完成两倍的计算。

这里有两种选择。我们可以简单地将除数 np.float32 ,这将导致numpy给我们相同的结果 数据类型 :

fr_foreground = fr_foreground / np.float32(255.0)
fr_background = fr_background / np.float32(255.0)
fr_alpha = fr_alpha / np.float32(255.0)

这为我们提供了以下时间安排:

Iterations: 1786
Step 0: 9.2550 ms
Step 1: 19.0144 ms
Step 2: 21.2120 ms
Step 3: 1.4662 ms
Step 4: 10.8889 ms
Total: 61.8365 ms

或者我们可以将阵列 NP浮动32 首先,然后就地进行缩放。

fr_foreground = np.float32(fr_foreground)
fr_background = np.float32(fr_background)
fr_alpha = np.float32(fr_alpha)

fr_foreground /= 255.0
fr_background /= 255.0
fr_alpha /= 255.0

它给出了以下计时(将步骤1拆分为转换(1)和缩放(2)——静止移位1):

Iterations: 1786
Step 0: 9.0589 ms
Step 1: 13.9614 ms
Step 2: 4.5960 ms
Step 3: 20.9279 ms
Step 4: 1.4631 ms
Step 5: 10.4396 ms
Total: 60.4469 ms

两者大致相当,运行时间约为原始时间的60%。我将坚持使用第二个选项,因为它将在后面的步骤中变得有用。让我们看看还有什么可以改进的。

从前面的计时中,我们可以看到缩放不再是瓶颈,但脑海中仍然浮现出一个想法——除法通常比乘法慢,那么如果我们乘以倒数呢?

fr_foreground *= 1/255.0
fr_background *= 1/255.0
fr_alpha *= 1/255.0

事实上,这确实为我们增加了一毫秒——没有什么了不起的,但这很容易,所以不妨这样做:

Iterations: 1786
Step 0: 9.1843 ms
Step 1: 14.2349 ms
Step 2: 3.5752 ms
Step 3: 21.0545 ms
Step 4: 1.4692 ms
Step 5: 10.6917 ms
Total: 60.2097 ms

现在,混合函数是最大的瓶颈,其次是所有3个数组的类型转换。如果我们看一下混合操作的作用:

foreground * alpha + background * (1.0 - alpha)

我们可以观察到,要使数学有效,唯一需要在范围(0.0,1.0)内的值是 alpha 。

如果我们只缩放alpha图像呢?此外,由于浮点乘法将升级为浮点,如果我们也跳过了类型转换呢?那就意味着 cmb() 必须返回 NPuint8 大堆

def cmb(fg,bg,a):
    return np.uint8(fg * a + bg * (1-a))

我们会的

    #fr_foreground = np.float32(fr_foreground)
    #fr_background = np.float32(fr_background)
    fr_alpha = np.float32(fr_alpha)

    #fr_foreground *= 1/255.0
    #fr_background *= 1/255.0
    fr_alpha *= 1/255.0

这方面的时间是

Step 0: 7.7023 ms
Step 1: 4.6758 ms
Step 2: 1.1061 ms
Step 3: 27.3188 ms
Step 4: 0.4783 ms
Step 5: 9.0027 ms
Total: 50.2840 ms

显然,步骤1和2要快得多,因为我们只做了三分之一的工作。 imshow 由于它不需要从浮点转换,因此速度也会加快。令人费解的是,读取速度也加快了(我想我们是在避免一些隐藏的重新分配,因为 fr_foreground 和 fr_background 始终包含原始帧)。我们确实支付了额外的铸入成本 cmb() 但总的来说,这似乎是一场胜利——我们的时间是原来的一半。

要继续,让我们摆脱 cmb() 函数,将其功能移动到 main() 并将其拆分,以衡量每项业务的成本。让我们也尝试重用 alpha.read() (因为我们最近看到 读取() 性能):

times = [0.0] * 11
total_times = [0.0] * (len(times) - 1)
n = 0
while True:
    times[0] = timer()
    r_fg, fr_foreground = foreground.read()
    r_bg, fr_background = background.read()
    r_a, fr_alpha_raw = alpha.read()
    if not r_fg or not r_bg or not r_a:
        break # End of video

    times[1] = timer()
    fr_alpha = np.float32(fr_alpha_raw)
    times[2] = timer()
    fr_alpha *= 1/255.0
    times[3] = timer()
    fr_alpha_inv = 1.0 - fr_alpha
    times[4] = timer()
    fr_fg_weighed = fr_foreground * fr_alpha
    times[5] = timer()
    fr_bg_weighed = fr_background * fr_alpha_inv
    times[6] = timer()
    sum = fr_fg_weighed + fr_bg_weighed
    times[7] = timer()
    result = np.uint8(sum)
    times[8] = timer()
    cv2.imshow('My Image', result)
    times[9] = timer()
    if cv2.waitKey(1) == ord('q'): break
    times[10] = timer()
    update_times(times, total_times)
    n += 1

新计时:

Iterations: 1786
Step 0: 6.8733 ms
Step 1: 5.2742 ms
Step 2: 1.1430 ms
Step 3: 4.5800 ms
Step 4: 7.0372 ms
Step 5: 7.0675 ms
Step 6: 5.3082 ms
Step 7: 2.6912 ms
Step 8: 0.4658 ms
Step 9: 9.6966 ms
Total: 50.1372 ms

我们并没有真正获得任何收获,但阅读速度明显加快。

这引出了另一个想法——如果我们尝试最小化分配并在后续迭代中重用阵列,会怎么样?

我们可以在第一次迭代中预先分配必要的数组(使用 numpy.zeros_like ),在阅读第一组帧后:

if n == 0: # Pre-allocate
    fr_alpha = np.zeros_like(fr_alpha_raw, np.float32)
    fr_alpha_inv = np.zeros_like(fr_alpha_raw, np.float32)
    fr_fg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
    fr_bg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
    sum = np.zeros_like(fr_alpha_raw, np.float32)
    result = np.zeros_like(fr_alpha_raw, np.uint8)

现在,我们可以使用

numpy.add 用于添加
numpy.subtract 用于减法
numpy.multiply 用于乘法运算
numpy.copyto 用于类型转换

我们还可以使用单个 努比。乘 。

times = [0.0] * 10
total_times = [0.0] * (len(times) - 1)
n = 0
while True:
    times[0] = timer()
    r_fg, fr_foreground = foreground.read()
    r_bg, fr_background = background.read()
    r_a, fr_alpha_raw = alpha.read()
    if not r_fg or not r_bg or not r_a:
        break # End of video

    if n == 0: # Pre-allocate
        fr_alpha = np.zeros_like(fr_alpha_raw, np.float32)
        fr_alpha_inv = np.zeros_like(fr_alpha_raw, np.float32)
        fr_fg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
        fr_bg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
        sum = np.zeros_like(fr_alpha_raw, np.float32)
        result = np.zeros_like(fr_alpha_raw, np.uint8)

    times[1] = timer()
    np.multiply(fr_alpha_raw, np.float32(1/255.0), fr_alpha)
    times[2] = timer()
    np.subtract(1.0, fr_alpha, fr_alpha_inv)
    times[3] = timer()
    np.multiply(fr_foreground, fr_alpha, fr_fg_weighed)
    times[4] = timer()
    np.multiply(fr_background, fr_alpha_inv, fr_bg_weighed)
    times[5] = timer()
    np.add(fr_fg_weighed, fr_bg_weighed, sum)
    times[6] = timer()
    np.copyto(result, sum, 'unsafe')
    times[7] = timer()
    cv2.imshow('My Image', result)
    times[8] = timer()
    if cv2.waitKey(1) == ord('q'): break
    times[9] = timer()
    update_times(times, total_times)
    n += 1

这为我们提供了以下计时:

Iterations: 1786
Step 0: 7.0515 ms
Step 1: 3.8839 ms
Step 2: 1.9080 ms
Step 3: 4.5198 ms
Step 4: 4.3871 ms
Step 5: 2.7576 ms
Step 6: 1.9273 ms
Step 7: 0.4382 ms
Step 8: 7.2340 ms
Total: 34.1074 ms

我们修改的所有步骤都有显著改进。我们将原始实现所需时间的大约35%。

次要更新:

基于 Silencer 的 answer 我测量过 cv2.convertScaleAbs 也它实际上运行得更快一些:

Step 6: 1.2318 ms

这给了我另一个想法--我们可以利用 cv2.add 让我们指定目标数据类型,并进行饱和转换。这将允许我们将步骤5和6结合在一起。

cv2.add(fr_fg_weighed, fr_bg_weighed, result, dtype=cv2.CV_8UC3)

在

Step 5: 3.3621 ms

又是一场小胜利(之前我们是3.9ms左右)。

在此基础上, cv2.subtract 和 cv2.multiply 是进一步的候选人。我们需要使用4元素元组来定义标量(Python绑定的复杂性),并且需要显式定义乘法的输出数据类型。

    cv2.subtract((1.0, 1.0, 1.0, 0.0), fr_alpha, fr_alpha_inv)
    cv2.multiply(fr_foreground, fr_alpha, fr_fg_weighed, dtype=cv2.CV_32FC3)
    cv2.multiply(fr_background, fr_alpha_inv, fr_bg_weighed, dtype=cv2.CV_32FC3)

时间安排:

Step 2: 2.1897 ms
Step 3: 2.8981 ms
Step 4: 2.9066 ms

这似乎是我们在没有一些并行化的情况下所能做到的。就单个操作而言,我们已经具备了OpenCV可能提供的优势,因此我们应该将重点放在实现的管道衬砌上。

为了帮助我弄清楚如何在不同的piepeline阶段(线程)之间划分代码,我制作了一个图表,其中显示了所有操作、它们的最佳时间以及计算的相互依赖性:

在制品 在我写这篇文章时,请参阅评论以获取更多信息。

mainactual 6 年前

如果它只是混合、渲染和遗忘,那么在GPU上执行它是有意义的。在许多其他工具中,VTK(可视化工具包)( https://www.vtk.org )我能帮你做这件事而不是 imshow .已从OpenCV 3D Visualizer模块了解VTK( https://docs.opencv.org/3.2.0/d1/d19/group__viz.html )所以不应该增加太多依赖性。

此后,整个计算部分(不包括读取视频帧)归结为 cv2.mixChannels 像素数据传输到两个渲染器,在我的计算机上,对于1280x720视频,每次迭代大约5ms。

import sys
import cv2
import numpy as np
import vtk
from vtk.util import numpy_support
import time

class Renderer:
    # VTK renderer with two layers
    def __init__( self ):
        self.layer1 = vtk.vtkRenderer()
        self.layer1.SetLayer(0)
        self.layer2 = vtk.vtkRenderer()
        self.layer2.SetLayer(1)
        self.renWin = vtk.vtkRenderWindow()
        self.renWin.SetNumberOfLayers( 2 )
        self.renWin.AddRenderer(self.layer1)
        self.renWin.AddRenderer(self.layer2)
        self.iren = vtk.vtkRenderWindowInteractor()
        self.iren.SetRenderWindow(self.renWin)
        self.iren.Initialize()      
    def Render( self ):
        self.iren.Render()

# set background image to a given renderer (resets the camera)
# from https://www.vtk.org/Wiki/VTK/Examples/Cxx/Images/BackgroundImage
def SetBackground( ren, image ):    
    bits = numpy_support.numpy_to_vtk( image.ravel() )
    bits.SetNumberOfComponents( image.shape[2] )
    bits.SetNumberOfTuples( bits.GetNumberOfTuples()/bits.GetNumberOfComponents() )

    img = vtk.vtkImageData()
    img.GetPointData().SetScalars( bits );
    img.SetExtent( 0, image.shape[1]-1, 0, image.shape[0]-1, 0,0 );
    origin = img.GetOrigin()
    spacing = img.GetSpacing()
    extent = img.GetExtent()

    actor = vtk.vtkImageActor()
    actor.SetInputData( img )

    ren.RemoveAllViewProps()
    ren.AddActor( actor )
    camera = vtk.vtkCamera()
    camera.ParallelProjectionOn()
    xc = origin[0] + 0.5*(extent[0] + extent[1])*spacing[0]
    yc = origin[1] + 0.5*(extent[2] + extent[3])*spacing[1]
    yd = (extent[3] - extent[2] + 1)*spacing[1]
    d = camera.GetDistance()
    camera.SetParallelScale(0.5*yd)
    camera.SetFocalPoint(xc,yc,0.0)
    camera.SetPosition(xc,yc,-d)
    camera.SetViewUp(0,-1,0)
    ren.SetActiveCamera( camera )
    return img

# update the scalar data without bounds check
def UpdateImageData( vtkimage, image ):
    bits = numpy_support.numpy_to_vtk( image.ravel() )
    bits.SetNumberOfComponents( image.shape[2] )
    bits.SetNumberOfTuples( bits.GetNumberOfTuples()/bits.GetNumberOfComponents() )
    vtkimage.GetPointData().SetScalars( bits );

r = Renderer()
r.renWin.SetSize(1280,720)
cap = cv2.VideoCapture('video.mp4')
image = cv2.imread('hello.png',1)
alpha = cv2.cvtColor(image,cv2.COLOR_RGB2GRAY )
ret, alpha = cv2.threshold( alpha, 127, 127, cv2.THRESH_BINARY )
alpha = np.reshape( alpha, (alpha.shape[0],alpha.shape[1], 1 ) )

src1=[]
src2=[]
overlay=[]
c=0
while ( 1 ):
    # read the data
    ret, mat = cap.read()
    if ( not ret ):
        break
    #TODO ret, image = cap2.read() #(rgb)
    #TODO ret, alpha = cap3.read() #(mono)

    # alpha blend
    t=time.time()
    if ( overlay==[] ):
        overlay = np.zeros( [image.shape[0],image.shape[1],4], np.uint8 ) 
    cv2.mixChannels( [image, alpha], [overlay], [0,0,1,1,2,2,3,3] )
    if ( src1==[] ):
        src1 = SetBackground( r.layer1, mat )
    else:
        UpdateImageData( src1, mat )
    if ( src2==[] ):
        src2 = SetBackground( r.layer2, overlay )
    else:
        UpdateImageData( src2, overlay )
    r.Render()
    # blending done
    t = time.time()-t;

    if ( c % 10 == 0 ):
        print 1000*t
    c = c+1;

Kinght é 6 年前

我正在使用 OpenCV 4.00-pre 和 Python 3.6 。

没有必要做三件事 xxx/255 操作。只为alpha好。

注意类型转换,首选 cv2.convertScaleAbs(xxx) 除了 np.uint8(xxx) 或 np.copyto(xxx,yyy, "unsafe") 。

预先分配内存应该更好。

我使用#2,即 cv2.convertScaleAbs 避免 underflow/overflow ,范围为[0255]。例如:

>>> x = np.array([[-1,256]])
>>> y = np.uint8(x)
>>> z = cv2.convertScaleAbs(x)
>>> x
array([[ -1, 256]])
>>> y
array([[255,   0]], dtype=uint8)
>>> z
array([[  1, 255]], dtype=uint8)

##! 2018/05/09 13:54:34

import cv2
import numpy as np
import time

def cmb(fg,bg,a):
    return fg * a + bg * (1-a)

def test2():
    cap = cv2.VideoCapture(0)
    ret, prev_frame = cap.read()
    """
    foreground = cv2.VideoCapture('circle.mp4')
    background = cv2.VideoCapture('video.MP4')
    alphavideo = cv2.VideoCapture('circle_alpha.mp4')
    """
    while cap.isOpened():
        ts = time.time()
        ret, fg = cap.read()
        alpha = fg.copy()
        bg = prev_frame
        """
        ret, fg = foreground.read()
        ret, bg = background.read()
        ret, alpha = alphavideo.read()
        """

        alpha = np.multiply(alpha, 1.0/255)
        blended = cv2.convertScaleAbs(cmb(fg, bg, alpha))
        te = time.time()
        dt = te-ts
        fps = 1/dt
        print("{:.3}ms, {:.3} fps".format(1000*dt, fps))
        cv2.imshow('Blended', blended)

        if cv2.waitKey(1) == ord('q'):
            break

    cv2.destroyAllWindows()

if __name__ == "__main__":
    test2()

一些输出如下:

39.0ms, 25.6 fps
37.0ms, 27.0 fps
38.0ms, 26.3 fps
37.0ms, 27.0 fps
38.0ms, 26.3 fps
37.0ms, 27.0 fps
38.0ms, 26.3 fps
37.0ms, 27.0 fps
37.0ms, 27.0 fps
37.0ms, 27.0 fps
37.0ms, 27.0 fps
38.0ms, 26.3 fps
37.0ms, 27.0 fps
37.0ms, 27.0 fps
37.0ms, 27.0 fps
37.0ms, 27.0 fps
...