I'm trying to move some computer vision tasks to tensorflow. The most intensive ops are convolutions, like GaussianBlur. The timings I get using timeit suggest that the GPU equivalent is >10 x slower.
- The stdout reports "WARNING:tensorflow:AutoGraph could not transform ... and will run it as-is..." for both functions. This means that no graph is created, and so performance is less good?
- Can I use timeit to check performance of a tf script?
- I assumed that by creating the tf variables outside the function, that the function starts from variables already on the GPU, is this true?
- Is it possible to do the 2D convolution layer by layer without splitting the image in layers like I do in tf_gauss_blur_stacked()?
The code below is my current version of Gaussian blur in opencv and tensorflow.
import tensorflow as tf # v2.2.0, requires tf 2 for example
import numpy as np # v1.21.6
import cv2 # v4.6.0
from timeit import timeit
def opencv_gauss_blur(im):
return cv2.GaussianBlur(im.copy(), (5, 5), 1.1)
@tf.function
def tf_gauss_blur_stacked(im, im_out, kernel2D):
for ii in range(im.shape[-1]):
im_slice = im[0, :, :, ii][tf.newaxis, :, :, tf.newaxis]
im_out[0, :, :, ii].assign(tf.nn.conv2d(im_slice, kernel2D, strides=[1, 1, 1, 1], padding="SAME")[0,:,:,0])
return im_out
@tf.function
def tf_gauss_blur(im, kernel3D):
return tf.nn.conv2d(im, kernel3D, strides=[1, 1, 1, 1], padding="SAME")
# input image A with dimensions (x, y, channel)
A = np.random.randint(0, 4095, (40, 50, 3)).astype(dtype=np.float32)
B = opencv_gauss_blur(A)
blur_kernel = cv2.getGaussianKernel(5, 1.1) * cv2.getGaussianKernel(5, 1.1).T
kernel2D = tf.constant(blur_kernel, dtype=tf.float32)[:, :, tf.newaxis, tf.newaxis] # shape dims: X, Y, num_input_channels, num_output_channels
kernel3D = np.zeros(shape=kernel2D.shape[:2] + (A.shape[-1], A.shape[-1]), dtype=np.float32)
for ii in range(A.shape[-1]):
kernel3D[:, :, ii, ii] = kernel2D[:, :, 0, 0]
kernel3D = tf.constant(kernel3D, dtype=tf.float32)
tfA = tf.constant(A, dtype=tf.float32)[tf.newaxis, ] # shape dims batch, X, Y, cha
im_out = tf.Variable(tfA)
B_tf_stack = tf_gauss_blur_stacked(tfA, im_out, kernel2D)
B_tf = tf_gauss_blur(tfA, kernel3D)
print(np.abs((B_tf[0, 2:-2, 2:-2, ] - B[2:-2, 2:-2,])).max())
print(np.abs((B_tf_stack[0, 2:-2, 2:-2, ] - B[2:-2, 2:-2,])).max())
The max difference between openCV and tensorflow is < 0.001 (on scale of 0-4095), which is sufficient agreement.
Timeit from the console:
%timeit B = opencv_gauss_blur(A)
%timeit B_tf = tf_gauss_blur_stacked(tfA, im_out, kernel2D)
%timeit B_tf = tf_gauss_blur(tfA, kernel3D)
Gives 11, 386 and 257 us per loop (mean ± std. dev. of 7 runs, 1000 loops each). OpenCV convolutes a 1D gaussian in X and then a 1D gaussian in Y direction, which should yield the same as the 2D/3D tensorflow functions, but has 2.5x/7.5x fewer computations.