CUDA，多线程，单线程比较（密集恐惧症慎入）

这篇文章针对同一个任务进行了单线程，多线程和CUDA程序的比较。以显示GPU在并行计算上的时间节省能力。
对GPU编程不了解的同学，这篇文章可能不会有特别大的帮助，因为我不打算对CUDA编程有很详细的讲解，我浏览了一下网上，无论所英文还是中文都有了很详细的入门讲解，比我目前能写出的好，就不重新造轮子了。下面是几篇写地不错的入门博客。内容很类似，看其中任何任意一篇即可。
https://www.jianshu.com/p/34a504af8d51
https://www.cnblogs.com/skyfsm/p/9673960.html
https://zhuanlan.zhihu.com/p/34587739
对于CUDA编程，仅仅列出要了解的一些关键词
kernel
SM(streaming multiprocessor), SP(streaming processor)
Block, Thread, Grid
CUDA编程比较麻烦的点是你需要根据具体的问题去设计你的GPU资源如何分配，但是CUDA编程也有些比较固定的步骤，会在下面的程序中看到。上面参考的文章列举了一些很简单的例子，比如利用CUDA完成矩阵的加法或者矩阵的乘法等，下面要列举的例子会稍稍复杂一丢丢，是处理图像，因为GPU常用来处理图像，我想这个例子更合适。代码存放于我的Github
例子要实现的内容，说来也很简单，我们要找一张图像的区域梯度最大点。区域是一个7X7像素的方块。假设有一张宽739高458的图像，向下取整，横向有739/7 $\approx$ 105，纵向有458/7 $\approx$ 65，所以这张图像我们能划分出105*65 = 6825个方块。
一个像素点在图像中的位置假设是 $(x,y)$ ，该点的灰度值是 $I(x,y)$ ，x方向和y方向梯度的计算方法
$g_x = I(x+1,y) - I(x-1,y) \\ g_y = I(x,y+1) - I(x,y-1)$
我们会遍历每个像素点，计算哪个点的梯度的平方和根最大，即找到每个区域的
$max(\sqrt{g_x^2 + g_y^2})$
原图如下

demo.png

单线程

单线程的计算很简单，程序如下

#include <opencv2/opencv.hpp>
#include <Eigen/Core>

int cell_ = 4;
int pattern_gradient_ = 3;//2*pattern + 1 = 7
std::vector<cv::KeyPoint> hg_keys_;

void ExtractHighGradient(const cv::Mat &im){
    //cell row and cell columns
    int rows = im.rows;
    int cols = im.cols;
    int c_row = im.rows/cell_;
    int c_col = im.cols/cell_;

    int k = 0, l = 0;
    //divided image into cell by cell parts, in each part we'll extract high gradient points in a 7 by 7 pattern and calcuate one NID
    for(int k = 0; k < cell_; k++)
        for(int l = 0; l < cell_; l++)
            for(int i = c_row*k; i < c_row*(k+1); i += (2*pattern_gradient_+1)){
                if(i == 0 || i >= (rows - (2*pattern_gradient_ + 1)))
                    continue;
                for(int j = c_col*l; j < c_col*(l+1); j += (2*pattern_gradient_+1)){
                    if(j == 0 || j>=(cols - (2*pattern_gradient_ + 1)) )
                        continue;
                    //find the pixel that has max gradient in the `2*pattern_gradient+1` by ``2*pattern_gradient+1` pattern
                    float max_gradient_norm = 0.0;
                    float gradient_norm = 0.0;
                    cv::KeyPoint lmgp; //local_max_gradient_point
                    for(int y = i; y < i+2*pattern_gradient_+1; y++){
                        for(int x = j; x< j+2*pattern_gradient_+1; x++){
                            //if(y+1<rows && x+1<cols && y-1>=0 && x-1>=0){
                                Eigen::Vector2d gradient(
                                    im.ptr<uchar>(y+1)[x] - im.ptr<uchar>(y-1)[x],
                                    im.ptr<uchar>(y)[x+1] - im.ptr<uchar>(y)[x-1]
                                );
                                gradient_norm = gradient.norm();
                            if(gradient_norm > max_gradient_norm){
                                lmgp.pt.x = x;
                                lmgp.pt.y = y;
                                max_gradient_norm = gradient_norm;
                            }
                        }
                    }
                    hg_keys_.push_back(lmgp);
                }
            }
    
}

int main(int argc, char* argv[]){
    cv::Mat test;
    std::string add("../demo.png");
    if(argc != 2)
        std::cout<<"use default address "<<add.c_str()<<std::endl;
    else{
        add.clear();
        add.append(argv[1]);
    }
    
    test = cv::imread(add, 0);//0 for gray and 1 for rgb
    
    cv::Mat im_with_keypoints;

    double extract_time = (double)cv::getTickCount();
    ExtractHighGradient(test);

    extract_time = ((double)cv::getTickCount() - extract_time)/cv::getTickFrequency();
    std::cout<<"extract time is "<<extract_time<<std::endl;

    std::cout<<"point size "<<hg_keys_.size()<<std::endl;

    cv::drawKeypoints(test, hg_keys_, im_with_keypoints);

    cv::namedWindow("show key points", cv::WINDOW_AUTOSIZE);
    cv::imshow("show key points", im_with_keypoints);

    cv::waitKey(0);
}

代码里有个变量cell，这是将一整张图片再细分成4X4的16个方块区域，多线程我需要把图片分成数个区域，每个线程并行且处理一个区域。单线程不必这么做，不过为了和接下来的多线程比较罢了。我把每个7X7小方块的局部最大梯度点给画了出来。
运行程序，获得结果如下

use default address ../demo.png
extract time is 0.101059
point size 6996

hg_points.png

上面显示出一个线程找到这些点总共用了0.101058秒。找到6996个点。中间的白色区域像素点全为0，所以没有局部梯度最大点。

多线程

多线程，正如我前面所说，我把图像分为4X4个cell。代码里我使用了8个线程，每个线程负责计算两个cell。
在使用多线程时，我们经常需要注意的问题就是各个线程之间不要冲突。比如我的代码中有一个全局变量hg_keys_向量，每找到一个局部最大梯度点，就会push到hg_keys_里。如果两个线程同时做这个操作，就会起冲突，因此在一个线程往hg_keys_ push新的点的时候，必须要把hg_keys_锁住，不让其他线程在此时对他进行操作。代码中这部分如下

                    mtx.lock();//avoid push_back conflix problem
                    hg_keys_.push_back(lmgp);//lmgp is local maximum gradient point
                    mtx.unlock();

全部代码内容如下，注意多线程函数调用结束之后一定需要调用线程.join()函数。

#include <opencv2/opencv.hpp>
#include <Eigen/Core>
#include <thread>
#include <mutex>

int cell_ = 4;
int pattern_gradient_ = 3;
std::vector<cv::KeyPoint> hg_keys_;
static const int num_threads = 8;
int repeat_ = cell_ * cell_ / num_threads;
std::mutex mtx;

void ExtractHighGradient(int flag, int cell_part, const cv::Mat &im){
    //cell row and cell columns
    int rows = im.rows;
    int cols = im.cols;
    int c_row = im.rows/cell_;
    int c_col = im.cols/cell_;
    if(flag == 0){
        //no consider monocular case for now
    }
    else{
        int k = cell_part/cell_, l = cell_part%cell_;
        int count = 0;
        //divided image into cell by cell parts, in each part we'll extract high gradient points in a 7 by 7 pattern and calcuate one NID
        while(k<cell_){
            for(int i = c_row*k; i < c_row*(k+1); i += (2*pattern_gradient_+1)){
                if(i == 0 || i >= (rows - (2*pattern_gradient_ + 1)))
                    continue;
                for(int j = c_col*l; j < c_col*(l+1); j += (2*pattern_gradient_+1)){
                    if(j == 0 || j>=(cols - (2*pattern_gradient_ + 1)) )
                        continue;
                    //find the pixel that has max gradient in the `2*pattern_gradient+1` by ``2*pattern_gradient+1` pattern
                    float max_gradient_norm = 0.0;
                    float gradient_norm = 0.0;
                    cv::KeyPoint lmgp; //local_max_gradient_point
                    for(int y = i; y < i+2*pattern_gradient_+1; y++){
                        for(int x = j; x< j+2*pattern_gradient_+1; x++){
                                Eigen::Vector2d gradient(
                                    im.ptr<uchar>(y+1)[x] - im.ptr<uchar>(y-1)[x],
                                    im.ptr<uchar>(y)[x+1] - im.ptr<uchar>(y)[x-1]
                                );
                            gradient_norm = gradient.norm();
                            if(gradient_norm > max_gradient_norm){
                                lmgp.pt.x = x;
                                lmgp.pt.y = y;
                                max_gradient_norm = gradient_norm;
                            }
                        }
                    }
                    mtx.lock();//avoid push_back conflix problem
                    hg_keys_.push_back(lmgp);
                    mtx.unlock();
                }
            }
            k += repeat_;
        }
    }
}

//I use 8 threads and the speed is only 2 or 3 times faster = = .....
int main(int argc, char* argv[]){
    cv::Mat test;
    std::string add("../demo.png");
    if(argc != 2)
        std::cout<<"use default address "<<add.c_str()<<std::endl;
    else{
        add.clear();
        add.append(argv[1]);
    }
    
    test = cv::imread(add, 0);//0 for gray and 1 for rgb
    cv::Mat im_with_keypoints;

    std::thread t[num_threads];

    double extract_time = (double)cv::getTickCount();
    for (int i = 0; i < num_threads; ++i) {
        t[i] = std::thread(ExtractHighGradient, 1, i, test);
    }
    for (int i = 0; i < num_threads; ++i) {
        t[i].join();
    }
    //ExtractHighGradient(1, test);
    extract_time = ((double)cv::getTickCount() - extract_time)/cv::getTickFrequency();
    std::cout<<"extract time is "<<extract_time<<std::endl;

    std::cout<<"point size "<<hg_keys_.size()<<std::endl;

    cv::drawKeypoints(test, hg_keys_, im_with_keypoints);

    cv::namedWindow("show key points", cv::WINDOW_AUTOSIZE);
    cv::imshow("show key points", im_with_keypoints);

    cv::waitKey(0);
}

程序运行结果如下

use default address ../demo.png
extract time is 0.0548764
point size 6996

可以看到，同样提取出了6996个点，8个线程把速度提升到了55ms左右，比之前单线程快了一倍的样子。并没有想象中的8倍。可能原因有很多，比如我们锁住hg_keys_时，其他线程必须等待，会耗费一些时间。当然最后的图像结果也基本一样。

multi_thread.png

CUDA

CUDA的程序和前面的单线程多线程看起来就是两个世界的程序了，内容非常不一样。我们可以想象GPU有几千几万个线程可以一起运行，所以我们只需要让每一个GPU的线程单独处理一个7*7的小像素块就行了。
由于GPU和CPU是两个不同的硬件，所以在需要使用GPU时，我们都需要把数据先从CPU拷贝到GPU，在GPU中处理完后，再拷贝回来。
每次使用了GPU后，我们也需要手动释放里面的空间。还有一些小细节，大概写在了代码里面，见代码里的步骤1到8相对比较固定。

#include <opencv2/opencv.hpp>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

__global__ void ExtractHighGradient(uchar* im, cv::KeyPoint* kp,int rows, int cols, int pattern_row, int pattern_col){
    //NOTE THIS.......if there are a lot of pixel that index_x is larger than threadIdx.x_max, then they won't be reached
    int thread_idx = threadIdx.x + blockIdx.x * blockDim.x;
    int thread_idy = threadIdx.y + blockIdx.y * blockDim.y;
    //int pattern = 3;
    int p = 7;//2*3+1
    int index;
    float gradient_x, gradient_y, norm;
    float max_norm = 0.0;
    if(thread_idx>pattern_col || thread_idy>pattern_row)
        return;
    // if(thread_idy == 0)
    //     printf("id x %d and pattern col is %d \n", thread_idx, pattern_col);
    for(int i = 0; i < p; i++)
        for(int j = 0; j< p; j++){
            index = i + p * (thread_idx+1) + (p*(thread_idy+1) + j)*cols;
            gradient_x = std::abs(im[index + 1] - im[index - 1]);//right and left
            gradient_y = std::abs(im[index + cols] - im[index - cols]);//down and up
            norm = sqrt(gradient_x*gradient_x + gradient_y*gradient_y);
            //push_to_key points
            if(norm>max_norm){
                kp[thread_idx + thread_idy*pattern_col].pt.y = p*(thread_idy+1) + j;
                kp[thread_idx + thread_idy*pattern_col].pt.x = p*(thread_idx+1) + i;
                max_norm = norm;
            }
        }
};

int main(int argc, char* argv[]){
    cv::Mat test;
    std::string add("../demo.png");
    if(argc != 2)
        std::cout<<"use default address "<<add.c_str()<<std::endl;
    else{
        add.clear();
        add.append(argv[1]);
    }
    //See https://stackoverflow.com/questions/25346395/why-is-the-first-cudamalloc-the-only-bottleneck
    //First cuda_realted function is responsible for initailize the cuda. It may cost about 100ms. once the initialization is done, copy data or do some other things shoudl be fast, so we just call a cudaFree(0) to initialize the device
    cudaFree(0);
    test = cv::imread(add, 0);//0 for gray and 1 for rgb
    cv::Mat im_with_keypoints;

    int rows = test.rows;
    int cols = test.cols;
    int p = 7;
    int row_pattern_num = rows / p + 1;
    int col_pattern_num = cols / p + 1;
    int keypoint_num = row_pattern_num * col_pattern_num;
    //std::cout<<"keypoint num is "<<keypoint_num<<std::endl; 
    int image_size = sizeof(uchar) * rows * cols;
    int keypoint_size = sizeof(cv::KeyPoint) * keypoint_num;

    //printf("size of keypoint %ld \n", sizeof(cv::KeyPoint));

    double extract_time = (double)cv::getTickCount();
    //1 allocate memory in the host(CPU)
    uchar* host_input = (uchar *)malloc(image_size);
    cv::KeyPoint* host_output = (cv::KeyPoint*)malloc(keypoint_size);

    //2 assign value to the data (In reality, you get your data from a thousand ways)
    host_input = test.data;

    //3 allocate memory in the device (GPU)
    uchar* device_input;
    cv::KeyPoint* device_output;
    
    //double alloc_time = (double)cv::getTickCount();
    cudaMalloc((void**)&device_output, keypoint_size);
    cudaMalloc((void**)&device_input, image_size);
    
    //4 copy the data from the host to the device
    cudaMemcpy(device_input, host_input, image_size, cudaMemcpyHostToDevice);
    cudaMemcpy(device_output, host_output, keypoint_size, cudaMemcpyHostToDevice);

    //5 assign block number and thread number in the block
    // On my lenovo computer
    // Maximum number of threads per multiprocessor:  2048
    // Maximum number of threads per block:           1024
    // 6 multi processor
    dim3 block_number = dim3(4, 4);
    dim3 thread_number = dim3(32, 32);

    //6 call function in the device. During this step, results should be ready.
    ExtractHighGradient<<<block_number, thread_number>>>(device_input, device_output, rows, cols, row_pattern_num, col_pattern_num);//in cpp file '<<<' doesn't make sense and will lead to error

    cudaDeviceSynchronize();

    //7 copy memory from device to host
    cudaMemcpy(host_output, device_output, keypoint_size, cudaMemcpyDeviceToHost);

    extract_time = ((double)cv::getTickCount() - extract_time)/cv::getTickFrequency();
    std::cout<<"extract time is "<<extract_time<<std::endl;

    std::vector<cv::KeyPoint> hg_points(keypoint_num);
    for(int i = 0; i <keypoint_num; i++){
        hg_points[i] = host_output[i];
    }

    std::cout<<"point size "<<hg_points.size()<<std::endl;

    cv::drawKeypoints(test, hg_points, im_with_keypoints);

    cv::namedWindow("show key points", cv::WINDOW_AUTOSIZE);
    cv::imshow("show key points", im_with_keypoints);

    cv::waitKey(0);
    //8 free the momory in device and the host
    //free(host_input);//free host_input will leads to double free or corruption, maybe its because free host_input will free image.data too, which will make the image means nothing. Just a guess. No need to free here in fact, when the program is done, everthing will be freed
    free(host_output);
    cudaFree(device_input);
    cudaFree(device_output);
    return 1; 
}

运行程序,得到结果

use default address ../demo.png
extract time is 0.000418617
point size 6996

最后出来的图像结果大底相同

GPU_result

但是可以看到速度有了爆炸性的提升，提取出相同数量的点只用了大概40us，比起单线程快了大概200倍。我的GPU还只是GTX1050渣渣，如果更好的GPU会更快。

CUDA处理图像的一些注意点

1: 前面我们讲到CUDA编程很麻烦的一点是你需要分配好资源。可以看到我的CUDA代码中大概有下面的内容

...
cv::KeyPoint* host_output = (cv::KeyPoint*)malloc(keypoint_size);

这句话是为最终的输出分配好空间，在CPU编程时，我们多了一个点push进向量就可以了，没有太考虑最终会有多少个点。而在CUDA编程中，我们先得考虑最终会有多少个点并预分配空间，可多不可少。另外我们还得设计好需要多少个block以及多少个thread。我们先前说每一个GPU的线程我们处理一个7*7的像素块，如果我们分配少了线程，肯定有些像素块就会处理不到，比如下面的图，图像右边没处理到

not_enough_threads

如果分配多了一些线程，那些线程我们得做一些其他处理(比如直接返回)。

2:CUDA代码需要初始化。任何一个CUDA函数都可以初始化CUDA。而这需要消耗大概几十到一百毫秒的时间。比如我程序中简单的使用了

cudaFree(0);

进行初始化。前面我给的几个参考链接有举CUDA程序的例子，进行简单的矩阵加法耗费了几十上百ms的时间，其实不是的，作者忘记这其中包含了CUDA初始化的时间。如果使用CPU处理1000张上面类似的图片，一张消耗100ms，1000张消耗的时间就是100s，然而使用GPU，就是初始化的几十ms+1000*400us。

3:在GPU里处理图像时，我们没法再使用opencv之类提供的懒人包了。比如我们要获取图像中位于坐标 $(x,y)$ 的像素点的灰度值，有opencv，我们只需要调用

im.ptr<uchar>(y)[x]

就可以得到像素值。你如果写python文件，用PIL之类的获取像素值也很类似，输入坐标就可以了。而在GPU的函数中，我们只能使用比较底层的储存方式。
比如opencv的矩阵，它底层其实还是一个一维的数组。opencv的Mat里有一个成员是uchar* data，这个成员是一个数组指针，数据的最底层是储存在这个数组里的。如果是一个灰度图，那么data[0]会返回像素点(0,0)的灰度值，如果要获取像素点（1,0）的灰度值，那么对应data的索引就应该是data[0 + img_width * 0]。即像素点(x,y)的灰度值

intensity = im.ptr<uchar>(y)[x]

等价于

uchar* pointer;
pointer = im.data;
intensity = pointer[x + y * img_width]

其中img_width为图像宽度。如果图像不是灰度图(比如16位深度图)，Mat每一个元素类型为ushort，我们不能用im.data来获取某个像素的值，因为data返回的是uchar类型的指针，而需要类似下面的东西

ushort* pointer;
pointer = im.ptr<ushort>(0);
depth = pointer[x + y * img_width]

depth为返回的深度值。
GPU没法处理Mat类型的变量，因此使用GPU处理图像时，都是把Mat里的一维数组的指针提取出来，传入到GPU里的。对应我上面CUDA的代码里的

//分配CPU上需要的一张图片的数据的空间。image_size为图像长*宽，每一个元素储存了一个uchar类型的变量，所以整体分配的空间是uchar的长度，8个字节，乘上image_size。
uchar* host_input = (uchar *)malloc(image_size);
...
//test是图像，test.data返回第一个元素的指针
host_input = test.data;
...
//给GPU分配同样大小的空间
cudaMalloc((void**)&device_input, image_size);
...
//把数据从CPU拷贝到GPU
cudaMemcpy(device_input, host_input, image_size, cudaMemcpyHostToDevice);

最后在global函数中我们获取像素点的方法就是

__global__ void ExtractHighGradient(uchar* im, cv::KeyPoint* kp,int rows, int cols, int pattern_row, int pattern_col){
...
//这里im不是cv::Mat，只是个uchar指针
// index = x + y*cols;
//用im[index]来获取位于(x, y)处像素的灰度值
...
}

总结

可能玩儿深度学习之类同学表示，pytorch里一行代码

img = img.to(device='gpu', dtype=torch.float32)

就把图像拷贝到GPU里去算了，结果没想到背后辣么麻烦。可不是嘛，所以做这些工具简化人类工作的老铁就是很强呢。opencv也提供了一些函数的GPU接口，节省大家的设计精力。不过呢，总有不少时候可没有pytorch, tensorflow这种东西了，得老老实实把数据拷贝到GPU去计算并设计GPU多少线程要用啊什么的，脑壳疼。
通过数据对比，我们还是了解到了GPU强大的并行能力。处理上面一张图片用了400us，这400us中大多数数据拷贝的时间，实际的计算时间更少的。

上一篇关于docker文章的总结给扶洋洋博士找女朋友，还没有着落，那我就再宣传一遍(手动滑稽)：扶洋洋博士，优质靠谱单身男青年一枚，德州A&M优秀博士后科研工作者，攀岩大神，欢迎获取联系方式。