Metal 学习笔记

使用GPU计算的流程

https://developer.apple.com/documentation/metal/basic_tasks_and_concepts/performing_calculations_on_a_gpu

1. 写一个 C语言的GPU函数

void add_arrays(const float* inA,
                const float* inB,
                float* result,
                int length)
{
    for (int index = 0; index < length ; index++)
    {
        result[index] = inA[index] + inB[index];
    }
}

2. 将C语言函数转化成Metal着色语言（MSL）

kernel void add_arrays(device const float* inA,
                       device const float* inB,
                       device float* result,
                       uint index [[thread_position_in_grid]])
{
    // the for-loop is replaced with a collection of threads, each of which
    // calls this function.
    result[index] = inA[index] + inB[index];
}

3. 找一个GPU设备（MTLDevice）

id<MTLDevice> device = MTLCreateSystemDefaultDevice();

4. 初始化Metal实体们

MetalAdder* adder = [[MetalAdder alloc] initWithDevice:device]; //用它来管理需要需Metal通讯的实体

5. 引用Metal函数

Metal函数在 app 的默认 Metal Library 里，所以使用 MTLDevice 获取 MTLLibrary，然后通过 MTLLibrary 或者MTLFunction（Metal 函数）

- (instancetype) initWithDevice: (id<MTLDevice>) device
{
    self = [super init];
    if (self)
    {
        _mDevice = device;
        
        NSError* error = nil;
        
        // Load the shader files with a .metal file extension in the project

        id<MTLLibrary> defaultLibrary = [_mDevice newDefaultLibrary];
        if (defaultLibrary == nil)
        {
            NSLog(@"Failed to find the default library.");
            return nil;
        }

        id<MTLFunction> addFunction = [defaultLibrary newFunctionWithName:@"add_arrays"];
        if (addFunction == nil)
        {
            NSLog(@"Failed to find the adder function.");
            return nil;
        }
//官网就没有结束大括号，也许是这个方法实际还没结束？

6. 准备Metal管道

Metal函数不是真正的可执行代码，Metal管道将函数转化成实际可执行代码。在Metal中，管道表示为pipeline state object （创建管道的时候编译代码）

_mAddFunctionPSO = [_mDevice newComputePipelineStateWithFunction: addFunction error:&error];

7. 创建命令队列

给GPU发送命令，需要一个命令队列

_mCommandQueue = [_mDevice newCommandQueue];

8. 创建Buffer和数据

Metal使用MTLResource管理内存，使用MTLDevice实例创建内存（实际使用MTLBuffer表示创建的buffer，是MTLResource的子类）

_mBufferA = [_mDevice newBufferWithLength:bufferSize options:MTLResourceStorageModeShared];//MTLResourceStorageModeShared可以让CPU和GPU共享
_mBufferB = [_mDevice newBufferWithLength:bufferSize options:MTLResourceStorageModeShared];
_mBufferResult = [_mDevice newBufferWithLength:bufferSize options:MTLResourceStorageModeShared];

[self generateRandomFloatData:_mBufferA];
[self generateRandomFloatData:_mBufferB];

- (void) generateRandomFloatData: (id<MTLBuffer>) buffer
{
    float* dataPtr = buffer.contents;
    
    for (unsigned long index = 0; index < arrayLength; index++)
    {
        dataPtr[index] = (float)rand()/(float)(RAND_MAX);
    }
}

9.创建Command Buffer

id<MTLCommandBuffer> commandBuffer = [_mCommandQueue commandBuffer];

10.创建命令编码器 Command Encoder

为了将命令写入Command Buffer，需要一个命令解码器来传递具体哪种命令，这里使用计算命令编码器。
它编码出一个计算通路，里面有一列命令，每个计算命令都会导致GPU创建一个矩阵表（grid）来执行

id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];

为了编码一个命令，可以对编码器调用一系列方法，有一些设置状态信息，比如pipeline state object（PSO），或者传递给管道的参数。当作出这些状态改变后，会编码命令来执行管道。编码器把所有的状态改变和命令参数写入Command Buffer

11.设置Pipeline State和参数数据

先设置管道要执行的Pipeline state object，再设置add_arrays函数需要处理的数据，这里的index和add_arrays的参数位置对应。offset是buffer的偏移量。也可以用同一个buffer，不同偏移量，代表不同参数

[computeEncoder setComputePipelineState:_mAddFunctionPSO];
[computeEncoder setBuffer:_mBufferA offset:0 atIndex:0];
[computeEncoder setBuffer:_mBufferB offset:0 atIndex:1];
[computeEncoder setBuffer:_mBufferResult offset:0 atIndex:2];

12.指定线程数和组织方式

Metal可以处理1D，2D和3D数据，本例是1D数据，所以传datasize * 1 * 1作为参数

MTLSize gridSize = MTLSizeMake(arrayLength, 1, 1);

13.指定线程组大小

Metal把整个数据表分割成小的表，叫做线程组，每个线程组独立运行，分发给不同的GPU处理单元，来加速处理。你需要决定线程组有多大

NSUInteger threadGroupSize = _mAddFunctionPSO.maxTotalThreadsPerThreadgroup;//目前可用的最大的线程数量
if (threadGroupSize > arrayLength)
{
    threadGroupSize = arrayLength;
}
MTLSize threadgroupSize = MTLSizeMake(threadGroupSize, 1, 1);

14.编码计算命令并执行线程

[computeEncoder dispatchThreads:gridSize
          threadsPerThreadgroup:threadgroupSize];

编码器可以编码多个命令，而无需多余步骤

15.结束计算通路

[computeEncoder endEncoding];

16.提交Command Buffer来执行命令

[commandBuffer commit];

Metal异步执行这些命令，在执行完以后，command buffer会被标记成已完成

17.等待计算完成

[commandBuffer waitUntilCompleted];

这个方法可以同步等待计算完成，也可以对command buffer添加addCompletedHandler(_:)，或者检查status属性来获取完成状态

18. 从Buffer中读取结果

例子是读取结果，然后cpu再算一遍，看看gpu算的对不对

- (void) verifyResults
{
    float* a = _mBufferA.contents;
    float* b = _mBufferB.contents;
    float* result = _mBufferResult.contents;

    for (unsigned long index = 0; index < arrayLength; index++)
    {
        if (result[index] != (a[index] + b[index]))
        {
            printf("Compute ERROR: index=%lu result=%g vs %g=a+b\n",
                   index, result[index], a[index] + b[index]);
            assert(result[index] == (a[index] + b[index]));
        }
    }
    printf("Compute results as expected\n");
}