使用CUDA内核获得堆栈溢出

2019-11-15 原文

我编程的代码存在很大问题.
我不是专家,在来这里之前我问了很多人.也纠正了很多事情.所以,我想我已准备好向您展示代码并向您提问我的问题.
我会把整个代码放在这里,以便让你很好地理解我的问题.
我想做的事情是,如果ARRAY_SIZE对于THREAD_SIZE来说太大了,那么我将大数组的数据放入一个较小的数组中,特别是用THREAD_SIZE大小创建.
然后,我将它发送到内核并做我必须做的任何事情.但是我有问题

isub_matrix[x*THREAD_SIZE+y]=big_matrix[x*ARRAY_SIZE+y];

由于堆栈溢出,代码停止的位置.首先,我制作了big_matrix的双指针.但freenode irc网络#cuda频道的人告诉我,cpu内存太大而无法处理它,我应该创建一个线性指针.我做到了,但我仍然有同样的堆栈溢出问题.所以,在这里…经过一些更改后更新,但还没有工作(堆栈溢出停止,但是链接和清单更新失败)

#define ARRAY_SIZE 2048
#define THREAD_SIZE 32
#define PI 3.14


int main(int argc,char** argv) 
{
        int array_plus=0,x,y;
        float time;
        //unsigned int memsize=sizeof(float)*THREAD_SIZE*THREAD_SIZE;
        //bool array_rest;
        cudaEvent_t start,stop;
        float *d_isub_matrix;

    float *big_matrix = new float[ARRAY_SIZE*ARRAY_SIZE];
    float *big_matrix2 = new float[ARRAY_SIZE*ARRAY_SIZE];
    float *isub_matrix = new float[THREAD_SIZE*THREAD_SIZE];
    float *osub_matrix = new float[THREAD_SIZE*THREAD_SIZE];

        //if the array's size is not compatible with the thread's size,it won't work.

        //array_rest=(ARRAY_SIZE*ARRAY_SIZE)/(THREAD_SIZE*THREAD_SIZE);
        //isub_matrix=(float*) malloc(memsize);
        //osub_matrix=(float*) malloc(memsize);

        if(((ARRAY_SIZE*ARRAY_SIZE)%(THREAD_SIZE*THREAD_SIZE)==0))
        {

            //allocating space in cpu memory and GPU memory for the big matrix and its sub matrixes
            //it has to be like this (lots of loops)



            //populating the big array
            for(x=0;x<ARRAY_SIZE;x++)
            {
                for(y=0;y<ARRAY_SIZE;y++)
                    big_matrix[x*ARRAY_SIZE+y]=rand()%10000;
            }

            //kind of loop for the big array

            //Start counting the time of processing (everything)
            cudaEventCreate(&start);
            cudaEventCreate(&stop);

            cudaEventRecord(start,0);

            while(array_plus<ARRAY_SIZE)
            {

                //putting the big array's values into the sub-matrix

                for(x=0;x<THREAD_SIZE;x++)
                {
                    for(y=0;y<THREAD_SIZE;y++)
                        isub_matrix[x*THREAD_SIZE+y]=big_matrix[(x+array_plus)*ARRAY_SIZE+y];
                }

                cudamalloc((void**)&d_isub_matrix,THREAD_SIZE*THREAD_SIZE*sizeof(float));
            cudamalloc((void**)&osub_matrix,THREAD_SIZE*THREAD_SIZE*sizeof(float));
            cudamemcpy(d_isub_matrix,isub_matrix,((THREAD_SIZE*THREAD_SIZE)*sizeof(float)),cudamemcpyHostToDevice);

                //call the cuda kernel

                twiddle_factor<<<1,256>>>(isub_matrix,osub_matrix);//<----

                cudamemcpy(osub_matrix,cudamemcpyDevicetoHost);

                array_plus=array_plus+THREAD_SIZE;
                for(x=0;x<THREAD_SIZE;x++)
                {
                    for(y=0;y<THREAD_SIZE;y++)
                        big_matrix2[x*THREAD_SIZE+array_plus+y]=osub_matrix[x*THREAD_SIZE+y];
                }

                array_rest=array_plus+(ARRAY_SIZE);

                cudaFree(isub_matrix);
                cudaFree(osub_matrix);
                system("PAUSE");
            }

            //Stop the time

            cudaEventRecord(stop,0);
            cudaEventSynchronize(stop);
            cudaEventelapsedtime(&time,start,stop);

            //Free memory in GPU




            printf("The processing time took... %fms to finish",time);
                    system("PAUSE");

        }
        printf("The processing time took...NAO ENTROU!");
        system("PAUSE");
        return 0;
}

//things to do: TRANSPOSITION!!!!

另一个问题是关于平行部分.
编译器(Visual Studio)说我一次搞了太多pow()和exp().
我该如何解决这个问题？

if((xIndex<THREAD_SIZE)&&(yIndex<THREAD_SIZE))
    {
        block[xIndex][yIndex]=exp(sum_sin[xIndex][yIndex])+exp(sum_cos[xIndex][yIndex]);
    }

原始代码在这里.我评论它是因为我想知道至少我的代码是否在GPU中占据了一些价值.但它甚至没有启动内核……太可悲了)

__global__ void twiddle_factor(float *isub_matrix,float *osub_matrix)
{
    __shared__ float block[THREAD_SIZE][THREAD_SIZE];
    // int x,y,z;
    unsigned int xIndex = threadIdx.x;
    unsigned int yIndex = threadIdx.y;
    /*
    int sum_sines=0.0;
    int sum_cosines=0.0;
    float sum_sin[THREAD_SIZE],sum_cos[THREAD_SIZE];
    float angle=(2*PI)/THREAD_SIZE;

    //put into shared memory the FFT calculation (F(u))

    for(x=0;x<THREAD_SIZE;x++)
    {
        for(y=0;y<THREAD_SIZE;y++)
        {
            for(z=0;z<THREAD_SIZE;z++)
            {
                sum_sines=sum_sin+sin(isub_matrix[y*THREAD_SIZE+z]*(angle*z));
                sum_cosines=sum_cos+cos(isub_matrix[y*THREAD_SIZE+z]*(angle*z));

            }
            sum_sin[x][y]=sum_sines/THREAD_SIZE;
            sum_cos[x][y]=sum_cosines/THREAD_SIZE;

        }
    }
    */

    if((xIndex<THREAD_SIZE)&&(yIndex<THREAD_SIZE))
        block[xIndex][yIndex]=pow(THREAD_SIZE,0.5);

        //block[xIndex][yIndex]=pow(exp(sum_sin[xIndex*THREAD_SIZE+yIndex])+exp(sum_cos[xIndex*THREAD_SIZE+yIndex]),0.5);

        __syncthreads();

    //transposition X x Y
    //transfer back the results into another sub-matrix that is allocated in cpu

    if((xIndex<THREAD_SIZE)&&(yIndex<THREAD_SIZE))
            osub_matrix[yIndex*THREAD_SIZE+xIndex]=block[xIndex][yIndex];



    __syncthreads();
}

感谢您阅读所有内容！

以下是整个代码：

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

#define ARRAY_SIZE 2048
#define THREAD_SIZE 32
#define PI 3.14



__global__ void twiddle_factor(float *isub_matrix,float *osub_matrix)
{
    __shared__ float block[THREAD_SIZE][THREAD_SIZE];
    int x,z;
    unsigned int xIndex = threadIdx.x;
    unsigned int yIndex = threadIdx.y;

    float sum_sines=0.0;
    //float expo_sums;
    float sum_cosines=0.0;
    float sum_sin[THREAD_SIZE][THREAD_SIZE],sum_cos[THREAD_SIZE][THREAD_SIZE];
    float angle=(2*PI)/THREAD_SIZE;

    //put into shared memory the FFT calculation (F(u))

    for(x=0;x<THREAD_SIZE;x++)
    {
        for(y=0;y<THREAD_SIZE;y++)
        {
            for(z=0;z<THREAD_SIZE;z++)
            {
                sum_sines=sum_sines+sin(isub_matrix[y*THREAD_SIZE+z]*(angle*z));
                sum_cosines=sum_cosines+cos(isub_matrix[y*THREAD_SIZE+z]*(angle*z));

            }
            sum_sin[x][y]=sum_sines/THREAD_SIZE;
            sum_cos[x][y]=sum_cosines/THREAD_SIZE;

        }
    }


    if((xIndex<THREAD_SIZE)&&(yIndex<THREAD_SIZE))
    {
        block[xIndex][yIndex]=exp(sum_sin[xIndex][yIndex])+exp(sum_cos[xIndex][yIndex]);
    }




        __syncthreads();

    //transposition X x Y
    //transfer back the results into another sub-matrix that is allocated in cpu

    if((xIndex<THREAD_SIZE)&&(yIndex<THREAD_SIZE))
            osub_matrix[yIndex*THREAD_SIZE+xIndex]=block[xIndex][yIndex];



    __syncthreads();
}


int main(int argc,stop;
        float *d_isub_matrix,*d_osub_matrix;

        float *big_matrix = new float[ARRAY_SIZE*ARRAY_SIZE];
        float *big_matrix2 = new float[ARRAY_SIZE*ARRAY_SIZE];
        float *isub_matrix = new float[THREAD_SIZE*THREAD_SIZE];
        float *osub_matrix = new float[THREAD_SIZE*THREAD_SIZE];

        //if the array's size is not compatible with the thread's size,it won't work.

        //array_rest=(ARRAY_SIZE*ARRAY_SIZE)/(THREAD_SIZE*THREAD_SIZE);
        //isub_matrix=(float*) malloc(memsize);
        //osub_matrix=(float*) malloc(memsize);

        if(((ARRAY_SIZE*ARRAY_SIZE)%(THREAD_SIZE*THREAD_SIZE)==0)&&(ARRAY_SIZE>=THREAD_SIZE))
        {

            //allocating space in cpu memory and GPU memory for the big matrix and its sub matrixes
            //it has to be like this (lots of loops)



            //populating the big array
            for(x=0;x<ARRAY_SIZE;x++)
            {
                for(y=0;y<ARRAY_SIZE;y++)
                    big_matrix[x*ARRAY_SIZE+y]=rand()%10000;
            }

            //kind of loop for the big array

            //Start counting the time of processing (everything)
            cudaEventCreate(&start);
            cudaEventCreate(&stop);

            cudaEventRecord(start,0);

            while(array_plus<ARRAY_SIZE)
            {

                //putting the big array's values into the sub-matrix

                for(x=0;x<THREAD_SIZE;x++)
                {
                    for(y=0;y<THREAD_SIZE;y++)
                        isub_matrix[x*THREAD_SIZE+y]=big_matrix[x*ARRAY_SIZE+y];
                }

                cudamalloc((void**)&d_isub_matrix,THREAD_SIZE*THREAD_SIZE*sizeof(float));
                cudamalloc((void**)&d_osub_matrix,THREAD_SIZE*THREAD_SIZE*sizeof(float));
                cudamemcpy(d_isub_matrix,256>>>(d_isub_matrix,d_osub_matrix);//<----

                cudamemcpy(osub_matrix,d_osub_matrix,cudamemcpyDevicetoHost);

                array_plus=array_plus+THREAD_SIZE;
                for(x=0;x<THREAD_SIZE;x++)
                {
                    for(y=0;y<THREAD_SIZE;y++)
                        big_matrix2[x*THREAD_SIZE+array_plus+y]=osub_matrix[x*THREAD_SIZE+y];
                }


                cudaFree(isub_matrix);
                cudaFree(osub_matrix);
                cudaFree(d_osub_matrix);
                cudaFree(d_isub_matrix);
            }

            //Stop the time

            cudaEventRecord(stop,stop);

            //Free memory in GPU

解决方法

我在这段代码中看到了很多问题.

>在将数据从big_matrix复制到isub_matrix之前,您没有为isub_matrix分配内存

for(x=0;x<THREAD_SIZE;x++)
    {
        for(y=0;y<THREAD_SIZE;y++)
            isub_matrix[x*THREAD_SIZE+y]=big_matrix[x*ARRAY_SIZE+y];
    }

>您没有为isub_matrix从主机到设备执行任何cudamemcpy.在设备上为isub_matrix分配内存后,您需要复制数据.
>我在while循环中看到你正在计算相同的数据.

//putting the big array's values into the sub-matrix

        for(x=0;x<THREAD_SIZE;x++)
        {
            for(y=0;y<THREAD_SIZE;y++)
                isub_matrix[x*THREAD_SIZE+y]=big_matrix[x*ARRAY_SIZE+y];
        }

for循环应该依赖于array_plus.

我建议你这样做

for(x=0;x<THREAD_SIZE;x++)
            {
                for(y=0;y<THREAD_SIZE;y++)
                    isub_matrix[x*THREAD_SIZE+y]=big_matrix[(x+array_plus)*ARRAY_SIZE+y];
            }

>而且,我不觉得使用array_rest.那用的是什么？

基于更新版本：

我看到的错误是

>您正在使用osub_matrix作为主机和设备指针.我建议你创建另一个浮点指针并将其用于设备

float *d_osub_matrix;

cudamalloc((void**)&d_osub_matrix,THREAD_SIZE*THREAD_SIZE*sizeof(float));

并打电话.

twiddle_factor<<<1,d_osub_matrix);

然后做

cudamemcpy(osub_matrix,cudamemcpyDevicetoHost);

>顺便说一下,事实并非如此

twiddle_factor<<< 1256>>>(isub_matrix,osub_matrix);

它应该是

twiddle_factor<<< 1256>>>(d_isub_matrix,osub_matrix);

最终和完成的代码：

int main(int argc,char** argv)
{
        int array_plus=0,y;
        int array_plus_x,array_plus_y;
        float time;
        //unsigned int memsize=sizeof(float)*THREAD_SIZE*THREAD_SIZE;
        //bool array_rest;
        cudaEvent_t start,0);
            for(array_plus_x = 0; array_plus_x < ARRAY_SIZE; array_plus_x += THREAD_SIZE)
            for(array_plus_y = 0; array_plus_y < ARRAY_SIZE; array_plus_y += THREAD_SIZE)
            {


                //putting the big array's values into the sub-matrix

                for(x=0;x<THREAD_SIZE;x++)
                {
                    for(y=0;y<THREAD_SIZE;y++)
                        isub_matrix[x*THREAD_SIZE+y]=big_matrix[(x+array_plus_x)*ARRAY_SIZE+(y+array_plus_y)];
                }

                cudamalloc((void**)&d_isub_matrix,cudamemcpyHostToDevice);

                //call the cuda kernel

                dim3 block(32,32);
                twiddle_factor<<<1,block>>>(d_isub_matrix,cudamemcpyDevicetoHost);

                for(x=0;x<THREAD_SIZE;x++)
                {
                    for(y=0;y<THREAD_SIZE;y++)
                        big_matrix2[(x+array_plus_x)*ARRAY_SIZE+(y+array_plus_y)]=osub_matrix[x*THREAD_SIZE+y];
                }

                cudaFree(d_osub_matrix);
                cudaFree(d_isub_matrix);
            }

            //Stop the time

            cudaEventRecord(stop,stop);

            //Free memory in GPU

使用CUDA内核获得堆栈溢出的更多相关文章

iOS绘图圈

我想在我的iOS应用中创建下面的圈子.我知道如何制作圆圈,但我不完全确定如何获得圆弧上的点数.它必须是代码而不是图像.下面是我目前的代码.更新到答案解决方法如果(x,y)是中心而r是大圆的半径,则第i个外圆的中心将是：在0开始cita并为下一个圆增加PI/4弧度工作实施编辑：添加实现并重命名变量.
ios – 检查互联网连接是否可用于swift

目前,当我的应用程序尝试在没有互联网连接的情况下对用户的位置进行地理位置分配我有点新的快速和ios编程–我的道歉.解决方法不是一个完整的网络检查库,但我发现了this简单的方法来检查网络可用性.我设法把它翻译成Swift,并在这里是最终的代码.它适用于3G和WiFi连接.我也将其上传到我的Github一个工作的例子.如果您正在寻找一种简单的方法来检查纯粹在Swift中的网络可用性,您可以使用它.
ios – 检查Internet连接可用性？

在开始与我的iPhone应用程序中的服务通信之前,我只需要检查互联网连接的可用性.我使用Swift1.2和Xcode6作为我的开发环境….我的问题是,我对于iOS开发是相当新鲜的,不太确定使用这个逻辑来完成它是多么好可靠.该课程中的大部分内容完全不清楚,但是我做的小测试工作很好！
swift基础判断网络连接

没网没有网的时候回弹出警告框
与Swift 2中的防火墙指针

我正在尝试检查用户是否具有互联网连接，部分过程涉及使用UnsafePointer调用。与Swift2.x中的防火墙指针一起使用的正确方法是什么？
Android 使用cos和sin绘制复合曲线动画

这篇文章主要介绍了Android 使用cos和sin绘制复合曲线动画的方法，帮助大家更好的理解和学习使用Android，感兴趣的朋友可以了解下
PyTorch中的CUDA的操作方法

这篇文章主要介绍了PyTorch中的CUDA的操作方法，CUDA是NVIDIA推出的异构计算平台，PyTorch中有专门的模块torch.cuda来设置和运行CUDA相关操作，更多相关介绍，需要的朋友可以查看下面文章内容
是否可以在另一个GPU（2 GPU系统）中处理数据

我的算法需要对每个相机的数据进行长期处理，因此每个相机都需要访问相同的GPU内存问题在一个GPU中处理4个摄像头可能会导致内存不足。所以，我认为一个GPU只能处理两个摄像头。但在第一时间，如果cam3在GPU0处被处理，则cam3数据不能在GPU1处处理。我想将cam3数据从GPU0复制到GPU1，但它并没有那么小，所以看起来效率很低。是否可以在GPU1上使用GPU0数据进行处理而无需内存？我在CUDA方面很短，所以如果有好的关键词来解决这个问题，请告诉我。
如何编写CUDA内核来加速python代码

几周来，我一直在学习python作为我的第一种编程语言。我决定用Numba编写一个乐透模拟。该代码在我的CPU上每秒大约250k次迭代时运行得很好。我真的很想看看它是如何在我的英伟达GPU上运行的，但我有点力不从心。如果有人能帮我一把，我将非常感激。我想我应该能够运行float16，因为数字并不复杂。此外，@vectorize似乎很重要。但是，老实说，我在踩水。
为什么cuGraphAddMemCopyNode已经获得了两个上下文，却需要额外的上下文？

考虑CUDA图形API函数在此描述。它采用的CUDA_MEMCPY3D结构是一组非常广泛的参数。实际上，它包含两个上下文句柄字段：srcContext和dstContext，用于定义源和目标内存区域或数组的上下文。然而，该函数需要额外的第三个上下文句柄。但是，这意味着什么？节点是一个图，它通过具有上下文的流启动。除此之外，为什么这很重要？两个端点上下文应该足以让CUDA驱动程序执行复制。虽然大多数节点插入API函数都没有？

随机推荐

从C到C#的zlib(如何将byte []转换为流并将流转换为byte [])

我的任务是使用zlib解压缩数据包(已接收),然后使用算法从数据中生成图片好消息是我在C中有代码,但任务是在C#中完成C我正在尝试使用zlib.NET,但所有演示都有该代码进行解压缩(C#)我的问题：我不想在解压缩后保存文件,因为我必须使用C代码中显示的算法.如何将byte[]数组转换为类似于C#zlib代码中的流来解压缩数据然后如何将流转换回字节数组？
为什么C标准使用不确定的变量未定义？

垃圾价值存储在哪里,为什么目的？解决方法由于效率原因,C选择不将变量初始化为某些自动值.为了初始化这些数据,必须添加指令.以下是一个例子：产生：虽然这段代码：产生：你可以看到,一个完整的额外的指令用来移动1到x.这对于嵌入式系统来说至关重要.
如何使用命名管道从c调用WCF方法？

更新：通过协议here,我无法弄清楚未知的信封记录.我在网上找不到任何例子.原版的：我有以下WCF服务我输出添加5行,所以我知道服务器是否处理了请求与否.我有一个.NET客户端,我曾经测试这一切,一切正常工作预期.现在我想为这个做一个非托管的C客户端.我想出了如何得到管道的名称,并写信给它.我从here下载了协议我可以写信给管道,但我看不懂.每当我尝试读取它,我得到一个ERROR_broKEN_P
“这”是否保证指向C中的对象的开始？

我想使用fwrite将一个对象写入顺序文件.班级就像当我将一个对象写入文件时.我正在游荡,我可以使用fwrite(this,sizeof(int),2,fo)写入前两个整数.问题是：这是否保证指向对象数据的开始,即使对象的最开始可能存在虚拟表.所以上面的操作是安全的.解决方法这提供了对象的地址,这不一定是第一个成员的地址.唯一的例外是所谓的标准布局类型.从C11标准：(9.2/20)Apointe
c – 编译单元之间共享的全局const对象

当我声明并初始化一个const对象时.两个cpp文件包含此标头.和当我构建解决方案时,没有链接错误,你会得到什么如果g_Const是一个非const基本类型！PrintInUnit1()和PrintInUnit2()表明在两个编译单元中有两个独立的“g_Const”具有不同的地址,为什么？
什么是C名称查找在这里？ (&GCC对吗？)

为什么在第三个变体找到func,但是在实例化的时候,原始变体中不合格查找找不到func？解决方法一般规则是,任何不在模板定义上下文中的内容只能通过ADL来获取.换句话说,正常的不合格查找仅在模板定义上下文中执行.因为在定义中间语句时没有声明func,并且func不在与ns::type相关联的命名空间中,所以代码形式不正确.
c – 在输出参数中使用auto

有没有办法在这种情况下使用auto关键字：当然,不可能知道什么类型的.因此,解决方案应该是以某种方式将它们合并为一个句子.这可用吗？解决方法看起来您希望默认初始化给定函数期望作为参数的类型的对象.您无法使用auto执行此操作,但您可以编写一个特征来提取函数所需的类型,然后使用它来声明您的变量：然后你就像这样使用它：当然,只要你重载函数,这一切都会失败.
在C中说“推动一切浮动”的确定性方式

鉴于我更喜欢将程序中的数字保留为int或任何内容,那么使用这些数字的浮点数等效的任意算术最方便的方法是什么？说,我有我想写通过将转换放在解析的运算符树叶中,无需将表达式转化为混乱是否可以使用C风格的宏？应该用新的类和重载操作符完成吗？解决方法这是一个非常复杂的表达.更好地给它一个名字：现在当您使用整数参数调用它时,由于参数的类型为double,因此使用常规的算术转换将参数转换为double用C11lambda……
objective-c – 如何获取未知大小的NSArray的第一个X元素？

在objectiveC中,我有一个NSArray,我们称之为NSArray*largeArray,我想要获得一个新的NSArray*smallArray,只有第一个x对象…
c – Setprecision是混乱

我只是想问一下setprecision,因为我有点困惑.这里是代码：其中x=以下：方程的左边是x的值.1.105=1.10应为1.111.115=1.11应为1.121.125=1.12应为1.131.135=1.14是正确的1.145=1.15也正确但如果x是：2.115=2.12是正确的2.125=2.12应为2.13所以为什么在一定的价值是正确的,但有时是错误的？请启发我谢谢解决方法没有理由期望使用浮点系统可以正确地表示您的帖子中的任何常量.因此,一旦将它们存储在一个双变量中,那么你所拥有的确切的一