AMD GPU 内核驱动分析(三)-gpu scheduler ring fence同步工作模型_dma_fence_put-程序员宅基地

技术标签： linux AMD GPU

在Linux Kernel 的AMDGPU驱动实现中，dma-fence扮演着重要角色，AMDGPU的Render/解码操作可能涉及到多个方面同时引用buffer的情况，以渲染/视频解码场景为例，应用将渲染/解码命令写入和GPU共享的BUFFER之后，需要将任务提交给GPU的运行队列等待执行，这个时候，应用不仅要监听buffer中任务的执行情况，还要保证在任务执行完毕之前，不能再去修改BUFFER中的数据。而AMDGPU也需要排它性地执行BUFFER命令。在GPU执行结束，应用希望及时得到执行完的信息，以便回收BUFFER重新利用，通知一般由绑定到某个BUFFER上的中断完成。这些操作，如果使用经典的共享BUFFER+锁的方式进行保护和同步，不但效率十分低下，而且各类内核机制杂糅在一起，缺乏一个统一的管理框架，使代码难以维护。 dma-fence则提供了一套简单便捷的机框架，将原子操作，休眠唤醒，同步/异步事件通知等功能嵌入到各种类型的BUFFER管理框架中，将各类机制有机的结合在一起，减少了用户态空间的忙等，使buffer使用更加智能高效。

以AMDGPU解码视频为例，利用dma-fence ring buffer隔离了应用解码请求和解码任务任务本身，提交任务和解码完成通知均通过BUFFER绑定的dma-fence进行交互：

为了便于分析，我把5.4内核中AMDGPU这部分的实现提取出来，写了一个可以独立加载执行的内核模块demo，类似于CMODEL，任务提交上下文和解码完成通知上下文在AMDGPU驱动中分别用内核线程和中断实现，在demo中则全部用内核线程实现.

工作模型如下图所示：

fence array有256个slot,每个位置代表一个fence.构成一个RING BUFFER。sync_seq为写指针序号，sync_seq只增不减，sync_seq mod 256为在array中的index，相当于RING BUFFER的写指针。

last_seq代表当前已经消耗掉结束位置（已完成处理的和fence绑定的buffer)的序号，同sync_seq一样，它也是单调递增的自然序列，last_seq mod 256 为在array中的index，相当于ring buffer的读指针。

而fence_seq则表示写位置的一个抽样，这是一个在AMDGPU中的hardware ring是所有进程共享的，每个进程随时可能会提交新的渲染任务，所以sync_seq时刻在更新，不适合处理。fence_seq则是某个时间点的快照，在last_seq和fence_seq之间的buffer(fence)，都会在下一批中一并得到signal.

当提交过快，但是消耗较慢时，写方追上读方，此时需要进行生产-消费之间的同步，读端将会反压到写端，写端调用dma_fence_wait进行同步：

当消费方完成下一个任务后，释放当前slot，并且调用dma_fence_signal唤醒写线程，写线程受到信号后，继续执行，向fence array中填充新的任务：

dma fence 的free

拔出萝卜带出泥，通过job fence的free触发schedule fence和finish fence的释放，释放是通过引用计数归零触发的回调，在回调用中利用RCU异步释放FENCE占用的空间。

以上就是fence array的使用过程中的几个corner case.

AMDGPU中gpu_sched的TRACE Point

Linux 内核调试框架提供了GPU调度器的trace event功能，可以在不重新编译模块的情况下，追踪GPU的执行情况，路径如下：

/sys/kernel/debug/tracing/events/gpu_scheduler

trace结果：

root@zlcao-A520MS:/sys/kernel/debug/tracing/events/gpu_scheduler# echo 1 > drm_run_job/enable 
root@zlcao-A520MS:/sys/kernel/debug/tracing/events/gpu_scheduler# echo 1 > drm_sched_job/enable 
root@zlcao-A520MS:/sys/kernel/debug/tracing/events/gpu_scheduler# echo 1 >
drm_run_job/            drm_sched_job/          drm_sched_job_wait_dep/ drm_sched_process_job/  enable                  filter                  
root@zlcao-A520MS:/sys/kernel/debug/tracing/events/gpu_scheduler# echo 1 > enable
root@zlcao-A520MS:/sys/kernel/debug/tracing/events/gpu_scheduler# cat ../../trace

fence 流量控制

ring结构体中包含一个名子为num_hw_submission的字段，fence array的空间是就是根据其两倍大小来分配的。

这个变量最早由amdgpu.ko的amdgpu_sched_hw_submission参数传入，AMDGPU驱动会根据RINT的类型再做进一步的调整。

之后在创建调度线程时传递给sched->hw_submission_limit变量，.如果当前RING上发射出去的正在执行的job packet的数量小于sched->hw_submission_limit时，提交JOB的上向文将会对RING调度内核线程执行唤醒操作。通过这种方式对RING中的执行情况进行流量控制。

只有KIQ Ring的submission 数目最大，达到了256,原因前图注释已经说明，可能是因为在其上运行的COMMAND数量大。

ubuntu图形界面常态工作时，AMDGPU ring的负载情况：

在运行UBUNTU图形终端时，AMDGPU主要有三个RING在执行任务，包含gfx_low/high, sdma，gfx ring负责执行具体的渲染命令，而sdma则负责数据拷贝。

KFD Ring和GFX Ring的处理方式区别

KFD ring是支持compute的ring类型，它是用户台提交command packet的，所以不依赖于内核态gpu scheduler.而gfx ring则需要。

> It seems the HSA KFD Driver are live in the DRM frameworkm, which 
> is like the AMD Graphc drivers. But it seems the Graphic Drivers are 
> free to KFD, because the GPU Graphic function are running well if 
> CONFIG_HSA_AMD macro disable  the KFD functions.
>
> but did the KFD driver needs the Drm drivers? *without the DRM 
> Framework, did the KFD driver work normally? *
> and another question,*did the gpu_sched module is share for Graphici 
> driver and KFD? or* it just use for Graphic Drivers.
>
> i am not familar the AMD GPU driver but i am very intrested on it when 
> i found it, thanks very much for your kindly help on my misunderstading.
>
The AMDGPU driver is a unified driver that supports both graphics and 
compute functionality on a wide range of AMD GPUs. You can think of KFD 
as an extension to AMDGPU to support ROCm user mode (HIP, OpenMP, 
PyTorch, Tensorflow, etc.). ROCm user mode still depends on some of the 
DRM functionality. So KFD cannot exist as a standalone driver without 
AMDGPU and DRM.

KFD does not use the gpu_sched scheduler. It is only used by the 
graphics driver. Instead it relies on user-mode-queues, where user mode 
places command directly into memory-mapped queues that are executed by 
the firmware on the GPU. We do not use a driver ioctl API for command 
submission, so we don't use the gpu_sched scheduler. Instead the user 
mode compute queues are scheduled by the GPU firmware.

附模块code：

#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/types.h>
#include <linux/spinlock.h>
#include <linux/module.h>
#include <linux/fs.h>
#include <linux/errno.h>
#include <linux/mm.h>
#include <linux/sched/signal.h>
#include <linux/dma-fence.h>
#include <linux/slab.h>
#include <linux/delay.h>
#include <linux/kthread.h>
#include <linux/sched.h>
#include <uapi/linux/sched/types.h>

#define assert(expr)   \
        if (!(expr)) { \
                printk( "Assertion failed! %s,%s,%s,line=%d\n",\
                                #expr,__FILE__,__func__,__LINE__); \
                BUG(); \
        }

#define num_hw_submission               128
struct fence_driver {
	uint64_t                        gpu_addr;
	volatile uint32_t               *cpu_addr;
	uint32_t                        sync_seq;
	atomic_t                        last_seq;
	bool                            initialized;
	bool                            initialized_emit;
	bool                            initialized_recv;
	unsigned                        num_fences_mask;
	spinlock_t                      lock;
	struct dma_fence                **fences;
	struct mutex                    mutex;
	struct dma_fence                *last_fence;
	struct timer_list               timer;
	struct timer_list               work_timer;
	wait_queue_head_t               job_scheduled;
};

struct fence_set {
	struct dma_fence                scheduled;
	struct dma_fence                finished;
	spinlock_t                      lock;
};

struct job_fence {
	struct dma_fence                job;
	void                            *data;
};

static uint32_t fence_seq;
static struct fence_driver *ring;
static struct task_struct *fence_emit_task;
static struct task_struct *fence_recv_task;
static struct kmem_cache *job_fence_slab;
static struct kmem_cache *sched_fence_slab;

static const char *dma_fence_get_name(struct dma_fence *fence)
{
	return "dma-fence-drv";
}

static bool dma_fence_enable_signal(struct dma_fence *fence)
{
	if (!timer_pending(&ring->work_timer)) {
		mod_timer(&ring->work_timer, jiffies + HZ / 10);
	}

	printk("%s line %d, signal fenceno %lld.\n", __func__, __LINE__, fence->seqno);

	return true;
}

void fencedrv_fence_free(struct rcu_head *rcu)
{
	struct fence_set *fs;
	struct dma_fence *f = container_of(rcu, struct dma_fence, rcu);
	struct job_fence *jb = container_of(f, struct job_fence, job);

	assert(jb->data != NULL);

	fs = (struct fence_set *)jb->data;

	// the dma_fence_get must be symmentry with dma_fence_put during all the alive.
	assert((kref_read(&fs->scheduled.refcount) == 1));
	assert((kref_read(&fs->finished.refcount) == 1));

	//dma_fence_put(&fs->scheduled);
	dma_fence_put(&fs->finished);
	kmem_cache_free(job_fence_slab, jb);

	// dump_stack();
}

static void fencedrv_dma_fence_release(struct dma_fence *fence)
{
	// typically usage for dma fence release by rcu.
	call_rcu(&fence->rcu, fencedrv_fence_free);
}

static const struct dma_fence_ops fence_ops = {
	.get_driver_name = dma_fence_get_name,
	.get_timeline_name = dma_fence_get_name,
	.enable_signaling = dma_fence_enable_signal,
	.release = fencedrv_dma_fence_release,
};

static int32_t fencedrv_get_ring_avail(void)
{
	uint32_t read_seq, write_seq;

	do {
		read_seq = atomic_read(&ring->last_seq);
		write_seq = ring->sync_seq;
	} while (atomic_read(&ring->last_seq) != read_seq);

	read_seq &= ring->num_fences_mask;
	write_seq &= ring->num_fences_mask;

	pr_err("%s line %d, read_seq %d, write_seq %d.\n",
	       __func__, __LINE__, read_seq, write_seq);

	if (read_seq <= write_seq) {
		return write_seq - read_seq;
	} else {
		return write_seq + num_hw_submission * 2 - read_seq;
	}
}

static const char *dma_fence_get_name_scheduled(struct dma_fence *fence)
{
	return "dma-fence-scheduled";
}

static const char *dma_fence_get_name_finished(struct dma_fence *fence)
{
	return "dma-fence-finished";
}

static void sched_fence_free(struct rcu_head *head);
static void fencedrv_dma_fence_release_scheduled(struct dma_fence *fence)
{
	struct fence_set *fs = container_of(fence, struct fence_set, scheduled);

	// typically usage for dma fence release by rcu.
	call_rcu(&fs->finished.rcu, sched_fence_free);
}

static void fencedrv_dma_fence_release_finished(struct dma_fence *fence)
{
	struct fence_set *fs = container_of(fence, struct fence_set, finished);
	dma_fence_put(&fs->scheduled);
	//while (1);
}


static const struct dma_fence_ops fence_scheduled_ops = {
	.get_driver_name = dma_fence_get_name_scheduled,
	.get_timeline_name = dma_fence_get_name_scheduled,
	.release = fencedrv_dma_fence_release_scheduled,
};

static const struct dma_fence_ops fence_finished_ops = {
	.get_driver_name = dma_fence_get_name_finished,
	.get_timeline_name = dma_fence_get_name_finished,
	.release = fencedrv_dma_fence_release_finished,
};

static struct fence_set *to_sched_fence(struct dma_fence *f)
{
	if (f->ops == &fence_scheduled_ops) {
		return container_of(f, struct fence_set, scheduled);
	}

	if (f->ops == &fence_finished_ops) {
		return container_of(f, struct fence_set, finished);
	}

	return NULL;
}

static void sched_fence_free(struct rcu_head *head)
{
	struct dma_fence *f = container_of(head, struct dma_fence, rcu);
	struct fence_set *fs = to_sched_fence(f);
	if (fs == NULL)
		return;

	assert(f == &fs->finished);

	kmem_cache_free(sched_fence_slab, fs);
	dump_stack();
}

static struct fence_set *init_fence_set(void)
{
	struct fence_set *fs = kmem_cache_alloc(sched_fence_slab, GFP_KERNEL);
	if (fs == NULL) {
		pr_err("%s line %d, alloc fence set from fence set slab failure.\n",
		       __func__, __LINE__);
		return NULL;
	}

	spin_lock_init(&fs->lock);

	dma_fence_init(&fs->scheduled, &fence_scheduled_ops, &fs->lock, 0, 0);
	dma_fence_init(&fs->finished, &fence_finished_ops, &fs->lock, 0, 0);

	return fs;
}

// ref amdgpu_fence_process
static int fence_recv_task_thread(void *data)
{
	struct sched_param sparam = {.sched_priority = 1};
	sched_setscheduler(current, SCHED_FIFO, &sparam);

	//mutex_lock(&ring->mutex);
	while (ring->initialized == false) {
		set_current_state(TASK_UNINTERRUPTIBLE);
		if (ring->initialized == true) {
			break;
		}
		//mutex_unlock(&ring->mutex);
		schedule();
		//mutex_lock(&ring->mutex);
	}

	set_current_state(TASK_RUNNING);
	//mutex_unlock(&ring->mutex);

	while (!kthread_should_stop() && ring->initialized_recv == true) {
		uint32_t seqno_next = 0;
		uint32_t seq, last_seq;
		int r;

		do {
			// last_seq is the read pointer of fence ring buffer.
			last_seq = atomic_read(&ring->last_seq);
			seq = *ring->cpu_addr;

			if (kthread_should_stop())
				return 0;
		} while (atomic_cmpxchg(&ring->last_seq, last_seq, seq) != last_seq);

		if (del_timer(&ring->work_timer) &&
		    seq != ring->sync_seq) {
			mod_timer(&ring->work_timer, jiffies + HZ / 10);
		}

		//printk("%s line %d, last_seq %d, seq %d, sync_seq %d.\n", __func__, __LINE__, last_seq, seq, ring->sync_seq);

		if (unlikely(seq == last_seq)) {
			msleep(10);
			continue;
		}

		assert(seq > last_seq);

		last_seq &= ring->num_fences_mask;
		seq &= ring->num_fences_mask;

		//printk("%s line %d, last_seq %d, seq %d, sync_seq %d.\n", __func__, __LINE__, last_seq, seq, ring->sync_seq);
		do {
			struct dma_fence *fence, **ptr;

			++last_seq;
			last_seq &= ring->num_fences_mask;
			ptr = &ring->fences[last_seq];
			fence = rcu_dereference_protected(*ptr, 1);

			RCU_INIT_POINTER(*ptr, NULL);
			if (!fence) {
				continue;
			}

			if (seqno_next == 0 || seqno_next == fence->seqno) {
				seqno_next = fence->seqno + 1;
			} else { /*if (seqno_next != 0 && seqno_next != fence->seqno)*/
				pr_err("%s line %d, seqno is not continue, exptect %d, actual %lld.\n",
				       __func__, __LINE__, seqno_next, fence->seqno);
			}

			printk("%s line %d, last_seq/slot %d, seq %d, signal %lld.\n",
			       __func__, __LINE__, last_seq, seq, fence->seqno);

			if (list_empty(&fence->cb_list)) {
				printk("%s line %d, fence cb list is empty.\n",
				       __func__, __LINE__);
			} else {
				printk("%s line %d, fence cb list is not empty.\n",
				       __func__, __LINE__);
			}

			r = dma_fence_signal(fence);
			if (kthread_should_stop()) {
				dma_fence_put(fence);
				return 0;
			}

			if (r) {
				pr_err("%s line %d, fence already signaled.\n",
				       __func__, __LINE__);
				continue;
				//BUG();
			}

			dma_fence_put(fence);
		} while (last_seq != seq);

		wake_up(&ring->job_scheduled);
	}

	set_current_state(TASK_RUNNING);

	return 0;
}

// ref amdgpu_fence_emit.
static int fence_emit_task_thread(void *data)
{
	int r;
	uint64_t oldwaitseqno = 0;
	struct sched_param sparam = {.sched_priority = 1};

	sched_setscheduler(current, SCHED_FIFO, &sparam);

	//mutex_lock(&ring->mutex);
	while (ring->initialized == false) {
		set_current_state(TASK_UNINTERRUPTIBLE);
		if (ring->initialized == true) {
			break;
		}

		//mutex_unlock(&ring->mutex);
		schedule();
		//mutex_lock(&ring->mutex);
	}

	set_current_state(TASK_RUNNING);
	//mutex_unlock(&ring->mutex);

	while (!kthread_should_stop() && ring->initialized_emit == true) {
#if 0
		msleep(1000);
		printk("%s line %d.\n", __func__, __LINE__);
#else
		struct dma_fence __rcu **ptr;
		struct job_fence *fence;
		uint32_t seq;
		struct fence_set *fs = init_fence_set();

		fence = kmem_cache_alloc(job_fence_slab, GFP_KERNEL);
		if (fence == NULL) {
			pr_err("%s line %d, alloc fence from fence slab failure.\n",
			       __func__, __LINE__);
			return -1;
		}

		// ring->sync_seq is fence ring write pointer.
		seq = ++ring->sync_seq;
		dma_fence_init(&fence->job, &fence_ops, &ring->lock, 0, seq);
		fence->data = fs;

		ptr = &ring->fences[seq & ring->num_fences_mask];

		//printk("%s line %d, seq = %d.\n", __func__, __LINE__, seq);

		if (kthread_should_stop()) {
			// will call fence_ops.release directly to free the fence.
			dma_fence_put(&fence->job);
			continue;
		}

		if (unlikely(rcu_dereference_protected(*ptr, 1))) {
			struct dma_fence *old;
			int diff;

			rcu_read_lock();
			old = dma_fence_get_rcu_safe(ptr);
			rcu_read_unlock();

			if (old) {
				mutex_lock(&ring->mutex);
				//dma_fence_get(old);
				ring->last_fence = old;
				mutex_unlock(&ring->mutex);

				r = dma_fence_wait(old, false);

				mutex_lock(&ring->mutex);
				ring->last_fence = NULL;
				dma_fence_put(old);
				mutex_unlock(&ring->mutex);

				if (kthread_should_stop() || r) {
					// will call fence_ops.release directly to free the fence.
					dma_fence_put(&fence->job);
					continue;
				}

				// if overlap happened, there must be a congruences on seq and old->seqno,which means seq≡ old->seqno mod(num_hw_submission * 2)
				// this implies seq = q*(num_hw_submission * 2) + old->seqno. q=1 typically.
				diff = seq - old->seqno;
				printk("%s line %d, fence wokenup, wakeseqno %lld, new adding seq %d, slot %d, diff %d, waken interval %lld, latestseq %d, avail %d.\n",
				       __func__, __LINE__, old->seqno, seq, seq & ring->num_fences_mask, diff, old->seqno - oldwaitseqno,
				       ring->sync_seq, fencedrv_get_ring_avail());

				if (diff != num_hw_submission * 2) {
					pr_err("%s line %d, fatal error, diff not match totoal ring.\n",
					       __func__, __LINE__);
				}

				oldwaitseqno = old->seqno;
			}
		}

#if 0
		printk("%s line %d, fence emit, seqno %lld, seq %d, slot %d.\n",
		       __func__, __LINE__, fence->seqno, seq, seq & ring->num_fences_mask);
#endif
		rcu_assign_pointer(*ptr, dma_fence_get(&fence->job));

		// because no outer usage of fence, so put it here for free ok.
		dma_fence_put(&fence->job);
#endif
	}

	set_current_state(TASK_RUNNING);
	return 0;
}

void work_timer_fn(struct timer_list *timer)
{
	uint32_t seqno_next = 0;
	uint32_t seq, last_seq;
	int r;

	do {
		last_seq = atomic_read(&ring->last_seq);
		seq = *ring->cpu_addr;
	} while (atomic_cmpxchg(&ring->last_seq, last_seq, seq) != last_seq);

	if (unlikely(seq == last_seq)) {
		goto end;
	}

	assert(seq > last_seq);

	last_seq &= ring->num_fences_mask;
	seq &= ring->num_fences_mask;

	do {
		struct dma_fence *fence, **ptr;

		++last_seq;
		last_seq &= ring->num_fences_mask;
		ptr = &ring->fences[last_seq];
		fence = rcu_dereference_protected(*ptr, 1);

		RCU_INIT_POINTER(*ptr, NULL);
		if (!fence) {
			continue;
		}

		if (seqno_next == 0 || seqno_next == fence->seqno) {
			seqno_next = fence->seqno + 1;
		} else { /*if (seqno_next != 0 && seqno_next != fence->seqno)*/
			pr_err("%s line %d, seqno is not continue, exptect %d, actual %lld.\n",
			       __func__, __LINE__, seqno_next, fence->seqno);
		}

		r = dma_fence_signal(fence);
		if (r) {
			pr_err("%s line %d, fence already signaled.\n",
			       __func__, __LINE__);
			continue;
			//BUG();
		}

		dma_fence_put(fence);
	} while (last_seq != seq);
end:
	pr_err("%s line %d, work timer triggerd.\n", __func__, __LINE__);
	mod_timer(timer, jiffies + HZ / 10);
}

void gpu_process_thread(struct timer_list *timer)
{
	uint32_t seq, oldseq;

	seq = ring->sync_seq;
	oldseq = fence_seq;

	// trigger a job done on device.
	if (fence_seq == 0) {
		if (seq > 6)
			fence_seq = seq - 4;
	} else if ((seq - fence_seq) > 10) {
		fence_seq += (seq - fence_seq) / 2;
		assert(fence_seq > oldseq);
	}

	printk("%s line %d, timer trigger job, latest consume fence %d.\n",
	       __func__, __LINE__, fence_seq);

	mod_timer(timer, jiffies + HZ / 2);
}

static int fencedrv_wait_empty(void)
{
	uint64_t seq = READ_ONCE(ring->sync_seq);
	struct dma_fence *fence, **ptr;
	int r;

	if (!seq)
		return 0;

	fence_seq = seq;
	ptr = &ring->fences[seq & ring->num_fences_mask];
	rcu_read_lock();

	fence = rcu_dereference(*ptr);
	if (!fence || !dma_fence_get_rcu(fence)) {
		rcu_read_unlock();
		return 0;
	}
	rcu_read_unlock();

	r = dma_fence_wait(fence, false);

	printk("%s line %d, wait last fence %lld, seq %lld, r %d.\n", \
	       __func__, __LINE__, fence->seqno, seq, r);
	dma_fence_put(fence);

	return r;
}

static int __init fencedrv_init(void)
{
	if ((num_hw_submission & (num_hw_submission - 1)) != 0) {
		pr_err("%s line %d, num_hw_submission must be power of two.\n",
		       __func__, __LINE__);
		return -1;
	}

	ring = kzalloc(sizeof(*ring), GFP_KERNEL);
	if (ring == NULL) {
		pr_err("%s line %d, alloc fence driver failure.\n",
		       __func__, __LINE__);
		return -ENOMEM;
	}

	// fence_seq is a snap shot of sync_seq for deal with fence batchlly.
	ring->cpu_addr = &fence_seq;
	ring->gpu_addr = (uint64_t)&fence_seq;
	ring->sync_seq = 0;
	atomic_set(&ring->last_seq, 0);
	ring->initialized = false;
	ring->initialized_emit = false;
	ring->initialized_recv = false;
	ring->last_fence = NULL;
	ring->num_fences_mask = num_hw_submission * 2 - 1;
	init_waitqueue_head(&ring->job_scheduled);

	spin_lock_init(&ring->lock);
	ring->fences = kcalloc(num_hw_submission * 2, sizeof(void *), GFP_KERNEL);
	if (!ring->fences) {
		pr_err("%s line %d, alloc fence buffer failure.\n",
		       __func__, __LINE__);
		return -ENOMEM;
	}

	printk("%s line %d, fence mask 0x%x, num_hw_submission 0x%x.\n",
	       __func__, __LINE__, ring->num_fences_mask, num_hw_submission);

	job_fence_slab = kmem_cache_create("job_fence_slab", sizeof(struct job_fence), 0,
	                                   SLAB_HWCACHE_ALIGN, NULL);
	if (!job_fence_slab) {
		pr_err("%s line %d, alloc job_fence_slab falure.\n",
		       __func__, __LINE__);
		return -ENOMEM;
	}

	sched_fence_slab = kmem_cache_create("sched_fence_slab", sizeof(struct fence_set), 0,
	                                     SLAB_HWCACHE_ALIGN, NULL);
	if (!sched_fence_slab) {
		pr_err("%s line %d, alloc sched_fence_slab falure.\n",
		       __func__, __LINE__);
		return -ENOMEM;
	}

	mutex_init(&ring->mutex);

	fence_emit_task = kthread_run(fence_emit_task_thread, NULL, "fence_emit");
	if (IS_ERR(fence_emit_task)) {
		pr_err("%s line %d, create fence emit tsk failure.\n",
		       __func__, __LINE__);
		return -1;
	}

	fence_recv_task = kthread_run(fence_recv_task_thread, NULL, "fence_recv");
	if (IS_ERR(fence_recv_task)) {
		pr_err("%s line %d, create fence recv tsk failure.\n",
		       __func__, __LINE__);
		return -1;
	}

	timer_setup(&ring->timer, gpu_process_thread, TIMER_IRQSAFE);
	add_timer(&ring->timer);
	mod_timer(&ring->timer, jiffies + HZ / 2);

	timer_setup(&ring->work_timer, work_timer_fn, TIMER_IRQSAFE);
	add_timer(&ring->work_timer);
	mod_timer(&ring->work_timer, jiffies + HZ / 10);

	printk("%s line %d, module init.\n", __func__, __LINE__);

	ring->initialized = true;
	ring->initialized_emit = true;
	ring->initialized_recv = true;

	wake_up_process(fence_emit_task);
	wake_up_process(fence_recv_task);

	return 0;
}

static void __exit fencedrv_exit(void)
{
	printk("%s line %d, module unload task begin.\n", __func__, __LINE__);

	del_timer_sync(&ring->work_timer);

	mutex_lock(&ring->mutex);
	if ((ring->last_fence != NULL) &&
	    (!test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &ring->last_fence->flags))) {
		ring->initialized_emit = false;
		dma_fence_signal(ring->last_fence);
		dma_fence_put(ring->last_fence);
	}
	mutex_unlock(&ring->mutex);

	kthread_stop(fence_emit_task);
	printk("%s line %d, module unload task mid.\n", __func__, __LINE__);

	del_timer_sync(&ring->timer);
	fencedrv_wait_empty();

	printk("%s line %d, sync wait avail %d.\n", __func__, __LINE__, fencedrv_get_ring_avail());

	wait_event_killable(ring->job_scheduled, fencedrv_get_ring_avail() <= 1);
	ring->initialized_recv = false;
	kthread_stop(fence_recv_task);

	printk("%s line %d, module unload task end.\n", __func__, __LINE__);

	ring->initialized = false;
	rcu_barrier();
	kmem_cache_destroy(job_fence_slab);
	kmem_cache_destroy(sched_fence_slab);
	kfree(ring->fences);
	kfree(ring);

	fence_emit_task = NULL;
	fence_recv_task = NULL;

	printk("%s line %d, module unload.\n", __func__, __LINE__);
}

module_init(fencedrv_init);
module_exit(fencedrv_exit);
MODULE_LICENSE("GPL");
MODULE_AUTHOR("czl");

fence的释放顺序

参考博客

AMD GPU任务调度（3） —— fence机制_gpu fence-程序员宅基地

Fence_dma fence_笔落梦昙的博客-程序员宅基地

结束

本文链接：https://blog.csdn.net/tugouxp/article/details/133519133

原作者删帖不实内容删帖广告或垃圾文章投诉

智能推荐

无法理解高等数学怎么办？_高数-程序员宅基地

文章浏览阅读5.8k次，点赞26次，收藏39次。我们学高等数学的时1候是这样的：这当然学不懂了，跨度太大了。这个锅，教材（对，说的就是同济《高等数学》）肯定得背。1 应该怎么学习？学习应该循序渐进，意思就是，应该从已有的知识出发，保持足够小的步伐前进。让我们把已有的知识称作，足够小的步伐称为，那么：才是最有效的学习方法。比如：注意：什么是是比较主观的问题。下面我尝试用的方法，解释下..._高数

openGauss学习笔记-42 openGauss 高级数据管理-触发器_opengauss数据库创建触发器语法-程序员宅基地

文章浏览阅读457次。触发器会在指定的数据库事件发生时自动执行函数。_opengauss数据库创建触发器语法

ELasticsearch的安装以及与spring-data-elasticsearch的整合使用_lorg/elasticsearch/common/settings/settings;ljava/-程序员宅基地

文章浏览阅读3.1k次。#Elasticsearch安装这里本人使用的是docker镜像安装，至于怎么安装就不说了，贴一下配置最基本的配置文件就好#集群名称，默认为elasticsearch，命名规则为 es-产品名-ES版本cluster.name: luckyqing#节点名称，es启动时会自动创建节点名称，但你也可进行配置node.name: es-46-68-76#设置索引的分片数#index..._lorg/elasticsearch/common/settings/settings;ljava/lang/string;)v

Linux运维面试题（四）之Linux服务管理_linux运维网络服务4-程序员宅基地

文章浏览阅读337次。Linux运维面试题（四）之Linux服务管理_linux运维网络服务4

Spring注解依赖注入详解_依赖注入和注解的区别-程序员宅基地

文章浏览阅读1.1k次。依赖注入可以使用 `@Autowired`, `@Resource`, `@Inject` 三个注解,那么这3中注解有何异同呢?_依赖注入和注解的区别

大数据lab5/spark_大数据实验spark-程序员宅基地

文章浏览阅读1k次。sparkidea加入Scala插件的方法：在pom.xml加入 <build> <plugins> <plugin> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> _大数据实验spark

随便推点

idea 如何配置软回车_idea 软换行什么意思-程序员宅基地

文章浏览阅读598次。软回车: 只是视觉上的换行, 其实文本内容并没实际换行, 这是为了我们可以直接看到整行内容, 而无需再使用鼠标水平滚动窗口idea 中可以配置编辑器软换行editor > general > 勾选 soft wrap these files: * (其中 * 表示软换行对所有文件生效)配置控制台软换行editor > console > 勾选 use soft wraps in console编辑器和控制台软换行效果如图..._idea 软换行什么意思

Vue 2项目如何升级到Vue 3？_vue2项目换成vue3-程序员宅基地

文章浏览阅读2.4w次，点赞14次，收藏67次。Vue2项目如何升级到Vue3_vue2项目换成vue3

Activiti工作流使用详细介绍_activiti开启工作流-程序员宅基地

文章浏览阅读554次。Activiti项⽬是⼀项新的基于Apache许可的开源BPM平台，BPM，即Business Process Management，业务流程管理，通常，BPM也指针对流程管理的信息化系统，其特点是注重流程驱动为核⼼，实现端到端全流程信息化管理。BPMN，即Business Process Modeling Notation，业务流程建模符号。BPMN定义了⼀个业务流程图。_activiti开启工作流

十大排序算法总结（c++版本）-程序员宅基地

文章浏览阅读78次。排序算法的分类：1插入：插入，折半插入，希尔2交换：冒泡，快速3选择：简单选择，堆4归并：归并（不只二路归并）5基数：1插入排序void insert_sort(){ for (int i = 1; i < n; i ++ ) { int x = a[i]; int j = i-1; while (j >= 0 && x < a[j])...

新颖的基于BS结构的毕业设计题目50例_bs实训题目-程序员宅基地

文章浏览阅读711次。新颖的基于BS结构的毕业设计题目50例，该基于BS结构的毕业设计题目包含了：UML对基于BS结构的PDM系统的分析与建模，基于BS结构的高校毕业设计选题系统的设计与实现，基于BS结构的第三方物流管理系统设计与实现，基于BS结构的毕业设计(论文)系统的设计与实现等。..._bs实训题目

GVRP基础配置_gvrp unrecognized command-程序员宅基地

文章浏览阅读1k次。原理概述：GVRP VLAN 注册协议是一种通用属性注册协议的应用，使得交换机之间能够相互交换VLAN配置信息，动态创建和管理VLAGVRP三种注册模式：Normal模式：允许动态VLAN在端口上进行注册，同时会发送静态VLAN和动态VLAN的声明消息。Fixed模式：不允许动态VLAN在端口上注册，只发送静态VLAN的声明消息。Forbidden模式：不允许动态VLAN在端口上进行注册，同时删除端口上除VLAN1外的所有VLAN，只发送VLAN1的声明消息。实验目的：1、理解GVRP的应用场_gvrp unrecognized command