2.8.5 触发时机

系统触发负载均衡的场景各有不同，本节将分析三个典型的触发场景。

2.8.5.1. 周期性负载均衡

系统会随着CPU的时钟节拍周期性地触发负载均衡，入口函数如下：

/* file: kernel/sched/core.c */
void scheduler_tick(void)
{
    int cpu = smp_processor_id();
    struct rq *rq = cpu_rq(cpu);

    /* 删除其他代码 */

#ifdef CONFIG_SMP
    rq->idle_balance = idle_cpu(cpu);
    trigger_load_balance(rq);
#endif
}

该函数在我们讨论CFS的调度节拍时曾提到过，这里仅保留触发负载均衡的代码：当系统开启 CONFIG_SMP 时，CPU会在每次时钟中断时尝试做负载均衡。函数 trigger_load_balance 的实现如下：

/* file: kernel/sched/fair.c */

void trigger_load_balance(struct rq *rq)
{
    /*
     * Don't need to rebalance while attached to NULL domain or
     * runqueue CPU is not active
     */
    if (unlikely(on_null_domain(rq) || !cpu_active(cpu_of(rq))))
        return;

    /* 检查负载均衡的时间，如果当前时间已经过了可以进行下次负载均衡的时间点，那么就产生 SCHED_SOFTIRQ 软中断，
     * 该中软信号的处理函数在函数 init_sched_fair_class 中注册，为 run_rebalance_domains */
    if (time_after_eq(jiffies, rq->next_balance))
        raise_softirq(SCHED_SOFTIRQ);

    /* 触发 nohz 负载均衡 */
    nohz_balancer_kick(rq);
}

这里我们只关心语句 raise_softirq(SCHED_SOFTIRQ), 函数在这里对负载均衡的时间点进行检查，如果当前时间（jiffies）已经超过了预设好的下一次负载均衡的时间点，那么系统就产生一个 SCHED_SOFTIRQ 软中断。该中断信号的处理程序在初始化函数 init_sched_fair_class 中进行注册：

/* file: kernel/sched/fair.c */
__init void init_sched_fair_class(void)
{
#ifdef CONFIG_SMP
    open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
#endif /* SMP */
}

该中断处理函数是 run_rebalance_domains, 该函数的主要职责是调用合适的负载均衡函数：

/* file: kernel/sched/fair.c */

static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
{
    struct rq *this_rq = this_rq();
    enum cpu_idle_type idle =
        this_rq->idle_balance ? CPU_IDLE : CPU_NOT_IDLE;

    /*
     * If this CPU has a pending nohz_balance_kick, then do the
     * balancing on behalf of the other idle CPUs whose ticks are
     * stopped. Do nohz_idle_balance *before* rebalance_domains to
     * give the idle CPUs a chance to load balance. Else we may
     * load balance only within the local sched_domain hierarchy
     * and abort nohz_idle_balance altogether if we pull some load.
     */
    if (nohz_idle_balance(this_rq, idle))
        return;

    /* normal load balance */
    update_blocked_averages(this_rq->cpu);
    rebalance_domains(this_rq, idle);
}

nohz_idle_balance 会在本节后面介绍，而 rebalance_domains 函数就是上一节介绍过的负载均衡函数。

2.8.5.2. NOHZ负载均衡

周期性负载均衡由CPU的时钟节拍来驱动，实际上系统很多周期性的工作都是由CPU的时钟节拍来驱动的。但如果CPU此时无事可做（rq为空）的话就会进入idle状态，并最终进入节能模式，此时的时钟中断会定期将CPU从节能模式唤醒，然后CPU发现自己仍然无事可做，最终再次进入节能模式。可见当CPU处于 idle 状态时，定时响应的时钟节拍此时不仅没有帮助，反而形成了干扰和能源浪费。如果有某种机制让CPU在 idle 状态下关掉定时时钟就好了。

NOHZ就是这种功能，内核通过 CONFIG_NO_HZ_COMMON 来控制是否启动该动能，如果启动的话当CPU在进入 idle 状态就会关闭时钟节拍。

NOHZ功能对CPU的能耗有积极的作用，但对负载均衡而言却不友好，前面我们看到负载均衡的工作机制由每个CPU的时钟节拍来驱动，并且总是尝试着从其它CPU的队列中拉取任务到本地队列，那么对于关闭了时钟节拍的 idle 状态的CPU而言，这个过程如何触发呢？调度器将涉及 idle CPU 的负载均衡逻辑叫着NOHZ负载均衡，其总体的工作方式如下：

初始化CPU的IPI处理函数既然CPU进入idle状态后会关闭时钟节拍，那么当某个CPU 繁忙而此时有CPU 处在 idle 状态时，我们就需要一种机制来唤醒 idle CPU, 这种机制是 IPI.

IPI 全称 Inter-Processor Interrupt, 是处理器之间的中断机制，不同的CPU可以通过这种机制通知对方有事件发生。关于IPI的详细介绍可以参考：https://en.wikipedia.org/wiki/Inter-processor_interrupt

调度器初始化时会初始化 nohz 的处理函数，代码如下：

    /* file: kernel/sched/core.c */

    void __init sched_init(void)
    {
        /* 当系统开启 nohz 时，通过如下代码对对应的字段进行初始化 */
    #ifdef CONFIG_NO_HZ_COMMON
        rq->last_blocked_load_update_tick = jiffies;
        atomic_set(&rq->nohz_flags, 0);

        /* 初始化 NOHZ 负载均衡的 IPI 处理函数，该函数会在函数 kick_ilb 中被调用 */
        INIT_CSD(&rq->nohz_csd, nohz_csd_func, rq);
    #endif
    }

函数 nohz_csd_func 的内容为：

    /* file: kernel/sched/core.c */

    static void nohz_csd_func(void *info)
    {
        struct rq *rq = info;
        int cpu = cpu_of(rq);
        unsigned int flags;

        /*
        * Release the rq::nohz_csd.
        */
        flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(cpu));
        WARN_ON(!(flags & NOHZ_KICK_MASK));

        rq->idle_balance = idle_cpu(cpu);
        if (rq->idle_balance && !need_resched()) {
            rq->nohz_idle_balance = flags;
            /* 触发 SCHED_SOFTIRQ 软中断，进行负载均衡处理 */
            raise_softirq_irqoff(SCHED_SOFTIRQ);
        }
    }

如果需要的话，函数最后会产生 SCHED_SOFTIRQ 软中断，触发CPU进行负载均衡操作。产生IPI中断的逻辑在函数 kick_ilb 中，这在后面会有介绍。

繁忙的CPU发起请求，通过IPI唤醒 idle CPU 该逻辑入口函数是 nohz_balancer_kick, 在函数 trigger_load_balance 中被繁忙的CPU随着时钟节拍周期性地调用：

    /* file: kernel/sched/fair.c */

    void trigger_load_balance(struct rq *rq)
    {
        /*
        * Don't need to rebalance while attached to NULL domain or
        * runqueue CPU is not active
        */
        if (unlikely(on_null_domain(rq) || !cpu_active(cpu_of(rq))))
            return;

        /* 检查负载均衡的时间，如果当前时间已经过了可以进行下次负载均衡的时间点，那么就产生 SCHED_SOFTIRQ 软中断，
        * 该中软信号的处理函数在函数 init_sched_fair_class 中注册，为 run_rebalance_domains */
        if (time_after_eq(jiffies, rq->next_balance))
            raise_softirq(SCHED_SOFTIRQ);

        /* 触发 nohz 负载均衡逻辑 */
        nohz_balancer_kick(rq);
    }

函数 nohz_balance_kick 首先根据自己队列的任务情况判断是否需要唤醒其他 idle CPU 来为自己分担压力，如果是的话就通过 IPI 对 idle CPU 进行唤醒，被唤醒的CPU会从其它繁忙的CPU拉取任务。内核此处将繁忙的CPU称为 kicker, 目标 idle CPU 称为 kickee, 可以直观地理解成繁忙的CPU将在睡觉的 idle CPU 踢起来干活了。

函数 nohz_balance_kick 由繁忙的CPU执行，主要逻辑是检查自己队列（rq）的情况，如果有必要的话则最终调用函数 kick_ilb 来唤醒 idle CPU. 我们可以简单看一下函数 kick_ilb 的代码：

    /* file: kernel/sched/fair.c */

    /*
    * Kick a CPU to do the nohz balancing, if it is time for it. We pick any
    * idle CPU in the HK_FLAG_MISC housekeeping set (if there is one).
    */
    static void kick_ilb(unsigned int flags)
    {
        int ilb_cpu;

        /* 找到 idle CPU */
        ilb_cpu = find_new_ilb();

        /* 通过 IPI 通知目标的 idle CPU */
        smp_call_function_single_async(ilb_cpu, &cpu_rq(ilb_cpu)->nohz_csd);
    }

函数 kick_ilb 最终会触发对应CPU队列中的 nohz_csd 函数被异步调用，完成唤醒操作。

被唤醒的 idle CPU 通过函数 nohz_idle_balance 完成负载均衡通过前两步我们知道，NOHZ 负载均衡最终也是将 idle CPU 唤醒并进入 SCHED_SOFTIRQ 软中断的处理函数，入口函数与前一节所分析的周期性负载均衡一样，都是 run_rebalance_domains, 我们再看一下该函数的代码：

    /* file: kernel/sched/fair.c */

    static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
    {
        struct rq *this_rq = this_rq();
        enum cpu_idle_type idle =
            this_rq->idle_balance ? CPU_IDLE : CPU_NOT_IDLE;

        /* 被唤醒的 idle CPU 会通过该函数完成负载均衡，然后整个函数直接返回 */
        if (nohz_idle_balance(this_rq, idle))
            return;

        /* normal load balance */
        update_blocked_averages(this_rq->cpu);
        rebalance_domains(this_rq, idle);
    }

如果是被唤醒的 idle CPU, 则会通过函数 nohz_idle_balance 完成负载均衡，该函数会对所有的 idle CPU 进行负载均衡，负载均衡的逻辑与上一节讲的大致一样，这里不再展开。

5.3. newidle balance

在CPU进入 idle 之前也可以主动发起负载均衡，尝试着从其它 CPU 拉取一些任务过来执行，如果拉取不到再进入 idle 状态也不迟。该逻辑的入口函数就是 newidle_balance, 该函数的主体逻辑为：

/* file: kernel/sched/fair.c */

/*
 * newidle_balance is called by schedule() if this_cpu is about to become
 * idle. Attempts to pull tasks from other CPUs.
 *
 * Returns:
 *   < 0 - we released the lock and there are !fair tasks present
 *     0 - failed, no new tasks
 *   > 0 - success, new (fair) tasks present
 */
static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
{
    int this_cpu = this_rq->cpu;
    struct sched_domain *sd;
    int pulled_task = 0;
    u64 curr_cost = 0;

    for_each_domain(this_cpu, sd)
    {
        int continue_balancing = 1;
        u64 t0, domain_cost;

        if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
            update_next_balance(sd, &next_balance);
            break;
        }

        if (sd->flags & SD_BALANCE_NEWIDLE) {
            t0 = sched_clock_cpu(this_cpu);

            pulled_task = load_balance(this_cpu, this_rq, sd,
                                       CPU_NEWLY_IDLE,
                                       &continue_balancing);

            domain_cost = sched_clock_cpu(this_cpu) - t0;
            if (domain_cost > sd->max_newidle_lb_cost)
                sd->max_newidle_lb_cost = domain_cost;

            curr_cost += domain_cost;
        }

        update_next_balance(sd, &next_balance);

        /*
         * Stop searching for tasks to pull if there are
         * now runnable tasks on this rq.
         */
        if (pulled_task || this_rq->nr_running > 0)
            break;
    }
}

这里我们仅保留了核心逻辑，可以看出函数也是对CPU的调度域进行自底向上的遍历，然后依次对各级调度域做负载均衡，这与之前介绍的函数 rebalance_domains 思路相似。

一个调用 newidle_balance 的典型例子是调度器在选择下一个任务时，如果此时CPU的队列为空，则调度器便会直接调用 newidle_balance 从其它CPU 拉取任务，代码如下：

/* file: kernel/sched/fair.c */

struct task_struct *pick_next_task_fair(struct rq *rq, struct task_struct *prev,
                                        struct rq_flags *rf)
{
    struct cfs_rq *cfs_rq = &rq->cfs;
    struct sched_entity *se;
    struct task_struct *p;
    int new_tasks;

again:
    if (!sched_fair_runnable(rq))
        goto idle;

    /* 删除主要代码 */

idle:
    if (!rf)
        return NULL;

    /* 触发负载均衡，从其它CPU拉取任务 */
    new_tasks = newidle_balance(rq, rf);

    /*
     * Because newidle_balance() releases (and re-acquires) rq->lock, it is
     * possible for any higher priority task to appear. In that case we
     * must re-start the pick_next_entity() loop.
     */
    if (new_tasks < 0)
        return RETRY_TASK;

    /* 如果成功地拉取到了任务，则尝试重新选择任务来执行 */
    if (new_tasks > 0)
        goto again;

    return NULL;
}

Previous2.8.4 算法思路 Next2.8.6 总结

Last updated 3 years ago

Was this helpful?