[Kernel] Why swapper is non-preemptible??

前幾天被同事問說為什麼他們在log看到 swapper 一直是non-preemptible的 (preempt_count > 0)

聽起來也蠻奇怪的齁,如果idle task不能preempt,那不是不就沒人可以搶CPU了嗎,CPU永遠在發呆XD

於是乎就來trace code吧吧吧吧

CPU 開起來之後就會進入這裡 然後就會開始一直跑cpu_idle_loop

void cpu_startup_entry(enum cpuhp_state state)
{
  /*
   * This #ifdef needs to die, but it's too late in the cycle to
   * make this generic (arm and sh have never invoked the canary
   * init for the non boot cpus!). Will be fixed in 3.11
   */
#ifdef CONFIG_X86
  /*
   * If we're the non-boot CPU, nothing set the stack canary up
   * for us. The boot CPU already has it initialized but no harm
   * in doing it again. This is a good place for updating it, as
   * we wont ever return from this function (so the invalid
   * canaries already on the stack wont ever trigger).
   */
  boot_init_stack_canary();
#endif
  arch_cpu_idle_prepare();
  cpuhp_online_idle(state);
  cpu_idle_loop();
}

接著這裡會反覆執行need_resched() 來判斷需不需要做scheduling

當需要重新排程調度的時候就會脫離while loop,然後把PREEMPT_NEED_RESCHED bit set up

/*
 * Generic idle loop implementation
 *
 * Called with polling cleared.
 */
static void cpu_idle_loop(void)
{
  int cpu = smp_processor_id();

  while (1) {
    /*
     * If the arch has a polling bit, we maintain an invariant:
     *
     * Our polling bit is clear if we're not scheduled (i.e. if
     * rq->curr != rq->idle).  This means that, if rq->idle has
     * the polling bit set, then setting need_resched is
     * guaranteed to cause the cpu to reschedule.
     */

    __current_set_polling();
    quiet_vmstat();
    tick_nohz_idle_enter();

    while (!need_resched()) {
      check_pgt_cache();
      rmb();

      if (cpu_is_offline(cpu)) {
        cpuhp_report_idle_dead();
        arch_cpu_idle_dead();
      }

      local_irq_disable();
      arch_cpu_idle_enter();

      /*
       * In poll mode we reenable interrupts and spin.
       *
       * Also if we detected in the wakeup from idle
       * path that the tick broadcast device expired
       * for us, we don't want to go deep idle as we
       * know that the IPI is going to arrive right
       * away
       */
      if (cpu_idle_force_poll || tick_check_broadcast_expired())
        cpu_idle_poll();
      else
        cpuidle_idle_call();

      arch_cpu_idle_exit();
    }

    /*
     * Since we fell out of the loop above, we know
     * TIF_NEED_RESCHED must be set, propagate it into
     * PREEMPT_NEED_RESCHED.
     *
     * This is required because for polling idle loops we will
     * not have had an IPI to fold the state for us.
     */
    preempt_set_need_resched();
    tick_nohz_idle_exit();
    __current_clr_polling();

    /*
     * We promise to call sched_ttwu_pending and reschedule
     * if need_resched is set while polling is set.  That
     * means that clearing polling needs to be visible
     * before doing these things.
     */
    smp_mb__after_atomic();

    sched_ttwu_pending();
    schedule_preempt_disabled();
  }
}

loop的最後一行 在schedule_preempt_disabled()

最後進入schedule() 進行排程讓給其他task,後面這段就是另一個故事惹

然後呢,在某年某月某日的時候 idle task(swapper) 在某次的schedule()後再度拿回CPU,preempt又會再度被disable惹!!!

void __sched schedule_preempt_disabled(void)
{
  sched_preempt_enable_no_resched();
  schedule();
  preempt_disable();
}

似乎是可以結論這段code的邏輯是讓idle task 一直在做 “到底要不要放開CPU給其他人” 這件事

所以我想non-preemptible好像也很合理的,要是做到一半被搶走了不就很奇怪…難道前面 “到底要不要放開CPU給其他人” 的 code邏輯有問題嗎XD