转载自https://blog.csdn.net/shift_wwx/article/details/121593698
以及参考https://jekton.github.io/2019/03/21/android9-lmk-lmkd/

以Android12为例

1. lmkd的启动

跟大多数守护进程一样，lmkd 也是由 init 进程启动的：

service lmkd /system/bin/lmkd
    class core
    user lmkd
    group lmkd system readproc
    capabilities DAC_OVERRIDE KILL IPC_LOCK SYS_NICE SYS_RESOURCE
    critical
    socket lmkd seqpacket+passcred 0660 system system
    writepid /dev/cpuset/system-background/tasks

on property:lmkd.reinit=1
    exec_background /system/bin/lmkd --reinit

这里创建的 socket lmkd 的 user/group 都是 system，而它的权限是 0660，所以只有 system 应用才能读写（一般是 activity manager）。

接下来的 writepid 跟 Linux 的 cgroups 相关。

2. main函数

int main(int argc, char **argv) {
    if ((argc > 1) && argv[1] && !strcmp(argv[1], "--reinit")) {
        if (property_set(LMKD_REINIT_PROP, "0")) {
            ALOGE("Failed to reset " LMKD_REINIT_PROP " property");
        }
        return issue_reinit();
    }
    // setp1: 更新设置的属性，未init做准备
    update_props();

    ctx = create_android_logger(KILLINFO_LOG_TAG);

    if (!init()) { // setp2: init处理所有核心的初始化工作
        if (!use_inkernel_interface) { // step3: 如果不再使用kernel中的LMK驱动
            /*
             * MCL_ONFAULT pins pages as they fault instead of loading
             * everything immediately all at once. (Which would be bad,
             * because as of this writing, we have a lot of mapped pages we
             * never use.) Old kernels will see MCL_ONFAULT and fail with
             * EINVAL; we ignore this failure.
             *
             * N.B. read the man page for mlockall. MCL_CURRENT | MCL_ONFAULT
             * pins ⊆ MCL_CURRENT, converging to just MCL_CURRENT as we fault
             * in pages.
             */
            /* CAP_IPC_LOCK required */
            // step4: 给虚拟空间上锁，防止内存交换
            if (mlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT) && (errno != EINVAL)) {
                ALOGW("mlockall failed %s", strerror(errno));
            }

            /* CAP_NICE required */
            // step4: 添加调度策略，即先进先出
            struct sched_param param = {
                    .sched_priority = 1,
            };
            if (sched_setscheduler(0, SCHED_FIFO, &param)) {
                ALOGW("set SCHED_FIFO failed %s", strerror(errno));
            }
        }
        // step5: 进入循环，等待polling
        mainloop();
    }

    android_log_destroy(&ctx);

    ALOGI("exiting");
    return 0;
}

基本的流程见上面代码的注释部分，可以看到 lmkd 的核心部分在step 2 和step 5，下面将单独来说明下。

这里需要注意下函数mlockall：

    if (mlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT) && (errno != EINVAL)) {
        ALOGW("mlockall failed %s", strerror(errno));
    }

mlockall函数将调用进程的全部虚拟地址空间加锁。防止出现内存交换，将该进程的地址空间交换到外存上。
mlockall将所有映射到进程地址空间的内存上锁。这些页包括：代码段，数据段，栈段，共享库，共享内存，user space kernel data,memory-mapped file.当函数成功返回的时候，所有的被映射的页都在内存中。
flags可取两个值：MCL_CURRENT,MCL_FUTURE
- MCL_CURRENT: 表示对所有已经映射到进程地址空间的页上锁
- MCL_FUTURE: 表示对所有将来映射到进程地空间的页都上锁。
函数返回：成功返回0，出错返回-1
此函数有两个重要的应用： real-time algorithms(实时算法) 和 high-security data processing(机密数据的处理)
- real-time algorithms：对时间要非常高。
如果进程执行了一个execve类函数，所有的锁都会被删除。
内存锁不会被子进程继承。
内存锁不会叠加，即使多次调用mlockall函数，只调用一次munlock就会解锁

3. update_props()

这个函数比较简单，会读取系统中属性的值，如果未设置，将使用默认的值。当设备为低内存设备时，ro.config.per_app_memcg属性也会为true

static void update_props() {
    /* By default disable low level vmpressure events */
    level_oomadj[VMPRESS_LEVEL_LOW] =
        property_get_int32("ro.lmk.low", OOM_SCORE_ADJ_MAX + 1);
    level_oomadj[VMPRESS_LEVEL_MEDIUM] =
        property_get_int32("ro.lmk.medium", 800);
    level_oomadj[VMPRESS_LEVEL_CRITICAL] =
        property_get_int32("ro.lmk.critical", 0);
    debug_process_killing = property_get_bool("ro.lmk.debug", false);

    /* By default disable upgrade/downgrade logic */
    enable_pressure_upgrade =
        property_get_bool("ro.lmk.critical_upgrade", false);
    upgrade_pressure =
        (int64_t)property_get_int32("ro.lmk.upgrade_pressure", 100);
    downgrade_pressure =
        (int64_t)property_get_int32("ro.lmk.downgrade_pressure", 100);
    kill_heaviest_task =
        property_get_bool("ro.lmk.kill_heaviest_task", false);
    low_ram_device = property_get_bool("ro.config.low_ram", false);
    kill_timeout_ms =
        (unsigned long)property_get_int32("ro.lmk.kill_timeout_ms", 100);
    use_minfree_levels =
        property_get_bool("ro.lmk.use_minfree_levels", false);
    per_app_memcg =
        property_get_bool("ro.config.per_app_memcg", low_ram_device);
    swap_free_low_percentage = clamp(0, 100, property_get_int32("ro.lmk.swap_free_low_percentage",
        DEF_LOW_SWAP));
    psi_partial_stall_ms = property_get_int32("ro.lmk.psi_partial_stall_ms",
        low_ram_device ? DEF_PARTIAL_STALL_LOWRAM : DEF_PARTIAL_STALL);
    psi_complete_stall_ms = property_get_int32("ro.lmk.psi_complete_stall_ms",
        DEF_COMPLETE_STALL);
    thrashing_limit_pct = max(0, property_get_int32("ro.lmk.thrashing_limit",
        low_ram_device ? DEF_THRASHING_LOWRAM : DEF_THRASHING));
    thrashing_limit_decay_pct = clamp(0, 100, property_get_int32("ro.lmk.thrashing_limit_decay",
        low_ram_device ? DEF_THRASHING_DECAY_LOWRAM : DEF_THRASHING_DECAY));
    thrashing_critical_pct = max(0, property_get_int32("ro.lmk.thrashing_limit_critical",
        thrashing_limit_pct * 2));
    swap_util_max = clamp(0, 100, property_get_int32("ro.lmk.swap_util_max", 100));
    filecache_min_kb = property_get_int64("ro.lmk.filecache_min_kb", 0);
}

4. init

init的代码比较多，这里分布剖析。

4.1 step1: 创建epoll

 epollfd = epoll_create(MAX_EPOLL_EVENTS);
    if (epollfd == -1) {
        ALOGE("epoll_create failed (errno=%d)", errno);
        return -1;
    }

整个lmkd 都是依赖epoll 机制，这里创建了 9 个event：

/*
 * 1 ctrl listen socket, 3 ctrl data socket, 3 memory pressure levels,
 * 1 lmk events + 1 fd to wait for process death
 */
#define MAX_EPOLL_EVENTS (1 + MAX_DATA_CONN + VMPRESS_LEVEL_COUNT + 1 + 1)

4.2 step2: 初始化socket lmkd

    ctrl_sock.sock = android_get_control_socket("lmkd");
    if (ctrl_sock.sock < 0) {
        ALOGE("get lmkd control socket failed");
        return -1;
    }

    ret = listen(ctrl_sock.sock, MAX_DATA_CONN);
    if (ret < 0) {
        ALOGE("lmkd control socket listen failed (errno=%d)", errno);
        return -1;
    }

    epev.events = EPOLLIN;
    ctrl_sock.handler_info.handler = ctrl_connect_handler;
    epev.data.ptr = (void *)&(ctrl_sock.handler_info);
    if (epoll_ctl(epollfd, EPOLL_CTL_ADD, ctrl_sock.sock, &epev) == -1) {
        ALOGE("epoll_ctl for lmkd control socket failed (errno=%d)", errno);
        return -1;
    }
    maxevents++;

ctrl_sock 主要存储的是socket lmkd 的fd 和handle info，主要注意这里的ctrl_connect_handler()

该函数时socket /dev/socket/lmkd 有信息时的处理函数，lmkd 的客户端AMS.mProcessList 会通过socket /dev/socket/lmkd 与lmkd 进行通信。

4.3 确定是否使用LMK驱动程序

 #define INKERNEL_MINFREE_PATH "/sys/module/lowmemorykiller/parameters/minfree"

   has_inkernel_module = !access(INKERNEL_MINFREE_PATH, W_OK);
   use_inkernel_interface = has_inkernel_module;

通过函数access确认旧的节点是否还存在，用以确认呢kernel是否还在使用LMK驱动程序(kernel 4.12已废弃）
之所以这样处理，应该是为了Android兼容旧版本的kernel。

4.4 init_monitors

该函数是init 函数中的核心了，这里用来注册PSI 的监视器策略（Android11及以后）或者是common 的adj 策略(vmpressure，Android11之前)，并将其添加到epoll中。

static bool init_monitors() {
    /* Try to use psi monitor first if kernel has it */
    use_psi_monitors = property_get_bool("ro.lmk.use_psi", true) &&
        init_psi_monitors();
    /* Fall back to vmpressure */
    if (!use_psi_monitors &&
        (!init_mp_common(VMPRESS_LEVEL_LOW) ||
        !init_mp_common(VMPRESS_LEVEL_MEDIUM) ||
        !init_mp_common(VMPRESS_LEVEL_CRITICAL))) {
        ALOGE("Kernel does not support memory pressure events or in-kernel low memory killer");
        return false;
    }
    if (use_psi_monitors) {
        ALOGI("Using psi monitors for memory pressure detection");
    } else {
        ALOGI("Using vmpressure for memory pressure detection");
    }
    return true;
}

变量use_psi_monitors 用以确认是使用 PSI 还是vmpressure

如果使用vmpressure，则通过init_mp_common 来初始化kill 策略；
如果使用PSI，则通过init_psi_monitors 来初始化kill 策略；

所以lmkd 中如果使用 PSI ，要求 ro.lmk.use_psi 为 true(注：博主说的其实有点问题，property_get_bool函数中的参数true为默认值，该值不设置即默认返回为true，并不需要设置为true)。

另外，lmkd 支持旧模式的kill 策略，只要 ro.lmk.use_new_strategy 设为false，或者将ro.lmk.use_minfree_levels 设为true（针对非低内存设备，即ro.config.low_ram 不为true）。

继续深入分析init_psi_monitors()函数。

static bool init_psi_monitors() {
    /*
     * When PSI is used on low-ram devices or on high-end devices without memfree levels
     * use new kill strategy based on zone watermarks, free swap and thrashing stats
     */
    bool use_new_strategy =
        property_get_bool("ro.lmk.use_new_strategy", low_ram_device || !use_minfree_levels);

    /* In default PSI mode override stall amounts using system properties */
    if (use_new_strategy) {
        /* Do not use low pressure level */
        psi_thresholds[VMPRESS_LEVEL_LOW].threshold_ms = 0;
        psi_thresholds[VMPRESS_LEVEL_MEDIUM].threshold_ms = psi_partial_stall_ms;
        psi_thresholds[VMPRESS_LEVEL_CRITICAL].threshold_ms = psi_complete_stall_ms;
    }
    // mp应该时memory pressure的意思
    if (!init_mp_psi(VMPRESS_LEVEL_LOW, use_new_strategy)) {
        return false;
    }
    if (!init_mp_psi(VMPRESS_LEVEL_MEDIUM, use_new_strategy)) {
        destroy_mp_psi(VMPRESS_LEVEL_LOW);
        return false;
    }
    if (!init_mp_psi(VMPRESS_LEVEL_CRITICAL, use_new_strategy)) {
        destroy_mp_psi(VMPRESS_LEVEL_MEDIUM);
        destroy_mp_psi(VMPRESS_LEVEL_LOW);
        return false;
    }
    return true;
}

函数比较简单的，最开始的变量use_new_strategy 用以确认是使用PSI 策略还是vmpressure。如果是使用PSI 策略，psi_thresholds数组中的threshold_ms 需要重新赋值为prop 指定的值（也就是说支持动态配置）。最后通过init_mp_psi 为每个级别的strategy 进行最后的注册，当然对于PSI，只有some 和full 等级，所以与level 中的medium 和 critical 分别对应。

这里的psi_thresholds 数组中threshold_ms 通过prop：

ro.lmk.psi_partial_stall_ms low_ram 默认为200ms，PSI 默认为70ms；
ro.lmk.psi_complete_stall_ms 默认700ms；

接下来看下init_mp_psi 到底做了些什么：

static bool init_mp_psi(enum vmpressure_level level, bool use_new_strategy) {
    int fd;

    /* Do not register a handler if threshold_ms is not set */
    if (!psi_thresholds[level].threshold_ms) {
        return true;
    }
    // 注意这里
    fd = init_psi_monitor(psi_thresholds[level].stall_type,
        psi_thresholds[level].threshold_ms * US_PER_MS,
        PSI_WINDOW_SIZE_MS * US_PER_MS);

    if (fd < 0) {
        return false;
    }

    vmpressure_hinfo[level].handler = use_new_strategy ? mp_event_psi : mp_event_common;
    vmpressure_hinfo[level].data = level;
    if (register_psi_monitor(epollfd, fd, &vmpressure_hinfo[level]) < 0) {
        destroy_psi_monitor(fd);
        return false;
    }
    maxevents++;
    mpevfd[level] = fd;

    return true;
}

函数比较简单，主要分三步：

通过init_psi_monitor 将不同level 的值写入节点/proc/pressure/memory，后期阈值如果超过了设定就会触发一次epoll；
根据use_new_strategy，选择是新策略mp_event_psi，还是旧模式mp_event_common，详细的策略见第8 节和第10 节；
通过register_psi_monitor 将节点/proc/pressure/memory 添加到epoll 中监听；

static void mp_event_psi(int data, uint32_t events, struct polling_params *poll_params) {
    enum reclaim_state {
        NO_RECLAIM = 0,
        KSWAPD_RECLAIM,
        DIRECT_RECLAIM,
    };

4.5 标记进入lmkd流程

/* let the others know it does support reporting kills */
property_set("sys.lmk.reportkills", "1");

4.6 其它初始化

    memset(killcnt_idx, KILLCNT_INVALID_IDX, sizeof(killcnt_idx));

    /*
     * Read zoneinfo as the biggest file we read to create and size the initial
     * read buffer and avoid memory re-allocations during memory pressure
     */
    if (reread_file(&file_data) == NULL) {
        ALOGE("Failed to read %s: %s", file_data.filename, strerror(errno));
    }

    /* check if kernel supports pidfd_open syscall */
    pidfd = TEMP_FAILURE_RETRY(pidfd_open(getpid(), 0));
    if (pidfd < 0) {
        pidfd_supported = (errno != ENOSYS);
    } else {
        pidfd_supported = true;
        close(pidfd);
    }

这里主要是reread_file 函数，用来占坑。通过读取 /proc/zoneinfo，创建一个最大size 的buffer，后面的其他节点都直接使用该buffer，而不用再重新malloc。详细看reread_file() 中的buf 变量。

另外，通过sys_pidfd_open，确定是否支持pidfd_open 的syscall。

至此，init 基本剖析完成，主要：

创建epoll，用以监听 9 个event；
初始化socket /dev/socket/lmkd，并将其添加到epoll 中；
根据prop ro.lmk.use_psi 确认是否使用PSI 还是vmpressure；
根据prop ro.lmk.use_new_strategy 或者通过 prop ro.lmk.use_minfree_levels 和 prop ro.config.low_ram 使用PSI 时的新策略还是旧策略；
新、旧策略主要体现在mp_event_psi 和mp_event_common 处理，而本质都是通过节点 /proc/pressure/memory 获取内存压力是否达到some/full 指定来确认是否触发event；
后期epoll 触发主要的处理函数是mp_event_psi 或 mp_event_common；

Android LMKD(2) 源码分析1