转载自https://justinwei.blog.csdn.net/article/details/122268437

10. mp_event_psi

与第 8 节中 mp_event_common 对应，新策略的event 处理是通过 mp_event_psi。函数逻辑有点多，还是分解剖析。

10.1 step0. 注意一些static变量

    static int64_t init_ws_refault;
    static int64_t prev_workingset_refault;
    static int64_t base_file_lru;
    static int64_t init_pgscan_kswapd;
    static int64_t init_pgscan_direct;
    static int64_t swap_low_threshold;
    static bool killing;
    static int thrashing_limit = thrashing_limit_pct;
    static struct zone_watermarks watermarks;
    static struct timespec wmark_update_tm;
    static struct wakeup_info wi;
    static struct timespec thrashing_reset_tm;
    static int64_t prev_thrash_growth = 0;
    static bool check_filecache = false;
    static int max_thrashing = 0;

整个lmkd 处理都是持续记录的，对于PSI 策略处理过程，这些static 起到了至关重要的作用。

init_ws_refault 初始的工作集 refault 值。每次event 触发时都会重新读取/proc/vmstat 节点中部分属性值，其中就有工作集refault，读取节点后都会记录在这个变量中；
prev_workingset_refault 上一次工作集refault 值，用以确认两次event 是否存在workingset_refault值是一样的；
base_file_lru 从vmstat 节点读取的inactive file 和active file 之和；
init_pgscan_kswaped 上一次vmstat 节点中pgscan_kswaped值，用以下一次event 时确认reclaim 状态，详细看 step 3；
init_pgscan_direct 上一次vmstat 节点中pgscan_direct 值，用以下一次event 时确认reclaim 状态，与上面的init_pgscan_kswaped 组合使用，详细看 step 3；
swap_low_threshold 用以记录swap 分区预留的内存大小。用以确认从 /proc/meminfo 节点中读取的free_swap 小于此预留值，详细看 step2 和 step 6；
killing 用以记录上一次event 正在处理，已经找到process 并处于killing 状态；
thrashing_limit PSI event处理的重要变量，用以记录抖动界限。如上面代码，正常情况下thrashing_limit 的值等同于prop ro.lmk.thrashing_limit（详细看 lmkd机制一），每一次reset thrashing时也会重置该值为prop ro.lmk.thrashing_limit。但是，当内存紧张时，短时间内可能会触发多次event，此时抖动比较厉害，抖动值thrashing 有可能会超过thrashing_limit，选择process kill后，会对该值进行衰减处理，衰减百分比为 prop ro.lmk.thrashing_limit_decay 的值，详细看 step 7；
watermarks 记录水位值，每分钟都会读取/proc/zoneinfo 的水位，会记录在此变量中，详细看 step 5；
wmark_update_tm 记录上一次更新水位的时间，详细看 step 5；
wi 用以记录event 被wake up 的时间；
thrashing_reset_tm 记录thrashing 值reset 的时间，详细看 step 4；
prev_thrash_growth 记录两次vmstat 节点读取的workingset_refault 增长幅度，详细看setp 4；

10.2 step1: 解析vmstat和meminfo


    if (vmstat_parse(&vs) < 0) {
        ALOGE("Failed to parse vmstat!");
        return;
    }
    /* Starting 5.9 kernel workingset_refault vmstat field was renamed workingset_refault_file */
    workingset_refault_file = vs.field.workingset_refault ? : vs.field.workingset_refault_file;

    if (meminfo_parse(&mi) < 0) {
        ALOGE("Failed to parse meminfo!");
        return;
    }

10.3 step2: 确定是swap是否足够

    /* Check free swap levels */
    if (swap_free_low_percentage) {
        if (!swap_low_threshold) {
            swap_low_threshold = mi.field.total_swap * swap_free_low_percentage / 100;
        }
        swap_is_low = mi.field.free_swap < swap_low_threshold;
    }

变量swap_free_low_percentage 是通过prop ro.lmk.swap_free_low_percentage 来标记swap 可预留的最低空间百分比，取值 0~100。

如果当前free 的swap 低于 swap 的最低空间大小，则标记swap 处于low 状态。

该值只会计算一次。

10.4 step3. 确定reclaim状态

    /* Identify reclaim state */
    // vs.field.pgscan_direct是从`/proc/vmstat`中解析的，init_pgscan_direct默认为0
    if (vs.field.pgscan_direct > init_pgscan_direct) {
        init_pgscan_direct = vs.field.pgscan_direct;
        init_pgscan_kswapd = vs.field.pgscan_kswapd;
        reclaim = DIRECT_RECLAIM;
    } else if (vs.field.pgscan_kswapd > init_pgscan_kswapd) {
        init_pgscan_kswapd = vs.field.pgscan_kswapd;
        reclaim = KSWAPD_RECLAIM;
    } else if (workingset_refault_file == prev_workingset_refault) {
        /*
         * Device is not thrashing and not reclaiming, bail out early until we see these stats
         * changing
         */
        goto no_kill;
    }

    prev_workingset_refault = workingset_refault_file;

通过当前的pgscan_direct 和pgscan_kswapd 与上一次对应的值进行比较，确认当前kswapd 处于reclaim 的状态。

10.5 step5: thrashing计算

     /*
     * It's possible we fail to find an eligible process to kill (ex. no process is
     * above oom_adj_min). When this happens, we should retry to find a new process
     * for a kill whenever a new eligible process is available. This is especially
     * important for a slow growing refault case. While retrying, we should keep
     * monitoring new thrashing counter as someone could release the memory to mitigate
     * the thrashing. Thus, when thrashing reset window comes, we decay the prev thrashing
     * counter by window counts. If the counter is still greater than thrashing limit,
     * we preserve the current prev_thrash counter so we will retry kill again. Otherwise,
     * we reset the prev_thrash counter so we will stop retrying.
     */
    since_thrashing_reset_ms = get_time_diff_ms(&thrashing_reset_tm, &curr_tm);
    if (since_thrashing_reset_ms > THRASHING_RESET_INTERVAL_MS) {
        long windows_passed;
        /* Calculate prev_thrash_growth if we crossed THRASHING_RESET_INTERVAL_MS */
        prev_thrash_growth = (workingset_refault_file - init_ws_refault) * 100
                            / (base_file_lru + 1);
        windows_passed = (since_thrashing_reset_ms / THRASHING_RESET_INTERVAL_MS);
        /*
         * Decay prev_thrashing unless over-the-limit thrashing was registered in the window we
         * just crossed, which means there were no eligible processes to kill. We preserve the
         * counter in that case to ensure a kill if a new eligible process appears.
         */
        if (windows_passed > 1 || prev_thrash_growth < thrashing_limit) {
            prev_thrash_growth >>= windows_passed;
        }

        /* Record file-backed pagecache size when crossing THRASHING_RESET_INTERVAL_MS */
        base_file_lru = vs.field.nr_inactive_file + vs.field.nr_active_file;
        init_ws_refault = workingset_refault_file;
        thrashing_reset_tm = curr_tm;
        thrashing_limit = thrashing_limit_pct;
    } else {
        /* Calculate what % of the file-backed pagecache refaulted so far */
        thrashing = (workingset_refault_file - init_ws_refault) * 100 / (base_file_lru + 1);
    }
    /* Add previous cycle's decayed thrashing amount */
    thrashing += prev_thrash_growth;
    if (max_thrashing < thrashing) {
        max_thrashing = thrashing;
    }

本段代码总的来说就是重置 thrashing 值。从代码来看，如果距离上一次重置超过了 THRASHING_RESET_INTERVAL_MS（默认是1000，即1s），那么thrashing 相关的值都需要重置。

主要是计算工作集refault 值占据 file-backed 页面缓存的抖动百分比：

thrashing = (workingset_refault_file - init_ws_refault) * 100 / (base_file_lru + 1);

vs.feild.workingset_refault 是当前的refault 值（kernel 5.9 之后改名了），init_ws_refault 是上一次的refault 值，base_file_lru 是file page（包括inactive 和active）。

有些时候计算后的oom_adj_min 却找不到大于该adj 的进程，此时需要重新找到一个虚拟的可以kill 的进程。

10.6 每过1分钟计算一次水位

    /*
     * Refresh watermarks once per min in case user updated one of the margins.
     * TODO: b/140521024 replace this periodic update with an API for AMS to notify LMKD
     * that zone watermarks were changed by the system software.
     */
    if (watermarks.high_wmark == 0 || get_time_diff_ms(&wmark_update_tm, &curr_tm) > 60000) {
        struct zoneinfo zi;

        if (zoneinfo_parse(&zi) < 0) {
            ALOGE("Failed to parse zoneinfo!");
            return;
        }

        calc_zone_watermarks(&zi, &watermarks);
        wmark_update_tm = curr_tm;
    }

通过读取/proc/zoneinfo 中的min、low、high 水位和protection 计算出这次的最终水位，并保存在静态结构体变量 watermarks 中，1 分钟计算一次（最开始high_wmark 为0，后面是1 分钟一次）。

在获取到water mark 后，会确认当前触发event 时处于什么水位：

   /* Find out which watermark is breached if any */
    wmark = get_lowest_watermark(&mi, &watermarks);

    

/*
 * Returns lowest breached watermark or WMARK_NONE.
 */
static enum zone_watermark get_lowest_watermark(union meminfo *mi,
                                                struct zone_watermarks *watermarks)
{
    int64_t nr_free_pages = mi->field.nr_free_pages - mi->field.cma_free;

    if (nr_free_pages < watermarks->min_wmark) {
        return WMARK_MIN;
    }
    if (nr_free_pages < watermarks->low_wmark) {
        return WMARK_LOW;
    }
    if (nr_free_pages < watermarks->high_wmark) {
        return WMARK_HIGH;
    }
    return WMARK_NONE;
}

通过/proc/meminfo中的nr_free_pages - cma_free 与水位进行比较。

10.7 step6: 确定kill reason和min_score_adj

    /*
     * TODO: move this logic into a separate function
     * Decide if killing a process is necessary and record the reason
     */
    if (cycle_after_kill && wmark < WMARK_LOW) {
        /*
         * Prevent kills not freeing enough memory which might lead to OOM kill.
         * This might happen when a process is consuming memory faster than reclaim can
         * free even after a kill. Mostly happens when running memory stress tests.
         */
        kill_reason = PRESSURE_AFTER_KILL;
        strncpy(kill_desc, "min watermark is breached even after kill", sizeof(kill_desc));
    } else if (level == VMPRESS_LEVEL_CRITICAL && events != 0) {
        /*
         * Device is too busy reclaiming memory which might lead to ANR.
         * Critical level is triggered when PSI complete stall (all tasks are blocked because
         * of the memory congestion) breaches the configured threshold.
         */
        kill_reason = NOT_RESPONDING;
        strncpy(kill_desc, "device is not responding", sizeof(kill_desc));
    } else if (swap_is_low && thrashing > thrashing_limit_pct) {
        /* Page cache is thrashing while swap is low */
        kill_reason = LOW_SWAP_AND_THRASHING;
        snprintf(kill_desc, sizeof(kill_desc), "device is low on swap (%" PRId64
            "kB < %" PRId64 "kB) and thrashing (%" PRId64 "%%)",
            mi.field.free_swap * page_k, swap_low_threshold * page_k, thrashing);
        /* Do not kill perceptible apps unless below min watermark or heavily thrashing */
        if (wmark > WMARK_MIN && thrashing < thrashing_critical_pct) {
            min_score_adj = PERCEPTIBLE_APP_ADJ + 1;
        }
        check_filecache = true;
    } else if (swap_is_low && wmark < WMARK_HIGH) {
        /* Both free memory and swap are low */
        kill_reason = LOW_MEM_AND_SWAP;
        snprintf(kill_desc, sizeof(kill_desc), "%s watermark is breached and swap is low (%"
            PRId64 "kB < %" PRId64 "kB)", wmark < WMARK_LOW ? "min" : "low",
            mi.field.free_swap * page_k, swap_low_threshold * page_k);
        /* Do not kill perceptible apps unless below min watermark or heavily thrashing */
        if (wmark > WMARK_MIN && thrashing < thrashing_critical_pct) {
            min_score_adj = PERCEPTIBLE_APP_ADJ + 1;
        }
    } else if (wmark < WMARK_HIGH && swap_util_max < 100 &&
               (swap_util = calc_swap_utilization(&mi)) > swap_util_max) {
        /*
         * Too much anon memory is swapped out but swap is not low.
         * Non-swappable allocations created memory pressure.
         */
        kill_reason = LOW_MEM_AND_SWAP_UTIL;
        snprintf(kill_desc, sizeof(kill_desc), "%s watermark is breached and swap utilization"
            " is high (%d%% > %d%%)", wmark < WMARK_LOW ? "min" : "low",
            swap_util, swap_util_max);
    } else if (wmark < WMARK_HIGH && thrashing > thrashing_limit) {
        /* Page cache is thrashing while memory is low */
        kill_reason = LOW_MEM_AND_THRASHING;
        snprintf(kill_desc, sizeof(kill_desc), "%s watermark is breached and thrashing (%"
            PRId64 "%%)", wmark < WMARK_LOW ? "min" : "low", thrashing);
        cut_thrashing_limit = true;
        /* Do not kill perceptible apps unless thrashing at critical levels */
        if (thrashing < thrashing_critical_pct) {
            min_score_adj = PERCEPTIBLE_APP_ADJ + 1;
        }
        check_filecache = true;
    } else if (reclaim == DIRECT_RECLAIM && thrashing > thrashing_limit) {
        /* Page cache is thrashing while in direct reclaim (mostly happens on lowram devices) */
        kill_reason = DIRECT_RECL_AND_THRASHING;
        snprintf(kill_desc, sizeof(kill_desc), "device is in direct reclaim and thrashing (%"
            PRId64 "%%)", thrashing);
        cut_thrashing_limit = true;
        /* Do not kill perceptible apps unless thrashing at critical levels */
        if (thrashing < thrashing_critical_pct) {
            min_score_adj = PERCEPTIBLE_APP_ADJ + 1;
        }
        check_filecache = true;
    } else if (check_filecache) {
        int64_t file_lru_kb = (vs.field.nr_inactive_file + vs.field.nr_active_file) * page_k;

        if (file_lru_kb < filecache_min_kb) {
            /* File cache is too low after thrashing, keep killing background processes */
            kill_reason = LOW_FILECACHE_AFTER_THRASHING;
            snprintf(kill_desc, sizeof(kill_desc),
                "filecache is low (%" PRId64 "kB < %" PRId64 "kB) after thrashing",
                file_lru_kb, filecache_min_kb);
            min_score_adj = PERCEPTIBLE_APP_ADJ + 1;
        } else {
            /* File cache is big enough, stop checking */
            check_filecache = false;
        }
    }

kill reason 大致分为：

PRESSURE_AFTER_KILL
NOT_RESPONDING
LOW_SWAP_AND_THRASHING
LOW_MEM_AND_SWAP
LOW_MEM_AND_SWAP_UTIL
LOW_MEM_AND_THRASHING
DIRECT_RECL_AND_THRASHING
LOW_FILECACHE_AFTER_THRASHING

（1）状态 PRESSURE_AFTER_KILL

此状态条件是：cycle_after_kill && wmark < WMARK_LOW

cycle_after_kill 为true 表明此时还处于killing 状态，并且水位已经低于low 水位。此状态通常发生在memory 压力测试中。

wmark的值即为proc/zoneinfo节点中的nr_free_pages.

（2）状态 NOT_RESPONDING

此状态条件是：level == VMPRESS_LEVEL_CRITICAL && events !=0

此时内存pressure 已经超出了PSI complete stall，即full 状态设定的阈值。此时设备处于拼命reclaim memory ，这有可能导致ANR 产生。

（3）状态LOW_SWAP_AND_THRASHING

此状态条件是：swap_is_low && thrashing > thrashing_limit_pct

swap_is_low 是swap 空间已经超过底线了，这个底线是详细看step 2。
thrashing 是workset refault值基于file-backed 页面缓存的抖动百分比，详细看step 4。
thrashing_limit_pct 来自prop ro.lmk.thrashing_limit，对于low ram 该值为30，否则为100；

但如果水位还没有低于MIN，并且thrashing 没有高于thrashing_critical_pct(由prop ro.lmk.thrashing_limit_critical指定，如果未定义该属性，默认取thrashing_limit_pct的2倍) 时，不去kill perceptible 之下的应用。

（4）状态 LOW_MEM_AND_SWAP

此状态条件是： swap_is_low && wmark < WMARK_HIGH

此时swap 低于设限的阈值，free pages 处于水位 HIGH 之下(有可能已经处于MIN 之下)。

(5) LOW_MEM_AND_SWAP_UTIL

此状态条件为： wmark < WMARK_HIGH && swap_util_max < 100 && (swap_util = calc_swap_utilization(&mi)) > swap_util_max

此时的内存水位已经很低了，swap_util_max由属性ro.lmk.swap_util_max指定，默认为100%，表示最大可使用的交换内存量。
且通过meminfo计算出的交换内存使用大于设置的可用swap内存。

说明即使大量anon的内存被交换后，swap的使用量依旧很高。说明此时使用的不可交换的内存造成了内存压力。

（6）LOW_MEM_AND_THRASHING

此状态条件是： wmark < WMARK_HIGH && thrashing > thrashing_limit

水位在HIGH 之下(有可能处于MIN)，并且抖动值已经超过 thrashing_limit。标记此时处于低水位并抖动状态。

如果抖动的值没有超过了ro.lmk.thrashing_limit_critical 设定的(默认为ro.lmk.thrashing_limit 2倍)，则不去kill perceptible 之下的进程。

（7）DIRECT_RECL_AND_THRASHING

此状态条件是：reclaim == DIRECT_RECLAIM && thrashing > thrashing_limit

当抖动大于limit 值，kswap 进入reclaim状态时，就会kill apps。

默认kill apps 的min_score_adj 是从0 开始，有些条件不是很过分时min_score_adj 会选择从PERCEPTIBLE_APP_ADJ + 1 开始。最终根据该min_score_adj 传入find_and_kill_process 找到合适的进程进行kill。

（8）LOW_FILECACHE_AFTER_THRASHING

此状态条件为file_lru_kb < filecache_min_kb

其中：

file_lru_kb = (vs.field.nr_inactive_file + vs.field.nr_active_file) * page_k;

filecache_min_kb由属性ro.lmk.filecache_min_kb指定，默认值为0.

这种情况说名抖动导致file cache值太低，不断的杀死后台进程。

10.8 step7: kill 进程

    /* Kill a process if necessary */
    if (kill_reason != NONE) {
        struct kill_info ki = {
            .kill_reason = kill_reason,
            .kill_desc = kill_desc,
            .thrashing = (int)thrashing,
            .max_thrashing = max_thrashing,
        };
        int pages_freed = find_and_kill_process(min_score_adj, &ki, &mi, &wi, &curr_tm);
        if (pages_freed > 0) {
            killing = true;
            max_thrashing = 0;
            if (cut_thrashing_limit) {
                /*
                 * Cut thrasing limit by thrashing_limit_decay_pct percentage of the current
                 * thrashing limit until the system stops thrashing.
                 */
                thrashing_limit = (thrashing_limit * (100 - thrashing_limit_decay_pct)) / 100;
            }
        }
    }

详细看第 9 节。

注意这里thrashing_limit 有可能衰减，因为之前的thrasing 值已经超过了thrashing_limit。这种情况一般出现在短时间连续抖动的情况。

Android LMKD(2) 源码分析3