关于watermark的计算

作者 by adtxl / 2022-07-20 / 暂无评论 / 272 个足迹

关于watermark的概念请参考zoned page frame allocator--3(对watermark的理解)

本文主要根据一个实例来加深理解

1. zoneinfo的信息

通过/proc/zoneinfo节点可以查看zone的相关信息,下面只截取了一部分,该开发平台只有1个node,也只有一直zone类型,即zone DMA32;

Node 0, zone    DMA32
......
  pages free     56388
        min      1139
        low      7295
        high     7621
        spanned  521952
        present  345696
        managed  326626
        protection: (0, 0, 0)
      nr_free_pages 56388

......

这里的min、low、high的值是怎么计算出来的的呢?主要是依据min_free_kbytes的值,下面先分析下这个值是怎么计算出来的。

2. min_free_kbytes值的计算

根据文档Documentation/sysctl/vm.txt,

==============================================================

min_free_kbytes:

This is used to force the Linux VM to keep a minimum number
of kilobytes free.  The VM uses this number to compute a
watermark[WMARK_MIN] value for each lowmem zone in the system.
Each lowmem zone gets a number of reserved free pages based
proportionally on its size.

Some minimal amount of memory is needed to satisfy PF_MEMALLOC
allocations; if you set this to lower than 1024KB, your system will
become subtly broken, and prone to deadlock under high loads.

Setting this too high will OOM your machine instantly.

=============================================================

从上面的话得出了以下几点:

  • min_free_kbyes代表的是系统保留空闲内存的最低限
  • watermark[WMARK_MIN]的值是通过min_free_kbytes计算出来的

系统中min_free_kbytes的值可以通过下面的节点查看。可以看到,这个4556/4=1139,就是上面min的值(因为我们这里只有一个zone DMA32)。

console:/ # cat /proc/sys/vm/min_free_kbytes
4556

下面的函数对每个zone做计算,将每个zone中超过high水位的值放到sum中。超过高水位的页数计算方法是:managed_pages减去watermark[WMARK_HIGH], 这样就可以获取到系统中各个zone超过高水位页的总和

其中,high_wmark_pages(zone)获取的是watermark[WMARK_HIGH]的值,这里有个让我疑惑的地方是watermark[WMARK_HIGH]的值是根据watermark[WMARK_MIN]确定的,我们这里是为了计算watermark[WMARK_MIN],怎么会需要watermark[WMARK_HIGH]呢,在这里加个打印,发现在我的平台一开始获取的high值就是0,所以这里返回的就是managed_pages值,即326626

/**
 * nr_free_zone_pages - count number of pages beyond high watermark
 * @offset: The zone index of the highest zone
 *
 * nr_free_zone_pages() counts the number of counts pages which are beyond the
 * high watermark within all zones at or below a given zone index.  For each
 * zone, the number of pages is calculated as:
 *
 *     nr_free_zone_pages = managed_pages - high_pages
 */
static unsigned long nr_free_zone_pages(int offset)
{
    struct zoneref *z;
    struct zone *zone;

    /* Just pick one node, since fallback list is circular */
    unsigned long sum = 0;

    struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);

    for_each_zone_zonelist(zone, z, zonelist, offset) {
        unsigned long size = zone->managed_pages;
        unsigned long high = high_wmark_pages(zone);
        if (size > high)
            sum += size - high;
    }

    return sum;
}

/**
 * nr_free_buffer_pages - count number of pages beyond high watermark
 *
 * nr_free_buffer_pages() counts the number of pages which are beyond the high
 * watermark within ZONE_DMA and ZONE_NORMAL.
 */
unsigned long nr_free_buffer_pages(void)
{
    return nr_free_zone_pages(gfp_zone(GFP_USER));
}

watermark的初始化是由下面的函数来完成的,将上面获取的值,即326626代入,可以算得在我们得平台上,min_free_kbytes得值为4572.与上面得4556大小基本相等。

/*
 * Initialise min_free_kbytes.
 *
 * For small machines we want it small (128k min).  For large machines
 * we want it large (64MB max).  But it is not linear, because network
 * bandwidth does not increase linearly with machine size.  We use
 *
 *  min_free_kbytes = 4 * sqrt(lowmem_kbytes), for better accuracy:
 *  min_free_kbytes = sqrt(lowmem_kbytes * 16)
 *
 * which yields
 *
 * 16MB:    512k
 * 32MB:    724k
 * 64MB:    1024k
 * 128MB:   1448k
 * 256MB:   2048k
 * 512MB:   2896k
 * 1024MB:  4096k
 * 2048MB:  5792k
 * 4096MB:  8192k
 * 8192MB:  11584k
 * 16384MB: 16384k
 */
int __meminit init_per_zone_wmark_min(void)
{
    unsigned long lowmem_kbytes;
    int new_min_free_kbytes;

    lowmem_kbytes = nr_free_buffer_pages() * (PAGE_SIZE >> 10);
    new_min_free_kbytes = int_sqrt(lowmem_kbytes * 16);

    if (new_min_free_kbytes > user_min_free_kbytes) {
        min_free_kbytes = new_min_free_kbytes;
        if (min_free_kbytes < 128)
            min_free_kbytes = 128;
        if (min_free_kbytes > 65536)
            min_free_kbytes = 65536;
    } else {
        pr_warn("min_free_kbytes is not updated to %d because user defined value %d is preferred\n",
                new_min_free_kbytes, user_min_free_kbytes);
    }
    // 建立各个zone得水位值
    setup_per_zone_wmarks();
    refresh_zone_stat_thresholds();
    setup_per_zone_lowmem_reserve();

#ifdef CONFIG_NUMA
    setup_min_unmapped_ratio();
    setup_min_slab_ratio();
#endif

    khugepaged_min_free_kbytes_update();

    return 0;
}

3. 建立各个zone得水位值

具体每个zone得水位值是由下面得函数确认的,

static void __setup_per_zone_wmarks(void)
{
    // pages_min = 4556/4=1139
    unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
    // extra_free_kbytes机制应该是Android得patch,看了下5.18得kernel,还不包含这个patch,本平台这个值大小为24300
    // page_low=24300/4=6075
    unsigned long pages_low = extra_free_kbytes >> (PAGE_SHIFT - 10);
    unsigned long lowmem_pages = 0;
    struct zone *zone;
    unsigned long flags;

    // 计算除了ZONE_HIGHMEM之外所有的managed_pages之和,即lowmem_pages=326626
    /* Calculate total number of !ZONE_HIGHMEM pages */
    for_each_zone(zone) {
        if (!is_highmem(zone))
            lowmem_pages += zone->managed_pages;
    }

    for_each_zone(zone) {
        u64 min, low;

        spin_lock_irqsave(&zone->lock, flags);
        // min = 1139*326626=372027014
        min = (u64)pages_min * zone->managed_pages;
        // 取商,min = 1139,
        do_div(min, lowmem_pages);
        low = (u64)pages_low * zone->managed_pages;
        // vm_total_pages的值就是nr_free_pagecache_pages()的返回值,打log发现这个值是340294
        // 也不是managed_pages值
        // 所以下面计算后low的值在我们的平台还是pages_low=6075*326626/340294=5830
        do_div(low, vm_total_pages);

        if (is_highmem(zone)) {
            /*
             * __GFP_HIGH and PF_MEMALLOC allocations usually don't
             * need highmem pages, so cap pages_min to a small
             * value here.
             *
             * The WMARK_HIGH-WMARK_LOW and (WMARK_LOW-WMARK_MIN)
             * deltas control asynch page reclaim, and so should
             * not be capped for highmem.
             */
            unsigned long min_pages;

            min_pages = zone->managed_pages / 1024;
            min_pages = clamp(min_pages, SWAP_CLUSTER_MAX, 128UL);
            zone->watermark[WMARK_MIN] = min_pages;
        } else {
            /*
             * If it's a lowmem zone, reserve a number of pages
             * proportionate to the zone's size.
             */
             // 在我们的平台,即走到这里,即求得了zone->watermark[WMARK_MIN]=1139
            zone->watermark[WMARK_MIN] = min;
        }

        /*
         * Set the kswapd watermarks distance according to the
         * scale factor in proportion to available memory, but
         * ensure a minimum size on small systems.
         */
         min = max_t(284,326),即min=326.
        min = max_t(u64, min >> 2,
                mult_frac(zone->managed_pages,
                      watermark_scale_factor, 10000));
// Android打的patch,low的值在这里起了作用
        // zone->watermark[WMARK_LOW]=1139+5830+326=7286
        zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) +
                    low + min;
        // watermark[WMARK_HIGH]即比zone->watermark[WMARK_LOW]多了个min的值
        zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) +
                    low + min * 2;

        spin_unlock_irqrestore(&zone->lock, flags);
    }

    /* update totalreserve_pages */
    calculate_totalreserve_pages();
}

/**
 * setup_per_zone_wmarks - called when min_free_kbytes changes
 * or when memory is hot-{added|removed}
 *
 * Ensures that the watermark[min,low,high] values for each zone are set
 * correctly with respect to min_free_kbytes.
 */
void setup_per_zone_wmarks(void)
{
    static DEFINE_SPINLOCK(lock);

    spin_lock(&lock);
    __setup_per_zone_wmarks();
    spin_unlock(&lock);
}

本平台得extra_free_kbytes值为24300

console:/ # cat /proc/sys/vm/extra_free_kbytes
24300

本平台watermark_scale_factor的值为10

console:/ # cat /proc/sys/vm/watermark_scale_factor
10

Android为什么要增加extra_free_kbytes的值呢,主要目的是为了拉开watermark[WMARK_LOW]和watermark[WMARK_MIN]的距离。
下面是我从本地的ubuntu开发机截取的zoneinfo信息,可以看到,min和low的差值并不大,其中low是min的1.25倍,high是min的1.5倍。

Node 0, zone      DMA
......
  pages free     3977
        min      16
        low      20
        high     24
        spanned  4095
        present  3998
        managed  3977
        protection: (0, 3116, 15719, 15719, 15719)
......
Node 0, zone    DMA32
  pages free     417377
        min      3346
        low      4182
        high     5018
        spanned  1044480
        present  825004
        managed  808543
        protection: (0, 0, 12602, 12602, 12602)
......

image.png

根据上图可知

  • 我们分配页的第一次尝试是从LOW水位开始分配的,当所剩余的空闲页小于LOW水位的时候,就会唤醒kswapd内核线程进行内存回收

  • 如果回收内存效果很显著,当空闲页大于HIGH水位的时候,则会停止kswapd内核线程回收

  • 如果回收内存效果不明显,当空闲内存直接小于MIN水位的时候,则会进行直接的内存回收(Direct reclaim),这样空闲内存就会逐渐增大

  • 当回收效果依然不明显的时候,则会启动OOM杀死进程

  • kswapd周期回收机制
    kswapd是linux中用于页面回收的内核线程。当空闲内存的值低于low时,内存面临着一定的压力,在这次分配结束后kswapd就会被唤醒,这时内核线程kswapd开始进行页面回收,当kswapd回收页面发现此时内存终于到达了high水位,那么系统认为内存已经不再紧张了,所以将会停止进一步的操作。要注意,在这种情况下,内存分配虽然会触发内存回收,但不存在被内存回收所阻塞的问题,两者的执行关系是异步的。这里所说的“空闲内存”其实是一个zone总的空闲内存减去其lowmem_reserve的值。对于kswapd来说,要回收多少内存才算完成任务呢?只要把空闲内存的大小恢复到high对应的watermark值就可以了,当然,这取决于当前空闲内存和high值之间的差距,差距越大,需要回收的内存也就越多。low可以被认为是一个警戒水位线,而high则是一个安全的水位线。

  • 内存紧缺直接回收机制
    如果内存达到或者低于min时,这时说明现在内存严重不足,会触发内核直接回收操作(direct reclaim),这是一种默认的操作,此时分配器将同步等待内存回收完成,再进行内存分配。还有一种特殊情况,如果内存分配的请求是带了PF_MEMALLOC标志位的,并且现在空闲内存的大小可以满足本次内存分配的需求,那么也将是先分配,再回收。

使用PF_MEMALLOC(PF表示per-process flag)相当于是忽略了watermark,因此它对应的内存分配的标志是ALLOC_NO_WATERMARk。能够获取"min"值以下的内存,也意味着该process有动用几乎所有内存的权力,因此它也对应GFP的标志__GFP_MEMALLOC

if (gfp_mask & __GFP_MEMALLOC)
    return ALLOC_NO_WATERMARKS;

if (alloc_flags & ALLOC_NO_WATERMARKS)
        set_page_pfmemalloc(page);

可在内存严重短缺的时候,拥有不等待回收而强行分配内存的权力的进程其实可以想到kswapd,因为kswapd本身就是负责回收内存的,它只需要占用一小部分内存支撑其正常运行,就可以去回收更多的内存。

虽然kswapd是在low到min的这段区间被唤醒加入调度队列的,但当它真正执行的时候,空闲内存的值可能已经掉到min以下了。可见,min值存在得一个意义是保证像kswapd这样得特殊任务能够在需要得时候立刻获取所需得内存。

比如下面ubunut的水位,其中min=16, low=20, high=24

image.png

比如当前空闲内存是在LOW水位以及MIN以上,这时候后台会启动kswapd内核线程进行进程内存回收,假设这时候突然有个很大的进程需要很大的内存请求,这样一来kswapd回收速度赶不上分配速度,内存一下掉到了MIN水位,这样直接就触发了直接内存回收,直接回收很影响系统性能的。这样看来linux原生的代码涉及MIN-LOW之间的间隙太小,很容易导致进入直接回收的情况的。所以在android的版本上增加了一个变量extra_free_kbytes.

独特见解