关于cma_alloc分配提示PFNs busy问题

adtxl
2022-08-16 / 3 评论 / 2,503 阅读 / 正在检测是否收录...

kernel version: Android common kernel:4.19.176

1. 问题背景

使用cma_alloc()分配内存时,提示PFNs busy,然后内存分配没有完全成功。
错误log如下:

01-01 09:03:01.490     0     0 I .(1)[2074:input@1.0-servi]alloc_contig_range: [c3400, c51aa) PFNs busy
01-01 09:03:01.498     0     0 D .(1)[2074:input@1.0-servi]cma: cma_alloc(): memory range at 00000000b5c4cb24 is busy, retrying
01-01 09:03:01.502     0     0 I .(1)[2074:input@1.0-servi]alloc_contig_range: [c3400, c52aa) PFNs busy
01-01 09:03:01.511     0     0 D .(1)[2074:input@1.0-servi]cma: cma_alloc(): memory range at 000000002ba86bc3 is busy, retrying
01-01 09:03:01.515     0     0 I .(3)[2074:input@1.0-servi]alloc_contig_range: [c3400, c53aa) PFNs busy
01-01 09:03:01.523     0     0 D .(3)[2074:input@1.0-servi]cma: cma_alloc(): memory range at 00000000b20006ec is busy, retrying
......
01-01 09:03:01.555     0     0 I .(3)[2074:input@1.0-servi]alloc_contig_range: [c3800, c56aa) PFNs busy
01-01 09:03:01.563     0     0 D .(3)[2074:input@1.0-servi]cma: cma_alloc(): memory range at 0000000091767e24 is busy, retrying
01-01 09:03:01.570     0     0 D .(3)[2074:input@1.0-servi]cma: cma_alloc(): memory range at 00000000292a9760 is busy, retrying
01-01 09:03:01.575     0     0 D .(1)[2074:input@1.0-servi]cma: cma_alloc(): memory range at 000000006f53cfb1 is busy, retrying
01-01 09:03:01.579     0     0 D .(1)[2074:input@1.0-servi]cma: cma_alloc(): memory range at 0000000031c9858d is busy, retrying
01-01 09:03:01.583     0     0 D .(1)[2074:input@1.0-servi]cma: cma_alloc(): memory range at 0000000045679819 is busy, retrying
01-01 09:03:01.589     0     0 D .(1)[2074:input@1.0-servi]cma: cma_alloc(): memory range at 00000000852192bc is busy, retrying
01-01 09:03:01.594     0     0 D .(1)[2074:input@1.0-servi]cma: cma_alloc(): memory range at 00000000fd982ebd is busy, retrying
......
01-01 09:03:03.884     0     0 E .(2)[2074:input@1.0-servi]platform media: Fail to allocate buffer

2. 问题分析

先看下PFNS busy这个log是哪个函数打印的,只有一处,在alloc_contig_range()函数中,
函数调用关系,cma_alloc()-->alloc_contig_range(),可以看到,如果alloc_contig_range()返回的值为-EBUSY,cma_alloc()会尝试在新的区域继续分配内存,所以,报PFNs busy并不一定是cma_alloc()内存分配失败了。从log中也可以看到,在有些情况下,不断的retrying之后是可能找到一块可以分配的内存的,但上面的log这种情况是retrying之后也没有找到,所以就报错了。


/**
 * cma_alloc() - allocate pages from contiguous area
 * @cma:   Contiguous memory region for which the allocation is performed.
 * @count: Requested number of pages.
 * @align: Requested alignment of pages (in PAGE_SIZE order).
 * @no_warn: Avoid printing message about failed allocation
 *
 * This function allocates part of contiguous memory on specific
 * contiguous memory area.
 */
struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
               bool no_warn)
{
......
    for (;;) {
        mutex_lock(&cma->lock);
        bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
                bitmap_maxno, start, bitmap_count, mask,
                offset);
        if (bitmap_no >= bitmap_maxno) {
            mutex_unlock(&cma->lock);
            break;
        }
        bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
        /*
         * It's safe to drop the lock here. We've marked this region for
         * our exclusive use. If the migration fails we will take the
         * lock again and unmark it.
         */
        mutex_unlock(&cma->lock);

        pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
        mutex_lock(&cma_mutex);
        ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
                     GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0));
        mutex_unlock(&cma_mutex);
        if (ret == 0) {
            page = pfn_to_page(pfn);
            break;
        }

        cma_clear_bitmap(cma, pfn, count);
        if (ret != -EBUSY)
            break;

        pr_debug("%s(): memory range at %p is busy, retrying\n",
             __func__, pfn_to_page(pfn));
        /* try again with a bit different memory target */
        start = bitmap_no + mask + 1;
    }
    ......
}

int alloc_contig_range(unsigned long start, unsigned long end,
               unsigned migratetype, gfp_t gfp_mask)
{
    unsigned long outer_start, outer_end;
    unsigned int order;
    int ret = 0;
.......
    // 在分配指定区域的内存之前,需要将指定区域的内存先isolate,以供后面做页面迁移
    ret = start_isolate_page_range(pfn_max_align_down(start),
                       pfn_max_align_up(end), migratetype,
                       false);

......
     * In case of -EBUSY, we'd like to know which page causes problem.
     * So, just fall through. test_pages_isolated() has a tracepoint
     * which will report the busy page.
     *
     * It is possible that busy pages could become available before
     * the call to test_pages_isolated, and the range will actually be
     * allocated.  So, if we fall through be sure to clear ret so that
     * -EBUSY is not accidentally used or returned to caller.
     */
    ret = __alloc_contig_migrate_range(&cc, start, end);
    if (ret && ret != -EBUSY)
        goto done;
    ret =0;
......
    /* Make sure the range is really isolated. */
    if (test_pages_isolated(outer_start, end, false)) {
        pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n",
            __func__, outer_start, end);
        ret = -EBUSY;
        goto done;
    }
......

}

test_pages_isolated()函数没什么好分析的,既然报PFNs busy,应该是有些页面没有isolated成功,主要逻辑在isolate_migratepages_block()函数中,如下所示

/**
 * isolate_migratepages_block() - isolate all migrate-able pages within
 *                  a single pageblock
 * @cc:        Compaction control structure.
 * @low_pfn:    The first PFN to isolate
 * @end_pfn:    The one-past-the-last PFN to isolate, within same pageblock
 * @isolate_mode: Isolation mode to be used.
 *
 * Isolate all pages that can be migrated from the range specified by
 * [low_pfn, end_pfn). The range is expected to be within same pageblock.
 * Returns zero if there is a fatal signal pending, otherwise PFN of the
 * first page that was not scanned (which may be both less, equal to or more
 * than end_pfn).
 *
 * The pages are isolated on cc->migratepages list (not required to be empty),
 * and cc->nr_migratepages is updated accordingly. The cc->migrate_pfn field
 * is neither read nor updated.
 */
static unsigned long
isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
            unsigned long end_pfn, isolate_mode_t isolate_mode)
{
    struct zone *zone = cc->zone;
    unsigned long nr_scanned = 0, nr_isolated = 0;
    struct lruvec *lruvec;
    unsigned long flags = 0;
    bool locked = false;
    struct page *page = NULL, *valid_page = NULL;
    unsigned long start_pfn = low_pfn;
    bool skip_on_failure = false;
    unsigned long next_skip_pfn = 0;

    /*
     * Ensure that there are not too many pages isolated from the LRU
     * list by either parallel reclaimers or compaction. If there are,
     * delay for some time until fewer pages are isolated
     */
     // too_many_isolated()如果判断当前临时从LRU链表分离出来的页面比较多,则最好睡眠等待100毫秒。
     // 如果迁移模式是异步的,则直接退出
    while (unlikely(too_many_isolated(zone))) {
        /* async migration should just abort */
        if (cc->mode == MIGRATE_ASYNC)
            return 0;

        congestion_wait(BLK_RW_ASYNC, HZ/10);

        if (fatal_signal_pending(current))
            return 0;
    }

    if (compact_should_abort(cc))
        return 0;

    if (cc->direct_compaction && (cc->mode == MIGRATE_ASYNC)) {
        skip_on_failure = true;
        next_skip_pfn = block_end_pfn(low_pfn, cc->order);
    }

    /* Time to isolate some pages for migration */
    for (; low_pfn < end_pfn; low_pfn++) {

        if (skip_on_failure && low_pfn >= next_skip_pfn) {
            /*
             * We have isolated all migration candidates in the
             * previous order-aligned block, and did not skip it due
             * to failure. We should migrate the pages now and
             * hopefully succeed compaction.
             */
            if (nr_isolated)
                break;

            /*
             * We failed to isolate in the previous order-aligned
             * block. Set the new boundary to the end of the
             * current block. Note we can't simply increase
             * next_skip_pfn by 1 << order, as low_pfn might have
             * been incremented by a higher number due to skipping
             * a compound or a high-order buddy page in the
             * previous loop iteration.
             */
            next_skip_pfn = block_end_pfn(low_pfn, cc->order);
        }

        /*
         * Periodically drop the lock (if held) regardless of its
         * contention, to give chance to IRQs. Abort async compaction
         * if contended.
         */
        if (!(low_pfn % SWAP_CLUSTER_MAX)
            && compact_unlock_should_abort(zone_lru_lock(zone), flags,
                                &locked, cc))
            break;

        if (!pfn_valid_within(low_pfn))
            goto isolate_fail;
        nr_scanned++;

        page = pfn_to_page(low_pfn);

        if (!valid_page)
            valid_page = page;

        /*
         * Skip if free. We read page order here without zone lock
         * which is generally unsafe, but the race window is small and
         * the worst thing that can happen is that we skip some
         * potential isolation targets.
         */
         // 如果该页还在伙伴系统中,那么该页不适合迁移,略过该页。
        if (PageBuddy(page)) {
            unsigned long freepage_order = page_order_unsafe(page);

            /*
             * Without lock, we cannot be sure that what we got is
             * a valid page order. Consider only values in the
             * valid order range to prevent low_pfn overflow.
             */
            if (freepage_order > 0 && freepage_order < MAX_ORDER)
                low_pfn += (1UL << freepage_order) - 1;
            continue;
        }

        /*
         * Regardless of being on LRU, compound pages such as THP and
         * hugetlbfs are not to be compacted. We can potentially save
         * a lot of iterations if we skip them at once. The check is
         * racy, but we can consider only valid values and the only
         * danger is skipping too much.
         */
         // 复合页面不适合迁移
        if (PageCompound(page)) {
            const unsigned int order = compound_order(page);

            if (likely(order < MAX_ORDER))
                low_pfn += (1UL << order) - 1;
            goto isolate_fail;
        }

        /*
         * Check may be lockless but that's ok as we recheck later.
         * It's possible to migrate LRU and non-lru movable pages.
         * Skip any other type of page
         */
         // 在LRU链表中的页面或balloon页面适合迁移,其它类型的页面将被略过
        if (!PageLRU(page)) {
            /*
             * __PageMovable can return false positive so we need
             * to verify it under page_lock.
             */
            if (unlikely(__PageMovable(page)) &&
                    !PageIsolated(page)) {
                if (locked) {
                    spin_unlock_irqrestore(zone_lru_lock(zone),
                                    flags);
                    locked = false;
                }

                if (!isolate_movable_page(page, isolate_mode))
                    goto isolate_success;
            }

            goto isolate_fail;
        }

        /*
         * Migration will fail if an anonymous page is pinned in memory,
         * so avoid taking lru_lock and isolating it unnecessarily in an
         * admittedly racy check.
         */
         // 对于匿名页面,通常情况下page_count(page)=page_mapcount(page),即page->count=page->mapcount+1.
         // 如果它们不相等,说明内核中有人偷偷地使用了这个匿名页面,所以匿名页面不适合迁移
        if (!page_mapping(page) &&
            page_count(page) > page_mapcount(page))
            goto isolate_fail;

        /*
         * Only allow to migrate anonymous pages in GFP_NOFS context
         * because those do not depend on fs locks.
         */
        if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
            goto isolate_fail;

        /* If we already hold the lock, we can skip some rechecking */
        // 加锁zone->lru_lock,并且重新判断该页是否是LRU链表中的页
        if (!locked) {
            locked = compact_trylock_irqsave(zone_lru_lock(zone),
                                &flags, cc);
            if (!locked)
                break;

            /* Recheck PageLRU and PageCompound under lock */
            if (!PageLRU(page))
                goto isolate_fail;

            /*
             * Page become compound since the non-locked check,
             * and it's on LRU. It can only be a THP so the order
             * is safe to read and it's 0 for tail pages.
             */
            if (unlikely(PageCompound(page))) {
                low_pfn += (1UL << compound_order(page)) - 1;
                goto isolate_fail;
            }
        }

        lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);

        /* Try isolate the page */
        if (__isolate_lru_page(page, isolate_mode) != 0)
            goto isolate_fail;

        VM_BUG_ON_PAGE(PageCompound(page), page);

        /* Successfully isolated */
        del_page_from_lru_list(page, lruvec, page_lru(page));
        inc_node_page_state(page,
                NR_ISOLATED_ANON + page_is_file_cache(page));

isolate_success:
        list_add(&page->lru, &cc->migratepages);
        cc->nr_migratepages++;
        nr_isolated++;

        /*
         * Record where we could have freed pages by migration and not
         * yet flushed them to buddy allocator.
         * - this is the lowest page that was isolated and likely be
         * then freed by migration.
         */
        if (!cc->last_migrated_pfn)
            cc->last_migrated_pfn = low_pfn;

        /* Avoid isolating too much */
        if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) {
            ++low_pfn;
            break;
        }

        continue;
isolate_fail:
        if (!skip_on_failure)
            continue;

        /*
         * We have isolated some pages, but then failed. Release them
         * instead of migrating, as we cannot form the cc->order buddy
         * page anyway.
         */
        if (nr_isolated) {
            if (locked) {
                spin_unlock_irqrestore(zone_lru_lock(zone), flags);
                locked = false;
            }
            putback_movable_pages(&cc->migratepages);
            cc->nr_migratepages = 0;
            cc->last_migrated_pfn = 0;
            nr_isolated = 0;
        }

        if (low_pfn < next_skip_pfn) {
            low_pfn = next_skip_pfn - 1;
            /*
             * The check near the loop beginning would have updated
             * next_skip_pfn too, but this is a bit simpler.
             */
            next_skip_pfn += 1UL << cc->order;
        }
    }

    /*
     * The PageBuddy() check could have potentially brought us outside
     * the range to be scanned.
     */
    if (unlikely(low_pfn > end_pfn))
        low_pfn = end_pfn;

    if (locked)
        spin_unlock_irqrestore(zone_lru_lock(zone), flags);

    /*
     * Update the pageblock-skip information and cached scanner pfn,
     * if the whole pageblock was scanned without isolating any page.
     */
    if (low_pfn == end_pfn)
        update_pageblock_skip(cc, valid_page, nr_isolated, true);

    trace_mm_compaction_isolate_migratepages(start_pfn, low_pfn,
                        nr_scanned, nr_isolated);

    cc->total_migrate_scanned += nr_scanned;
    if (nr_isolated)
        count_compact_events(COMPACTISOLATED, nr_isolated);

    return low_pfn;
}

3. Debug

感觉这个问题分配失败的原因是因为内存被pin住了,但为什么被pin住,被谁pin住了还是没分析出来。
使用ftrace简单分析了下,虽然没能解决问题,但也可以加深对cma的理解。

如下面的这个allocator进程,分配下面这段内存出错,然后在isolate_migratepages_block()函数中加点log看下是什么原因。我这个log本来报错了,结果没成功,找不到了。。。

#:/sys/kernel/debug/tracing # cat trace | grep fail
 allocator@2.0-s-2100  [002] ....  2086.979016: test_pages_isolated: start_pfn=0xc3dc0 end_pfn=0xc3de4 fin_pfn=0xc3dc2 ret=fail
 input@1.0-servi-2071  [001] ....   573.011902: test_pages_isolated: start_pfn=0xc2400 end_pfn=0xc41aa fin_pfn=0xc41aa ret=success
 input@1.0-servi-2071  [002] ....   573.170969: test_pages_isolated: start_pfn=0xc4200 end_pfn=0xc5faa fin_pfn=0xc5faa ret=success
 input@1.0-servi-2071  [002] ....   573.448709: test_pages_isolated: start_pfn=0xc6000 end_pfn=0xc7daa fin_pfn=0xc7daa ret=success
 input@1.0-servi-2071  [001] ....   573.736047: test_pages_isolated: start_pfn=0xc7e00 end_pfn=0xc9baa fin_pfn=0xc9baa ret=success
 input@1.0-servi-2071  [000] ....   726.651208: test_pages_isolated: start_pfn=0xc2400 end_pfn=0xc41aa fin_pfn=0xc41aa ret=success
 input@1.0-servi-2071  [000] ....   726.669588: test_pages_isolated: start_pfn=0xc4200 end_pfn=0xc5faa fin_pfn=0xc5fac ret=success
 input@1.0-servi-2071  [000] ....   726.688339: test_pages_isolated: start_pfn=0xc6000 end_pfn=0xc7daa fin_pfn=0xc7dac ret=success
 input@1.0-servi-2071  [000] ....   726.709690: test_pages_isolated: start_pfn=0xc7e00 end_pfn=0xc9baa fin_pfn=0xc9baa ret=success
 input@1.0-servi-2071  [003] ....   737.278719: test_pages_isolated: start_pfn=0xc2400 end_pfn=0xc41aa fin_pfn=0xc41aa ret=success
 input@1.0-servi-2071  [003] ....   737.296976: test_pages_isolated: start_pfn=0xc4200 end_pfn=0xc5faa fin_pfn=0xc5fac ret=success
 input@1.0-servi-2071  [003] ....   737.314510: test_pages_isolated: start_pfn=0xc6000 end_pfn=0xc7daa fin_pfn=0xc7dac ret=success
 input@1.0-servi-2071  [003] ....   737.334471: test_pages_isolated: start_pfn=0xc7e00 end_pfn=0xc9baa fin_pfn=0xc9baa ret=success

4. 总结

对于这种问题,我想到的解决办法就是增大预留的cma内存。有其它更好的解决办法或者debug方法,欢迎告知。

2

评论 (3)

取消
  1. 头像
    makebobo
    Windows 10 · Google Chrome

    请问这块有新的思路吗?

    回复
    1. 头像
      adtxl 作者
      Windows 10 · Google Chrome
      @ makebobo

      没,很久没看了。你可以看下kernel最新分支上面关于cma的一些patch,对cma利用率好像有些fix

      回复
  2. 头像
    sparkle
    Linux · Google Chrome

    我在分配巨页的时候也遇到这个问题,巨页分配需要半个小时,非常慢,dmesg里面显示也是pfns busy

    回复