kernel version: Android common kernel:4.19.176
1. 问题背景
使用cma_alloc()分配内存时,提示PFNs busy,然后内存分配没有完全成功。
错误log如下:
01-01 09:03:01.490 0 0 I .(1)[2074:input@1.0-servi]alloc_contig_range: [c3400, c51aa) PFNs busy
01-01 09:03:01.498 0 0 D .(1)[2074:input@1.0-servi]cma: cma_alloc(): memory range at 00000000b5c4cb24 is busy, retrying
01-01 09:03:01.502 0 0 I .(1)[2074:input@1.0-servi]alloc_contig_range: [c3400, c52aa) PFNs busy
01-01 09:03:01.511 0 0 D .(1)[2074:input@1.0-servi]cma: cma_alloc(): memory range at 000000002ba86bc3 is busy, retrying
01-01 09:03:01.515 0 0 I .(3)[2074:input@1.0-servi]alloc_contig_range: [c3400, c53aa) PFNs busy
01-01 09:03:01.523 0 0 D .(3)[2074:input@1.0-servi]cma: cma_alloc(): memory range at 00000000b20006ec is busy, retrying
......
01-01 09:03:01.555 0 0 I .(3)[2074:input@1.0-servi]alloc_contig_range: [c3800, c56aa) PFNs busy
01-01 09:03:01.563 0 0 D .(3)[2074:input@1.0-servi]cma: cma_alloc(): memory range at 0000000091767e24 is busy, retrying
01-01 09:03:01.570 0 0 D .(3)[2074:input@1.0-servi]cma: cma_alloc(): memory range at 00000000292a9760 is busy, retrying
01-01 09:03:01.575 0 0 D .(1)[2074:input@1.0-servi]cma: cma_alloc(): memory range at 000000006f53cfb1 is busy, retrying
01-01 09:03:01.579 0 0 D .(1)[2074:input@1.0-servi]cma: cma_alloc(): memory range at 0000000031c9858d is busy, retrying
01-01 09:03:01.583 0 0 D .(1)[2074:input@1.0-servi]cma: cma_alloc(): memory range at 0000000045679819 is busy, retrying
01-01 09:03:01.589 0 0 D .(1)[2074:input@1.0-servi]cma: cma_alloc(): memory range at 00000000852192bc is busy, retrying
01-01 09:03:01.594 0 0 D .(1)[2074:input@1.0-servi]cma: cma_alloc(): memory range at 00000000fd982ebd is busy, retrying
......
01-01 09:03:03.884 0 0 E .(2)[2074:input@1.0-servi]platform media: Fail to allocate buffer
2. 问题分析
先看下PFNS busy这个log是哪个函数打印的,只有一处,在alloc_contig_range()函数中,
函数调用关系,cma_alloc()-->alloc_contig_range(),可以看到,如果alloc_contig_range()返回的值为-EBUSY,cma_alloc()会尝试在新的区域继续分配内存,所以,报PFNs busy并不一定是cma_alloc()内存分配失败了。从log中也可以看到,在有些情况下,不断的retrying之后是可能找到一块可以分配的内存的,但上面的log这种情况是retrying之后也没有找到,所以就报错了。
/**
* cma_alloc() - allocate pages from contiguous area
* @cma: Contiguous memory region for which the allocation is performed.
* @count: Requested number of pages.
* @align: Requested alignment of pages (in PAGE_SIZE order).
* @no_warn: Avoid printing message about failed allocation
*
* This function allocates part of contiguous memory on specific
* contiguous memory area.
*/
struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
bool no_warn)
{
......
for (;;) {
mutex_lock(&cma->lock);
bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
bitmap_maxno, start, bitmap_count, mask,
offset);
if (bitmap_no >= bitmap_maxno) {
mutex_unlock(&cma->lock);
break;
}
bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
/*
* It's safe to drop the lock here. We've marked this region for
* our exclusive use. If the migration fails we will take the
* lock again and unmark it.
*/
mutex_unlock(&cma->lock);
pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
mutex_lock(&cma_mutex);
ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0));
mutex_unlock(&cma_mutex);
if (ret == 0) {
page = pfn_to_page(pfn);
break;
}
cma_clear_bitmap(cma, pfn, count);
if (ret != -EBUSY)
break;
pr_debug("%s(): memory range at %p is busy, retrying\n",
__func__, pfn_to_page(pfn));
/* try again with a bit different memory target */
start = bitmap_no + mask + 1;
}
......
}
int alloc_contig_range(unsigned long start, unsigned long end,
unsigned migratetype, gfp_t gfp_mask)
{
unsigned long outer_start, outer_end;
unsigned int order;
int ret = 0;
.......
// 在分配指定区域的内存之前,需要将指定区域的内存先isolate,以供后面做页面迁移
ret = start_isolate_page_range(pfn_max_align_down(start),
pfn_max_align_up(end), migratetype,
false);
......
* In case of -EBUSY, we'd like to know which page causes problem.
* So, just fall through. test_pages_isolated() has a tracepoint
* which will report the busy page.
*
* It is possible that busy pages could become available before
* the call to test_pages_isolated, and the range will actually be
* allocated. So, if we fall through be sure to clear ret so that
* -EBUSY is not accidentally used or returned to caller.
*/
ret = __alloc_contig_migrate_range(&cc, start, end);
if (ret && ret != -EBUSY)
goto done;
ret =0;
......
/* Make sure the range is really isolated. */
if (test_pages_isolated(outer_start, end, false)) {
pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n",
__func__, outer_start, end);
ret = -EBUSY;
goto done;
}
......
}
test_pages_isolated()函数没什么好分析的,既然报PFNs busy,应该是有些页面没有isolated成功,主要逻辑在isolate_migratepages_block()函数中,如下所示
/**
* isolate_migratepages_block() - isolate all migrate-able pages within
* a single pageblock
* @cc: Compaction control structure.
* @low_pfn: The first PFN to isolate
* @end_pfn: The one-past-the-last PFN to isolate, within same pageblock
* @isolate_mode: Isolation mode to be used.
*
* Isolate all pages that can be migrated from the range specified by
* [low_pfn, end_pfn). The range is expected to be within same pageblock.
* Returns zero if there is a fatal signal pending, otherwise PFN of the
* first page that was not scanned (which may be both less, equal to or more
* than end_pfn).
*
* The pages are isolated on cc->migratepages list (not required to be empty),
* and cc->nr_migratepages is updated accordingly. The cc->migrate_pfn field
* is neither read nor updated.
*/
static unsigned long
isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
unsigned long end_pfn, isolate_mode_t isolate_mode)
{
struct zone *zone = cc->zone;
unsigned long nr_scanned = 0, nr_isolated = 0;
struct lruvec *lruvec;
unsigned long flags = 0;
bool locked = false;
struct page *page = NULL, *valid_page = NULL;
unsigned long start_pfn = low_pfn;
bool skip_on_failure = false;
unsigned long next_skip_pfn = 0;
/*
* Ensure that there are not too many pages isolated from the LRU
* list by either parallel reclaimers or compaction. If there are,
* delay for some time until fewer pages are isolated
*/
// too_many_isolated()如果判断当前临时从LRU链表分离出来的页面比较多,则最好睡眠等待100毫秒。
// 如果迁移模式是异步的,则直接退出
while (unlikely(too_many_isolated(zone))) {
/* async migration should just abort */
if (cc->mode == MIGRATE_ASYNC)
return 0;
congestion_wait(BLK_RW_ASYNC, HZ/10);
if (fatal_signal_pending(current))
return 0;
}
if (compact_should_abort(cc))
return 0;
if (cc->direct_compaction && (cc->mode == MIGRATE_ASYNC)) {
skip_on_failure = true;
next_skip_pfn = block_end_pfn(low_pfn, cc->order);
}
/* Time to isolate some pages for migration */
for (; low_pfn < end_pfn; low_pfn++) {
if (skip_on_failure && low_pfn >= next_skip_pfn) {
/*
* We have isolated all migration candidates in the
* previous order-aligned block, and did not skip it due
* to failure. We should migrate the pages now and
* hopefully succeed compaction.
*/
if (nr_isolated)
break;
/*
* We failed to isolate in the previous order-aligned
* block. Set the new boundary to the end of the
* current block. Note we can't simply increase
* next_skip_pfn by 1 << order, as low_pfn might have
* been incremented by a higher number due to skipping
* a compound or a high-order buddy page in the
* previous loop iteration.
*/
next_skip_pfn = block_end_pfn(low_pfn, cc->order);
}
/*
* Periodically drop the lock (if held) regardless of its
* contention, to give chance to IRQs. Abort async compaction
* if contended.
*/
if (!(low_pfn % SWAP_CLUSTER_MAX)
&& compact_unlock_should_abort(zone_lru_lock(zone), flags,
&locked, cc))
break;
if (!pfn_valid_within(low_pfn))
goto isolate_fail;
nr_scanned++;
page = pfn_to_page(low_pfn);
if (!valid_page)
valid_page = page;
/*
* Skip if free. We read page order here without zone lock
* which is generally unsafe, but the race window is small and
* the worst thing that can happen is that we skip some
* potential isolation targets.
*/
// 如果该页还在伙伴系统中,那么该页不适合迁移,略过该页。
if (PageBuddy(page)) {
unsigned long freepage_order = page_order_unsafe(page);
/*
* Without lock, we cannot be sure that what we got is
* a valid page order. Consider only values in the
* valid order range to prevent low_pfn overflow.
*/
if (freepage_order > 0 && freepage_order < MAX_ORDER)
low_pfn += (1UL << freepage_order) - 1;
continue;
}
/*
* Regardless of being on LRU, compound pages such as THP and
* hugetlbfs are not to be compacted. We can potentially save
* a lot of iterations if we skip them at once. The check is
* racy, but we can consider only valid values and the only
* danger is skipping too much.
*/
// 复合页面不适合迁移
if (PageCompound(page)) {
const unsigned int order = compound_order(page);
if (likely(order < MAX_ORDER))
low_pfn += (1UL << order) - 1;
goto isolate_fail;
}
/*
* Check may be lockless but that's ok as we recheck later.
* It's possible to migrate LRU and non-lru movable pages.
* Skip any other type of page
*/
// 在LRU链表中的页面或balloon页面适合迁移,其它类型的页面将被略过
if (!PageLRU(page)) {
/*
* __PageMovable can return false positive so we need
* to verify it under page_lock.
*/
if (unlikely(__PageMovable(page)) &&
!PageIsolated(page)) {
if (locked) {
spin_unlock_irqrestore(zone_lru_lock(zone),
flags);
locked = false;
}
if (!isolate_movable_page(page, isolate_mode))
goto isolate_success;
}
goto isolate_fail;
}
/*
* Migration will fail if an anonymous page is pinned in memory,
* so avoid taking lru_lock and isolating it unnecessarily in an
* admittedly racy check.
*/
// 对于匿名页面,通常情况下page_count(page)=page_mapcount(page),即page->count=page->mapcount+1.
// 如果它们不相等,说明内核中有人偷偷地使用了这个匿名页面,所以匿名页面不适合迁移
if (!page_mapping(page) &&
page_count(page) > page_mapcount(page))
goto isolate_fail;
/*
* Only allow to migrate anonymous pages in GFP_NOFS context
* because those do not depend on fs locks.
*/
if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
goto isolate_fail;
/* If we already hold the lock, we can skip some rechecking */
// 加锁zone->lru_lock,并且重新判断该页是否是LRU链表中的页
if (!locked) {
locked = compact_trylock_irqsave(zone_lru_lock(zone),
&flags, cc);
if (!locked)
break;
/* Recheck PageLRU and PageCompound under lock */
if (!PageLRU(page))
goto isolate_fail;
/*
* Page become compound since the non-locked check,
* and it's on LRU. It can only be a THP so the order
* is safe to read and it's 0 for tail pages.
*/
if (unlikely(PageCompound(page))) {
low_pfn += (1UL << compound_order(page)) - 1;
goto isolate_fail;
}
}
lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
/* Try isolate the page */
if (__isolate_lru_page(page, isolate_mode) != 0)
goto isolate_fail;
VM_BUG_ON_PAGE(PageCompound(page), page);
/* Successfully isolated */
del_page_from_lru_list(page, lruvec, page_lru(page));
inc_node_page_state(page,
NR_ISOLATED_ANON + page_is_file_cache(page));
isolate_success:
list_add(&page->lru, &cc->migratepages);
cc->nr_migratepages++;
nr_isolated++;
/*
* Record where we could have freed pages by migration and not
* yet flushed them to buddy allocator.
* - this is the lowest page that was isolated and likely be
* then freed by migration.
*/
if (!cc->last_migrated_pfn)
cc->last_migrated_pfn = low_pfn;
/* Avoid isolating too much */
if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) {
++low_pfn;
break;
}
continue;
isolate_fail:
if (!skip_on_failure)
continue;
/*
* We have isolated some pages, but then failed. Release them
* instead of migrating, as we cannot form the cc->order buddy
* page anyway.
*/
if (nr_isolated) {
if (locked) {
spin_unlock_irqrestore(zone_lru_lock(zone), flags);
locked = false;
}
putback_movable_pages(&cc->migratepages);
cc->nr_migratepages = 0;
cc->last_migrated_pfn = 0;
nr_isolated = 0;
}
if (low_pfn < next_skip_pfn) {
low_pfn = next_skip_pfn - 1;
/*
* The check near the loop beginning would have updated
* next_skip_pfn too, but this is a bit simpler.
*/
next_skip_pfn += 1UL << cc->order;
}
}
/*
* The PageBuddy() check could have potentially brought us outside
* the range to be scanned.
*/
if (unlikely(low_pfn > end_pfn))
low_pfn = end_pfn;
if (locked)
spin_unlock_irqrestore(zone_lru_lock(zone), flags);
/*
* Update the pageblock-skip information and cached scanner pfn,
* if the whole pageblock was scanned without isolating any page.
*/
if (low_pfn == end_pfn)
update_pageblock_skip(cc, valid_page, nr_isolated, true);
trace_mm_compaction_isolate_migratepages(start_pfn, low_pfn,
nr_scanned, nr_isolated);
cc->total_migrate_scanned += nr_scanned;
if (nr_isolated)
count_compact_events(COMPACTISOLATED, nr_isolated);
return low_pfn;
}
3. Debug
感觉这个问题分配失败的原因是因为内存被pin住了,但为什么被pin住,被谁pin住了还是没分析出来。
使用ftrace简单分析了下,虽然没能解决问题,但也可以加深对cma的理解。
如下面的这个allocator进程,分配下面这段内存出错,然后在isolate_migratepages_block()函数中加点log看下是什么原因。我这个log本来报错了,结果没成功,找不到了。。。
#:/sys/kernel/debug/tracing # cat trace | grep fail
allocator@2.0-s-2100 [002] .... 2086.979016: test_pages_isolated: start_pfn=0xc3dc0 end_pfn=0xc3de4 fin_pfn=0xc3dc2 ret=fail
input@1.0-servi-2071 [001] .... 573.011902: test_pages_isolated: start_pfn=0xc2400 end_pfn=0xc41aa fin_pfn=0xc41aa ret=success
input@1.0-servi-2071 [002] .... 573.170969: test_pages_isolated: start_pfn=0xc4200 end_pfn=0xc5faa fin_pfn=0xc5faa ret=success
input@1.0-servi-2071 [002] .... 573.448709: test_pages_isolated: start_pfn=0xc6000 end_pfn=0xc7daa fin_pfn=0xc7daa ret=success
input@1.0-servi-2071 [001] .... 573.736047: test_pages_isolated: start_pfn=0xc7e00 end_pfn=0xc9baa fin_pfn=0xc9baa ret=success
input@1.0-servi-2071 [000] .... 726.651208: test_pages_isolated: start_pfn=0xc2400 end_pfn=0xc41aa fin_pfn=0xc41aa ret=success
input@1.0-servi-2071 [000] .... 726.669588: test_pages_isolated: start_pfn=0xc4200 end_pfn=0xc5faa fin_pfn=0xc5fac ret=success
input@1.0-servi-2071 [000] .... 726.688339: test_pages_isolated: start_pfn=0xc6000 end_pfn=0xc7daa fin_pfn=0xc7dac ret=success
input@1.0-servi-2071 [000] .... 726.709690: test_pages_isolated: start_pfn=0xc7e00 end_pfn=0xc9baa fin_pfn=0xc9baa ret=success
input@1.0-servi-2071 [003] .... 737.278719: test_pages_isolated: start_pfn=0xc2400 end_pfn=0xc41aa fin_pfn=0xc41aa ret=success
input@1.0-servi-2071 [003] .... 737.296976: test_pages_isolated: start_pfn=0xc4200 end_pfn=0xc5faa fin_pfn=0xc5fac ret=success
input@1.0-servi-2071 [003] .... 737.314510: test_pages_isolated: start_pfn=0xc6000 end_pfn=0xc7daa fin_pfn=0xc7dac ret=success
input@1.0-servi-2071 [003] .... 737.334471: test_pages_isolated: start_pfn=0xc7e00 end_pfn=0xc9baa fin_pfn=0xc9baa ret=success
4. 总结
对于这种问题,我想到的解决办法就是增大预留的cma内存。有其它更好的解决办法或者debug方法,欢迎告知。
请问这块有新的思路吗?
没,很久没看了。你可以看下kernel最新分支上面关于cma的一些patch,对cma利用率好像有些fix
我在分配巨页的时候也遇到这个问题,巨页分配需要半个小时,非常慢,dmesg里面显示也是pfns busy