cgroup--(3)subsystem

作者 by adtxl / 2021-09-29 / 暂无评论 / 524 个足迹

1. 简介

A subsystem is a kernel component that modifies the behavior of the processes in a cgroup. Various subsystems have been implemented, making it possible to do things such as limiting the amount of CPU time and memory available to a cgroup, accounting for the CPU time used by a cgroup, and freezing and resuming execution of the processes in a cgroup. Subsystems are sometimes also known as resource controllers (or simply, controllers).

2.常用的subsystem

blkio — 块存储配额 » this subsystem sets limits on input/output access to and from block devices such as physical drives (disk, solid state, USB, etc.).
cpu — CPU时间分配限制 » this subsystem uses the scheduler to provide cgroup tasks access to the CPU.
cpuacct — CPU资源报告 » this subsystem generates automatic reports on CPU resources used by tasks in a cgroup.
cpuset — CPU绑定限制 » this subsystem assigns individual CPUs (on a multicore system) and memory nodes to tasks in a cgroup.
devices — 设备权限限制 » this subsystem allows or denies access to devices by tasks in a cgroup.
freezer — cgroup停止/恢复 » this subsystem suspends or resumes tasks in a cgroup.
memory — 内存限制 » this subsystem sets limits on memory use by tasks in a cgroup, and generates automatic reports on memory resources used by those tasks.
net_cls — 配合tc进行网络限制 » this subsystem tags network packets with a class identifier (classid) that allows the Linux traffic controller (tc) to identify packets originating from a particular cgroup task.
net_prio — 网络设备优先级 » this subsystem provides a way to dynamically set the priority of network traffic per network interface.
ns — 资源命名空间限制 » the namespace subsystem.

2.1 blkio

  • common

    • blkio.reset_stats - 重置统计信息,写int到此文件
    • blkio.time - 统计cgroup对设备的访问时间 - device_types:node_numbers milliseconds
    • blkio.sectors - 统计cgroup对设备扇区访问数量 - device_types:node_numbers sector_count
    • blkio.avg_queue_size - 统计平均IO队列大小(需要CONFIG_DEBUG_BLK_CGROUP=y)
    • blkio.group_wait_time - 统计cgroup等待总时间(需要CONFIG_DEBUG_BLK_CGROUP=y, 单位ns)
    • blkio.empty_time - 统计cgroup无等待io总时间(需要CONFIG_DEBUG_BLK_CGROUP=y, 单位ns)
    • blkio.idle_time - reports the total time (in nanoseconds — ns) the scheduler spent idling for a cgroup in anticipation of a better request than those requests already in other queues or from other groups.
    • blkio.dequeue - 此cgroup IO操作被设备dequeue次数(需要CONFIG_DEBUG_BLK_CGROUP=y) - device_types:node_numbers number
    • blkio.io_serviced - 报告CFQ scheduler统计的此cgroup对特定设备的IO操作(read, write, sync, or async)次数 - device_types:node_numbers operation number
    • blkio.io_service_bytes - 报告CFQ scheduler统计的此cgroup对特定设备的IO操作(read, write, sync, or async)数据量 - device_types:node_numbers operation bytes
    • blkio.io_service_time - 报告CFQ scheduler统计的此cgroup对特定设备的IO操作(read, write, sync, or async)时间(单位ns) - device_types:node_numbers operation time
    • blkio.io_wait_time - 此cgroup对特定设备的特定操作(read, write, sync, or async)的等待时间(单位ns) - device_types:node_numbers operation time
    • blkio.io_merged - 此cgroup的BIOS requests merged into IO请求的操作(read, write, sync, or async)的次数 - number operation
    • blkio.io_queued - 此cgroup的queued IO 操作(read, write, sync, or async)的请求次数 - number operation
  • Proportional weight division 策略 - 按比例分配block io资源

    • blkio.weight - 100-1000的相对权重,会被blkio.weight_device的特定设备权重覆盖
    • blkio.weight_device - 特定设备的权重 - device_types:node_numbers weight
  • I/O throttling (Upper limit) 策略 - 设定IO操作上限

    • 每秒读/写数据上限 blkio.throttle.read_bps_device - device_types:node_numbers bytes_per_second blkio.throttle.write_bps_device - device_types:node_numbers bytes_per_second
    • 每秒读/写操作次数上限 blkio.throttle.read_iops_device - device_types:node_numbers operations_per_second blkio.throttle.write_iops_device - device_types:node_numbers operations_per_second
    • 每秒具体操作(read, write, sync, or async)的控制 blkio.throttle.io_serviced - device_types:node_numbers operation operations_per_second blkio.throttle.io_service_bytes - device_types:node_numbers operation bytes_per_second

2.2 cpu - CPU使用时间限额

  • CFS(Completely Fair Scheduler)策略 - CPU最大资源限制

    • cpu.cfs_period_us, cpu.cfs_quota_us - 必选 - 二者配合,前者规定时间周期(微秒)后者规定cgroup最多可使用时间(微秒),实现task对单个cpu的使用上限(cfs_quota_us是cfs_period_us的两倍即可限定在双核上完全使用)。
    • cpu.stat - 记录cpu统计信息,包含 nr_periods(经历了几个cfs_period_us), nr_throttled (cgroup里的task被限制了几次), throttled_time (cgroup里的task被限制了多少纳秒)
    • cpu.shares - 可选 - cpu轮转权重的相对值
  • RT(Real-Time scheduler)策略 - CPU最小资源限制

    • cpu.rt_period_us, cpu.rt_runtime_us

二者配合使用规定cgroup里的task每cpu.rt_period_us(微秒)必然会执行cpu.rt_runtime_us(微秒)

2.3 cpuacct - CPU资源报告

  • cpuacct.usage - cgroup中所有task的cpu使用时长(纳秒)
  • cpuacct.stat - cgroup中所有task的用户态和内核态分别使用cpu的时长
  • cpuacct.usage_percpu - cgroup中所有task使用每个cpu的时长

2.4 cpuset - CPU绑定

  • cpuset.cpus - 必选 - cgroup可使用的cpu,如0-2,16代表 0,1,2,16这4个cpu

  • cpuset.mems - 必选 - cgroup可使用的memory node

  • cpuset.memory_migrate - 可选 - 当cpuset.mems变化时page上的数据是否迁移, default 0

  • cpuset.cpu_exclusive - 可选 - 是否独占cpu, default 0

  • cpuset.mem_exclusive - 可选 - 是否独占memory,default 0

  • cpuset.mem_hardwall - 可选 - cgroup中task的内存是否隔离, default 0

  • cpuset.memory_pressure - 可选 - a read-only file that contains a running average of the memory pressure created by the processes in this cpuset

  • cpuset.memory_pressure_enabled - 可选 - cpuset.memory_pressure开关,default 0

  • cpuset.memory_spread_page - 可选 - contains a flag (0 or 1) that specifies whether file system buffers should be spread evenly across the memory nodes allocated to this cpuset, default 0

  • cpuset.memory_spread_slab - 可选 - contains a flag (0 or 1) that specifies whether kernel slab caches for file input/output operations should be spread evenly across the cpuset, default 0

  • cpuset.sched_load_balance - 可选 - cgroup的cpu压力是否会被平均到cpu set中的多个cpu, default 1

  • cpuset.sched_relax_domain_level - 可选 - cpuset.sched_load_balance的策略

    • -1 = Use the system default value for load balancing
    • 0 = Do not perform immediate load balancing; balance loads only periodically
    • 1 = Immediately balance loads across threads on the same core
    • 2 = Immediately balance loads across cores in the same package
    • 3 = Immediately balance loads across CPUs on the same node or blade
    • 4 = Immediately balance loads across several CPUs on architectures with non-uniform memory access (NUMA)
    • 5 = Immediately balance loads across all CPUs on architectures with NUMA

2.5 device - cgoup的device权限限制

  • 设备黑/白名单
  • devices.allow - 允许名单
  • devices.deny - 禁止名单
  • 语法 - type device_types:node_numbers access type - b (块设备) c (字符设备) a (全部设备) access - r 读 w 写 m 创建
  • devices.list - 报告

2.6 freezer - 暂停/恢复 cgroup的限制

  • 不能出现在root目录下
  • freezer.state - FROZEN 停止 FREEZING 正在停止 THAWED 恢复

2.7 memory - 内存限制

  • memory.usage_in_bytes - 报告内存限制byte
  • memory.memsw.usage_in_bytes - 报告cgroup中进程当前所用内存+swap空间
  • memory.max_usage_in_bytes - 报告cgoup中的最大内存使用
  • memory.memsw.max_usage_in_bytes - 报告最大使用到的内存+swap
  • memory.limit_in_bytes - cgroup - 最大内存限制,单位k,m,g. -1代表取消限制
  • memory.memsw.limit_in_bytes - 最大内存+swap限制,单位k,m,g. -1代表取消限制
  • memory.failcnt - 报告达到最大允许内存的次数
  • memory.memsw.failcnt - 报告达到最大允许内存+swap的次数
  • memory.force_empty - 设为0且无task时,清除cgroup的内存页
  • memory.swappiness - 换页策略,60基准,小于60降低换出机率,大于60增加换出机率
  • memory.use_hierarchy - 是否影响子group
  • memory.oom_control - 0 enabled,当oom发生时kill掉进程
  • memory.stat - 报告cgroup限制状态
  • cache - page cache, including tmpfs (shmem), in bytes
  • rss - anonymous and swap cache, not including tmpfs (shmem), in bytes
  • mapped_file - size of memory-mapped mapped files, including tmpfs (shmem), in bytes
  • pgpgin - number of pages paged into memory
  • pgpgout - number of pages paged out of memory
  • swap - swap usage, in bytes
  • active_anon - anonymous and swap cache on active least-recently-used (LRU) list, including tmpfs (shmem), in bytes
  • inactive_anon - anonymous and swap cache on inactive LRU list, including tmpfs (shmem), in bytes
  • active_file - file-backed memory on active LRU list, in bytes
  • inactive_file - file-backed memory on inactive LRU list, in bytes
  • unevictable - memory that cannot be reclaimed, in bytes
  • hierarchical_memory_limit - memory limit for the hierarchy that contains the memory cgroup, in bytes
  • hierarchical_memsw_limit - memory plus swap limit for the hierarchy that contains the memory cgroup, in bytes

2.8 net_cls

  • net_cls.classid - 指定tc的handle,通过tc实现网络控制

net_prio 指定task网络设备优先级

  • net_prio.prioidx - a read-only file which contains a unique integer value that the kernel uses as an internal representation of this cgroup.
  • net_prio.ifpriomap - 网络设备使用优先级 -

2.9 其他

  • tasks - 该cgroup的所有进程pid
  • cgroup.event_control - event api
  • cgroup.procs - thread group id
  • release_agent(present in the root cgroup only) - 根据- notify_on_release是否在task为空时执行的脚本
  • notify_on_release - 当cgroup中没有task时是否执行release_agent

独特见解