1. 简介
A subsystem is a kernel component that modifies the behavior of the processes in a cgroup. Various subsystems have been implemented, making it possible to do things such as limiting the amount of CPU time and memory available to a cgroup, accounting for the CPU time used by a cgroup, and freezing and resuming execution of the processes in a cgroup. Subsystems are sometimes also known as resource controllers (or simply, controllers).
2.常用的subsystem
blkio — 块存储配额 » this subsystem sets limits on input/output access to and from block devices such as physical drives (disk, solid state, USB, etc.).
cpu — CPU时间分配限制 » this subsystem uses the scheduler to provide cgroup tasks access to the CPU.
cpuacct — CPU资源报告 » this subsystem generates automatic reports on CPU resources used by tasks in a cgroup.
cpuset — CPU绑定限制 » this subsystem assigns individual CPUs (on a multicore system) and memory nodes to tasks in a cgroup.
devices — 设备权限限制 » this subsystem allows or denies access to devices by tasks in a cgroup.
freezer — cgroup停止/恢复 » this subsystem suspends or resumes tasks in a cgroup.
memory — 内存限制 » this subsystem sets limits on memory use by tasks in a cgroup, and generates automatic reports on memory resources used by those tasks.
net_cls — 配合tc进行网络限制 » this subsystem tags network packets with a class identifier (classid) that allows the Linux traffic controller (tc) to identify packets originating from a particular cgroup task.
net_prio — 网络设备优先级 » this subsystem provides a way to dynamically set the priority of network traffic per network interface.
ns — 资源命名空间限制 » the namespace subsystem.
2.1 blkio
common
blkio.reset_stats
- 重置统计信息,写int到此文件blkio.time
- 统计cgroup对设备的访问时间 -device_types:node_numbers milliseconds
blkio.sectors
- 统计cgroup对设备扇区访问数量 -device_types:node_numbers sector_count
blkio.avg_queue_size
- 统计平均IO队列大小(需要CONFIG_DEBUG_BLK_CGROUP=y
)blkio.group_wait_time
- 统计cgroup等待总时间(需要CONFIG_DEBUG_BLK_CGROUP=y
, 单位ns)blkio.empty_time
- 统计cgroup无等待io总时间(需要CONFIG_DEBUG_BLK_CGROUP=y
, 单位ns)blkio.idle_time
- reports the total time (in nanoseconds — ns) the scheduler spent idling for a cgroup in anticipation of a better request than those requests already in other queues or from other groups.blkio.dequeue
- 此cgroup IO操作被设备dequeue次数(需要CONFIG_DEBUG_BLK_CGROUP=y) -device_types:node_numbers number
blkio.io_serviced
- 报告CFQ scheduler统计的此cgroup对特定设备的IO操作(read, write, sync, or async)次数 -device_types:node_numbers
operation numberblkio.io_service_bytes
- 报告CFQ scheduler统计的此cgroup对特定设备的IO操作(read, write, sync, or async)数据量 -device_types:node_numbers operation bytes
blkio.io_service_time
- 报告CFQ scheduler统计的此cgroup对特定设备的IO操作(read, write, sync, or async)时间(单位ns) -device_types:node_numbers operation time
blkio.io_wait_time
- 此cgroup对特定设备的特定操作(read, write, sync, or async)的等待时间(单位ns) -device_types:node_numbers operation time
blkio.io_merged
- 此cgroup的BIOS requests merged into IO请求的操作(read, write, sync, or async)的次数 -number operation
blkio.io_queued
- 此cgroup的queued IO 操作(read, write, sync, or async)的请求次数 -number operation
Proportional weight division 策略 - 按比例分配block io资源
blkio.weight
- 100-1000的相对权重,会被blkio.weight_device
的特定设备权重覆盖blkio.weight_device
- 特定设备的权重 -device_types:node_numbers weight
I/O throttling (Upper limit) 策略 - 设定IO操作上限
- 每秒读/写数据上限
blkio.throttle.read_bps_device
-device_types:node_numbers bytes_per_second
blkio.throttle.write_bps_device
-device_types:node_numbers bytes_per_second
- 每秒读/写操作次数上限
blkio.throttle.read_iops_device
-device_types:node_numbers operations_per_second
blkio.throttle.write_iops_device
-device_types:node_numbers operations_per_second
- 每秒具体操作(read, write, sync, or async)的控制
blkio.throttle.io_serviced
-device_types:node_numbers operation operations_per_second
blkio.throttle.io_service_bytes
-device_types:node_numbers operation
bytes_per_second
- 每秒读/写数据上限
2.2 cpu - CPU使用时间限额
CFS(Completely Fair Scheduler)策略 - CPU最大资源限制
- cpu.cfs_period_us, cpu.cfs_quota_us - 必选 - 二者配合,前者规定时间周期(微秒)后者规定cgroup最多可使用时间(微秒),实现task对单个cpu的使用上限(cfs_quota_us是cfs_period_us的两倍即可限定在双核上完全使用)。
- cpu.stat - 记录cpu统计信息,包含 nr_periods(经历了几个cfs_period_us), nr_throttled (cgroup里的task被限制了几次), throttled_time (cgroup里的task被限制了多少纳秒)
- cpu.shares - 可选 - cpu轮转权重的相对值
RT(Real-Time scheduler)策略 - CPU最小资源限制
- cpu.rt_period_us, cpu.rt_runtime_us
二者配合使用规定cgroup里的task每cpu.rt_period_us(微秒)必然会执行cpu.rt_runtime_us(微秒)
2.3 cpuacct - CPU资源报告
cpuacct.usage
- cgroup中所有task的cpu使用时长(纳秒)cpuacct.stat
- cgroup中所有task的用户态和内核态分别使用cpu的时长cpuacct.usage_percpu
- cgroup中所有task使用每个cpu的时长
2.4 cpuset - CPU绑定
cpuset.cpus
- 必选 - cgroup可使用的cpu,如0-2,16代表 0,1,2,16这4个cpucpuset.mems
- 必选 - cgroup可使用的memory nodecpuset.memory_migrate
- 可选 - 当cpuset.mems变化时page上的数据是否迁移, default 0cpuset.cpu_exclusive
- 可选 - 是否独占cpu, default 0cpuset.mem_exclusive
- 可选 - 是否独占memory,default 0cpuset.mem_hardwall
- 可选 - cgroup中task的内存是否隔离, default 0cpuset.memory_pressure
- 可选 - a read-only file that contains a running average of the memory pressure created by the processes in this cpusetcpuset.memory_pressure_enabled
- 可选 - cpuset.memory_pressure开关,default 0cpuset.memory_spread_page
- 可选 - contains a flag (0 or 1) that specifies whether file system buffers should be spread evenly across the memory nodes allocated to this cpuset, default 0cpuset.memory_spread_slab
- 可选 - contains a flag (0 or 1) that specifies whether kernel slab caches for file input/output operations should be spread evenly across the cpuset, default 0cpuset.sched_load_balance
- 可选 - cgroup的cpu压力是否会被平均到cpu set中的多个cpu, default 1cpuset.sched_relax_domain_level
- 可选 - cpuset.sched_load_balance的策略- -1 = Use the system default value for load balancing
- 0 = Do not perform immediate load balancing; balance loads only periodically
- 1 = Immediately balance loads across threads on the same core
- 2 = Immediately balance loads across cores in the same package
- 3 = Immediately balance loads across CPUs on the same node or blade
- 4 = Immediately balance loads across several CPUs on architectures with non-uniform memory access (NUMA)
- 5 = Immediately balance loads across all CPUs on architectures with NUMA
2.5 device - cgoup的device权限限制
- 设备黑/白名单
devices.allow
- 允许名单devices.deny
- 禁止名单- 语法 -
type device_types:node_numbers access type - b (块设备) c (字符设备) a (全部设备) access - r 读 w 写 m 创建
- devices.list - 报告
2.6 freezer - 暂停/恢复 cgroup的限制
- 不能出现在root目录下
- freezer.state - FROZEN 停止 FREEZING 正在停止 THAWED 恢复
2.7 memory - 内存限制
- memory.usage_in_bytes - 报告内存限制byte
- memory.memsw.usage_in_bytes - 报告cgroup中进程当前所用内存+swap空间
- memory.max_usage_in_bytes - 报告cgoup中的最大内存使用
- memory.memsw.max_usage_in_bytes - 报告最大使用到的内存+swap
- memory.limit_in_bytes - cgroup - 最大内存限制,单位k,m,g. -1代表取消限制
- memory.memsw.limit_in_bytes - 最大内存+swap限制,单位k,m,g. -1代表取消限制
- memory.failcnt - 报告达到最大允许内存的次数
- memory.memsw.failcnt - 报告达到最大允许内存+swap的次数
- memory.force_empty - 设为0且无task时,清除cgroup的内存页
- memory.swappiness - 换页策略,60基准,小于60降低换出机率,大于60增加换出机率
- memory.use_hierarchy - 是否影响子group
- memory.oom_control - 0 enabled,当oom发生时kill掉进程
- memory.stat - 报告cgroup限制状态
- cache - page cache, including tmpfs (shmem), in bytes
- rss - anonymous and swap cache, not including tmpfs (shmem), in bytes
- mapped_file - size of memory-mapped mapped files, including tmpfs (shmem), in bytes
- pgpgin - number of pages paged into memory
- pgpgout - number of pages paged out of memory
- swap - swap usage, in bytes
- active_anon - anonymous and swap cache on active least-recently-used (LRU) list, including tmpfs (shmem), in bytes
- inactive_anon - anonymous and swap cache on inactive LRU list, including tmpfs (shmem), in bytes
- active_file - file-backed memory on active LRU list, in bytes
- inactive_file - file-backed memory on inactive LRU list, in bytes
- unevictable - memory that cannot be reclaimed, in bytes
- hierarchical_memory_limit - memory limit for the hierarchy that contains the memory cgroup, in bytes
- hierarchical_memsw_limit - memory plus swap limit for the hierarchy that contains the memory cgroup, in bytes
2.8 net_cls
- net_cls.classid - 指定tc的handle,通过tc实现网络控制
net_prio 指定task网络设备优先级
- net_prio.prioidx - a read-only file which contains a unique integer value that the kernel uses as an internal representation of this cgroup.
- net_prio.ifpriomap - 网络设备使用优先级 -
2.9 其他
- tasks - 该cgroup的所有进程pid
- cgroup.event_control - event api
- cgroup.procs - thread group id
- release_agent(present in the root cgroup only) - 根据- notify_on_release是否在task为空时执行的脚本
- notify_on_release - 当cgroup中没有task时是否执行release_agent
评论 (0)