Você está na página 1de 38

Page 1

Linux

Kernel (

2.6.13.2

) Source code analysis

Miao Yanchao
Summary:
1 system start
1.1 and previous assembly code head.S
Initial Settings CPU status, create a process 0, the process of building the stack:
movq init_rsp (% rip),% rsp, init_rsp definitions
.globl init_rsp
init_rsp:
.quad init_thread_union + THREAD_SIZE-8
Upcoming virtual address init_thread_union + THREAD_SIZE-8 as the current process (process 0) kernel stack space bottom of the stack,
init_thread_union defined in the file arch / x86_64 / kernel / init_task.c in:
union thread_union init_thread_union __attribute __ ((__ section __ ( ". data.init_task"))) =
{INIT_THREAD_INFO (init_task)};
INIT_THREAD_INFO defined in the file include / asm-x86_64 / thread_info.h, the initialization init_thread_union.task =
& init_task, init_task defined in the same file init_task.c, the initialization is:
struct task_struct init_task = INIT_TASK (init_task) ;
INIT_TASK macro defined in include / linux / init_task.h in.
Initial use all the time to compile a static set of a process control structure 0 setting, so that the process can access press 0 Common core processes.
init_task.mm = NULL; init_task.active_mm = INIT_MM ( init_mm), init_task.comm = "swapper"
INIT_MM will init_mm.pgd initialized swapper_pg_dir, namely init_level4_pgt, definition and head.S in. The name of the process 0
Called swapper.
Using the following assembly code to jump to C functions:
movl% esi,% edi
// Transfer function parameters
movq
initial_code (% rip),% rax
jmp *% rax
initial_code:
.quad
x86_64_start_kernel
Begin file arch / x86_64 / kernel / head64.c C function x86_64_start_kernel (char * real_mode_data),
1.2 Function x86_64_start_kernel (char * real_mode_data)
1
2
3
4

Set all interrupt vectors for the initial entry early_idt_handler, loading the interrupt descriptor idt_descr

5
6
7

cpu_set: Setting CPU 0 start working mark


Processing "earlyprintk =", "numa" , "disableapic" and other command-line arguments

clear_bss (): BSS segment cleared


pda_init (0): 0 set the processor-related information (processor datastructure area?), reset the CR3 is init_level4_pgt
copy_bootdata: Copy the BIOS boot parameters to the operating system variables x86_boot_params, and then copy the boot command line arguments
By the number of x86_boot_params to saved_command_line with printk display saved_command_line, no longer
Dealing with real-mode data

setup_boot_cpu_data (): set CPU information structure boot_cpu_data, use instruction cpuid
1

Page 2

Executive start_kernel () function

1.3 start_kernel function


System initialization related structure of the front

1.3.1

lock_kernel (): File lib / kernel_lock.c realized BKL (big kernel lock), use: lock_kernel / unlock_kernel
If you turn PREEMPT_BKL, using semaphores kernel_sem achieve, otherwise use spinlocks kernel_flag implementation. usually
Open PREEMPT_BKL default option.
When task-> lock_depth equals -1, the execution down (& kernel_sem) operating current-> lock_depth ++
unlock_kernel execution --current-> lock and up (& kernel_sem) operation
page_address_init (): x86-64 systems in empty function.
printk (linux_banner): print characteristic information
Architecture-initialization setup_arch (& command_line)

1.3.2
9

setup_memory_region ():
I
II
III

10

11
12
13
14

15
16

sanitize_e820_map (E820_MAP, & E820_MAP_NR): Figure cleanup E820, E820_MAP, E820_MAP_NR


The parameters are x86_boot_params
copy_e820_map (): call add_memory_region () function is a valid address range is added to the structure struct E820

e820map e820, the most systematic E820 FIG.


e820_print_map (): E820 calls printk to display the final map and data sources BIOS-e820, BIOS-e88 or
BIOS-e801
copy_edd (): If you turn on the compiler options EDD, then copy the information from the EDD to variable struct EDD_MBR_SIGNATURE
edd edd in, EDD_MBR_SIGNATURE x86_boot_params set by startup parameter. EDD: Enhanced Disk Dirve
Services, the parameters passed to the drivers / firmware / edd.c, reference include / linux / edd.h
Setting init_mm, code_resource, data_resource information
parse_cmdline_early: Analysis of command line parameters used earlier
Setting CPU parameter information structure boot_cpu_data again

end_pfn = e820_end_of_ram (): E820 analysis diagram, set global variables related to memory: end_user_pfn: Start Senate
Number mem = xx set the number of pages; end_pfn_map: System RAM (main memory) the number of pages that the establishment of direct-mapped pa
Number of pages, by __va, __ pa address macro operation; end_pfn: the number of pages the direct management of the operating system
check_efer: read msr register MSR_EFER, test extended features extended feature register
init_memory_mapping (0, end_pfn_map << PAGE_SIZE) : direct mapped page table
I
find_early_table_space: According to the demand for memory mapped page tables to calculate the total capacity of the pud and pmd (2M page, 3
Page tables) tables byte (page capacity PAGE_SIZE integer multiples), the use of E820 diagram, starting from the physical address 8000h
Looking capacity tables bytes of contiguous physical memory, and skip interval [640KB, _end] reserved memory, under normal circumstances
Results from 8000h to find the start is the beginning, set the global variable table_end = table_start = start >>
II
III

17

PAGE_SHIFT, interval [table_start, table_end] is directly mapped page table.


Establish the interval [0, end_pfn_map << PAGE_SIZE) is mapped directly using alloc_low_page page and establish temporary assignment
When mapping using phys_pud_init set pud, and set pgd, use unmap_low_page lift the temporary maps.
allow_low_page (& map, & pud_phys) : physical page table_end, and table_end ++, the assignment was
Interim management page is mapped to the virtual address 40M or 42M, using temporary page table temp_boot_pmds (defined in the file head.S).

IV
unmap_low_page (map) to lift allow_low_page temporary mapping of 40M or 42M
acpi_boot_table_init (arch / i386 / kernel / acpi / boot.c in): ACPI initialization
I
acpi_table_init (drivers / acpi / tables.c) : ACPI table initialization (Initialize the ACPI boot-time table parser)
i acpi_find_rsdp: Location RSDP (Root System Description Pointer) position,
A acpi_scan_rsdp (0, 0x400): in the interval [0,3FFh] Search RDSP signature "RSD PTR".
2

Page 3

B acpi_scan_rsdp (0xE0000, 0x20000): in the interval [E_0000h, F_FFFFh] search RDSP signature "RSD
PTR ".
C signature return address where the search is successful, otherwise it returns 0.
ii by printk display "RSDP (rsdp-> version, rsdp- > oem_id, rsdp_phys)" information.
iii acpi_table_compute_checksum: checksum calculation rsdp
iv acpi_table_get_sdt (rsdp): In version 2.0 and above as an example:
A std_pa = ((struct acpi20_table_rsdp * ) rsdp) -> xsdt_address: Get a physical address table XSDT
B header = __acpi_map_table (std_pa): Gets ACPI table header virtual address, x86-64 use direct mapping __va
Shoot, std_pa no more than 8M when i386 also uses __va direct mapped, using a fixed mapping of more than 8M
C mapped_xstd = __acpi_map_table (std_pa), to map the entire XSDT (Extended System
Description Table)
D Check XSDT header signature "XSDT" and checksum
E Set sdt_count and XSDT each table entry physical address to std_entry [i] .pa in
F acpi_table_print (header, sdt_pa): printk display with header information

G __acpi_map_table
(sdt_entry
.pa ): to structure
XSDT physical
of each entry
std_entry
.pa for calculation,
A acpi_table_header
address[i]mapping
is calledaddress
acpi_table_print
display
and a[i]
checksum
Setting std_entry [i] .size field, the signature header-> signature of the array acpi_table_signatures
Name more, set std_entry [i] .id field. Array acpi_table_signatures define more bizarre forms.
H acpi_get_table_header_early: search ACPI_DSDT and call acpi_table_print print, but physically
Address unknown, is set to 0
II

acpi_table_parse (ACPI_BOOT, acpi_parse_sbf): Searching ACPI_BOOT entries in sdt_entry in and tune


With acpi_parse_sb (sdt_entry [?]. Pa , sdt_entry [?]. Size), mapping sb = __acpi_map_table (sdt_entry [?] . Pa,
size), set sbf_port = sb-> sbf_cmos

III
18

acpi_blacklisted (drivers / acpi / backlist.c) : sdt_entry [*] whether there is ACPI table acpi_backlist [] given in
ID, in line with the conditions given in error and may be in the closed acpi function call acpi_disable

acpi_numa_init: need to open compiler option ACPI_NUMA,


I
acpi_table_parse (ACPI_SRAT, acpi_parse_srat): Analysis SRAT (System Resource Affinity Table)
II
acpi_table_parse_srat (ACPI_SRAT_PROCESSOR_AFFINITY, acpi_parse_processor_affinity, NR_
CPUS);
i
acpi_table_parse_madt_family (ACPI_SRAT,

sizeof (struct

acpi_table_srat),

ACPI_SRAT_PROCESSOR_AFFINITY, acpi_parse_processor_affinity, NR_CPUS):


A positioning MADT, MADT in sdt_entry [*] The ID: ACPI_SRAT
B Find MADT in ID = ACPI_SRAT_PROCESSOR_AFFINITY entries, each entry call
Function acpi_parse_processor_affinity
C acpi_parse_processor_affinity:
a acpi_table_print_srat_entry: print information
b acpi_numa_processor_affinity_init (processor_affinity):
(1) pxm = processor_affinity-> proximity_domain;
(2)
(I)

setup_node (pxm):
nodes_weight (nodes_found): the final call generic_hweight64 (nodes_found)
Nodes_found calculated as the number of bit 1

(II)
fisrt_unset_node: find the first zero bit node number
(III)
node_set (node, nodes_found)
(IV)
pxm2node [pxm] = node
(3) cpu_to_node [num_processors] = node, acpi_numa = 1
(4) Display information: printk (KERN_INFO "SRAT: PXM % u -> APIC% u -> CPU% u ->
3

Page 4

Node% u \ n ", pxm , pa-> apic_id, num_processors, node)


III

19
20

(5) Increasing the processor count: num_processors ++


acpi_table_parse_srat (ACPI_SRAT_MEMORY_AFFINITY,
acpi_parse_memory_affinity,
NR_NODE_MEMBLKS), the process is similar to the previous step, but the final step in the calling function
acpi_parse_memory_affinity, and then call the function acpi_numa_memory_affinity_init, handle memory node.
And set nodes_parsed nodes and fields. Note says is mainly used for IA64.

IV
acpi_table_parse (ACPI_SLIT, acpi_parse_slit): Analysis of SLIT (System Locality Information Table)
V
acpi_numa_arch_fixup: empty function
Open compiler option NUMA calling function numa_initmem_init (0, end_pfn), otherwise the function is called contig_initmem_init (0,
end_pfn)
numa_initmem_init:
If open ACPI_EMU compiler option, an numa_emulation (0, end_pfn), successful
I
II

III

numa_initmem_init return; This option is primarily for debugging


If open ACPI_NUMA compiler option, an acpi_scan_nodes (0, end_pfn << PAGE_SHIFT) , a
Power is numa_initmem_init return;
i compute_hash_shift: Calculation memnode_shift
ii display feature information: printk (KERN_DEBUG "Using% d for the hash shift Max adder is% lx.
\ n ", shift, maxend) ;
iii According nodes field is set, the function is called setup_node_bootmem (i, nodes [i] .start for each node,
nodes [i] .end)
iv numa_init_array: set cpu_to_node, node_to_cpumask number mapping and other fields.
If open K8_NUMA compiler option, an k8_scan_nodes (0, end_pfn << PAGE_SHIFT) , success
Back then numa_initmem_init; k8_scan_nodes structure defined maximum support 8 NODE.
i

find_northbirdgh: Find the CPU North Bridge module memory address mapping function (function 0:
HyperTransport Technology Configuration, Function 1: Address Map), (VendorID : DeviceID)
= (1022: 1100/1101), returns the device number
ii feature information: printk (KERN_INFO "Scanning NUMA topology in Northbridge% d \ n", nb);

iii Northbridge
reading of
device
function 0 (1022: 1100) Offset 60h Information (NodeID), the number of computing systems Node,
That is the number
processors.
iv display information indicating: printk (KERN_INFO "Number of nodes % d \ n", numnodes)

v Northbridge reading apparatus 1 of the offset 40h7Ch, obtain distribution information memory and each memory address corresponding nodeid
Recorded in a local variable nodes, nodes [nodeid] .start = base , nodes [nodeid] .end = limit, in
nodes_parsed marked valid nodeid.
vi memnode_shift = compute_hash_shift (nodes, numnodes )
A maxend = MAX {nodes [* ]. End}
B satisfies the condition (1UL << shift) <maxend / NODEMAPSIZE the minimum shift value,
NODEMAPSIZE = 0xFF. Back shift
C for all memory address addr, granularity (1UL << shift) setting memnodemap [addr >> shift] = i , i
NODE number is (07)
Note:
Shiftsothe
divided
number of segments, each segment capacity
memnode_shift: The total physical memory,
255 and
:
Every
home
and
other
segments
number
memnodemap [0..254]
NODE
vii flag information: printk (KERN_INFO "Using node hash shift of% d \ n", memnode_shift)
viii For all configured with physical memory NODE: Setting cpu_to_node [i] = i, setup_node_bootmem (i,
nodes [i] .start, nodes [ i] .end):
A start = round_up (start, ZONE_ALIGN ): round starting address, ZONE_ALIGN:
4

Page 5

1 << (MAX_ORDER + PAGE_SHIFT) = 8MB


B logo Information: printk ( "Bootmem setup node% d% 016lx-% 016lx \ n", nodeid, start, end)
Export,
NOTE: struct mem_section mem_section [NR_MEM_SECTIONS (8192
)] total occupancy
space
64KB
C memory_present: For the full section of the NODE tagged mem_section [section].
section_mem_map effective
D nodedata_phys = find_e820_area (start, end , pgdat_size): memory allocation pg_data_t structure,
Page capacity and alignment of the NODE allocated from memory.
E node_data [nodeid] = phys_to_virt ( nodedata_phys), node_data: 64 th component
F node_data [nodeid] -> bdata = & plat_node_bdata [nodeid] ( a global variable)
G node_data [nodeid] -> node_start_pfn = start_pfn ( The NODE starting page number)
H node_data [nodeid] -> node_spanned_pages = end_pfn - start_pfn ( NODE this page number)
I bootmap_pages = bootmem_bootmap_pages (end_pfn - start_pfn) : Calculation of the NODE Full
Bitmap memory unit to establish the number of pages needed
J

bootmap_start = round_up (nodedata_phys + pgdat_size, PAGE_SIZE): The NODE empty


Idle capacity of physical memory pages are aligned base address

K bootmap_start
=
find_e820_area (bootmap_start, end,
bootmap_pages <<
PAGE_SHIFT): The NODE allocated memory space bitmap
L bootmap_size
= init_bootmem_node (node_data [nodeid], bootmap_start >>
PAGE_SHIFT, start_pfn, end_pfn), direct call init_bootmem_core (pgdat,
freepfn / mapstart, startpfn, endpfn) , the order parameter is directly bonded
a bdata = node_data [nodeid] -> bdata ( previously set to & plat_node_bdata [nodeid])
b bdata-> node_bootmem_map = phys_to_virt (mapstart << PAGE_SHIFT)
c bdata-> node_boot_start = (start << PAGE_SHIFT ) ( reset)
d bdata-> node_low_pfn = end
e Bitmap area bdata-> node_bootmem_map all set to 1, reserves all memory
f Back bitmap capacity of 8 byte alignment
M e820_bootmem_free (node_data [nodeid], start, end): According to e820 table for all of the nodes
It belongs E820_RAM, and e820 Fig flag as a valid area of memory, calls the function free_bootmem_node

IV

(node_data [nodeid], addr, last-addr), further called directly free_bootmem_core


(node_data [nodeid] -> bdata , physaddr, size), all valid pages in bdata->
node_bootmem_map corresponding bits cleared, mark the memory as free.
N retention pg_data_t structure memory occupied node_data [nodeid]
O reserved bitmap memory for bootmap_start
P-labeled in this NODE effective node_online_map
ix numa_init_array: For other CPU (not configured with physical memory), setting cpu_to_node [i] values and
node_online_map marked this CPU corresponding NODE number is valid, set node_to_cpumask
[cpu_to_node (0)] of bit0 is set to 1
Use No NUMA configuration:
i

memnode_shift = 63, memnodemap [0] = 0, node_online_map Cleared valid only NODE0

ii cpu_to_node [*] = 0, node_to_cpumask [0] = cpumask_of_cpu (0)


iii setup_node_bootmem (0, 0, end_pfn << PAGE_SHIFT): Set all the physical memory owned NODE 0
management
twenty contig_initmem_init
one
(0, end_pfn):
I
memory_present (0, start_pfn, end_pfn)
II

bootmap_size = bootmem_bootmap_pages (end_pfn) << PAGE_SHIFT : calculate the required total memory
5

Page 6

Bitmap capacity
III
IV

bootmap = find_e820_area (0, end_pfn << PAGE_SHIFT, bootmap_size): allocation bitmap space
bootmap_size = init_bootmem (bootmap >> PAGE_SHIFT, end_pfn):
i max_low_pfn = pages, min_low_pfn = start
ii init_bootmem_core (NODE_DATA (0), start, 0, pages): Settings Bitmap

V
VI

e820_bootmem_free (NODE_DATA (0), 0 , end_pfn << PAGE_SHIFT): According to e820 Map full release
Effective memory unit
reserve_bootmem (bootmap, bootmap_size): Reserved bit map memory

twenty reserve_bootmem_generic
two
(table_start << PAGE_SHIFT, (table_end - table_start) << PAGE_SHIFT): Paul
Stay directly mapped page table memory
I
II

int nid = phys_to_nid (phys): by memnodemap [addr >> memnode_shift] get nid
reserve_bootmem_node (NODE_DATA (nid), phys , len): Direct call reserve_bootmem_core
(pgdat-> bdata, physaddr, size ), bdata-> node_bootmem_map corresponding bit is cleared, marked reserved
core image memory that region [1M, __ pa (_end) ], the first reserved physical memory 0
twenty Reserved
three
twenty reserve_ebda_region
four
(): Reserved EBDA area
If you turn on SMP options: Reserved memory and a trampoline area, namely page 6
25
If the option is turned ACPI_SLEEP, acpi_reserve_bootmem: call alloc_bootmem_low assigned a physical memory,
26
And save it to the acpi_wakeup_address
I find_smp_config: direct call find_intel_smp, smp_scan_config using the function in the interval [0,1K), [639K,
640K) and [960K, 1024K) search SMP configuration, find_intel_smp successful return, if fails to read the physical address 40Eh
Data addr, multiplied by 16 as the base address of the base, call the function smp_scan_config search interval [base again, base + 4K)
Search SMP configuration
II smp_scan_config: Looking In "_MP_" starting area MP table signature as MP table, if other parameters match the checksum
Success found that the MP table and retain the page where the region is set smp_found_config = 1. If more further MP table
There are configuration table address, which is the first field after the signature is not 0, then save the configuration page table
27

If the option is turned BLK_DEV_INITRD, if initrd effective region, is retained initrd region, the base address: INITRD_START;
Length: INITRD_SIZE

28

If you open KEXEC option is reserved crashk_res designated area. The region "crashkernel =" specified by the startup parameter

29

30

KEXEC NOTE: kexec is a system call that implements the ability to shutdown your current kernel, and to
start another kernel. It is like a reboot but it is indepedent of the system firmware.
And like a reboot
you can start any kernel with it, not just Linux.
sparse_init: Open compiler option SPARSEMEM effective for all valid memory SECTIONS, perform the following actions:
The first pnum a SECTIOON, call interface alloc_bootmem_node from its present SECTION corresponding NODE memory
Assign a map area, set up a page structure for each page, set mem_section [pnum]. Section_mem_map
| = Map - section_nr_to_pfn (pnum)
paging_init (NUMA structure): According node_possible_map structure, effective for each NODE i, the function calls
setup_node_zones (i):
I
start_pfn, end_pfn: The NODE from memory, only the page number, there may be empty
Setting zones [ZONE_DMA], zones [ZONE_NORMAL ]: Memory interval [start_pfn, end_pfn] in the e820
II
Effective capacity
Setting holes [ZONE_DMA], holes [ZONE_NORMAL ]: Memory interval [start_pfn, end_pfn] in the e820
III
Invalid size for the memory hole of the NODE within
Note:If the
start_pfn dma_end_pfn (16M),then zones [ZONE_DMA], holes [ZONE_DMA]Are 0
IV

free_area_init_node (nodeid, NODE_DATA (nodeid) , zones, start_pfn, holes)


i pgdat = NODE_DATA (nodeid)
ii pgdat-> node_id = nid; pgdat- > node_start_pfn = node_start_pfn;
6

Page 7

iii calculate_zone_totalpages (pgdat, zones_size, zholes_size ):


A reset pgdat-> node_spanned_pages = SUM {zones_size [ *]}, and the same as the original value
B set pgdat-> node_present_pages = SUM {zones_size [ *]} - SUM {holes_size [*]}
C logo Information: printk (KERN_DEBUG "On node% d totalpages:% lu \ n", pgdat-> node_id,
realtotalpages)
iv alloc_node_mem_map (pgdat)
A compilation options open if FLAT_NODE_MEM_MAP, then perform the following steps, otherwise empty function
B set pgdat-> node_mem_map = alloc_bootmem_node (pgdat, size), in memory of the NODE
Allocate memory for storing page structure
If the further opening of the C compiler options FLAGMEM, and pgdat = NODE_DATA (0) is set mem_map =
NODE_DATA (0) -> node_mem_map
v free_area_init_core (pgdat, zones_size, zholes_size ): initialize the head of the queue pgdat-> kswapd_wait,
Setting pgdat-> kswapd_max_order = 0, for each zone of the NODE inside (x86-64 only a maximum of 2
Valid zone), perform the following steps:
A number of the actual page within this zone is added to nr_kernel_pages, nr_all_pages in
B set zone-> spanned_pages to include the number of empty pages, zone-> presend_pages actual valid pages
Number of Sides
C set zone name: zone-> name pointing zone_names
D settings for each CPU single-page management set: zone-> pageset [cpu] = & boot_pageset [cpu], call the function
setup_pageset initialized: pcp-> high: 255 * 6 , pcp-> low: 255 * 2, pcp-> batch = 255.
E logo Information: printk (KERN_DEBUG " % s zone:% lu pages, LIFO batch:% lu \ n ",
be
zone_names [j], realsize, batch ), Note: batchIt should255
F initialization wait_table: zone-> wait_table_xx etc.
G set zone-> zone_start_page = zone_start_pfn
H set zone-> zone_mem_map = pfn_to_page (zone_start_pfn)
I

memmap_init (size, nid, j, zone_start_pfn), direct call function memmap_init_zone, parameters completely
Consistency

For each physical page within this zone, set its page structure attribute parameters
a Setting page-> flags
b Settings page to Reserved

31
32
33

K zonetable_add (zone, nid, j , zone_start_pfn, size): set a global variable zone_table


L zone_init_free_lists (pgdat, zone, zone- > spanned_pages): Initialization zone-> free_area empty
check_ioapic: VIA and Nvidia motherboard processing some of the features
If the open is performed compiler option ACPI_BOOT acpi_boot_init: processing ACPI_BOOT, ACPI_FADT, MADT
(Multiple APIC Description Table) and ACPI_HPET (need to open the compiler option HPET_TIMER)
get_smp_config:
If acpi_lapic && acpi_ioapic while effectively print feature information printk (KERN_INFO "Using ACPI
I
Back after (MADT) for SMP configuration information \ n ")
Otherwise, the print flag information :: printk ( "Intel MultiProcessor Specification v1 .% D \ n", mpf-> mpf_
II
III

specification)
Calling function smp_read_mpc analysis MPC (multiple processor config) table, handle multiple processor information
Comparison Table MPC whether the signature is "PCMP", checksums, version number, and whether there LAPIC
i
Print logo Information: printk (KERN_INFO "OEM ID: % s", str)
ii
Print logo Information: printk (KERN_INFO "Product ID: % s", str)
iii
Print logo Information: printk (KERN_INFO "APIC at: 0x% X \ n", mpc-> mpc_lapic)
iv
v

Analysis MPC table


7

Page 8

vi

If the processor entry function is called MP_processor_info:


A print mark information: printk (KERN_INFO "Processor #% d% d:% d APIC version% d \ n", ...);
B increase the total CPU count num_processors,
C physid_set (m-> mpc_apicid, phys_cpu_present_map ): to set the current CPU physical APIC_ID
Global variables in phys_cpu_present_map

vii
viii
ix

D set bios_cpu_apicid [cpu] = x86_cpu_to_apicid [cpu ] = m-> mpc_apicid, where BP when cpu = 0,
The CPU serial number when MPC table AP
E signs the current CPU valid in a global variable cpu_possible_map and cpu_present_map table
If the bus entry function is called MP_bus_info:
If IOAPIC entry function is called MP_ioapic_info:
Print logo Information: printk ( "I / APIC # % d Version% O d at 0x% X \ n.", ...);
If the interrupt source entry function is called MP_intsrc_info:

34

35
36
37
38
39
40
41

x
If the source is a local interrupt entry function is called MP_lintsrc_info:
init_apic_mappings
Establish FIX_APIC_BASE fixed mapping
I
The establishment of a fixed mapping FIX_IO_APIC_BASE_0 etc. IOAPIC
II
probe_roms: Record Resource (address space information, etc.) ROM occupied
e820_reserve_resources: According to information e820, e820 memory retention in iomem_resource (initial value for the whole resource space
Address Space) in the space occupied by the core image and code segments, data segments in each e820 resource space occupancy
Reserved video RAM resource space occupied iomem_resource
Reserved Standard I / O device occupies space resources ioport_resource
If you turn on the compiler option GART_IOMMU, then call the function iommu_hole_init
If you start the compilation option VGA_CONSOLE (normally open) is set conswitchp = & vga_con, or if the open
DUMMY_CONSOLE, is set conswitchp = & dummy_con
End setup_arch
Late architecture-independent initialization start_kernel

1.3.3
42

43
44
45

setup_per_cpu_areas: For each system CPU, call the function alloc_bootmem allocate a memory area ptr, will
Copy data area [__per_cpu_start, __ per_cpu_end] content to the memory area ptr, and set cpu_pda [cpu] .data_offset
= Ptr - __per_cpu_start
smp_prepare_boot_cpu: global variable cpu_online_map, cpu_callout_map, cpu_sibling_map [0] and
cpu_core_map marked this processor effective
sched_init: process initialization each CPU run queue runqueue, increase init_mm.mm_count reference count based CPU
Initialization idle process
build_all_zonelists: For the system each NODE i call the function build_zonelists (NODE_DATA (i) / pgdat ):
Initialization pgdat-> node_zonelists [*]. Zone [0] = NULL
According to the distance of the entire system and NODE NODE between, from near and far to traverse the entire system system NODE,

I
II

The calling function build_zonelists_node pgdat-> zone_zonelist [*]. Zone [i] point to the corresponding type in each NODE
zone area
Note:
in pgdat(Types of
struct pglist_data {}) Is defined:
struct zone node_zones [MAX_NR_ZONES];
// MAX_NR_ZONES = 3
struct zonelist node_zonelists [GFP_ZONETYPES] // GFP_ZONETYPES = 3
andzonelist It is defined as:
struct zonelist {
truct zone * zones [MAX_NUMNODES * MAX_NR_ZONES + 1];
8

Page 9

};
That all points to zone pointer, multiplied by the total number of zone each NODE in the number NODE System
Currently, in addition to a NULL pointer ends, namely zonelist the pointer can point to the system in all NODE
Each zone.
Said initialization process is to treat each NODE of zonelists each pointer system in all NODE
The zone, in ascending order according to NODE distance from the nearest front.
Display feature information: printk ( "Built% i zonelists \ n", num_online_nodes ())

46

47
48
49
50
51

III
IV
cpuset_init_current_mems_allowed: Setting current-> mems_allowed = NODE_MASK_ALL
age_alloc_init, directly call the macro hotcpu_notifier (page_alloc_cpu_notify, 0), define a static variable block notification
struct nodifier_block page_alloc_cpu_notify_nb = {page_alloc_cpu_notify, 0 }, and call the function
register_cpu_notifier registration notification block, block page_alloc_cpu_notify_nb registration notification to the global CPU activity notification chain
cpu_chain in
Display feature information: printk (KERN_NOTICE "Kernel command line :% s \ n", saved_command_line)
parse_early_param: analysis of early startup parameter
parse_args: Analysis of command line parameters :( "Booting kernel", command_line, __start___param, __stop___param __start___param, & unknown_bootoption);
sort_main_extable: direct call function sort_extable (__ start___ex_table, __stop___ex_table), and then call sort
Function, abnormal contents of the table quickly sort
trap_init: Abnormal initialization
Initialization exception vector (less than 32 interrupt vectors)
I
II
cpu_init (): initialize the CPU
If it is CPU 0, then the function is called pda_init (cpu) set CPU basic information
i
And displays feature information: printk ( "Initializing CPU #% d \ n", cpu)
ii
Setting GDT and IDT
iii
iv

syscall_init (): initialize the system call entry

wrmsrl
((u64) __
48entry
| ((u64)
__ KERNEL_CS) << 32): Let
Set
x86 (MSR_STAR,
legacy mode (legacy
x86USER32_CS)
mode) system<<
call
address

wrmsrl (MSR_LSTAR, system_call): Setting the Long Mode (long mode) 64-bit software entry address

syscall32_cpu_init (), you need to open software compatible compiler option IA32_EMULATION, set up long mode
Member system call entry address
enters to system calls, use the entry
( address
Note: x86-64 Used syscall / sysretInstruction/Return
STAR
C000_0081h),
LSTAR ( C000_0082h)and CSTAR( C000_0083h) The mode register. The corresponding inlet assembly instructions are
in /the
file / ia32 / ia32entry.S
ia32_syscall, ia32_cstar_target andsystem_call. System call table located
arch
x86_64
andinclude / asm-x86_64 / unistd.h
in. No longer use
options
80h Soft interrupt, but start the compilation
IA32_EMULATION
Rear
80h Interrupt still available.
52

53

III
fpu_init (): initialize the floating point processor
rcu_init: rcu initialization,
Call the function rcu_cpu_notify;
I
Call the function block rcu_nb register_cpu_notifier registered rcu notification to the cpu_chain list, which rcu_nb back
II
Transfer function that is rcu_cpu_notify;
init_IRQ:
I init_ISA_irqs:
i Call init_bsp_APIC: If the SMP APIC mode or CPU has been returned directly, otherwise set the local APIC
ii calling function init_8259A (0): Initialization 8259
iii initialize the interrupt descriptor structure irq_desc [224] empty state, for the first 16 interrupts are processed using the 8259A type
iv set the interrupt vector [32..255] to interrupt the door
9

Page 10

II For SMP configurations, inter-processor interrupt settings and APIC interrupt


III calling function setup_timer: access to I / O port initialization timer 0x43,0x40
IV If acpi_ioapic is 0, then initialize the interrupt request number 2 (interrupt vector 34), the interrupt handler for irq2
54
55

pidhash_init: PID hash table initialization


init_timers: timer initialization
Call the function timer_cpu_notify;
I
Call the function block timers_nb register_cpu_notifier registration timer notification to the cpu_chain list, which timers_nb
II
Callback function that is timer_cpu_notify.
III

open_softirq (TIMER_SOFTIRQ, run_timer_softirq, NULL) : soft interrupt timer initialization


(TIMER_SOFTIRQ): softirq_vec [TIMER_SOFTIRQ] .data = NULL; softirq_vec
[TIMER_SOFTIRQ] .action = run_timer_softirq;

56

57

softirq_init: call open_softirq initialize other soft interrupt:


I
open_softirq (TASKLET_SOFTIRQ, tasklet_action, NULL) ;
II
open_softirq (HI_SOFTIRQ, tasklet_hi_action, NULL) ;
time_init: system initialization time, set the global variable xtime, wall_to_monotonic, vxtime_hz, cpu_khz etc.
I
get_cmos_time: CMOS time to get set in the xtime.tv_sec
II
set_normalized_timespec: Setting wall_to_monotonic
III
hpet_init: Initialization HPET
i Setting fixed mapping FIX_HPET_BASE and VSYSCALL_HPET
ii obtain information through the interface HPET hpet_readl, if there is HPET proceed
iii hpet_timer_stop_set_go: Initialization HPET
If HPET exist, namely hpet_use_timer set, then set cpu_khz = hpet_calibrate_tsc (), set the name of the clock
IV
V
VI
VII

timename as "HPET", or else, perform


If you start the compilation options X86_PM_TIMER and initialization process will ACPI pmtmr_ioport set, the execution pit_init,
Juxtaposed cpu_khz = pit_calibrate_tsc (), set the clock name timename as "PM", or else, the Executive
pit_init, juxtaposed cpu_khz = pit_calibrate_tsc (), set the clock name timename as "PIT"
Flag information: printk (KERN_INFO "time.c:. . Using% ld% 06ld MHz% s timer \ n", vxtime_hz /

1000000, vxtime_hz% 1000000, timename) ; the name given here to use the clock timename
printk (KERN_INFO "time.c:. Detected % d% 03d MHz processor \ n.", cpu_khz / 1000, cpu_khz%
1000);
IX
rdtscll_sync (& vxtime.last_tsc): The vxtime.last_tsc set to the current value of TSC
X
setup_irq (0, & irq0); set the timer interrupt handler for irq0
XI
set_cyc2ns_scale (cpu_khz / 1000)
XII
time_init_gtod: need to open SMP option, "Decide after all CPUs are booted what mode gettimeofday
should use "
i unsynchronized_tsc ()
ii flag information: printk (KERN_INFO "time.c:. Using% s based timekeeping \ n", timetype)
58
console_init: early initialization console
VIII

59

tty_register_ldisc (N_TTY, & tty_ldisc_N_TTY): Setting tty_ldiscs [N_TTY], Setup the default TTY
line discipline

II
disable_early_printk, if the compiler options open EARLY_PRINTK, closed early printout
Executive function pointer [__con_initcall_start, __ con_initcall_end], that the implementation of each function defined console_initcall
III
profile_init:
If prof_on = 0 direct return, otherwise continue, prof_on by the startup parameter profile = xx on, refer to the section Notes
prof_len = (_etext - _stext) >> prof_shift
pro_buffer = alloc_bootmem (prof_len * sizeof ( atomic_t))
10

Page 11

with " "or"


Note:If the command-line parameters
profile = schedule,
n
Function sets
Features:
profile
i

ii

profile = "(
n n Digital), is executed
profile_setup

If " profile = schedule,"then


n
Set upprof_on = SCHED_PROFILING(equal 2)
prof_shift = n
Flag information:
printk (KERN_INFO "kernel schedule profiling enabled (shift:% ld) \ n", prof_shift);
If " profile = n"then
Set upprof_on = CPU_PROFILING(equal 1)
prof_shift = n
Flag information:
printk (KERN_INFO "kernel profiling enabled ( shift:% ld) \ n", prof_shift)

60
61

local_irq_enable: Open interrupt


vfs_caches_init_early:
I
dcache_init_early: dentry hash table initialization: calling function dentry_hashtable = alloc_large_system
( "Dentry cache", ...) assigned dentry hash table, and initialize all of the header is empty. In the function alloc_large_system
Display flag information: printk ( "% s hash table entries:% d (order:% d,% lu bytes) \ n", tablename, (1U <<
log2qty), long_log2 (size) - PAGE_SHIFT, size);
II

62

inode_init_early: inode hash table initialization: calling function inode_hashtable = alloc_large_system_hash


( "Inode-cache", ... ) allocated hash table and initialize all of the header is empty. The show marked function alloc_large_system
Log information: printk ( "% s hash table entries:% d (order:% d,% lu bytes) \ n", tablename, (1U << log2qty),

long_log2 (size) - PAGE_SHIFT, size );


mem_init: memory page management mechanism initialization
Set the global variable max_low_pfn = max_pfn = num_physpages = end_pfn;
I
II
III
IV

high_memory = (void *) __va ( end_pfn * PAGE_SIZE)


memset (empty_zero_page, 0, PAGE_SIZE) : reserved 0 page empty_zero_page cleared
Open NUMA compiler option, an totalram_pages + = numa_free_all_bootmem (), function
numa_free_all_bootmem the tag node_online_map, call the function for each valid NODE i online
free_all_bootmem_node (NODE_DATA (i)), and then call the function directly free_all_bootmem_core (pgdat) built
Liben NODE page table and Buddy memory management mechanism:
i The NODE starting physical page number: pfn = bdata-> node_boot_start (bdata = pgdat-> bdata)
ii The NODE number of pages bdata-> node_low_pfn - (bdata-> node_boot_start >> PAGE_SHIFT)
iii The NODE bitmap base address map = bdata-> node_bootmem_map
iv If pfn equal to 0 or 64 (long type bitmap) page alignment is marked gofast = 1, can handle continuous
64 pages.
v __ClearPageReserved: Clear Reserved to mark the current page is processed
vi __free_pages (page, order), order determined according to the number of consecutive pages.
A free_hotpage (page): If the order is equal to 0. Further direct call free_hot_cold_page (page, 0)
a zone = page_zone (page): Get page page zone where the pointer
b free_pages_check: the flag test page, the page is safe to release, if PG_dirty
Mark is valid, call the macro ClearPageDirty (page) direct removal of "dirty" mark.
c

per_cpu_pages * pcp = zone-> pageset [get_cpu ()] -> pcp [0], that is, get this page to your zone,
Current processor corresponding to a single page chain management structure pcp

d use page-> lru pointer page page added to pcp-> list page, and increase pcp count. in case
exceeded the maximum number of pages in the pcp pcp-> high, then call the function free_pages_bulk (zone, pcp-> batch,
& Pcp-> list, 0) release pcp-> batch pages.
e free_pages_bulk: When the queue pcp-> list is not empty and the processing of pages not exceeding pcp-> batch, call
Function __free_page_bulk, released each time a page. __free_page_bulk:
11

Page 12
(1)

destroy_compound_page (struct page * page, unsigned long order): When start the compilation
Options HUGETLB_PAGE and order> 0 calls
(I)
(II)

If the flag is not set PG_compound, not a composite page directly returned.
if (page [1] .index! = order) is wrong, that is the great second 4KB page to subpages
The page structure should be used to record the page size of the index members

(III)

ClearPageCompound: Clear composite page Mark all subpages large pages, while
private members determine whether a page for each page pointing to the first page of the page structure

(IV)

page_idx = page_to_pfn (page) & ((1 << MAX_ORDER) - 1), the current page
Numbers in the range of BUDDY

(V)

(VI)
(VII)

while (ordef <MAX_ORDER - 1) {...}, the whole cycle to achieve within buddy algorithm
Memory reclaiming process
1

combined_idx = __find_combined_index (page_idx, order) =


(Page_idx & (1 << order)), after binding the first page number, page_idx
2 order
Align the page number. Unless the order = 0, page_idx any value, otherwise
BUDDY is a plane home

buddy_idx = page_idx ^ (1 << order)

bad_range (zone, buddy): buddy_idx determine whether to return to normal within and
Belong to this zone, etc., were offered illegal circulation

4
5

page_is_buddy (buddy, order): buddy is valid, then the introduction of the illegal circulation
(Zone-> free_area + order) -> nr_free -; rmv_page_order (buddy); buddy
Have been combined into a larger page, delete from the current order

6
page_idx = combined_idx; order ++; the next loop processing
set_page_order (page, order): page-> private = order; set page-> flags in
The flag is valid PG_private
list_add (& page-> lru, & zone-> free_area [order] .free_list): The page structure plus
Into the corresponding free list

B __free_pages_ok (page, order), order is not equal to 0 call


A
LIST_HEAD (list): the definition of a temporary exemplar head

63

64

mod_page_state (pgfree, 1 << order): page_states.pgfree + = 1 << order, structure variables


page_states, define a separate component for each CPU

C
D

free_pages_check: check the validity of each page


list_add (& page-> lru, & list): The first page header is added to the temporary table, which lists only the
Th component

E
free_pages_bulk (page_zone (page), 1, & list, order): the release of the page, with the foregoing description,
kmem_cache_init: kmem_cache mechanism initialization
Initialization list cache_chain
I
Cache_cache initialization, will be added to the cache_chain cache_cache, each calling function kmem_cache_create
II
III

After creating a kmem_cache, we will use cache.next domain join the list cache_chain
Initialization malloc_sizes, for the malloc_sizes [*] and cs_dmacachep cs_cachep members were calling function

IV

kmem_cache_create, the establishment of memory space


Setting cache_cache.array [smp_processor_id ()] and malloc_sizes [0] .cs_cachep-> array [smp_processor

_id ()]
V
register_cpu_notifier (& cpucache_notifier)
setup_per_cpu_pageset:
I
process_zones (smp_processor_id ()): set per_cpu_pageset to CPU 0
In the system for each zone, set zone-> pageset [cpu] = kmalloc_node (...); kmalloc_node (size_t size,
i
12

Page 13

unsigned int __nocast flags, int node): kmalloc same function is assigned malloc_size [*] in
Specify the amount of memory, except that the priority allocation of the specified node (NODE) node memory, no memory node node
Re-allocated from other nodes

65

ii
setup_pageset: Setting zone-> pageset [cpu] -> pcp [*]
II
register_cpu_notifier (& pageset_notifier): Sign CPU startup notification chain pageset_notifier
numa_policy_init:
Initialization dedicated cache: policy_cache = kmem_cache_create ( "numa_policy", sizeof (struct
I
II

mempolicy), ...)
Initialization dedicated cache: sn_cache = kmem_cache_create ( "shared_policy_node", sizeof (struct
sp_node), ...)

III
66

sys_set_mempolicy
(MPOL_allocated
INTERLEAVE,
...): setfailure
the memory
to be deposited
(MPOL_INTERLEAVE) policy
Slightly,
so that the memory
during startup
in the NODE
0;

calibrate_delay: Calibration clock, set the computing power nominal value BogoMIPS, set the global variable loops_per_jiffy
If you use the command line parameter "lpj = xxx", placed directly loops_per_jiffy = preset_lpj, and display the flag information:
I

II

III

printk ( "Calibrating delay loop (skipped) ...% lu.% 02lu BogoMIPS preset \ n", loops_per_jiffy /
(500000 / HZ), (loops_per_jiffy / (5000 / HZ))% 100); ...);
loops_per_jiffy = calibrate_delay_direct (): If the return value is not 0, the display flag message: printk ( "Calibrating
delay using timer specific routine .. "); printk (".% lu% 02lu BogoMIPS (lpj =% lu) \ n ", loops_per_jiffy /
(500000 / HZ), (loops_per_jiffy / (5000 / HZ))% 100, loops_per_jiffy);
Begin the calibration code, calculate loops_per_jiffy, display flag message: printk (KERN_DEBUG "Calibrating
delay loop ... "); printk ("% lu% 02lu BogoMIPS (lpj =% lu.) \ n ", supra);
"Modify
" global variables
Note:Command LinelpjParameters
= xxx
preset_lpjIn this work

67

pidmap_init: global variable pidmap_array [0], is pidmap_array [0] .page allocate a memory page, the function is called
Number attach_pid (current, ...) 0 PID process flag has been used

68

pgtable_cache_init: empty function

69
70

prio_tree_init: base priority search tree (radix priority search tree) is initialized to initialize static variables
index_bits_to_maxindex
anon_vma_init: Initialization dedicated cache: anon_vma_cachep = kmem_cache_create ( "anon_vma", sizeof (struct

71

anon_vma), ...)
If the function is called efi_enabled efi_enter_virtual_mode, you need to compile option EFI

72

73

fork_init (num_physpages): initialization process management


Initialization dedicated cache: task_struct_cachep = kmem_cache_create ( "task_struct", sizeof (struct
I
task_struct), ...);
Setting max_threads: max_threads = mempages / (8 * THREAD_SIZE / PAGE_SIZE)
II
Setting init_task.signal-> rlim [RLIMIT_NPROC] etc.
III
proc_caches_init: kmem_cache_create initialization process calls the function associated cache
I
sighand_cachep = kmem_cache_create ( "sighand_cache", sizeof (struct sighand_struct), ...);
II
signal_cachep = kmem_cache_create ( "signal_cache", sizeof (struct signal_struct), ...);
III
IV
V

74

files_cachep = kmem_cache_create ( "files_cache", sizeof (struct files_struct), ...);


fs_cachep = kmem_cache_create ( "fs_cache", sizeof (struct fs_struct), ...);
vm_area_cachep = kmem_cache_create ( "vm_area_struct", sizeof (struct vm_area_struct), ...);

VI
mm_cachep = kmem_cache_create ( "mm_struct", sizeof (struct mm_struct), ...);
buffer_init:
Initialization dedicated cache: bh_cachep = kmem_cache_create ( "buffer_head", sizeof (struct
I
II

buffer_head), ...);
Setting static variables max_buffer_heads
13

Page 14

75

76

77

III
hotcpu_notifier (buffer_cpu_notify, 0): Sign CPU startup notification chain buffer_cpu_notify
unnamed_dev_init: calling only function idr_init (& unnamed_dev_idr):
I
init_id_cache (): Initialization dedicated cache: idr_layer_cache = kmem_cache_create ( "idr_layer_cache",
sizeof (struct idr_layer), ...);
II
unnamed_dev_idr space cleared, initialized spin lock unnamed_dev_idr.lock;
key_init: initialization key (key) management
Initialization dedicated cache: key_jar = kmem_cache_create ( "key_jar", sizeof (struct key)
I
The key_type_keyring.link, key_type_dead.link and key_type_user.link added to the list key_types_list
II
Tail
Relevant global variable initialization
III
security_init: security initialization
Display flag information: printk (KERN_INFO "Security Framework v" SECURITY_FRAMEWORK_
I
VERSION "initialized \ n");
II
verify (& dummy_security_ops), call the function security_fixup_ops (& dummy_security_ops), on
dummy_security_ops Each member of the macro call set_to_dummy_if_null, set dummy_security_ops
Members point to the initial value corresponding to the empty function dummy_xx function (security / dummy.c in)
Initialize the pointer security_ops = & dummy_security_ops
III
IV

78

do_security_initcalls: call [__security_initcall_start, __ security_initcall_end] Each pointer function,


Security mechanism for initialization, that set of calls by the macro security_initcall (fn) initialization function. Currently only the source code
Function security / capability.c / capability_init (), security / selinux / hooks.c / selinux_init () and security /

root_plug.c / rootplug_init ()
vfs_caches_init (num_physpages): VFS layer file system initialization
Calculation reserved memory reserve = min ((num_physpages - nr_free_pages ()) * 3/2, mempages - 1), that is already reserved
I

When
reserve as the following initialization - 1.5 times the use of memory pages to free memory mempages = num_physpages
Use
ofthe
memory;
II
III
IV

Dedicated cache initialization: names_cachep = kmem_cache_create ( "names_cache", PATH_MAX, ...)


Dedicated cache initialization: filp_cachep = kmem_cache_create ( "filp", sizeof (struct file), ...);
dcache_init (mempages):
Dedicated cache initialization: dentry_cache = kmem_cache_create ( "dentry_cache", sizeof (struct

dentry), ...)
ii

iii
V

dcache_init_early () has been initialized dentry_hashtable, where no initialization, why the early beginning
Initialization?

inode_init (mempages):
Dedicated cache initialization: inode_cachep = kmem_cache_create ( "inode_cache", sizeof (struct
i
inode), ...);
ii
iii

VI
VII

set_shrinker (DEFAULT_SEEKS, shrink_dcache_memory): assign a struct shrinker structure


shrinker, set shrinker-> shrinker = shrink_dcache_memory, and shrinker-> list is added to the global
The list shrinker_list

set_shrinker (DEFAULT_SEEKS, shrink_icache_memory): the role of the previous step with


inode_init_early () has been initialized inode_hashtable, here no longer initialize

files_init (mempages): set a global variable files_stat members max_files initial value, calculated according to the remaining memory capacity
mnt_init (mempages):
Dedicated cache initialization: mnt_cache = kmem_cache_create ( "mnt_cache", sizeof (struct vfsmount), ...);
i
Mount_hashtable assigned to a page, and initializes the hash-table head
ii
iii

sysfs_init (): sysfs file system initialization, enable the compiler option SYSFS
A dedicated cache initialization: sysfs_dir_cachep = kmem_cache_create ( "sysfs_dir_cache",
14

Page 15

sizeof (struct sysfs_dirent), ...);


B register_filesystem (& sysfs_fs_type): Register file system sysfs_fs_type
C sysfs_mount = kern_mount (& sysfs_fs_type): Install the internal file system sysfs, direct call function
iv
v

do_kern_mount (type-> name, 0, type-> name, NULL), follow-up to see the file attachment portion
init_rootfs (): direct call function register_filesystem (& rootfs_fs_type), register file system
rootfs_fs_type
init_mount_tree (): initialize the file system installation tree
A mnt = do_kern_mount ( "rootfs", 0, "rootfs", NULL): Install the internal file system rootfs
B allocates a struct namespace namespace structure and initialization
C list_add (& mnt-> mnt_list, & namespace-> list): the rootfs mount point was added namespace member list

VIII

79

80
81

D namespace-> root = mnt, mnt-> mnt_namespace = namespace: rootfs mount point as the root directory
E Set init_task.namespace = namespace
F in the system for each current thread p, set p-> namespace = namespace, p is of type task_struct
G set_fs_pwd: Set the current process in the current directory and mount point respectively namespace-> root and namespace->
root-> mnt_root
H set_fs_root: Setting the root directory of the current process and mount point respectively namespace-> root and namespace->
root-> mnt_root
bdev_cache_init (): block device initialization
Dedicated cache initialization: bdev_cachep = kmem_cache_create ( "bdev_cache", sizeof (struct
i

bdev_inode), ...);
ii
register_filesystem (& bd_type): Register block device file system bd_type
iii
bd_mnt = kern_mount (& bd_type): internal mounting block device file system
iv
blockdev_superblock = bd_mnt-> mnt_sb: Setting the block device superblocks
IX
chrdev_init (): Direct call cdev_map = kobj_map_init (base_probe, & chrdevs_lock)
radix_tree_init: radix tree initialization
Dedicated cache initialization: radix_tree_node_cachep = kmem_cache_create ( "radix_tree_node", sizeof (struct
I
radix_tree_node), ...);
II
radix_tree_init_maxindex (): initialize static variables height_to_maxindex [*] each component
III
hotcpu_notifier (radix_tree_callback, 0): Sign CPU startup notification chain
signals_init (): initialization signal, only initialization dedicated cache sigqueue_cachep = kmem_cache_create ( "sigqueue",
sizeof (struct sigqueue), ...);
page_writeback_init (): initialize the page write-back mechanism
Depending on memory capacity setting the global variables associated dirty_background_ratio and vm_dirty_ratio
I
II
III
IV

mod_timer (& wb_timer, ...): Modify wb_timer timeout


set_ratelimit (): set the parameters defining ratelimit_pages
register_cpu_notifier (& ratelimit_nb): Sign CPU start chain ratelimit_nb

82

proc_root_init: proc file system initialization


I
proc_init_inodecache (): initialize dedicated cache proc_inode_cachep = kmem_cache_create ( "proc_inode_
II

83

cache ", sizeof (struct proc_inode), ...);


register_filesystem (& proc_fs_type): Register proc filesystem

III
IV
V

proc_mnt = kern_mount (& proc_fs_type): Internal installation proc file system


proc_misc_init (): the establishment of each regular file under / proc directory, usually read-only attribute
Established in the directory / proc subdirectory: net, net / stat, sysvipc (need to open compiler option SYSVIPI), sys (need to open
Compiler option SYSCTL), fs, dirver, fs / nfsd, bus, etc.

VI

proc_tty_init: Initialization subdirectory / proc / tty

cpuset_init (): need to open compiler option CPUSETS, working set CPU initialization
15

Page 16

I
II
III
IV
V
84

Initialize global variables top_cpuset, set init_task.cpuset = top_cpuset


register_filesystem (& cpuset_fs_type): Working Set CPU register file system cpuset
cpuset_mount = kern_mount (& cpuset_fs_type): internally mounted file system working set CPU
Setting top_cpuset.dentry = cpuset_mount-> mnt_sb-> s_root, and relevant members
cpuset_populate_dir (cpuset_mount-> mnt_sb-> s_root): multiple calls to the function cpuset_add_file dentry
The increase related documents

check_bugs (): architecture related functions further initialization


I
identify_cpu (c = & boot_cpu_data): CPU initialization parameters further
i
ii
iii

early_identify_cpu (c): CPU preliminary recognition, first set phys_proc_id [smp_processor_id ()] is
APIC_ID
cpuid_eax (): further identify CPU
init_amd (c): If AMD processor
A get_model_name (c): Setting CPU Type value recorded boot_cpu_data.x86_model_id
B display_cacheinfo (c): show CPU Cache Information
C set the current number of CPU cores to c-> x86_num_cores in
D amd_detect_cmp (c): detection of multi-core CPU configuration
a cpu_core_id [cpu] = phys_proc_id [cpu] & ((1 << bits) -1): Sets the current CPU within this package
CPU core ID
b phys_proc_id [cpu] >> = bits: Set the current CPU This package ID (APIC_ID remove the CPU core series
number)
c If acpi_numa <= 0 is set cpu_to_node [cpu] = phys_proc_id [cpu]
d display flag message: printk (KERN_INFO "CPU% d (% d) -> Node% d -> Core% d \ n", ...);

iv

init_intel (c): If INTEL processor

v
vi

display_cacheinfo (c): If other companies X86 processor


select_idle_routine (c): If the CPU supports Monitor / Mwait support characteristic and pm_idle empty set
pm_idle = mwait_idle
parameters
willpm_idle = poll_idle
Note:Command line
idle
= "poll"
detect_ht (c): Setting Hyper-Threading feature, you need to open compiler options SMP
A cpuid (1, ...): to obtain the number of sibling each CPU within the package may be calculated based on Hyper-Threading CPU
B global variable phys_proc_id [smp_processor_id ()] = phys_pkg_id (index_msb);
C display flag message: printk (KERN_INFO
"CPU: Physical Processor ID:% d \ n",

vii

viii

II
III

85
86

phys_proc_id [cpu]);
D global variable cpu_core_id [smp_processor_id ()]
mcheck_init (c): need to open compiler option X86_MCE,
A mce_init: MCE (Machine Check Exception) initialization function, call the function do_machine_check
Etc., and access to relevant internal CPU registers

B mce_cpu_features: If INTEL processors, you call the function mce_intel_feature_init ()


intel_init_thermal () further initialization
If you do not start the compilation option SMP, flag information display CPU features: printk ( "CPU:"); ...

alternative_instructions (): call the function apply_alternatives (__ alt_instructions, __alt_instructions_end),


Replace interval [__alt_instructions, __alt_instructions_end] Central command chain structure associated with the CPU fast instruction
Then pointed out that the script replacement area location. If the command line parameter contains noreplacement, said substitute function is not exec

acpi_early_init: ACPI early initialization,


rest_init (): ready to execute the next kernel initialization function
I
kernel_thread (init, NULL, CLONE_FS | CLONE_SIGHAND): Create a kernel thread init
II
numa_default_policy (): call the function sys_set_mempolicy (MPOL_DEFAULT, ...) disposed within the structure of NUMA
16

Page 17

Memory allocation strategy defaults MPOL_DEFAULT


III
IV

unlock_kernel (): Unlock the kernel


preempt_enable_no_resched (): macro, barrier (); dec_preempt_count (): kernel into the pre-emptive scheduling status

V
VI

schedule (): once the process of scheduling


cpu_idle (): process 0 idle execution process

1.4 init process


87

lock_kernel (): kernel lock

88

set_cpus_allowed (current, CPU_MASK_ALL): set the init process is allowed to run on the entire CPU
I
task_rq_lock = rq (P = Current , & the flags): Get the current process (init) where the run queue rq
II
III
IV
V

cpus_intersects (new_mask = CPU_MASK_ALL , cpu_online_map): Test the new CPU mask word online
Are CPU (cpu_online_map) mask word is empty, an error is empty
p-> cpus_allowed = new_mask: the process of setting a new CPU mask word
cpu_isset (task_cpu (p), new_mask): init process is currently running test whether the new CPU mask word, is the
The successful launch
migrate_task (p, any_online_cpu (new_mask), req): init processes currently running on the CPU is not a new word in the mask,
Init migration process to the new CPU mask word in any online CPU,
i

If init is not running the queue (p-> array = NULL && task_running (rq, p) == NULL)

ii

A call to the function set_task_cpu () to set p-> thread_info-> cpu = new_mask,


If init in the run queue, fill in migration commands req,
A will req-> list can be added rq migration command queue rq-> migration_ queue in
B wake_up_process (rq-> migration_thread): Wake migration thread
C wait_for_completion (& req.done): wait for the end of the migration

89
90

D tlb_migrate_finish (p-> mm): end of the migration, refresh TLB


child_reaper = current
smp_prepare_cpus (max_cpus): preparing to launch the SMP other CPU, the parameter defaults to maxcpus NR_CPUS,
Startup parameters "maxcpus = xx" will be re-set parameters maxcpus value xx:
I

II
III
IV
V

nmi_watchdog_default (): set a global variable nmi_watchdog, if the current is not the default NMI_DEFAULT,
Direct return (command line parameters nmi_watchdog = xxx sets nmi_watchdog value), or if the INTEL
Or AMD CPU and type (boot_cpu_data.x86) parameter is 15, set nmi_watchdog =
NMI_LOCAL_APIC, otherwise it is set nmi_watchdog = NMI_IO_APIC
current_cpu_data = boot_cpu_data: Set the current CPU characteristic parameters, open the macro compiler option SMP
current_cpu_data defined as cpu_data [smp_processor_id ()]
current_thread_info () -> cpu = 0
enforce_max_cpus (max_cpus): Clear greater than max_cpus of CPU and global variables cpu_possible_map
cpu_present_map markup that is greater than the CPU number is unavailable max_cpus
prefill_possible_map (): need to open compiler option HOT_PLUG_CPU, the system supports a CPU NR_CPUS
All added to the cpu_possible_map

VI

smp_sanity_check (max_cpus): Verify feasibility SMP configuration, if the function fails to close SMP
BP test whether the current mark in the global variable phys_cpu_present_map unlabeled the mark again
i
If smp_found_config is 0, the SMP configuration fails, the direct return
ii
iii
Are boot_cpu_id has been marked in the global variable phys_cpu_present_map unlabeled the mark again
The presence or absence of APIC
iv
VII
connect_bsp_APIC (): If the current mode is APIC no action, or the switch from the current mode to the PIC APIC mode
Formula: call the function clear_local_APIC () to reset the local APIC, through the port 22h and 23h write data
VIII

setup_local_APIC (): Local APIC initialization


17

Page 18

IX

setup_IO_APIC (): Start I / O APIC:


i
enable_IO_APIC ():
A global variable is initialized irq_2_pin [*]
B If no command line parameter settings pirq = xx, then initialize global variables pirq_entries [*]
C global variables visit each IOAPIC nr_ioapics value according to set global interrupt pin number of each IOAPIC
Variable nr_ioapic_registers [*], the value of the global variable nr_ioapics function mp_register_ioapic () and
MP_ioapic_info () Setting
D clear_IO_APIC (): each pin for all IOAPIC calling function clear_IO_APIC_pin () Clear the interrupt

ii
iii

If ACPI analytic function acpi_process_madt () has been set IOAPIC (acpi_ioapic = 1), is set
io_apic_irqs = 0, that is, all the IRQ through IOAPIC, otherwise it is set io_apic_irqs = PIC_IRQS
If ACPI analytic function acpi_process_madt () is not set up IOAPIC (acpi_ioapic = 0), the function is called
setup_ioapic_ids_from_mpc analyzed from the MPC table IOAPIC, set all IOAPIC associated registers significant
Shows flag information: printk (KERN_INFO "Using IO-APIC% d \ n", mp_ioapics [apic] .mpc_apicid)

iv
v

sync_Arb_IDs ()
setup_IO_APIC_irqs (): Setting IOAPIC each pin interrupt vector,
Command line
parameters
or "Apic = verbose"
You can print the kernel boot process
Note 1:
"Apic
= debug"
APIC Related letter
Interest rates,
which
When print more
debug
Note 2:See interrupt pin count structure
union IO_APIC_reg_01 {}definition,
IOAPIC See other registers Department
ColumnIO_APIC_reg_xx
Structure
{}definition
Note 3: the IOAPIC
Interrupt pin register no empty row, also Function
setup_IO_APIC_irqs ()

vi

init_IO_APIC_traps (): initialize IOAPIC interrupt vector entry, less than 16 for the interrupt vector, the function is called
Number make_8259A_irq, otherwise it is set irq_desc [*]. Handler = no_irq_type

vii

check_timer (): validation timer, the code is more complex

viii
X

print_IO_APIC (): If an analytic function if ACPI acpi_process_madt () is not set IOAPIC


(Acpi_ioapic = 0), the current print settings IOAPIC

setup_boot_APIC_clock (): Setting BP in APIC timer


Display flag information: printk (KERN_INFO ". Using local APIC timer interrupts \ n")
i
ii
calibration_result = calibrate_APIC_clock (): Local APIC clock calibration, display flag message:
printk (KERN_INFO "Detected% d.% 03d MHz APIC timer. \ n", result / 1000/1000, result /
1000 1000%);

91

iii
setup_APIC_timer (calibration_result): Start the local APIC timer
do_pre_smp_initcalls ():
I
migration_init ():
i

migration_call (& migration_notifier, CPU_UP_PREPARE, cpu):


A p = kthread_create (migration_thread, hcpu, "migration /% d", cpu): Create a process for the current CPU
Migration thread, named Migration N , N is the CPU number, where n = 0, as a function of the thread inlet
migration_thread
B kthread_bind (p, cpu): help set thread migration Migration N can only run on the current CPU
C set_task_cpu (p, cpu): p-> thread_info-> cpu = cpu
D p-> cpus_allowed = cpumask_of_cpu (cpu)
E __setscheduler (p, SCHED_FIFO, MAX_RT_PRIO-1): set FIFO scheduling policy and scheduling priority
High level
F cpu_rq (cpu) -> migration_thread = p: Set the current CPU migration of threads thread just created

ii
iii

migration_call (& migration_notifier, CPU_ONLINE, cpu): calling only function wake_up_process


(Cpu_rq (cpu) -> migration_thread), the current migration to wake the CPU thread
register_cpu_notifier (& migration_notifier): call the function notifier_chain_register (& cpu_chain, nb)
18

Page 19

Register notification to the CPU block migration_notifier startup / shutdown notification chain
II

92
93

spawn_ksoftirqd ():
i
cpu_callback (& cpu_nfb, CPU_UP_PREPARE, cpu)
A p = kthread_create (ksoftirqd, hcpu, "ksoftirqd /% d", hotcpu): creates a core CPU threads for the current
ksoftirqd N , N number of CPU, entry function ksoftirqd
B kthread_bind (p, hotcpu): with the help of a given thread ksoftirqd N can only run on the current CPU
C per_cpu (ksoftirqd, hotcpu) = p: p tag thread to CPU management structure
ii
cpu_callback (& cpu_nfb, CPU_ONLINE, cpu): calling only function wake_up_process
(Per_cpu (ksoftirqd, hotcpu)), the current CPU wake soft interrupt processing thread
iii
register_cpu_notifier (& cpu_nfb): Register notice to the CPU block cpu_nfb startup / shutdown notification chain
fixup_cpu_present_map (): if the global variable cpu_present_map is empty, cpu_possible_map recorded
Each CPU to mark the cpu_present_map

smp_init (): SMP function to start another AP


I
cpu_up (i): For each CPU cpu_present_map recorded, if the current the CPU i does not start, and the current is enabled
Less than the total number of CPU dynamic global variables specified max_cpus a function is called cpu_up (i) start the current CPU:
i

notifier_call_chain (& cpu_chain, CPU_UP_PREPARE, hcpu): CPU execution start chain cpu_chain
Each notification block

ii __cpu_up (i): Start the i-th CPU


A
apicid = cpu_present_to_apicid (i): according to the global variable bios_cpu_apicid [i] to get the current record
The physical ID CPU
B
per_cpu (cpu_state, i) = CPU_UP_PREPARE: CPU is ready to start setting state

do_boot_cpu (i, apicid): Start a CPU, a logical number i, physics ID: apicid
Define an idle thread management structure c_idle
a
The definition of a work queue structure work: Executive function do_fork_idle
b
c

c_idle.idle = get_idle_for_cpu (i): Get the current process CPU idle control structure
If the idle process has been created (c_idle.idle 0), set the stack c_idle.idle-> thread.rsp, call
Init_idle idle process initialization function: set to a very low priority, set the mask to the current run intelligent

CPU, is added to the current CPU scheduler queue


If the idle process does not create (c_idle.idle = 0), the function is called schedule_work (& work) use of the work
Idle queue creation process or function call do_fork_idle (ie work.func point) created directly into the idle
Cheng, and set the idle process just created to the idle process management structure: the macro call set_idle_for_cpu (i,

c_idle.idle) set idle_thread_array [i] = c_idle.idle, do_fork_idle (): the function is called
fork_idle ():
(I)
(II)

task = copy_process (CLONE_VM, 0 ...); copy of the current process as a new idle init process
init_idle (task, cpu): set the idle process parameters newly created
(I)
(Ii)

rq = cpu_rq (cpu): Get the current CPU's run queue pointer


idle-> cpus_allowed = cpumask_of_cpu (cpu): set the idle process only in the current CPU
On the Run

(Iii)
(Iv)

set_task_cpu (idle, cpu): set p-> thread_info-> cpu = cpu,


rq-> curr = rq-> idle = idle: Set the current operation of the CPU idle process queue

(III)
unhash_process (task): Remove idle newly created process from the process pid hash table
CPU set the current process is idle process: cpu_pda [i] .pcurrent = c_idle.idle
f
g
start_rip = setup_trampoline (): Get the code springboard physical address SMP_TRAMPOLINE_BASE
(6000h), and a springboard for the code [trampoline_data, trampoline_end] copied to the springboard area 6000h
h

init_rsp = c_idle.idle-> thread.rsp: Modify head.S file defined in the current process stack is idle
Process stack
19

Page 20

i
j

per_cpu (init_tss, cpu) .rsp0 = init_rsp: Setting TSS structure


initial_code = start_secondary: The first C code to modify the function defined in the file entry head.S
As start_secondary

clear_ti_thread_flag (c_idle.idle-> thread_info, TIF_FORK): Clear idle process


TIF_FORK flag
Display flag information: printk (KERN_INFO "Booting processor% d /% d APIC 0x% x \ n", ...);

l
m

CMOS_WRITE (0xa, 0xf); * ((unsigned short *) phys_to_virt (0x469)) = start_rip >> 4;


* ((Unsigned short *) phys_to_virt (0x467)) = start_rip & 0xf: CMOS address 0xF written as

0xA: indicates the current system status to "crash reset from 40: 67h at the beginning of the implementation," that execute code start_rip
wakeup_secondary_via_INIT (apicid, start_rip): through inter-processor interrupt (IPI) to the target
CPU sends a start command, and the target CPU start code address start_rip, namely springboard Code
o
BP AP detects the current cycle is started successfully, the successful return 0, otherwise it returns failure code given below
AP startup process, from real mode assembly code trampoline_data begin:
The current CS: IP value of 0600: 0000

Set DS 0600, code segments, data segments together

movl $ 0xA5A5A5A5, trampoline_data - r_base: the springboard code was originally written for the standard position
Hutchison value A5A5_A5A5h, in order to run the notification
Setting idt / gdt
% Ax = 1, lmsw% ax: entering protected mode
ljmpl $ __ KERNEL32_CS, $ (startup_32 -__ START_KERNEL_map): Jump to text
Member head.S started at the startup_32
Setting CR3, CR4, etc., to enter long mode
Startup_64 begin from 64-bit code, set CR3, use page tables init_level4_pgt
Set up the stack to init_rsp, that is the previous step to set the idle process stack h
Jump to C code is performed at initial_code that set the previous step j function start_secondary ()
Executive Office
cpu_init (): CPU initialization, call the function pda_init (cpu), initialization of the AP of pda structure,
Setting pda-> cpunumber = cpu, you can later use the function smp_processor_id () Gets CPU
And will display a flag information: printk ( "Initializing CPU #% d \ n", cpu)
smp_callin (): BP has begun to run reports AP
(I)
cpuid = smp_processor_id (): Get the current CPU logic number
(II)
setup_local_APIC (): Sets the current local APIC AP
(III)
calibrate_delay (): The current AP calibration performance bogmips
(IV)
disable_APIC_timer (): Close the current APIC timer

(V)

smp_store_cpu_info
(cpuid):
store to
thecpu_data
current AP
Copy boot_cpu_data
structure
[cpuid] in
(I)
(Ii)
identify_cpu (cpu_data + cpuid):
(VI)
cpu_set (cpuid, cpu_callin_map): the CPU ID to mark the current global variables
cpu_callin_map in
setup_secondary_APIC_clock (): call the function setup_APIC_timer (calibration_result)
Current local APIC timer settings of AP
If nmi_watchdog use NMI_IO_APIC, use LVT0 use as NMI
enable_APIC_timer (): enable the local APIC timer
set_cpu_sibling_map (smp_processor_id ()):
tsc_sync_wait (): call the function sync_tsc (0) synchronized TSC
cpu_set (smp_processor_id (), cpu_online_map): Set the current CPU to a global variable
20

Page 21

cpu_online_map in
cpu_idle (): The current AP into the idle state
(I)
while (1) {}: This function is an infinite loop, all of the following functions are in the loop body
(II)

while {} (need_resched ()!): If there is no scheduling needs, perform the following idle operation,
Otherwise execution schedule () function, the effective implementation of other processes, internal operating cycle:
(I)

(Ii)
D

II
III

If cpu_is_offline (smp_processor_id ()) is true, then execute the function play_dead ()


(A)

idle_task_exit (): If the current active_mm not init_mm, then switch


active_mm to init_mm

(B)
(C)

mmdrop (init_mm): release init_mm structure


Set the current CPU state to CPU_DEAD

(D)

Into the loop, execute function safe_halt (), ie, assembly instructions ( "sti;

hlt ":::" memory ")


If pm_idle is not empty, then execute pm_idle (), or the execution of the function default_idle ()

while (! cpu_isset (I, cpu_online_map)): Wait AP startup is complete, the waiting is starting to set the AP
Global variables cpu_online_map, see AP start18
first
step.

iii notifier_call_chain (& cpu_chain, CPU_ONLINE, hcpu): Marks the current CPU startup is complete, normal
run
Display flag information: printk (KERN_INFO "Brought up% ld CPUs \ n", (long) num_online_cpus ())
smp_cpus_done (max_cpus): End of the SMP AP start
i
zap_low_mappings (): When unopened compile option HOTPLUG_CPU execution, clearing the page table init_level4_pgt
The virtual address 0 entries that can not be accessed by user space init_mm, and calls the function flush_tlb_all () Refresh
All TLB

ii
smp_cleanup_boot (): CMOS register 0xF cleared, the physical address 467h cleared
When unopened compilation options HOTPLUG_CPU, then release the first one (1000h) and SMP springboard page (6000h)
iii
IV
setup_ioapic_dest (): call the function set_ioapic_affinity_irq () settings on each IOAPIC interrupt pin CPU affinity
And sexual
V
VI

94

time_init_gtod (): Set the time structure, function pointer do_gettimeoffset set value, the display flag information:
printk (KERN_INFO "time.c:. Using% s based timekeeping \ n", timetype)
check_nmi_watchdog (): efficacy NMI watchdog effectiveness
Display flag information: printk (KERN_INFO "testing NMI watchdog ...")
i

Perform verification watchdog operation


ii
Display flag information: printk ( ". OK \ n")
iii
sched_init_smp (): SMP architecture scheduling process
I arch_init_sched_domains (& cpu_online_map): establishment scheduling domain
i
check_sibling_maps (): need to start the compilation options SCHED_SMT and NUMA before execution, and check bits
Within a single CPU of SMT unit belong to the same CPU NODE
ii
iii

cpus_andnot (cpu_default_map, cpu_online_map, cpu_isolated_map): setting effective CPU mask map,


That cpu_default_map = cpu_online_map & cpu_isolated_map
build_sched_domains (& cpu_default_map): establish an effective CPU scheduling domain
For cpu_default_map each CPU unit performs the following actions:
A
a nodemask = node_to_cpumask (cpu_to_node (i)): Get The CPU unit where the NODE
CPU Mask
b obtain the i-th CPU corresponding address field node_domains sd,
c groupe = cpu_to_node_group (i): Get the i-th CPU unit corresponding NODE group
d set sd-> span = * cpu_map: This field contains all valid processor scheduling
twenty one

Page 22

e
f

Setting sd-> groups = & sched_grpup_nodes [group]


The scheduling domains need to turn compiler option NUMA

gp = sp: save on a scheduling domain


h Get the i-th CPU corresponding address field phys_domains sd
i group = cpu_to_phys_groups (i): Get the i-th unit corresponds CPU CPU package group
j sd-> span = nodemask: This field contains the only CPU scheduling unit within this NODE
k sd-> parent = p: setting up the schedule for the parents in front of the NODE scheduling domain scheduling domain
l

sd-> groups = & sched_group_phys [group]: set scheduling group scheduling domain

mp = sd
n Get the i-th CPU corresponding address field cpu_domains sd
o group = cpu_to_cpu_groups (i): Get the i-th unit corresponding to the CPU core CPU within the group
p sd-> span = cpu_sibling_map [i]: This field contains the only CPU scheduling unit within this core CPU
q sd-> parent = p: Setting scheduling domain of the parents is scheduled in front of the CPU package scheduling domain

r sd-> groups = & sched_group_cpus [group]: set scheduling group scheduling domain
s The scheduling domains need to turn compiler option SCHED_SMT
For each previous CPU unit, call the function init_sched_build_groups (sched_group_cpus,
this_sibling_map): The this_sibling_map each CPU is connected to a different sched_group_cpus
A circular linked list

For all systems NODE, NODE will not belong to the same CPU corresponding sched_group_phys
Together into a circular linked list

All online CPU corresponding sched_group_nodes together into a circular linked list
For each CPU in the system unit is set computing capabilities, including cpu_domain, phys_domains and

E
F

node_domanis in groups-> cpu_power


For each line CPU, call the function cpu_attach_domain (sd, i): set the current scheduling domain of CPU
As sd

II hotcpu_notifier (update_sched_domains, 0): Hot-swap CPU scheduling domain setting update notification block
95
96

97

cpuset_init_smp (): set a global variable top_cpuset.cpus_allowed as cpu_online_map, set


top_cpuset.mems_allowed = node_online_map
populate_rootfs ():
I
unpack_to_rootfs (__ initramfs_start, __initramfs_end - __initramfs_start, 0) ;: initramsfs decompression
Setting initrd, more content
II
do_basic_setup ()
I
init_workqueues ():
i
hotcpu_notifier (workqueue_cpu_callback, 0): Hot-swap CPU notification setting block workqueue_cpu_
callback
ii
keventd_wq = create_workqueue ( "events"): the establishment of a work queue keventd_wq, and for each single CPU
Yuan create a queue worker thread
II
III

usermodehelper_init (): direct execution khelper_wq = create_singlethread_workqueue ( "khelper"), create


Help a kernel thread
driver_init ():
i
devices_init (): direct call function subsystem_register (& devices_subsys), registered device subsystems
devices_subsys
ii
buses_init (): direct call function subsystem_register (& bus_subsys), registered bus subsystem bus_subsys
iii
classes_init (): subsystem_register (& class_subsys) type of registration subsystem class_subsys,
twenty two

Page 23

iv
v
vi

subsystem_init (& class_obj_subsys) initialization subsystem class_obj_subsys


firmware_init (): direct call function subsystem_register (& firmware_subsys), registered firmware subsystem
firmware_subsys
platform_bus_init (): direct call function
device_register (& platform_bus) and
bus_register (& platform_bus_type)
system_bus_init ():

vii
viii
IV

sysctl_init (): need to turn on and compile options SYSCTL PROC_FS valid,
i
register_proc_table (root_table, proc_sys_root): to root_table table Each entry registered a proc entry
ii

init_irq_proc ():! registered directory / proc / irq, and for each active interrupt vector (irq_desc [irq] .handler =
& No_irq_ type) registered smp_affinity file: / proc / irq / xx / smp_affinity, by setting the corresponding file in the
Interrupt vector affinity

sock_init (): network protocol (BSD socket) initialization


i
sk_init (): set the relevant global variables, depending on memory capacity
ii
iii

skb_init (): sk_buff dedicated cache initialization, skbuff_head_cache = kmem_cache_create ( "skbuff_


head_cache ", sizeof (struct sk_buff):
init_inodecache (): dedicated cache initialization sock_inode_cachep = kmem_cache_create ( "sock_inode_

iv

cache ", sizeof (struct socket_alloc)


register_filesystem (& sock_fs_type): Register sockfs file system

v
vi
VI

98

cpu_dev_init (): direct call function sysdev_class_register (& cpu_sysdev_class)


attribute_container_init (): initialize the global list attribute_container_list

sock_mnt = kern_mount (& sock_fs_type): Core installation file system sockfs


netfilter_init (): need to open compiler option NETFILTER, filter network initialization, the initialization list nf_hooks [i]
[J]

do_initcalls (): execution [__initcall_start, __initcall_end] between initialization function that is performed by the macro
core_initcall (fn), postcore_initcall (fn), arch_initcall (fn), subsys_initcall (fn), fs_initcall (fn),

device_initcall (fn), late_initcall (fn), __ initcall (fn) defined initialization function


If the file / init exists, it is the user space program initialization / init, or else call the function prepare_namespace (), usually
File / init does not exist, but you need to perform the function prepare_namespace ()
I
II

mount_devfs (): Direct call sys_mount ( "devfs", "/ dev", "devfs", 0, NULL), installation devfs
md_run_setup (): set the RAID
i
create_dev ( "/ dev / md0", MKDEV (MD_MAJOR, 0), "md / 0"), the establishment of the device / dev / md0
ii
md_setup_drive ():
initrd_load (): read and use initrd

III
99

IV
mount_root ()
Release memory used during initialization: free_initmem ()

100
101
102

numa_default_policy (): set the NUMA scheduling policy


sys_open ( "/ dev / console") to open the console
sys_dup (0), sys_dup (0): Setting the standard output and standard error

103

run_init_process ( "/ sbin / init"): Start init process

twenty three

Page 24

2 file system
2.1 mount operation
2.1.1
1

Physical file systems (ext2 to Case)


module_init (init_ext2_fs): Start the installation ext2 file system
I
init_ext2_xattr (): need to open compiler option EXT2_FS_XATTR, direct call function ext2_xattr_cache =
mb_cache_create ( "ext2_xattr", ...) initialization file system metadata dedicated cache Filesystem Meta Information
Block Cache (mbcache)
II
init_inodecache (): initialize dedicated cache ext2_inode_cachep = kmem_cache_create ( "ext2_inode_
cache ", sizeof (struct ext2_inode_info)
III
register_filesystem (fs = & ext2_fs_type): Register ext2 file system
Initialization fs-> fs_supers list is empty
i
ii

file_system_type ** p = find_filesystem (fs-> name): In a file_systems table head, single linked list
Search registered file system if there named "ext2", and returns the next domain address of the last node

iii
* P = fs: ext2_fs_type structure will be added to the list in the file system type
open
,then module_init (initfn)Defined as
NOTE:If1 defined in a module that compiler options
MODULE
function
static inline initcall_t __inittest (void) {return initfn;} // Define a function that returns
initfnaaddress

int init_module (void) __attribute __ ((alias (#initfn)) // Defined Functions


initfn Alias function
init_module
compiler,then
options
"
2 If defined in the module, which closedMODULE
module_init (initfn)Defined as__initcall
(initfn)"
and" __initcall (initfn)"defined as" device_initcall (initfn)",At last" device_initcall (initfn)"defined as
Finally, thedo_initcalls ():carried[__initcall_start
out
, __initcall_end]ZONE
__define_initcall ( "6", initfn)
Of
2
3

Command "mount -t ext2 device dir": Install an ext2 file system device, perform sys_mount system call;
sys_mount ():
I
copy_mount_options (): Copy the user space to kernel space parameters, each parameter occupies one page space
II
getname (dir_name): Replication of mount points pathname
i

tmp = __getname () = kmem_cache_alloc (names_cachep, SLAB_KERNEL): from the cache names_


cachep allocated memory is initialized when names_cachep SLAB capacity PATH_MAX (4KB)

ii

III

do_getname (dir_name, tmp): address bounds checking to prevent illegal content into the core space, call the function
strncpy_from_user () perform the copy operation
lock_kernel (): Prohibition kernel preemption scheduling

IV
V

do_mount (): perform the installation


unlock_kernel (): Allows kernel preemption scheduling

do_mount (dev_name, dir_name, type_page, flags, data_page): perform the installation entity
Validate input parameters
I
Analysis flags parameter
II
III
IV
V
VI
VII

path_lookup (dir_name, LOOKUP_FOLLOW, & ND): Find the target installation point nameidata structure
do_remount (): reinstall command contains MS_REMOUNT mark
do_lookback (): Install the loopback device, command contains MS_BIND mark
do_move_mount (): Delete installation command contains MS_MOVE mark
do_new_mount (& ND, type_page, the flags, mnt_flags, dev_name, Data_Page): a new installation

VIII
path_release (& nd): nd release
NOTE: struct nameidata
Structure Description:
point
structure
struct dentry *
dentry: Point to the target
dentry
twenty four

Page 25

struct vfsmount * mnt : Point to the target point where the equipment is installed
strct qstr
last: For indexes
do_new_mount (): Install a new device;
Verifying the correctness of the input parameters
I
II
III
IV

capable (CAP_SYS_ADMIN): Verify whether the process has administrator privileges


= mnt do_kern_mount (& ND, type_page, the flags, mnt_flags, dev_name, Data_Page): the implementation of kernel installation
do_add_mount (mnt, nd, ...): the installation point to a namespace installation tree
i
while (d_mountpoint (nd-> dentry) && follow_down (& nd-> mnt, & nd-> dentry)):
A
d_mountpoint (nd-> dentry): return nd-> dentry-> d_mounted, namely return to the current dentry structure has been
The number of directory for installation
B
ii
iii
iv

follow_down: Forward to the root device has been installed, and while circulating through further testing new
Installation point until the end, that is no longer goes to the root of a device installed on the device so far.

if (nd-> mnt-> mnt_sb == newmnt-> mnt_sb && nd-> mnt-> mnt_root == nd-> dentry) goto fail: same
A file system device installation can not be repeated in the same directory
mnt-> mnt_namespace = current-> namespace
graft_tree (mnt, nd)
A
if (S_ISDIR (nd-> dentry-> d_inode-> i_mode)! = S_ISDIR (mnt-> mnt_root-> d_inode->
i_mode)) return: must be a directory
B
nd-> dentry or the root directory, or d_flags no mark DCACHE_UNHASHED, can execute
Line follow-up operation, the file system can not be installed in a directory with the tag DCACHE_UNHASHED
C

attach_mnt (mnt, nd): connection point to complete the installation operation


a
mnt-> mnt_parent = mntget (nd-> mnt)
b
mnt-> mnt_mountpoint = dget (nd-> dentry)
c
list_add (& mnt-> mnt_hash, mount_hashtable + hash (..)): The installation point is added to the overall installation point
Hash table
d

D
E
v

list_add_tail (& mnt-> mnt_child, & nd-> mnt-> mnt_mounts): The installation will be added to the mounting point
Word lists mount point directory installation point

e
nd-> dentry-> d_mounted ++: increase the installation directory installed count
list_add_tail (& head, & mnt-> mnt_list) in mnt-> mnt_list added a temporary head node
list_splice (& head, current-> namespace-> list.prev): the head where the list (ie mnt-> mnt_list)
Join current-> namespace-> list.prev in

mntput (mnt), release the counter. At this point, the installation process is completed

do_kern_mount (type_page, flags, dev_name, data_page): Internal perform the installation;


I
type = get_fs_type (fstype):
i
type = find_filesystem (fstype): In the Registered file system (file_systems table header singly linked list) in
Find a name for the fstype file system
If the failure is trying to load the file system module fstype call find_filesystem Find Again
ii
II
III

IV

mnt = alloc_vfsmnt (dev_name): assignment vfsmount structure dedicated cache mnt_cache and initialized, assigned
Memory used to store the device name dev_name, and mnt-> mnt_devname point of
sb = type-> get_sb (): call the super block of the file system read function body for the ext2 file system, ext2_fs_type
Get_sb defined as a function ext2_get_sb (), and ext2_get_sb () functions are called directly get_sb_bdev (fs_type, the flags,
dev_name, data, ext2_fill_super)
Setting vfsmount structure mnt:
i
ii
iii

mnt-> mnt_sb = sb;


mnt-> mnt_root = sb-> s_root;
mnt-> mnt_mount = sb-> s_root: that this is a root installation
25

Page 26

V
VI
7

iv
mnt-> mnt_parent = mnt: that this is a root installation
put_filesystem (type): the release of the file system
return mnt

get_sb_bdev (..., ext2_fill_super): set the super-block structure


I
bdev = open_bdev_excl (dev_name, flags, fs_type): Open the block device dev_name, and the owner of the file
System fs_type
i

bdev = lookup_bdev (dev_name)


A
path_lookup (dev_name, ..., & nd): Looking nameidata structure of the device dev_name
B
C

inode = nd.dentry-> d_inode: Equipment dev_name the inode structure


bdev = bd_acquire (inode): Get bdev_inode structure of the inode belongs to another member
block_device structure bdev
a If inode-> i_bdev inode is not empty and idle, simply return inode-> i_bdev
b Otherwise, call the function bdev = bdget (inode-> i_rdev):
(1)

inode = iget5_locked (bd_mnt-> mnt_sb, hash (dev), bdev_test, bdev_set, & i_rdev)
From bdev (bd_type), to obtain equipment in bdev inode file system
(I)
(II)
(III)

head = inode_hashtable + hash (sb, hashval): Looking for block device in the inode hash table
indoe, set the hash-table head, that bdev_inode.vfs_inode
inode = ifind (sb, head, bdev_test, ...), call the function find_inode () to find the target
indoe, lookup fails, call the function get_new_inode () to create inode
get_new_inode (): Create inode
(I)
inode = alloc_inode (sb): call the function sb-> s_op-> alloc_inode = bdev_alloc_
inode () allocates a data structure bdev_inode ei from the dedicated cache bdev_cachep,
And members ei-> vfs_inode return as VFS inode, and the inode conduct group
The initialization, such as inode-> i_mapping = inode-> i_data
(Ii)
(Iii)

(2)
(3)
(4)
(5)
(6)
(7)
(8)

list_add (& inode-> i_list, & inode_in_use): The inode added to inode_in_ use
List

list_add (& inode-> i_sb_list, & sb-> s_inodes): was added to the inode
bd_mnt-> mnt_sb-> s_inodes list under
bdev = & BDEV_I (inode) -> bdev: get the inode located under with a block_inode
block_device address structure
bdev-> bd_inode = inode
inode-> i_rdev = dev: setting represents a block device inode number of the target device
inode-> i_bdev = bdev: block device
inode-> i_data.a_ops = & def_blk_aops
inode-> i_data.backing_dev_info = & default_backing_dev_info
list_add (& bdev-> bd_linst, & all_bdevs): recording block device
device
, Structure block_inode One to
Note 1:here
inode It represents a block
inode
Members, and with the other members
of the There are mutual between the pointer to the structure
block_device
( bdev-> bd_inode, inode-> i_bdev), The inode Also added toinode
the system
Hash table
.
Note 2:Block Device Management: block devices through
a
virtual
file
M
anagement,
system
bdev (bd_type)
The file systemstart_kernel
consists of () vfs_caches_init ()bdev_cache_init ()registered,
And Core installations
(
kern_mount
()do_kern_mount ()). Each corresponds to a block device
A bdev_inode Structure, via the virtual filebdev
system
Allocation and deallocation of the structure
member
vfs_inode As equipment
inode Join inode Hash table, in a conventional manner to access,
And by function
struct bdev_inode * BDEV_I (inode)and strcut block_device *

I_BDDEV (inode),will bdev_indoe.vfs_inode As input parameters to obtain


26

Page 27

and address
bdev_indoe Another member of the structure
bdev address.
struct bdev_inode {
struct block_device
bdev;
struct inode

vfs_inode;

}
c inode-> i_bdev = bdev, Note: Here inode is the device name (eg dev / hda1) corresponding inode
d inode-> i_mapping = bdev-> bd_inode-> i_mapping, alloc_inode initial value, has no other initial
Of
e
ii

iii
II
III
8

list_add (& inode-> i_devices, & bdev-> bd_inodes): The path represents the device inode set to join
Preparation of management in the inode list

blkdev_get (bdev, mode, 0): use local variables fake_file and fake_dentry execution do_open (bdev,
& Fake_file) operation
disk = get_gendisk (bdev-> bd_dev, & part)
To be continued
bd_claim (bdev, holder):

s = sget (fs_type, test_bdev_super, set_bdev_super, bdev): assign a super-block structure, and by


set_bdev_super function set s-> s_bdev = bdev, s-> s_dev = s-> s_bdev-> bd_dev
Call the function ext2_fill_super (fs_type, ...): begin ext2 file system user interface

ext2_fill_super (struct super_block * sb, void * data, int silent): fill in the superblock reading device, set the super block operations
Method structure:
I
II
III
IV

Assign a ext2_sb_info structure sbi cleared and set sb-> s_fs_info = sbi
sb_block = get_sb_block (& data): Set the start position of the super block, if the parameter data contains the string "sb = xx",
Set sb_block = xx, otherwise it is set to 1 by default
Sets the logical block number logic_sb_block, superblock location
bh = sb_read (sb, logic_sb_block): read the superblock, direct call function __bread (sb-> s_bdev, block,
sb-> s_blocksize) read superblock:
i
bh = __getblk (bdev, block, size): To be continued , to find the target in buffer_cache bh
A
bh = __find_get_block (bdev, block, size)
a

bh = lookup_bh_lru (bdev, block, size):


(1) bh_lru_lock (): macro, open SMP when defined as local_irq_disable (), otherwise defined as
preempt_disable ()
(2) lru = & __ get_cpu_var (bh_lrus): Get the current CPU Buffer Cache of LRU queue address
(3) Find Target Buffer Cache blocks in the LRU array (BH_LRU_SIZE = 8-th component), that
Meet bh-> b_bdev = bdev && bh-> b_blocknr = block && bh-> b_size = size, found
The block will advance to the LRU array of 0 indicates the component

(4) bh_lru_unlock (): macro, unlock


b bh = __find_get_block_slow (bdev, block, size)
(1) bd_mapping = bdev-> bd_inode-> i_mapping: Device dev structure of address_space
(2) index = block >> (PAGE_CACHE_SHIFT, bdev-> bd_inode-> i_blkbits): target
Switch Buffer Cache device blocks the page number
(3) page = find_get_page (bd_mapping, index): the function is called radix_tree_lookup
(& Mapping-> page_tree, index), to find the target block in the radix tree of the target device, if it is found
Macro call page_cache_get (page) page_get (page) increasing the page count page reference
(4) If the previous step returns NULL, the introduction of otherwise continue
(5) bh = page_buffers (page) = page-> private: private to a head pointer Buffer Cache
(6) Check all bh this page, if bh-> b_blocknr = block is to find the target Buffer Cache,
27

Page 28

bh lock, unlock page structure


If the function is called to find bh bh_lru_install (bh): bh added to the current target processor Buffer

The LRU Cache front of the queue, and may be released to a BH last
d toch_buffer (bh): Marks the current bh has access

B
C

ii

might_sleep (): scheduling process may be performed once


bh = __getblk_slow (bdev, block, size): the use of an infinite loop is assigned a bh
Called again function __find_get_block (bdev, block, size) to find bh
a
b
grow_buffers (bdev, block, size): Buffer Cache to allocate a physical page, users can cache
block block device, and the device is added to the base of the tree bdev

bh = __bread_slow (bh), if the content is not updated bh (possibly newly allocated bh), is executed from the hard disk
Read target bh
A
B

lock_buffer (bh): Lock bh


If bh has been updated is returned directly

C
D

get_bh (bh): increase the reference count bh


bh-> b_end_io = end_buffer_read_sync: complete set io handler

submit_bh (READ, bh): Submit a read request to the driver, assign a struct bio structure and associated
Setting, the function is called submit_bio (READ, bio) submit a request to the driver

F
wait_on_buffer (bh): wait for the completion of degrees
Radix treeradix
(
as, follows. Maximum height
Comment:
tree), Buffer Cache use 64 Tree fork base management, data structure is12
use bit6360 As Index,
Level
use
As
an
index,
so
the
first
Level
use
As
a
10
bit5954
1
bit50
Index leaf nodes store valid data
struct radix_tree_root {
uint height;
int gfp_mask;
struct radix_tree_node * rnode
}
struct rasix_tree_node {
uint count;
void * slot [MAP_SIZE];
// 64
ulong tags [TAGS] [TAG_LONGS]; // 2, 1
}
V
VI
VII
VIII

IX

sbi initialization, call the function parse_options (data, sbi) further provided data based on the parameters sbi
Analyzed according to the data read superblock, relevant settings, if not using the actual block device reads the original block
Consistent with the need to re-read a function call sb_bread superblock
Superblock other initialization parameters
SB-> s_export_op = & ext2_export_ops : get_parent only two methods of operation and get_dentry
SB-> s_op = & ext2_sops : define the inode operation method and other methods:
alloc_inode = Ext2_alloc_inode
destroy_inode = ext2_destroy_inode
read_inode = Ext2_read_inode
write_inode = ext2_write_inode
root = iget (sb, ino = EXT2_ROOT_INO): to find the target in the global inode inode hash table, if fails stars
With a new inode and basic initialization
i

iget_locked (sb, ino): Gets inode address


A
head = inode_headtable + hash (sb, ino): ready to find the target inode in the inode hash table
B
C

ifind_fast (sb, head, ino): to find the target in the inode hash table
inode = get_new_inode_fast (sb, head, ino):
a

inode = alloc_inode (sb): call the function sb-> s_op-> alloc_inode (sb) assign a inode, namely letter
28

Page 29

Number ext2_alloc_inode (sb): a distribution system from the dedicated cache ext2_inode_cachep

ii

ext2_inode_info structure ei, and members of the structure vfs_inode address (& ei-> vfs_inode) as
Returns for the allocation of the inode structure. New inode common initialization;
The inode added sb-> s_inode the list and global inode hash table
b
Call functions SB-> s_op-> read_inode ext2_read_inode (inode): read the contents of inode
A
B

C
D

ei = EXT2_I (inode): Get the address of where the structure inode ext2_inode_info
raw_inode = ext2_get_inode (inode-> i_sb, ino, & bh): read the content on the target disk inode
Computing Group target inode superblock where according to the records, and the group turned the offset amount and the number of dis
a
b
* Bh = sb_bread (sb, block): reading the target block, * bh saved Buffer Cache Address
Ext2_inode calculated in the offset Buffer Cache
c
d
return (struct ext2_inode *) (bh-> b_data + offset): Returns the destination address
inode-> i_mode = raw_inode-> i_mode, according inode-> i_mode judgment inode types handled as follows:
Regular file (S_ISREG (inode-> i_mode)):
a
b

inode-> i_op = & ext2_file_inode_operations ;


If the support XIP (excute in place, open EXT2_FS_XIP compiler option)
(1) inode-> i_mapping-> a_ops = & ext2_aops_xip;
(2) inode-> i_fop = & ext2_xip_file_operations;

(1) inode-> i_mapping-> a_ops = & ext2_aops ;


(2) inode-> i_fop = & ext2_file_operations ;
Directory (S_ISDIR (inode-> i_mode)):
a
b
c

File system does not buffer head (with "nobh" installation option, namely open EXT2_MOUNT_NOBH)
(1) inode-> i_mapping-> a_ops = & ext2_nobh_aops ;
(2) inode-> i_fop = & ext2_file_operations ;
Normal file system

inode-> = & i_op the ext2_dir_inode_ operations ;


inode-> i_fop = & ext2_dir_operations ;
File system does not buffer head (with "nobh" option during installation), then inode-> i_mapping-> a_ops =

& Ext2_nobh_aops ; otherwise inode-> i_mapping-> a_ops = & ext2_aops ;


Symbolic link (S_ISLNK (inode-> i_mode)):
Fast symbolic links (FAST the symlink): inode-> i_op = & ext2_fast_symlink_inode_operations
a
b

Other Links
(1) inode-> i_op = & ext2_ the symlink _inode_operations ;
(2) file system does not buffer head (with "nobh" option during installation), inode-> i_mapping-> a_ops =

& Ext2_nobh_aops ; otherwise inode-> i_mapping-> a_ops = & ext2_aops


Other forms (equipment, etc.)
a
b

inode-> i_op = & ext2_special_inode_operations ;


Computing device number:
(1) If the val = raw_inode-> i_block [0] is not equal to 0, using the old 16-bit device number: rdev =
MKDEV ((val >> 8) & 255, val & 255), namely high eight major and low-ranking device number 8
(2) raw_inode-> i_block [0] is equal to 0 using the newer 32-bit device number: dev = raw_inode->
i_block [1]: major = (dev & 0xfff00) >> 8, minor = (dev & 0xff) | ((dev >> 12) &
0xfff00); rdev = MKDEV (major, minor); namely bit198 to 12 major device number, bit3120,
bit87 device number 20 ranking

init_special_inode (inode, inode-> i_mode, devt): according to the type of device settings
(1) character devices:
inode-> i_fop = & def_chr_fops
29

Page 30

inode-> i_rdev = rdev


(2) block device
inode-> i_fop = & def_blk_fops

inode-> i_rdev = rdev


(3) FIFO:
inode-> i_fop = & def_fifo_fops
(4) SOCK:
inode-> i_fop = & bad_sock_fops
(5) others are wrong
Note: XIP : The main value of XIP (eXecute In Place) lies in providing a means of allowing
several copies of a program to be running without duplicating the text segment. Indeed
the text segment can reside in flash memory and need not be copied to the system Ram
at all. This is useful for tasks that have large program bodies with many executable
instances running in the system.
Only the Stack, BSS and data segments of an executable needs to be produced for each
running program. The text segment can then reside in flash memory or, if execution
speed is an issue, then copy the file system to ram first and mount it from there. If
executables in the file system are compiled to support XIP and also flagged in their
headers as XIP they will load and execute with just a single copy of the text segment.
(from http://www.ucdot.org/ Article This article was .pl? SID = 02/08/28 / 0.43421 million )
(Other definatinon: http://www.ucdot.org/article.pl?sid=02/08/28/0434210)
H
ext2_set_inode_flags (inode): set the inode associated marker
sb-> s_root = d_alloc_root (root)
Setting the path name struct qstr name {.name = "/", .len = 1}
i
ii
res = d_alloc (NULL, name): allocate a dentry structure res, set the name and initializes the other members of the list,
res-> d_op = NULL, res-> d_flags = DCACHE_UNHASHED, represents not added a hash table, adding dentry
After the hash table to remove the tag. ext2 file system like uninitialized d_op pointer, which is always NULL
iii
iv
v

res-> d_sb = root-> i_sb: Setting dentry structure superblock pointer


res-> d_parent = res: Parent structure to itself, that this is a root node
d_instantiage (res, root): the dentry inode structure and link structure:

A
XI

list_add (& res-> d_alias, & root-> i_dentry): The dentry structure d_alias added pointer inode structure
The list of i_dentry

B
res-> d_inode = inode: d_inode pointer dentry inode structure to the target
Analyzing the file system: display flag message: printk (KERN_ERR "EXT2-fs: get root inode failed \ n");
printk (KERN_ERR "EXT2-fs: corrupt root inode, run e2fsck \ n")

XII

XIII

ext2_setup_super (sb, es, sb-> s_flags & MS_RDONLY):


Related judged according to sb, es content and display relevant information
i
ii
ext2_write_super (sb): Write superblock
If the compiler option CONFIG_EXT2_CHECK open and install the tags contain EXT2_MOUNT_CHECK
iii
( "Check") is executed ext2_check_blocks_bitmap (sb) and ext2_check_inodes_bitmap (sb) Inspection valid
Sex
This completes the function get_sb_bdev () execution

2.1.2

Internal virtual file system

2.1.2.1

bdev file system

The process used to manage the file system block devices is loaded: the start_kernel () vfs_caches_init () bdev_cache_init (), and
30

Page 31

A Core installation (kern_mount () do_kern_mount (), the function bdev_cache_init () start the following analysis
1

bdev_cachep = kmem_cache_create ( "bdev_cache", sizeof (struct bdev_inode): assigned block special management
Establish a dedicated cache structure bdev_inode

register_filesystem (& bd_type): Registered dedicated file system bdev (bd_type), will be added to the registered structure bd_type
File system list in file_systems

bd_mnt = kern_mount (type = & bd_type): internally mounted file system, direct call function do_kern_mount
(Type-> name, 0, type-> name, NULL), namely the file system type "bdev", the name of the target device is installed "bdev"
I
type = get_fs_type (fstype)
II
III

mnt = alloc_vfsmnt (dev_name): assignment vfsmount structure dedicated cache mnt_cache and initialized, assigned
Memory used to store the device name dev_name, and mnt-> mnt_devname point of
sb = type-> get_sb (): call the super block of the file system read function for bdev file system, bd_type the set
Get_sb defined as a function bd_get_sb (), and bd_get_sb () functions are called directly get_sb_pseudo (fs_type, "bdev:",
& Bdev_sops, ...):
i

s = sget (fs_type, NULL, set_anon_super, NULL) Gets superblock


A
s = alloc_super (): assign a super-block structure, and initialize
B

C
D
E
F
ii
iii

iv
v
vi
vii

IV
V

set_anon_super (s, NULL): idr use data structures used to produce a device ID number, set up to
s-> s_dev in
a idr_pre_get (& unnamed_dev_idr, ...): preparation of a node in unnamed_dev_idr
b idr_get_new (& unnamed_dev_idr, NULL, & dev): apply for a section in the unnamed_dev_idr
Point in the dev returned node ID, used as a minor number
c s-> s_dev = MKDEV (0, dev)
s-> s_type = type: set this to your file system superblock
strlcpy (s-> s_id, type-> name, ...): set the file system superblock to your name
list_add_tail (& s-> s_list, & super_blocks): The super block is added to a global list of the superblock
list_add (& s-> s_instance, & type-> fs_supers): The super block is added to this file system belongs superblock
The list

s-> s_op = & bdev_sops, set superblock s other members


root = new_inode (sb = s): assign an inode structure and initialization
A
inode = sb-> s_op-> alloc_inode (sb): call the function bdev_sops.alloc_inode = bdev_alloc_inode
(): Call the function kmem_cache_alloc () allocates a bdev_inode knot from the dedicated cache bdev_cachep
Configuration ei, and returns its members vfs_inode as newly allocated inode, namely return & ei-> vfs_inode
Settings for each member of the inode
B
C
inode-> i_data.a_ops = & empty_aops
D
inode-> i_mapping = & inode-> i_data
dentry = d_alloc (NULL, & d_name): call the function kmem_cache_alloc () from the dedicated cache
dentry_cachep allocate a dentry structure, set up in the name of d_name value initialization other members
dentry-> d_sb = s: Setting the superblock dentry structure
dentry-> d_parent = dentry: no parent directory, which is already the top-level directory
d_instantiate (dentry, root): root structure and set dentry points to the relationship: the dentry-> d_alias added to the chain
Table root-> i_dentry, set dentry-> d_inode = root

viii
s-> s_root = dentry: Setting the superblock root dentry structure
mnt-> mnt_sb = sb; mnt-> mnt_root = sb-> s_root; mnt-> mnt_mountpoint = sb-> s_root
mnt-> mnt_parent = mnt: This is a show with node

VI
return mnt: return to the installation point structure
blockdev_superblock = bd_mnt-> mnt_sb, save this file system superblock
Note: The final of the global variable bd_mnt save the installation information point, the global variable blockdev_superblock save superblock information
31

Page 32

2.1.2.2

sockfs file system

The process file system for managing network connections, is loaded: the start_kernel () rest_init () the init () do_basic_setup ()
sock_init (), call the function register_filesystem (& sock_fs_type) registered sockfs file system, then execute sock_mnt =

kern_mount (& sock_fs_type) internally mounted file system, the registration process has been described in the previous section, the following main analysis kern_
(& Sock_fs_type) implementation process:
1

sock_mnt = kern_mount (type = & sock_fs_type): internally mounted file system, direct call function do_kern_mount
(Type-> name, 0, type-> name, NULL), namely the file system type "sockfs", the name of the target device is installed
"Sockfs"

sb = type-> get_sb (): call the super block of the file system read function for sockfs file system, sock_fs_type in
get_sb defined as a function sockfs_get_sb (), and sockfs_get_sb () functions are called directly get_sb_pseudo (fs_type,

3
4

"Socket:", & sockfs_sops, ...):


Bd_type file system with the installation process is different from the super block operation interface sockfs_ops, other processes are identical
During installation, use only alloc_inode sock_alloc_inode, independent of other steps super_operations.
sock_alloc_inode procedure: Call function kmem_cache_alloc () allocates a dedicated cache from the sock_inode_cachep
socket_alloc structure, and initialize, and finally socket_alloc.vfs_inode as inode structure VFS return, and
Management added to the inode structure in the future by the inode can find where socket_alloc structure and the structure of another
Members socket.

And ultimately by the global variable sock_mnt save the installation point information vfsmount structure.

2.1.2.3

procfs file system

The loading process is: the start_kernel () proc_root_init (), first call the function register_filesystem (& proc_fs_type) Register

procfs file system, then execute proc_mnt = kern_mount (& proc_fs_type) internally mounted file system, the registration process has been described in the previou
The following analysis of the main differences between kern_mount (& proc_fs_type) implementation and installation of bd_type file system:
1

sb = type-> get_sb (): call the super block of the file system read function for the proc filesystem, proc_fs_type the set
Get_sb defined as a function proc_get_sb (), and proc_get_sb () functions are called directly get_sb_single (fs_type, flags, data,
proc_fill_super), perform the function given below:
I
s = sget (fs_type, compare_single, set_anon_super, ...): The procedure is when the file system is basically the same installation bdev
II

proc_fill_super (s, data, ...): fill in the superblock s data structure and set root inode
i
s-> s_op = & proc_sops: set the super-block operator interface
ii
root_inode = proc_get_inode (s, ino = PROC_ROOT_INO, de = & proc_root)
A
B

de_get (de): to increase the reference count


inode = iget (sb, ino): a standard inode allocation and initialization
a
inode = iget_locked (sb, ino): inode allocation
(1)
head = inode_hashtable + hash (sb, ino): prepare a search in the inode hash table
(2)
inode = ifind_fast (sb, head, ino): proc just initialized, you must search fails
(3)
get_new_inode_fast (sb, head, ino): assign a inode, and initialize
(I)
inode = alloc_inode (sb): call the function sb-> s_op-> alloc_inode = proc_alloc_inode
(), Allocate inode, proc_alloc_inode () allocates a proc_inode structure ei, initialization
ei other members, and one of the inode member vfs_inode returned as standard inode.
And completion of the basic initialization, the process is similar to the file system bdev
Related setting inode pointer member
(II)
b

C
D

sb-> s_op-> read_inode (inode): call the function proc_read_inode () function to fill inode content,
This function is actually set only the inode time information for the current time.

c
unlock_new_inode (inode): inode is unlocked, can be used normally
PROC_I (inode) -> pde = de: set the inode structure where proc_inode members pde = & proc_root
Inode set parameters:
32

Page 33

a
b

inode-> i_size = de-> size


inode-> i_nlink = de-> nlink (= 2)

iii

III

c
inode-> i_op = de-> proc_iops = proc_root_inode_operations
d
inode-> i_fop = de-> proc_fops = proc_root_operations
s-> s_root = d_alloc_root (root_inode)
A

res = d_alloc (NULL, & name): allocate a dentry structure, and set the path name and the associated initial list
Of

res-> d_sb = root_inode-> i_sb

C
D

res-> d_parent = res: set a root directory


d_instantiate (res, root_inode): setting the link between the dentry and inode

do_remount_sb (s, flags, data, 0): To Be Continued

2.2 open
Core entry function sys_open (const char __user * filename, int flags, int mode), the analysis starts here
1

tmp = getname (filename)


I
II

tmp = __getname (): macro, defined as kmem_cache_alloc (names_cachep, ...), to allocate memory from the dedicated cache
do_getname (filename, page = tmp):
If get_fs () = current_thread_info () -> addr_limit not equal KERNEL_DS, indicates that the user-space processes
i
Access, you need to check the address range, as follows:
If filename> TASK_SIZE (= TASK_SIZE64 = 8000_0000_0000h - 1000h) indicates the file
A
B

Were located in the core space, directly back EFAULT mistakes


If TASK_SIZE distance filename to less than PATH_MAX (4096), the maximum length of the file name
len = TASK_SIZE-filename, to ensure that the core does not replicate spatial data

ii

strncpy_from_user (page, filename, len)


A
access_ok (VERIFY_READ, src, 1): macro test target data is all in the process user address
Space segment, defined as (__range_not_ok (addr, size) == 0), while __range_not_ok (addr, size)
Assembly instruction execution, if the size + addr current_thread_info () -> addr_limit.seg returns 0, otherwise
Return 1.
B

__do_strncpy_from_user (page, filename, len, res): implementation of assembly instructions from the user address space
filename to copy the data to the core address space page, length len bytes, the error code is set to res

fd = get_unused_fd ():
I
files_struct structure of the current process: files = current-> files
II
fd = find_next_zero_bit (files-> open_fds-> fds_bits, ...): find a free mark bitmap file opens in
III

expand_files (files, fd):


If fd is greater than the bit map (files-> max_fdset), then calls the function expand_fdset (files, fd) extended recording bitmap
i
ii

files-> open_fds, files-> close_on_exec


If fd file number greater than the number of records (files-> max_fds), then calls the function expand_fd_array (files, fd) extension file
No record files-> fd

IV
V

FD_SET (fd, files-> open_fds): open_fds bitmap fd set, indicating that the corresponding file number is no longer idle
FD_CLR (fd, files-> close_on_exec): close_on_exec pointed bitmap fd cleared, said that if the current process
By exec () system call an executable file without having to close the file, the contents of the bitmap can be

ioctl () system call sets


f = filp_open (tmp, flags, mode)
I

open_namei (filename, ..., & nd)


i
ACC_MODE (x): macro, defined as: ( "\ 000 \ 004 \ 002 \ 006" [(x) & O_ACCMODE]), constant
O_ACCMODE defined as 3, when x = 0,1,2,3, the macro ACC_MODE (x) values were 0,4,2,
33

Page 34

ii
iii
iv
v

6, that is, the string defined numeric


If the flag is not O_CREATE marker function is called path_lookup (pathname, lookup_flags (flag) |
LOOKUP_OPEN, nd), then turn the function xi step may_open
Otherwise, call the function path_lookup (pathname, LOOKUP_PARENT | LOOKUP_OPEN | LOOKUP_
CREATE, nd), then continue
If the ND-> last_type LAST_NORM or ND-> last.name [ND-> last.len] 0 is an error
path.dentry = __lookup_hash (name = & nd-> last, base = nd-> dentry, nd): to find the target in the parent directory
File, this time pointing to the parent directory nd the target file / directory, and the results path.dentry pointed to the target file / directory
dentry structure
A
inode = base-> d_inode
B
permission (inode, MAY_EXEC, nd): permission checks the parent directory that test inode-> i_mode
Permissions
C

dentry = cached_lookup (base, name, nd): find the target file / directory name in the directory base
a
dentry = __d_lookup (base, name):
(1) head = d_hash (base, name-> hash), looking at the global dentry hash table dentry_hashtable

To head the list


(2) Search all nodes in the head, look for the hash key agreement (dentry-> d_name.hash == name->
hash), consistent with the parent directory (dentry-> d_parent == base), consistent with the length of the name
(Dentry-> d_name.len = name-> len), consistent with the name (dentry-> d_name.name =
name-> name) of dentry structure to find success increase dentry structure and returns the reference count
Structure, otherwise returns NULL
b

d_lookup (base, name): again to find the target dentry structure,


(1) read_seqbegin (& rename_lock): Returns the value rename_lock.sequence
(2)
(3)

D
E

dentry = __d_lookup (base, name): again to find the target dentry structure
read_seqretry (& rename_lock, seq): return value is not equal to 0 Repeat steps (1) Step

new = d_alloc (base, name), if cached_lookup lookup fails, and then allocate a dentry structure
Initialized
dentry = inode-> i_op-> lookup (dir = inode, new, nd), to find the target in the current inode dentry knot in
Construct new, for ext2 file systems function ext2_lookup ()
If dentry-> d_name.len> EXT2_NAME_LEN (255) returns -ENAMETOOLONG,
a
Name is too long, ext2 file system maximum support 255
b

ino = ext2_inode_by_name (dir, new): Find dentry inode structure in the new target
(1) de = ext2_find_entry (dir, new, & page)
(I) ei = EXT2_I (dir): Gets inode structure dir where ext2_inode_info structure
(II) ext2_get_page (dir, n)
(I) page = read_cache_page (mapping, n, mapping -> a_ops -> readpage,
NULL): For ext2 file system, readpage function ext2_readpage
(A) page = __read_cache_page (mapping, index, filler = ext2_readpage,
NULL)
(A) page = find_get_page (mapping, index): the function is called radix_tree_
lookup (mapping-> page_tree, index), with an offset in the base of the tree
For the index page, if found in the calling function page_cache_get (page) =
get_page (page) to increase page reference count and returns the page structure
(B) cache_page = page_cache_alloc_cold (mapping): If the previous step execution
Fails, this step calls the function alloc_pages () allocates a page assignment
(C) add_to_page_cache_lru (cache_page, mapping, index, ..):
34

Page 35

add_to_page_cache (cache_page, ...) : call function radix_


tree_insert () will be added to the base page cache_page tree mapping->
page_tree, set cache_page-> mapping = mapping,
cache_page-> index = index
lru_cache_add (cache_page) : added a page corresponding to the list
PVEC = & get_cpu_var (lru_add_pvecs) : Close kernel preemption,
Get current address of the CPU architecture lru_add_pvecs
pagevec_add (PVEC, Page) : In pvec-> pages [pvec-> nr ++ ]
Pointer cache_page, namely cache_page page temporarily by the pvec
Management, return pvec-> pages [] array space
If the remaining space is 0 function is called __pagevec_lru_add () ,
The pvec each page added to each zone of inactivity chain
Table (add_page_to_inactive_list (zone,) to achieve)
release_pages () : Check again pvec-> pages structure
List for the reference count is 0 (page_count (page)) of
Page form a new pagevec structure, and release
pagevec_reinit (pvec) : reinitialize pvec structure, the quasi
Be used again
(D) page = cache_page
(E) filler / ext2_readpage (NULL, page): calls only function mpage_readpage
(Page, ext2_get_block)
do_mpage_readpage (Bio, Page, 1, & last_block_in_bio, get_blo
ck): To be continued

(B) mark_page_accessed (page): set the page to access state


(Ii) ext2_check_page (): characteristic parameter detection
(III) de = page_address (page): first page address as ext2_dir_entry_2 entrance
(IV) ext2_match (namelen, name, de): Search the target file / directory in read dentry structure

Directory names, matching exit, return to de structure, otherwise the test next dentry structure
(2) ino = le32_to_cpu (de-> inode): Get the target file / directory inode number

vi
vii

inode = iget (dir-> i_sb, ino): reads No. ino inode, which refer to the installation process ext2 file system
The first step 8.IX

d_splice_alias (inode, dentry): set the relationship between the inode and dentry

path.mnt = nd-> mnt


vfs_create (dir-> d_inode, path.dentry, mode, nd): to create the target file / directory
A
B

may_create (dir, dentry, nd): Test the target file / directory exists and permissions
dir-> i_op-> create (dir, dentry, mode, nd), that is, call the function ext2_create () founding documents:
a
inode = ext2_new_inode (dir, mode): assignment and set inode
(1)
(2)

sb = dir-> i_sb
inode = new_inode (sb): inode allocation and initialization
(I)

alloc_inode (sb): call sb-> s_op-> alloc_inode (sb), namely the function ext2_alloc_inode (),
Ext2_inode_info assigned a dedicated structure ei from the cache ext2_inode_cachep, and

(II)
(III)
(IV)

Vfs_inode address is returned as a member of the allocated inode. And set the inode structure of each member of the initial value
Use and i_sb_list i_list members were added to the list inode_in_use and sb_s_inodes
list_add (& inode-> i_list, & inode_in_use): The inode list to join the global use
list_add (& inode-> i_sb_list, & sb-> s_inodes): The inode superblock join the current list
35

Page 36

(3)

(V)
The number of inode last_ino as a static variable, record distribution: inode-> i_ino = ++ last_ino
If you want to create the directory and the super-block function is called with marking EXT2_MOUNT_OLDALLOC
find_group_dir, if the mark does not create a directory and then call the function find_group_orlov, if
Not a directory function is called find_group_other, explained below:

(4)

group = find_group_dir (sb, dir): Select the block for creating directory
(I)
(II)

ngroups = EXT2_SB (sb) -> s_groups_count: The current number of blocks included in the device
avefreei = ext2_count_free_inodes (sb) / ngroups: average block free inode number
(I) desc = ext2_get_group_desc (sb, i, NULL), get the i-th block description of the structure
(A)
(B)

struct ext2_sb_info * sbi = (struct ext2_sb_info *) sb-> s_fs_info


group_desc = block_group >> sbi-> s_desc_per_block_bits: According blocks
No. Being where the buffer zone group descriptor cache (buffer_head) ordinal

(C)

offset = block_group & (sbi-> s_desc_per_block - 1): In the current descriptor


buffer cache ordinal

(D)

(III)
(5)

return (strcut ext2_group_desc *) sbi-> s_group_desc [group_desc] ->


b_data + offset
(Ii) desc_count + = desc-> bg_free_inodes_count;
(Iii) for all of the devices block the implementation of the operation
Check the device for each block and look for more than the average in the remaining indoe free blocks (desc->

bg_free_blocks_count) up to block, select a block


group = find_group_orlov (sb, dir): Select the block for creating directory
More effective than a more complex
(I)

(6)

group = find_group_other (sb, dir): to create a non-directories (file or link) selected block
If the directory dir where the blocks have free inode and free block directly select the time zone set dir
(I)
If the above conditions are not met, then the secondary hash with a way to select the other there are free inode (desc->
(II)
bg_free_inodes_count> 0) and block free block
If you still have not checked, then choose a free inode node block, regardless of the free block
(III)
Thus, the block selection is completed, select the first group a block
(7)
(8)

gdp = ext2_get_group_desc (sb, group, & bh2): Get block group address and location of buffer
cache structure
(9)
bitmap_bh = read_inode_bitmap (sb, group): Read the current block inode bitmap where the disk blocks
(I)
desc = ext2_get_group_desc (sb, group, NULL): Get the address block group
(II)
bh = sb_read (sb, desc-> bg_inode_bitmap): read the current paltry inode bitmap disk blocks
(10)
ino = ext2_find_next_zero_bit ((unsigned long *) bitmap_bh-> b_data, EXT2_
INODES_PER_GROUP (sb), ino): bitmap find an idle point, allocate disk inode
Ino If the test fails, the next area inode allocation group
(11)
(12)
ext2_set_bit_atomic (sb_bgl_lock (sbi, group), ino, bitmap_bh-> b_data), bitmap
bitmap_bh-> b_data marked the inode number has been used for the ino
(13)
mark_buffer_dirty (bitmap_bh): Mark the buffer_cache modified
(14)
sync_dirty_buffer (bitmap_bh): If sb-> s_flags marked MS_SYNCHRONOUS,
It said synchronous modification, you call the function submit_bh (WRITE, bh) writeback inode bitmap
(15)

brelse (bitmap_bh): release Bitmap

(16)

ino + = group * EXT2_INODES_PER_GROUP (sb) + 1: the inode number converted to a block device
Within the number

(17)

percpu_counter_mod (& sbi-> s_freeinodes_counter, -1): a reduction of idle in the superblock


inode

(18)

percpu_counter_inc (& sbi-> s_dirs_counter): If you are creating a directory, add the directory count
36

Page 37

(19)
gdp-> bg_free_inodes_count - = 1: inner block free inode minus 1
(20)
sb-> s_dirt = 1: Modified superblock
(twenty one)
mark_buffer_dirty (bh2): mark area group buffer_cache modified
(twenty two)
inode-> i_ino = ino: set the inode itself where the inode number
Other initialization inode structure
(twenty three)
b
c
d

(twenty four)
mark_inode_dirty (inode): mark inode modified
Setting inode operations function pointer: inode-> i_op, inode-> i_mapping-> a_ops, inode-> i_fop
mark_inode_dirty (inode): mark inode modified
ext2_add_nondir (dentry, inode):
(1)
ext2_add_link (dentry, inode)
(I)
(II)

dir = dentry-> d_parent-> d_inode: parent directory inode structure


Find the target file name in the directory, if already exists, an error is returned -EEXIST otherwise fill
A ext2_dir_entry_2 structure de, de set relevant members: de-> inode = inode-> i_ino

(III)
mark_inode_dirty (dir): mark the parent directory has been modified
(2)
d_instantiate (dentry, inode): If successful, it will point to establish relationships with dentry inode structure,
Otherwise release inode, an error is returned
viii
ix
x
xi
II

__follow_mount (& path): The path to the installation device is currently active
may_open (nd, acc_mode, flag): Test attribute tags and permissions, if marked with O_TRUNC, the number
Get written permission, and calls the function locks_verify_locked locking and file length is set to 0

dentry_open (nd.dentry, nd.mnt, ...)


i
f = get_empty_filp (): assign a struct file structure, and basic initialization
If there is a write request, get_write_access (inode) to obtain written permission, an increase inode-> i_writecount count with
ii
In mutually exclusive with mmap
iii
iv

4
5
6

nd-> dentry = path.dentry: nd-> dentry points to the new directory / file

Setting f associated members: f-> f_op = inode-> i_fop, f-> f_dentry = nd.dentry, f-> f_vfsmnt = nd.mnt,
f-> f_mapping = inode-> i_mapping
Few calls the function f-> f_op-> open (inode, f), namely generic_file_open function, function
Detection O_DIRECT mark
Back f structure

v
vi
fsnotify_open (f-> f_dentry): Only open in the implementation of the compiler options CONFIG_INOTIFY
fd_install (fd, f): set files-> fd [fd] = f
At this point, the file open completion

2.3 read
read through system calls sys_read process is complete, this system call, the first call the function file = fget_light (fd, & fput_needed)
Get the file descriptor file, and then calls the main function vfs_read (), then call back function file_pos_write () and fput_light (), reset
Read the file location and release file descriptors, the following analysis only trunk function vfs_read (file, buf, count, & pos)
Check the correctness of the parameters, as well as pointers file-> f_op, file-> f_op-> read, file-> f_op-> aio_read non-empty
1
2
3

access_ok (VERIFY_WRITE, buf, count): Verify that the user-space area [buf, buf + count) whether writable property
rw_verify_area (READ, file, pos, count): Verify that the file readable attribute whether the target area
Verification data read count must be less than the total amount of file data file-> f_maxcount
I
File data is read before and after must be valid, that is, POS 0 && + POS COUNT 0
II
If the file lock (inode-> i_flock) effective and mandatory file locking enabled, the macro MANDATORY_LOCK (inode) = TRUE,
III
That inode-> i_sb-> s_flag marked MS_MANDLOCK set and the set group ID but no execution flag, namely full
Foot inode-> i_mode & (S_ISGID | S_IXGRP) == S_ISGID, then call the function locks_mandatory_area
37

Page 38

(Read_write == READ FLOCK_VERIFY_READ:? FLOCK_VERIFY_WRITE, inode, file, pos,


count) test mandatory documents
If file-> f_op-> read is not empty then the function is called file-> f_op-> read (file, buf, count, pos), otherwise the call function

do_sync_read (file, buf, count, pos), explained below


file-> f_op-> read () that is generic_file_read (file, buf, count, ppos)
I
II
III

struct iovec local_iov = {.iov_base = buf, .iov_len = count}; user space data receiving buffer zone
init_sync_kiocb (& kiocb, filp): initialize data structures kiocb
ret = __generic_file_aio_read (iocb = & kiocb, iov = & local_iov, nr_segs = 1, ppos): asynchronous I / O side
Read data type
The input parameter space is only one user indicates a buffer zone, i.e. only one component iov
i
ii

iii
iv
v

The amount of data for the use of a loop, verify iov [..] buffer each component given whether the user space, and read
The total amount of data whether there exists a negative value, if a component is illegal, then the follow-up to the sub-component and two will be
For the components of each component before services
If filp-> f_flags set mark O_DIRECT, Cache mechanism file system is not used, but directly from
The user buffer to device data access, continue, otherwise go to step vii Operation
mapping = filp-> f_mapping, inode = mapping-> host, size = inode-> i_size
retval = generic_file_direct_IO (READ, iocb, iov, offset = pos, nr_segs = 1):
A
file = iocb-> ki_filp, mapping = file-> f_mapping
If the write operation, perform the following steps:
B
a
b

write_len = iov_length (iov, nr_segs): Calculate the total amount of data currently IO operation block
If the file is executed address mapping (mmap) radio operation, that mapping_mapped (mapping) is true,
That mapping-> i_mmap-> prio_tree_node NULL or mapping-> i_mmap_nonlinear
NULL, then the function is called unmap_mapping_range (mapping, offset, write_len, 0), the release area
Domain [offset, offset + write_len] address mapping

filemap_write_and_wait (mapping)
If no address mapping that mapping-> nrpages = 0, then the direct return 0; otherwise continue

a
b

filemap_fdatawrite (mapping), direct call function __filemap_fdatawrite (mapping,


WB_SYNC_ALL), further direct call function __filemap_fdatawrite_range (mapping,
start = 0, end = 0, sync_mode = WB_SYNC_ALL)
Declare a local variable struct writeback_control wbc = {. Nr_to_write = mapping->
(1)
(2)
(3)

nrpages * 2}, based on the input parameter initialization


mpping_cap_writeback_dirty (mapping): Detection mapping-> backing_dev_info->
capabilities marked BDI_CAP_NO_WRITEBACK, to return 0, 1 otherwise.
do_writepages (mapping, & wbc), call the function pointer mapping-> a_ops->
writepages (mapping, wbc), if the pointer is NULL, the function is called generic_
writepages (mapping, wbc), for the ext2 file system, the pointer mapping-> a_ops->
writepages point function ext2_writepages, the function direct call function
mpage_writepages (mapping, wbc, ext2_get_block)
to be continued
(I)

filemap_fdatawait (mapping), direct call function wait_on_page_writeback_range (mapping,

0, (i_size - 1) >> PAGE_CACHE_SHIFT), waiting for data to finish


to be continued
(1)
mapping-> a_ops-> direct_IO (rw, iocb, iov, offset, nr_segs), that is, call the function ext2_direct_IO,
Further direct call function blockdev_direct_IO (rw, iocb, inode = iocb-> ki_filp-> f_mapping-> host,
inode-> i_sb-> s_bdev, iov, offset, nr_segs, ext2_get_blocks, NULL), and further direct call
Function __blockdev_ direct_IO (rw, iocb, inode, bdev, iov, offset, nr_segs, get_blocks = ext2_
38

Page 39

vi
vii

IV
6

get_blocks, end_io = NULL, DIO_LOCKING)


to be continued
a
After direct access to the (tagged O_DIRECT) is completed, the function returns
filp-> Cache mechanism tag O_DIRECT, use the file system is not set f_flags
A set up a for loop, complete the literacy nr_segs user buffer zone, each complete cycle again a cushioning
Read and write operations area, given below some buffer write process

B declare local variables read_descriptor_t des, based on the input parameters and initializes
C do_generic_file_read (filp, ppos, & desc, file_read_actor) direct call function do_generic_
mapping_read (filp-> f_mapping, & filp-> f_ra, filp, ppos, desc, actor = file_read_actor)
to be continued
a
If the return ret = -EIOCBQUEUED, then call the function wait_on_sync_kiocb (& kiocb), so that the current process proceeds

TASK_UNINTERRUPTIBLE state, and process scheduling, wait for the completion of the reading process
do_sync_read (file, buf, count, pos)

I
II

III

IV
V

init_sync_kiocb (& kiocb, filp): initialize data structures kiocb, and set kiocb.ki_pos = * ppos
filp-> f_op-> aio_read (& kiocb, buf, len, kiocb.ki_pos): That is the calling function ret = generic_file_aio_read ()
i
struct iovec local_iov = {.iov_base = buf, .iov_len = count};
ii
__generic_file_aio_read (iocb, & local_iov, 1, & iocb-> ki_pos)
If the return value ret = -EIOCBRETRY function is called wait_on_retry_sync_kiocb (& kiocb), and repeat on
The steps until the return value is not equal -EIOCBRETRY. During the execution of the function __generic_file_aio_read See 5.III
step
If the return value ret = -EIOCBQUEUED function is called wait_on_sync_kiocb (& kiocb)
* Ppos = kiocb.ki_pos

2.4 write
Entry function for the system call sys_write, and sys_read similar system call, call VFS layer function vfs_write, if further letter
Number pointer File-> f_op-> the Write NULL, the function is called file-> f_op-> write (file, buf, count, pos), for the ext2 file system calls
Function ext2_generic_write, otherwise call the function do_sync_ write (file, buf, count, pos)
generic_file_write (file, buf, count, pos):
1
2
3

inode = file-> f_mapping-> host


Define local variables struct iov local_iov = {.iov_base = buf, .iov_len = count};

ret = __generic_file_write_nolock (file, & local_iov, 1, pos)


I
init_sync_kiocb (& kiocb, file)

down (& inode-> i_sem)

II

ret = __generic_file_aio_write_noblock (& kiocb, iov, iov, nr_segs = 1, pos)


Use a for loop, verify iov [..] buffer each component given whether the user space, and the amount of data written
i
The total amount of data whether there exists a negative value, if a component is illegal, then the follow-up to the sub-component and two will be
For the components of each component before services
ii
iii
iv
v

vfs_check_frozen (inode-> i_sb, SB_FREEZE_WRITE), the macro is defined as wait_event ((inode-> sb) ->
s_wait_unfrozen, ((inode-> sb) -> s_frozen <(SB_FREEZE_WRITE)))
current-> backing_dev_info = mapping-> backing_dev_info
generic_write_checks (file, & pos, & count, S_ISBLK (inode-> i_mode)): Prior to completion of the necessary write data
Data checking, such as the amount of data written and location, some errors may trigger SIGXFSZ
Call the function notify_change (dentry, & newattrs) after preparation parameters: remove_suid (file-> f_dentry)

vi

inode_update_time (inode, 1): Setting inode-> i_mtime and inode-> i_ctime current system time, and
Call the function mark_inode_dirty_sync (inode) marks the inode need to write back

vii

If file-> f_flags in marked O_DIRECT, then call the function generic_file_direct_write (iocb, iov,
39

Page 40

& Nr_segs, pos, ppos, count, ocount) write data


A If COUNT ocount, not all data is written to the write function is called * nr_segs = iov_shorten (iov,
* The amount of data nr_segs, count) to adjust the writing
B written = generic_file_direct_IO (WRITE, iocb, iov, pos, * nr_segs), see the read Process
5.III.v step
C If the data written to the end position exceeds the original position at the end of the inode record, then i_size_write (inode, end)
Reset inode-> i_size, and inode marked as modified

viii

D If synchronous access to the file (the condition (written> = 0 && ((file-> f_flags & O_SYNC) ||
IS_SYNC (inode)))), then calls the function generic_osync_inode (inode, mapping, OSYNC_
METADATA), the inode all the changes to your data file back to disk
to be continued
a
If file-> f_flags in marked O_DIRECT, then call the function generic_file_buffered_write (iocb, iov,

nr_segs, pos, ppos, count, written) to write data


A
If the return value ret = -EIOCBQUEUED function is called wati_on_sync_kiocb (& kiocb)
III
If the use of a synchronous manner (ie, satisfy (ret> 0 && ((file-> f_flags & O_SYNC) || IS_SYNC (inode))), then tune
Wait for data function sync_page_range (inode, mapping, * ppos-ret, ret) completed

do_sync_ write (file, buf, count, pos):


1
init_sync_kiocb (& kiocb, filp): initialize data structures kiocb, and set kiocb.ki_pos = * ppos
2
filp-> f_op-> aio_write (& kiocb, buf, len, kiocb.ki_pos): That is the calling function ret = generic_file_aio_write ()
I
struct iovec local_iov = {.iov_base = buf, .iov_len = count};
II
__generic_file_write_nolock (iocb, & local_iov, 1, & iocb-> ki_pos)
If the use of a synchronous manner (ie, satisfy (ret> 0 && ((file-> f_flags & O_SYNC) || IS_SYNC (inode))),
III
Function is called sync_page_range (inode, mapping, * ppos-ret, ret) wait for the data to complete
If the return value ret = -EIOCBRETRY, then call the function wait_on_retry_sync_kiocb (& kiocb), and go to step 2,
3
Other home until you return

If the return value ret = -EIOCBQUEUED, then call the function wait_on_sync_kiocb (& kiocb), process proceeds
TASK_UNINTERRUPTIBLE state, and process scheduling.

* Ppos = kiocb.ki_pos

2.5 mmap
Corresponding system calls sys_mmap, if it is anonymous mapping (flags unmarked MAP_ANONYMOUS), that the use of File
The mapping function is called file = fget (fd) to obtain the file descriptor, and performs the mapping function body do_mmap_pgoff (file, addr, len, prot,
flags, off >> PAGE_SHIFT), following analysis of the function implementation
Validation parameters and map section is rounded to an integer multiple of the page size
1
2
addr = get_unmapped_area (file, addr, len, pgoff, flags)
If the mark flags marked MAP_FIXED not specified, that do not have to use the specified address addr, then use the following pointers
I
II

Initial function allocates address addr


If (P = File-> f_op-> get_unmapped_area) NULL, then execute the function pointer p, or the execution of the function q =
current-> mm-> get_unmapped_area, the ext2 file system, the pointer p = NULL, so the execution pointer function q
= Arch_get_unmapped_area (), located in the file arch / x86-64 / kernel / sys_x86_64.c in
i
find_start_end (flags, * begin, * end): You can set the address mapping of virtual address space, 32-bit applications
* Begin = 4000_0000h, * end = 8000_0000h; 64-bit programs * begin = TASK_UNMAPPED_BASE
ii

= TASK_SIZE / 3, * end = TASK_SIZE = 8000_0000_0000h - 1000h


If addr 0, ie, the target address specified map after the call find_vma (mm, addr) Find, if the free space is full
Enough demand, simply return the virtual address addr, otherwise continue
40

Page 41

iii
iv

Initial setting addr:? Addr = mm-> free_area_cache <begin begin: mm-> free_area_cache
In a for loop repeatedly calls the function find_vma (mm, addr), find free virtual address space to find return

addr, otherwise it returns an error -ENOMEM


Note: pointer current-> mm-> get_unmapped_area function arch_pick_mmap_layout (struct mm_struct *
mm) set the parameters, the file include / linux / sched.h while setting three parameters:
mm-> mmap_base = TASK_UNMAPPED_BASE
mm-> get_unmapped_area = arch_get_unmapped_area
mm-> unmap_area = arch_unmap_area
The function arch_pick_mmap_layout (mm) of the call chain as follows:
arch_pick_mmap_layout () load_elf_binary () elf_format.load_binary
arch_pick_mmap_layout () exec_mmap () flush_old_exec ()
load_aout_binary () aout_format.load_binary
load_elf_binary ()
elf_format.load_binary
load_elf_fdpic_binary () elf_fdpic_format.load_binary
load_som_binary ()
som_format.load_binary
ia32_aout.c / load_aout_binary () ia32_aout.c / aout_format.load_binary
load_flat_file () load_flat_binary () flat_format.load_binary
load_flat_shared_library () calc_reloc () load_llat_file ()
III
IV

4
5
6
7
8
9
10
11
12

Verify that the address is valid


If the file mapping and the use of hugetlbfs file system, the function is called prepare_hugepage_range (addr, len) test
Whether the certificate address hugepage alignment, otherwise the call function is_hugepage_only_range (current-> mm, addr, len),

x86-64 systems empty function


can_do_mlock (): If the flags marked MAP_LOCKED set, then call this function to test whether the process has mapped
Memory locking permission, authority or if it has CAP_IPC_LOCK current-> signal-> rlim [RLIMIT_MEMLOCK].
rlim_cur 0, proceed and set VM_LOCKED mark, otherwise it returns an error -EPERM
If the VM_LOCKED flag is locked calculate the total area mm-> locked_vm exceeds a defined limit rlim
Degree, if it exceeds the return -EAGAIN error, otherwise continue
Analysis of marker flags and prot attributes and properties in the file file-> f_mode matches
locks_verify_locked (inode): Make sure the file does not use mandatory locks
vma = find_vma_prepare (): Find vma, ready to establish mapping
do_munmap (mm, addr, len): If you find the vma management area and the area to be mapped overlap, then call the function release
The overlapping area, if the release failed to return -ENOMEM error, otherwise continue
may_expand_vm (): Verify that the total does not exceed the address space mapping system limits
vma_merge (): If it is a private anonymous mapping (ie File NULL && no VM_SHARED mark), is called the
Function expansion area to find the map, success, returns
Assign a dedicated cache vm_area_cachep from vma structure and subjected to initialize
If the file mapping (File NULL) do the following:
You can not have growth directions marked VM_GROWSDOWN or VM_GROWSUP, if a write mark
I
VM_DENYWRITE, then call the function deny_write_access (file) file is currently writable

II
III
IV
V
VI
VII

vma-> vm_file = file


get_file (file): increased file usage count file-> f_count
file-> f_op-> mmap (file, vma): For ext2 file system function is called generic_file_mmap ():
mapping = file-> f_mapping
If mapping-> a_ops-> readpage = NULL is returned -ENOEXEC error
file_accessed (file): tag file is accessed, if the file-> f_flags no mark O_NOATIME, the call letters
Number touch_atime (file-> f_vfsmnt, file-> f_dentry), direct call function update_atime (dentry-> d_inode) more
41

Page 42

New access time inode-> i_atime:


VIII
13

shmem_zero_setup (vma): If the file is mapped and marked with VM_SHARED, then call this function to establish a common
Enjoy anonymity map
I

II
III
14
15
16
17

vma-> vm_ops = & generic_file_vm_ops: Key Operator address mapping, generic_file_vm_ops only two
Effective members: .nopage = filemap_nopage; .populate = filemap_populate

file = shmem_file_setup ( "dev / zero", size, vma-> vm_flags): establish a shared memory file
i
root = shm_mnt-> mnt_root;
ii
dentry = d_alloc (root, & this): assignment dentry structure shmfs file system
iii
iv

file = get_empty_filp (): Gets a file descriptor


inode = shmem_get_inode (root-> d_sb, ...): shmfs allocated inode structure in the file system

v
vi

d_instantiate (dentry, inode): to establish contact between the dentry and inode]
file-> f_vfsmnt = shm_mnt); file-> f_dentry = dentry; file-> f_mapping = inode-> i_mapping;
file-> f_op = & shmem_file_operations

vma-> vm_file = file


vma-> vm_ops = & shmem_vm_ops

vma_merge (): If it is an anonymous mapping function tries to call the merged vma
atomic_inc (& inode-> i_writecount): If the file mapping and write map count increased write
mm-> total_vm + = len >> PAGE_SHIFT: Record amount of mapping data
If with VM_LOCKED marked increase in the amount of memory locked mm-> locked_vm; call functions make_pages_present
(Addr, addr + len): the function is called get_user_pages (current, current-> mm, addr, len, write After verifying the correctness of the parameters,
force = 0, pages = NULL, vmas = NULL)
I

vma = find_extend_vma (mm, addr = start):


i
vma = find_vma (mm, addr): Looking for a first meet start <vma-> end of vma, find returns NULL
If vma-> vm_start <= addr, addr that is located in the middle vma region, simply return vma
ii
If there is no mark VM_GROWSDOWN, direct return NULL, ie addr area and found not vma
iii
Regional growth is not down range (stack) directly returns NULL
iv

expand_stack (vma, addr): At this point the stack is managed vma's vma, room for expansion to include the address addr
A
anon_vma_prepare (vma): Find or assign an anonymous vma, if vma-> anon_vma NULL
Direct return that value, otherwise continue
a

anon_vma = find_mergeable_anon_vma (vma): to find whether the adjacent can be combined in vma
Anonymous vma, preventing distribution behind after the merger, that is found near the vma return of anon_vma

b
anon_vma = anon_vma_alloc (): not found the function to allocate an anonymous call
c
vma-> anon_vma = anon_vma
d
list_add (& vma-> anon_vma_node, & anon_vma-> head): Add to the list of anonymous vma
The address addr aligned to an integer multiple of the page address

II
III

B
C
size = address - vma-> vma_start; grow = (address - vma-> vm_end) >> PAGE_SHIFT
D
acct_stack_growth (vma, size, grow): Verify that you can increase the stack space
E
vma-> vm_start = address; vma-> vm_pgoff - = grow: Accept Extended
If vma contains VM_LOCKED flag, calling function make_pages_present (addr, start) distribution
v
Physical page
At this time, or vma is NULL, or which contains the start address
If the condition (vma = NULL && in_gate_area (tsk, start)), that is, start address is gate_vma (section
[VSYSCALL_START, VSYSCALL_END] in), for an existing page, fill pages and parameters vmas
Parameters, there is no page is returned encounter, if you continue to meet this Article

IV

page = follow_page (mm, start, write): direct call function __follow_page (mm, address, 0, write, 1): eligible
Take page address of the page where the address structure, if the page = NULL, then the function is called __handle_mm_fault (mm,
42

Page 43

V
VI
VII

vma, start, write) assigned a physical page


to be continued
i
If Pages NULL, fill out pages [i] = page; if VMAs NULL, fill vmas [i] = vma
start + = PAGE_SIZE, len-If this page vma there to be treated, the condition is met len 0 && Start <vma-> vma_end, then continue to turn IV

Processing Next
If there is need to deal with pages that satisfy the condition len 0, then continue to the next turn I vma
VIII
After completion, the virtual address has a corresponding physical page
IX
If the parameter is marked with MAP_POPULATE, then call the function sys_remap_file_pages (addr, len, 0, pgof, flags &
18
MAP_NONBLOCK): remapping
to be continued
I
For the latter ext2 file system, complete address mapping will be generated when accessing the page fault
.nopage = filemap_nopage; .populate = filemap_populate
IX
2.6 path_lookup
2.7 file locking and counter
2.8 File System Summary

Process 3
There are three linux system calls fork, vfork, and clone the process used to produce, in the core, respectively sys_fork, sys_vfork and
sys_cloen, all further calls an internal function do_fork () is complete, the only difference is the different parameters call do_fork () of.
do_fork parameters:
unsigned long clone_flags: Characteristic Parameters
unsigned long stack_start: subprocess stack starting address
Register structure pointer
struct pt_regs * regs:
unsigned long stack_size: Stack size, the parameter is not used
int __user *
parent_tidptr: parent process pointer tid
int __user *
child_tidptr: subprocess tid pointer
sys_fork parameters:
struct pt_regs * regs
When calling do_fork format:
It sends the signal to the parent process child process end (terminate) or stop (stop) when: clone_flags = SIGCHLD
stack_start = regs-> rsp: common parent process stack, using a mechanism for replication COW
regs = regs
stack_size = 0
parent_tidptr = NULL
child_tidptr = NULL
sys_vfork parameters:
struct pt_regs * regs
When calling do_fork format:
clone_flags = CLONE_VFORK | CLONE_VM | SIGCHLD: shared with a parent process address space;
43

Page 44

And so the parent process hangs enters a wait state until the child process releases the address space, that is the end or perform a new program;
Other sys_fork same
sys_clone parameters:
unsigned long clone_flags
unsigned long newsp
void __user * parent_tid
void __user * child_tid
struct pt_regs * regs
When calling do_fork format:
stack_start = newsp:? regs-> rsp
stack_start = 0

Other parameters corresponding to the use


The following analysis do_fork () procedure
1
2
3

pid = alloc_pidmap (): pid assigned a number free


Detection current-> ptrace mark, if you need to track child process is marked in clone_flags added CLONE_PTRACE
p = copy_process (): create a process descriptor
If clone_flags while CLONE_NEWNS and CLONE_FS is marked with an error. CLONE_NEWNS table
I
Shown using the new namespace (namespace); CLONE_FS represents shared with the parent process current-> fs structure that fs_struct
If clone_flags with CLONE_THREAD mark but no CLONE_SIGHAND tag is wrong.
II

III

CLONE_THREAD: The child process is added to the parent thread group, forced child share signal description of the parent process
symbol. CLONE_SIGHAND: shared signal indicated table, including the signal handler (handler), blocking and pending signals
If clone_flags with CLONE_SIGHAND mark but no CLONE_VM tag is wrong. CLONE_VM:
Parent and child share the virtual address space

IV

p = dup_task_struct (orig = current): replication process control word task_struct


i
prepare_to_copy (orig): direct call function unlazy_fpu (orig), if the current process is the use of a single floating point
Yuan (FPU), ie orig-> thread_info-> status have TS_USEDFPU marker function is called save_init_fpu
(Orig), to save the value to the floating-point registers orig-> thread.i387.fxsave while clearing orig-> thread_info->
status of TS_USEDFPU mark, and will call the macro stts TS in CR0 mark (bit3) is set to indicate
Imminent handover process, the use of x87 instruction or instruction will produce a multimedia device is unavailable exception
ii

tsk = alloc_task_struct (): allocation process from a dedicated cache task_struct_cachep control structure

iii

ti = alloc_thread_info (): call macro __get_free_pages (, 1) allocated two consecutive physical pages as pipe thread
Word processing and kernel stack

iv

* Ti = * orig-> thread_info: Copy thread control structure

v
vi

* Tsk = * origf: replication process control structures


tsk-> thread_info = ti; ti-> task = tsk: the process of building, the thread control word corresponding relationship
Setting tsk-> usage = 2: Set the control word count, a new process is created for itself and one for release_task ()
Performer, usually the parent process

vii
V
VI
VII
VIII

IX
X

The total number of new process to verify whether the owner of the process exceeds the limit, ie p-> user-> processes> = p-> signal-> rlim
[RLIMIT_NPROC] .rlim_cur, if exceeded, and no administrator privileges are not root user, then an error
Incrementing count: p-> user -> __ count, p-> user-> processes
get_group_info (p-> group_info): increase p-> group_info-> usage count
copy_flags (clone_flags, p): The p-> flags cleared PF_SUPERPRIV mark set PF_FORKNOEXEC
Mark, if clone_flags no CLONE_PTRACE mark, set p-> ptrace = 0. PF_SUPERPRIV:
Superuser mark, PF_FORKNOEXEC: fork but does mark
p-> pid = pid: PID set for the child process
If clone_flags in with CLONE_PARENT_SETTID mark, then p-> pid write user space variable
44

Page 45

XI
XII

parent_tidptr in
Set p-> proc_dentry = NULL, the initialization list p-> children, p-> sibling, initialize the spin lock p-> alloc_lock,
p-> proc_lock, init_sigpending (& p-> pending): Suspend initialization signal management structure, the other members of the initialization p
copy_semundo (clone_flags, p):
If clone_flags no CLONE_SYSVSEM mark, set p-> sysvsem.undo_list = NULL
i
After the return, otherwise continue
ii

get_undo_list (undo_listp = & undo_list):


If the current process undo_list (current-> sysvsem.undo_list) = NULL, then assign a
A
semundo_list structure
B
* Undo_lsitp = current-> sysvsem.undo_list
Increase undo_list count undo_list-> refcnt

XIII

iii
iv
p-> sysvsem.undo_list = undo_list
copy_files (clone_flags, p)
If there are signs clone_flags CLONE_FILES, increase the reference count current-> files-> count returns
i
Assign a dedicated cache files_cachep from files_struct structure and initialization
ii
iii
iv
v

XIV

XV

open_files = count_open_files (): Maximum file descriptor fd calculated using the current process
Copy files_struct in open_fds and close_on_exec marked unused portion cleared
Use a for loop, copy current-> files-> fd [..] in all open files to p-> files-> fd [..], the
If the file is not open, clear p-> file-> open_fds corresponding bitmaps

copy_fs (clone_flags, p): If clone_flags mark CLONE_FS set, increasing the current-> fs-> count count
Back after a few, otherwise it is set p-> fs = __copy_fs_struct (current-> fs): assign a dedicated cache from the fs_cachep
fs_struct structure, and set the initial value equal to the current process of structural fs_struct
copy_sighand (clone_flags, p): Copy the signal processing functions
If CLONE_SIGHAND or CLONE_THREAD labeled with, directly increasing the current-> sighand->
i

return count after count


Assign a sighand_struct structure sig from the dedicated cache sighand_cachep and subjected to initialize
Copy current-> sighand-> action to sig-> action in

ii
iii
iv
XVI

Set p-> sighand = sig, sig-> count = 1

copy_signal (clone_flags, p)
If clone_flags marked CLONE_THREAD set the reference count is increased current-> signal-> count
i
And after current-> signal-> live return
Assign a signal_struct structure sig from the dedicated cache signal_cachep
ii
iii

XVII

p-> signal = sig, perform other basic initialization

Copy current-> signal-> rlim to sig-> rlim in


iv
copy_mm (clone_flags, p): Copy the virtual address space, copy the kernel space in the page table from init_level4_pgt
If clone_flags marked CLONE_VM set the reference count is increased current-> mm-> mm_users, No
i
Proceed
ii

mm = allocate_mm (): assign a mm_struct structure from the dedicated cache mm_cachep

iii
iv

memcpy (mm, current-> mm, sizeof (* mm)): Copy mm_struct structure


mm = mm_init (mm): distribution pgd, and copy the kernel space in the page table from init_level4_pgt
A basic structure of the initialization mm
B mm_alloc_pgd (mm): Setting mm-> pgd = pgd_alloc (mm):
a
pgd = __get_free_page (): assign a physical page
b

boundary = pgd_index (__ PAGE_OFFSET): calculate the address __PAGE_OFFSET in pgd


s position

The pgd of 0 ~ boundary items cleared


45

Page 46

d
v

Copy init_level4_pgt + boundary to pgd + boundary, total PTRS_PER_PGD-boundary


Items that replicate the core spatial correspondence pgd, set up sub-process core space page table

init_new_context (p, mm): the function is called copy_ldt (new = & mm-> context, old = & current->
mm-> context)
A alloc_ldt (pc = new, mimcount = old-> size, reload = 0)
If new-> size> = old-> size will be returned directly, then return directly to the replication process
a
The mincount rounded up to an integer multiple of 512 bytes
b
c

The new LDT capacity mincount * LDT_ENTRY_SIZE, if the value is less than a call to

kmalloc allocate memory, or the call to allocate memory newldt vmalloc


Copy the contents of the original LDT (ie pc-> ldt), the remaining space is cleared

e
f
g

LDT release original content (ie pc-> ldt)


Setting pc-> ldt = newldt
If reload 0, then call load_LDT (PC) load_LDT_noblock (PC, the CPU)
set_ldt_desc (the CPU, pc-> LDT, COUNT)
load_LDT_desc ()

B memcpy (new-> ldt, old-> ldt, old-> size * LDT_ENTRY_SIZE): copy of the parent process mm-> context.
ldt content to the child process mm-> context.ldt, the content of the pointer consistent

4 locking mechanism

5 Memory Management
5.1 swap mechanism
swap mechanism recovered memory type:
page_launder () : Recycling inactive_dirty_list of page
refill_inactive_scan () : The active_list the page becomes inactive
swap_out () : from init_mm.mmlinst start scanning all mm_struct structure, swap out vm_area_struct management section
page
shrink_dcache_memory () : Recycling dentry structure
shrink_icache_memory () : Recycling inode structure
kmem_cache_reap () : Reclaiming Space slab structure
1
index:
address_space

587

dentry
dentry_operations

428
418

ext2_aops

587

ext2_dir_entry_2

427

ext2_dir_inode_operations

448

ext2_group_desc

533

ext2_inode

425

ext2_inode_info

458
46

Page 47

ext2_sb_ifno

526

ext2_sops

455

ext2_super_block

527

file

541

file_operations

416

file_system_type

497

files_struct

543

fs_struct

417

inode

423

nameidata

433

proc_dir_entry

657

proc_sops

657

super_block

524

vfsmount

510

6 core thread
migration_thread ()
ksoftirqd ()
7 Special Function Function
schedule ()
path_lookup ()
Part 8 global variables
CPU Unit

Operating System CPU smallest unit of recognition, if open SCHED_SMT, refers to each CPU thread, otherwise every finger
CPU cores

cpumask_t
cpumask_t
cpumask_t
u8

cpu_sibling_map [NR_CPUS]: CPU Unit Mask, considering hyperthreading


cpu_core_map [NR_CPUS]: CPU core mask
cpu_online_map: All active CPU Unit Mask
phys_proc_id [NR_CPUS]: where each CPU unit CPU package APIC_ID, remove nuclear raw numbers
In part

u8
int

cpu_core_id [NR_CPUS]: where each CPU Unit CPU core number, excluding package code
smp_num_siblings: The current number of CPU unit CPU package included, noting each CPU when SMT structure
Thread as a CPU unit

u8

cpu_to_node [NR_CPUS]: If phys_proc_id [i] line (located bitmap node_online_map), then


cpu_to_node [i] = phys_proc_id [i], otherwise equal fisrt_node (node_online_map)

47

Você também pode gostar