Escolar Documentos
Profissional Documentos
Cultura Documentos
Linux
Kernel (
2.6.13.2
Miao Yanchao
Summary:
1 system start
1.1 and previous assembly code head.S
Initial Settings CPU status, create a process 0, the process of building the stack:
movq init_rsp (% rip),% rsp, init_rsp definitions
.globl init_rsp
init_rsp:
.quad init_thread_union + THREAD_SIZE-8
Upcoming virtual address init_thread_union + THREAD_SIZE-8 as the current process (process 0) kernel stack space bottom of the stack,
init_thread_union defined in the file arch / x86_64 / kernel / init_task.c in:
union thread_union init_thread_union __attribute __ ((__ section __ ( ". data.init_task"))) =
{INIT_THREAD_INFO (init_task)};
INIT_THREAD_INFO defined in the file include / asm-x86_64 / thread_info.h, the initialization init_thread_union.task =
& init_task, init_task defined in the same file init_task.c, the initialization is:
struct task_struct init_task = INIT_TASK (init_task) ;
INIT_TASK macro defined in include / linux / init_task.h in.
Initial use all the time to compile a static set of a process control structure 0 setting, so that the process can access press 0 Common core processes.
init_task.mm = NULL; init_task.active_mm = INIT_MM ( init_mm), init_task.comm = "swapper"
INIT_MM will init_mm.pgd initialized swapper_pg_dir, namely init_level4_pgt, definition and head.S in. The name of the process 0
Called swapper.
Using the following assembly code to jump to C functions:
movl% esi,% edi
// Transfer function parameters
movq
initial_code (% rip),% rax
jmp *% rax
initial_code:
.quad
x86_64_start_kernel
Begin file arch / x86_64 / kernel / head64.c C function x86_64_start_kernel (char * real_mode_data),
1.2 Function x86_64_start_kernel (char * real_mode_data)
1
2
3
4
Set all interrupt vectors for the initial entry early_idt_handler, loading the interrupt descriptor idt_descr
5
6
7
setup_boot_cpu_data (): set CPU information structure boot_cpu_data, use instruction cpuid
1
Page 2
1.3.1
lock_kernel (): File lib / kernel_lock.c realized BKL (big kernel lock), use: lock_kernel / unlock_kernel
If you turn PREEMPT_BKL, using semaphores kernel_sem achieve, otherwise use spinlocks kernel_flag implementation. usually
Open PREEMPT_BKL default option.
When task-> lock_depth equals -1, the execution down (& kernel_sem) operating current-> lock_depth ++
unlock_kernel execution --current-> lock and up (& kernel_sem) operation
page_address_init (): x86-64 systems in empty function.
printk (linux_banner): print characteristic information
Architecture-initialization setup_arch (& command_line)
1.3.2
9
setup_memory_region ():
I
II
III
10
11
12
13
14
15
16
end_pfn = e820_end_of_ram (): E820 analysis diagram, set global variables related to memory: end_user_pfn: Start Senate
Number mem = xx set the number of pages; end_pfn_map: System RAM (main memory) the number of pages that the establishment of direct-mapped pa
Number of pages, by __va, __ pa address macro operation; end_pfn: the number of pages the direct management of the operating system
check_efer: read msr register MSR_EFER, test extended features extended feature register
init_memory_mapping (0, end_pfn_map << PAGE_SIZE) : direct mapped page table
I
find_early_table_space: According to the demand for memory mapped page tables to calculate the total capacity of the pud and pmd (2M page, 3
Page tables) tables byte (page capacity PAGE_SIZE integer multiples), the use of E820 diagram, starting from the physical address 8000h
Looking capacity tables bytes of contiguous physical memory, and skip interval [640KB, _end] reserved memory, under normal circumstances
Results from 8000h to find the start is the beginning, set the global variable table_end = table_start = start >>
II
III
17
IV
unmap_low_page (map) to lift allow_low_page temporary mapping of 40M or 42M
acpi_boot_table_init (arch / i386 / kernel / acpi / boot.c in): ACPI initialization
I
acpi_table_init (drivers / acpi / tables.c) : ACPI table initialization (Initialize the ACPI boot-time table parser)
i acpi_find_rsdp: Location RSDP (Root System Description Pointer) position,
A acpi_scan_rsdp (0, 0x400): in the interval [0,3FFh] Search RDSP signature "RSD PTR".
2
Page 3
B acpi_scan_rsdp (0xE0000, 0x20000): in the interval [E_0000h, F_FFFFh] search RDSP signature "RSD
PTR ".
C signature return address where the search is successful, otherwise it returns 0.
ii by printk display "RSDP (rsdp-> version, rsdp- > oem_id, rsdp_phys)" information.
iii acpi_table_compute_checksum: checksum calculation rsdp
iv acpi_table_get_sdt (rsdp): In version 2.0 and above as an example:
A std_pa = ((struct acpi20_table_rsdp * ) rsdp) -> xsdt_address: Get a physical address table XSDT
B header = __acpi_map_table (std_pa): Gets ACPI table header virtual address, x86-64 use direct mapping __va
Shoot, std_pa no more than 8M when i386 also uses __va direct mapped, using a fixed mapping of more than 8M
C mapped_xstd = __acpi_map_table (std_pa), to map the entire XSDT (Extended System
Description Table)
D Check XSDT header signature "XSDT" and checksum
E Set sdt_count and XSDT each table entry physical address to std_entry [i] .pa in
F acpi_table_print (header, sdt_pa): printk display with header information
G __acpi_map_table
(sdt_entry
.pa ): to structure
XSDT physical
of each entry
std_entry
.pa for calculation,
A acpi_table_header
address[i]mapping
is calledaddress
acpi_table_print
display
and a[i]
checksum
Setting std_entry [i] .size field, the signature header-> signature of the array acpi_table_signatures
Name more, set std_entry [i] .id field. Array acpi_table_signatures define more bizarre forms.
H acpi_get_table_header_early: search ACPI_DSDT and call acpi_table_print print, but physically
Address unknown, is set to 0
II
III
18
acpi_blacklisted (drivers / acpi / backlist.c) : sdt_entry [*] whether there is ACPI table acpi_backlist [] given in
ID, in line with the conditions given in error and may be in the closed acpi function call acpi_disable
sizeof (struct
acpi_table_srat),
setup_node (pxm):
nodes_weight (nodes_found): the final call generic_hweight64 (nodes_found)
Nodes_found calculated as the number of bit 1
(II)
fisrt_unset_node: find the first zero bit node number
(III)
node_set (node, nodes_found)
(IV)
pxm2node [pxm] = node
(3) cpu_to_node [num_processors] = node, acpi_numa = 1
(4) Display information: printk (KERN_INFO "SRAT: PXM % u -> APIC% u -> CPU% u ->
3
Page 4
19
20
IV
acpi_table_parse (ACPI_SLIT, acpi_parse_slit): Analysis of SLIT (System Locality Information Table)
V
acpi_numa_arch_fixup: empty function
Open compiler option NUMA calling function numa_initmem_init (0, end_pfn), otherwise the function is called contig_initmem_init (0,
end_pfn)
numa_initmem_init:
If open ACPI_EMU compiler option, an numa_emulation (0, end_pfn), successful
I
II
III
find_northbirdgh: Find the CPU North Bridge module memory address mapping function (function 0:
HyperTransport Technology Configuration, Function 1: Address Map), (VendorID : DeviceID)
= (1022: 1100/1101), returns the device number
ii feature information: printk (KERN_INFO "Scanning NUMA topology in Northbridge% d \ n", nb);
iii Northbridge
reading of
device
function 0 (1022: 1100) Offset 60h Information (NodeID), the number of computing systems Node,
That is the number
processors.
iv display information indicating: printk (KERN_INFO "Number of nodes % d \ n", numnodes)
v Northbridge reading apparatus 1 of the offset 40h7Ch, obtain distribution information memory and each memory address corresponding nodeid
Recorded in a local variable nodes, nodes [nodeid] .start = base , nodes [nodeid] .end = limit, in
nodes_parsed marked valid nodeid.
vi memnode_shift = compute_hash_shift (nodes, numnodes )
A maxend = MAX {nodes [* ]. End}
B satisfies the condition (1UL << shift) <maxend / NODEMAPSIZE the minimum shift value,
NODEMAPSIZE = 0xFF. Back shift
C for all memory address addr, granularity (1UL << shift) setting memnodemap [addr >> shift] = i , i
NODE number is (07)
Note:
Shiftsothe
divided
number of segments, each segment capacity
memnode_shift: The total physical memory,
255 and
:
Every
home
and
other
segments
number
memnodemap [0..254]
NODE
vii flag information: printk (KERN_INFO "Using node hash shift of% d \ n", memnode_shift)
viii For all configured with physical memory NODE: Setting cpu_to_node [i] = i, setup_node_bootmem (i,
nodes [i] .start, nodes [ i] .end):
A start = round_up (start, ZONE_ALIGN ): round starting address, ZONE_ALIGN:
4
Page 5
K bootmap_start
=
find_e820_area (bootmap_start, end,
bootmap_pages <<
PAGE_SHIFT): The NODE allocated memory space bitmap
L bootmap_size
= init_bootmem_node (node_data [nodeid], bootmap_start >>
PAGE_SHIFT, start_pfn, end_pfn), direct call init_bootmem_core (pgdat,
freepfn / mapstart, startpfn, endpfn) , the order parameter is directly bonded
a bdata = node_data [nodeid] -> bdata ( previously set to & plat_node_bdata [nodeid])
b bdata-> node_bootmem_map = phys_to_virt (mapstart << PAGE_SHIFT)
c bdata-> node_boot_start = (start << PAGE_SHIFT ) ( reset)
d bdata-> node_low_pfn = end
e Bitmap area bdata-> node_bootmem_map all set to 1, reserves all memory
f Back bitmap capacity of 8 byte alignment
M e820_bootmem_free (node_data [nodeid], start, end): According to e820 table for all of the nodes
It belongs E820_RAM, and e820 Fig flag as a valid area of memory, calls the function free_bootmem_node
IV
bootmap_size = bootmem_bootmap_pages (end_pfn) << PAGE_SHIFT : calculate the required total memory
5
Page 6
Bitmap capacity
III
IV
bootmap = find_e820_area (0, end_pfn << PAGE_SHIFT, bootmap_size): allocation bitmap space
bootmap_size = init_bootmem (bootmap >> PAGE_SHIFT, end_pfn):
i max_low_pfn = pages, min_low_pfn = start
ii init_bootmem_core (NODE_DATA (0), start, 0, pages): Settings Bitmap
V
VI
e820_bootmem_free (NODE_DATA (0), 0 , end_pfn << PAGE_SHIFT): According to e820 Map full release
Effective memory unit
reserve_bootmem (bootmap, bootmap_size): Reserved bit map memory
twenty reserve_bootmem_generic
two
(table_start << PAGE_SHIFT, (table_end - table_start) << PAGE_SHIFT): Paul
Stay directly mapped page table memory
I
II
int nid = phys_to_nid (phys): by memnodemap [addr >> memnode_shift] get nid
reserve_bootmem_node (NODE_DATA (nid), phys , len): Direct call reserve_bootmem_core
(pgdat-> bdata, physaddr, size ), bdata-> node_bootmem_map corresponding bit is cleared, marked reserved
core image memory that region [1M, __ pa (_end) ], the first reserved physical memory 0
twenty Reserved
three
twenty reserve_ebda_region
four
(): Reserved EBDA area
If you turn on SMP options: Reserved memory and a trampoline area, namely page 6
25
If the option is turned ACPI_SLEEP, acpi_reserve_bootmem: call alloc_bootmem_low assigned a physical memory,
26
And save it to the acpi_wakeup_address
I find_smp_config: direct call find_intel_smp, smp_scan_config using the function in the interval [0,1K), [639K,
640K) and [960K, 1024K) search SMP configuration, find_intel_smp successful return, if fails to read the physical address 40Eh
Data addr, multiplied by 16 as the base address of the base, call the function smp_scan_config search interval [base again, base + 4K)
Search SMP configuration
II smp_scan_config: Looking In "_MP_" starting area MP table signature as MP table, if other parameters match the checksum
Success found that the MP table and retain the page where the region is set smp_found_config = 1. If more further MP table
There are configuration table address, which is the first field after the signature is not 0, then save the configuration page table
27
If the option is turned BLK_DEV_INITRD, if initrd effective region, is retained initrd region, the base address: INITRD_START;
Length: INITRD_SIZE
28
If you open KEXEC option is reserved crashk_res designated area. The region "crashkernel =" specified by the startup parameter
29
30
KEXEC NOTE: kexec is a system call that implements the ability to shutdown your current kernel, and to
start another kernel. It is like a reboot but it is indepedent of the system firmware.
And like a reboot
you can start any kernel with it, not just Linux.
sparse_init: Open compiler option SPARSEMEM effective for all valid memory SECTIONS, perform the following actions:
The first pnum a SECTIOON, call interface alloc_bootmem_node from its present SECTION corresponding NODE memory
Assign a map area, set up a page structure for each page, set mem_section [pnum]. Section_mem_map
| = Map - section_nr_to_pfn (pnum)
paging_init (NUMA structure): According node_possible_map structure, effective for each NODE i, the function calls
setup_node_zones (i):
I
start_pfn, end_pfn: The NODE from memory, only the page number, there may be empty
Setting zones [ZONE_DMA], zones [ZONE_NORMAL ]: Memory interval [start_pfn, end_pfn] in the e820
II
Effective capacity
Setting holes [ZONE_DMA], holes [ZONE_NORMAL ]: Memory interval [start_pfn, end_pfn] in the e820
III
Invalid size for the memory hole of the NODE within
Note:If the
start_pfn dma_end_pfn (16M),then zones [ZONE_DMA], holes [ZONE_DMA]Are 0
IV
Page 7
memmap_init (size, nid, j, zone_start_pfn), direct call function memmap_init_zone, parameters completely
Consistency
For each physical page within this zone, set its page structure attribute parameters
a Setting page-> flags
b Settings page to Reserved
31
32
33
specification)
Calling function smp_read_mpc analysis MPC (multiple processor config) table, handle multiple processor information
Comparison Table MPC whether the signature is "PCMP", checksums, version number, and whether there LAPIC
i
Print logo Information: printk (KERN_INFO "OEM ID: % s", str)
ii
Print logo Information: printk (KERN_INFO "Product ID: % s", str)
iii
Print logo Information: printk (KERN_INFO "APIC at: 0x% X \ n", mpc-> mpc_lapic)
iv
v
Page 8
vi
vii
viii
ix
D set bios_cpu_apicid [cpu] = x86_cpu_to_apicid [cpu ] = m-> mpc_apicid, where BP when cpu = 0,
The CPU serial number when MPC table AP
E signs the current CPU valid in a global variable cpu_possible_map and cpu_present_map table
If the bus entry function is called MP_bus_info:
If IOAPIC entry function is called MP_ioapic_info:
Print logo Information: printk ( "I / APIC # % d Version% O d at 0x% X \ n.", ...);
If the interrupt source entry function is called MP_intsrc_info:
34
35
36
37
38
39
40
41
x
If the source is a local interrupt entry function is called MP_lintsrc_info:
init_apic_mappings
Establish FIX_APIC_BASE fixed mapping
I
The establishment of a fixed mapping FIX_IO_APIC_BASE_0 etc. IOAPIC
II
probe_roms: Record Resource (address space information, etc.) ROM occupied
e820_reserve_resources: According to information e820, e820 memory retention in iomem_resource (initial value for the whole resource space
Address Space) in the space occupied by the core image and code segments, data segments in each e820 resource space occupancy
Reserved video RAM resource space occupied iomem_resource
Reserved Standard I / O device occupies space resources ioport_resource
If you turn on the compiler option GART_IOMMU, then call the function iommu_hole_init
If you start the compilation option VGA_CONSOLE (normally open) is set conswitchp = & vga_con, or if the open
DUMMY_CONSOLE, is set conswitchp = & dummy_con
End setup_arch
Late architecture-independent initialization start_kernel
1.3.3
42
43
44
45
setup_per_cpu_areas: For each system CPU, call the function alloc_bootmem allocate a memory area ptr, will
Copy data area [__per_cpu_start, __ per_cpu_end] content to the memory area ptr, and set cpu_pda [cpu] .data_offset
= Ptr - __per_cpu_start
smp_prepare_boot_cpu: global variable cpu_online_map, cpu_callout_map, cpu_sibling_map [0] and
cpu_core_map marked this processor effective
sched_init: process initialization each CPU run queue runqueue, increase init_mm.mm_count reference count based CPU
Initialization idle process
build_all_zonelists: For the system each NODE i call the function build_zonelists (NODE_DATA (i) / pgdat ):
Initialization pgdat-> node_zonelists [*]. Zone [0] = NULL
According to the distance of the entire system and NODE NODE between, from near and far to traverse the entire system system NODE,
I
II
The calling function build_zonelists_node pgdat-> zone_zonelist [*]. Zone [i] point to the corresponding type in each NODE
zone area
Note:
in pgdat(Types of
struct pglist_data {}) Is defined:
struct zone node_zones [MAX_NR_ZONES];
// MAX_NR_ZONES = 3
struct zonelist node_zonelists [GFP_ZONETYPES] // GFP_ZONETYPES = 3
andzonelist It is defined as:
struct zonelist {
truct zone * zones [MAX_NUMNODES * MAX_NR_ZONES + 1];
8
Page 9
};
That all points to zone pointer, multiplied by the total number of zone each NODE in the number NODE System
Currently, in addition to a NULL pointer ends, namely zonelist the pointer can point to the system in all NODE
Each zone.
Said initialization process is to treat each NODE of zonelists each pointer system in all NODE
The zone, in ascending order according to NODE distance from the nearest front.
Display feature information: printk ( "Built% i zonelists \ n", num_online_nodes ())
46
47
48
49
50
51
III
IV
cpuset_init_current_mems_allowed: Setting current-> mems_allowed = NODE_MASK_ALL
age_alloc_init, directly call the macro hotcpu_notifier (page_alloc_cpu_notify, 0), define a static variable block notification
struct nodifier_block page_alloc_cpu_notify_nb = {page_alloc_cpu_notify, 0 }, and call the function
register_cpu_notifier registration notification block, block page_alloc_cpu_notify_nb registration notification to the global CPU activity notification chain
cpu_chain in
Display feature information: printk (KERN_NOTICE "Kernel command line :% s \ n", saved_command_line)
parse_early_param: analysis of early startup parameter
parse_args: Analysis of command line parameters :( "Booting kernel", command_line, __start___param, __stop___param __start___param, & unknown_bootoption);
sort_main_extable: direct call function sort_extable (__ start___ex_table, __stop___ex_table), and then call sort
Function, abnormal contents of the table quickly sort
trap_init: Abnormal initialization
Initialization exception vector (less than 32 interrupt vectors)
I
II
cpu_init (): initialize the CPU
If it is CPU 0, then the function is called pda_init (cpu) set CPU basic information
i
And displays feature information: printk ( "Initializing CPU #% d \ n", cpu)
ii
Setting GDT and IDT
iii
iv
wrmsrl
((u64) __
48entry
| ((u64)
__ KERNEL_CS) << 32): Let
Set
x86 (MSR_STAR,
legacy mode (legacy
x86USER32_CS)
mode) system<<
call
address
wrmsrl (MSR_LSTAR, system_call): Setting the Long Mode (long mode) 64-bit software entry address
syscall32_cpu_init (), you need to open software compatible compiler option IA32_EMULATION, set up long mode
Member system call entry address
enters to system calls, use the entry
( address
Note: x86-64 Used syscall / sysretInstruction/Return
STAR
C000_0081h),
LSTAR ( C000_0082h)and CSTAR( C000_0083h) The mode register. The corresponding inlet assembly instructions are
in /the
file / ia32 / ia32entry.S
ia32_syscall, ia32_cstar_target andsystem_call. System call table located
arch
x86_64
andinclude / asm-x86_64 / unistd.h
in. No longer use
options
80h Soft interrupt, but start the compilation
IA32_EMULATION
Rear
80h Interrupt still available.
52
53
III
fpu_init (): initialize the floating point processor
rcu_init: rcu initialization,
Call the function rcu_cpu_notify;
I
Call the function block rcu_nb register_cpu_notifier registered rcu notification to the cpu_chain list, which rcu_nb back
II
Transfer function that is rcu_cpu_notify;
init_IRQ:
I init_ISA_irqs:
i Call init_bsp_APIC: If the SMP APIC mode or CPU has been returned directly, otherwise set the local APIC
ii calling function init_8259A (0): Initialization 8259
iii initialize the interrupt descriptor structure irq_desc [224] empty state, for the first 16 interrupts are processed using the 8259A type
iv set the interrupt vector [32..255] to interrupt the door
9
Page 10
56
57
1000000, vxtime_hz% 1000000, timename) ; the name given here to use the clock timename
printk (KERN_INFO "time.c:. Detected % d% 03d MHz processor \ n.", cpu_khz / 1000, cpu_khz%
1000);
IX
rdtscll_sync (& vxtime.last_tsc): The vxtime.last_tsc set to the current value of TSC
X
setup_irq (0, & irq0); set the timer interrupt handler for irq0
XI
set_cyc2ns_scale (cpu_khz / 1000)
XII
time_init_gtod: need to open SMP option, "Decide after all CPUs are booted what mode gettimeofday
should use "
i unsynchronized_tsc ()
ii flag information: printk (KERN_INFO "time.c:. Using% s based timekeeping \ n", timetype)
58
console_init: early initialization console
VIII
59
tty_register_ldisc (N_TTY, & tty_ldisc_N_TTY): Setting tty_ldiscs [N_TTY], Setup the default TTY
line discipline
II
disable_early_printk, if the compiler options open EARLY_PRINTK, closed early printout
Executive function pointer [__con_initcall_start, __ con_initcall_end], that the implementation of each function defined console_initcall
III
profile_init:
If prof_on = 0 direct return, otherwise continue, prof_on by the startup parameter profile = xx on, refer to the section Notes
prof_len = (_etext - _stext) >> prof_shift
pro_buffer = alloc_bootmem (prof_len * sizeof ( atomic_t))
10
Page 11
ii
profile = "(
n n Digital), is executed
profile_setup
60
61
62
per_cpu_pages * pcp = zone-> pageset [get_cpu ()] -> pcp [0], that is, get this page to your zone,
Current processor corresponding to a single page chain management structure pcp
d use page-> lru pointer page page added to pcp-> list page, and increase pcp count. in case
exceeded the maximum number of pages in the pcp pcp-> high, then call the function free_pages_bulk (zone, pcp-> batch,
& Pcp-> list, 0) release pcp-> batch pages.
e free_pages_bulk: When the queue pcp-> list is not empty and the processing of pages not exceeding pcp-> batch, call
Function __free_page_bulk, released each time a page. __free_page_bulk:
11
Page 12
(1)
destroy_compound_page (struct page * page, unsigned long order): When start the compilation
Options HUGETLB_PAGE and order> 0 calls
(I)
(II)
If the flag is not set PG_compound, not a composite page directly returned.
if (page [1] .index! = order) is wrong, that is the great second 4KB page to subpages
The page structure should be used to record the page size of the index members
(III)
ClearPageCompound: Clear composite page Mark all subpages large pages, while
private members determine whether a page for each page pointing to the first page of the page structure
(IV)
page_idx = page_to_pfn (page) & ((1 << MAX_ORDER) - 1), the current page
Numbers in the range of BUDDY
(V)
(VI)
(VII)
while (ordef <MAX_ORDER - 1) {...}, the whole cycle to achieve within buddy algorithm
Memory reclaiming process
1
bad_range (zone, buddy): buddy_idx determine whether to return to normal within and
Belong to this zone, etc., were offered illegal circulation
4
5
page_is_buddy (buddy, order): buddy is valid, then the introduction of the illegal circulation
(Zone-> free_area + order) -> nr_free -; rmv_page_order (buddy); buddy
Have been combined into a larger page, delete from the current order
6
page_idx = combined_idx; order ++; the next loop processing
set_page_order (page, order): page-> private = order; set page-> flags in
The flag is valid PG_private
list_add (& page-> lru, & zone-> free_area [order] .free_list): The page structure plus
Into the corresponding free list
63
64
C
D
E
free_pages_bulk (page_zone (page), 1, & list, order): the release of the page, with the foregoing description,
kmem_cache_init: kmem_cache mechanism initialization
Initialization list cache_chain
I
Cache_cache initialization, will be added to the cache_chain cache_cache, each calling function kmem_cache_create
II
III
After creating a kmem_cache, we will use cache.next domain join the list cache_chain
Initialization malloc_sizes, for the malloc_sizes [*] and cs_dmacachep cs_cachep members were calling function
IV
_id ()]
V
register_cpu_notifier (& cpucache_notifier)
setup_per_cpu_pageset:
I
process_zones (smp_processor_id ()): set per_cpu_pageset to CPU 0
In the system for each zone, set zone-> pageset [cpu] = kmalloc_node (...); kmalloc_node (size_t size,
i
12
Page 13
unsigned int __nocast flags, int node): kmalloc same function is assigned malloc_size [*] in
Specify the amount of memory, except that the priority allocation of the specified node (NODE) node memory, no memory node node
Re-allocated from other nodes
65
ii
setup_pageset: Setting zone-> pageset [cpu] -> pcp [*]
II
register_cpu_notifier (& pageset_notifier): Sign CPU startup notification chain pageset_notifier
numa_policy_init:
Initialization dedicated cache: policy_cache = kmem_cache_create ( "numa_policy", sizeof (struct
I
II
mempolicy), ...)
Initialization dedicated cache: sn_cache = kmem_cache_create ( "shared_policy_node", sizeof (struct
sp_node), ...)
III
66
sys_set_mempolicy
(MPOL_allocated
INTERLEAVE,
...): setfailure
the memory
to be deposited
(MPOL_INTERLEAVE) policy
Slightly,
so that the memory
during startup
in the NODE
0;
calibrate_delay: Calibration clock, set the computing power nominal value BogoMIPS, set the global variable loops_per_jiffy
If you use the command line parameter "lpj = xxx", placed directly loops_per_jiffy = preset_lpj, and display the flag information:
I
II
III
printk ( "Calibrating delay loop (skipped) ...% lu.% 02lu BogoMIPS preset \ n", loops_per_jiffy /
(500000 / HZ), (loops_per_jiffy / (5000 / HZ))% 100); ...);
loops_per_jiffy = calibrate_delay_direct (): If the return value is not 0, the display flag message: printk ( "Calibrating
delay using timer specific routine .. "); printk (".% lu% 02lu BogoMIPS (lpj =% lu) \ n ", loops_per_jiffy /
(500000 / HZ), (loops_per_jiffy / (5000 / HZ))% 100, loops_per_jiffy);
Begin the calibration code, calculate loops_per_jiffy, display flag message: printk (KERN_DEBUG "Calibrating
delay loop ... "); printk ("% lu% 02lu BogoMIPS (lpj =% lu.) \ n ", supra);
"Modify
" global variables
Note:Command LinelpjParameters
= xxx
preset_lpjIn this work
67
pidmap_init: global variable pidmap_array [0], is pidmap_array [0] .page allocate a memory page, the function is called
Number attach_pid (current, ...) 0 PID process flag has been used
68
69
70
prio_tree_init: base priority search tree (radix priority search tree) is initialized to initialize static variables
index_bits_to_maxindex
anon_vma_init: Initialization dedicated cache: anon_vma_cachep = kmem_cache_create ( "anon_vma", sizeof (struct
71
anon_vma), ...)
If the function is called efi_enabled efi_enter_virtual_mode, you need to compile option EFI
72
73
74
VI
mm_cachep = kmem_cache_create ( "mm_struct", sizeof (struct mm_struct), ...);
buffer_init:
Initialization dedicated cache: bh_cachep = kmem_cache_create ( "buffer_head", sizeof (struct
I
II
buffer_head), ...);
Setting static variables max_buffer_heads
13
Page 14
75
76
77
III
hotcpu_notifier (buffer_cpu_notify, 0): Sign CPU startup notification chain buffer_cpu_notify
unnamed_dev_init: calling only function idr_init (& unnamed_dev_idr):
I
init_id_cache (): Initialization dedicated cache: idr_layer_cache = kmem_cache_create ( "idr_layer_cache",
sizeof (struct idr_layer), ...);
II
unnamed_dev_idr space cleared, initialized spin lock unnamed_dev_idr.lock;
key_init: initialization key (key) management
Initialization dedicated cache: key_jar = kmem_cache_create ( "key_jar", sizeof (struct key)
I
The key_type_keyring.link, key_type_dead.link and key_type_user.link added to the list key_types_list
II
Tail
Relevant global variable initialization
III
security_init: security initialization
Display flag information: printk (KERN_INFO "Security Framework v" SECURITY_FRAMEWORK_
I
VERSION "initialized \ n");
II
verify (& dummy_security_ops), call the function security_fixup_ops (& dummy_security_ops), on
dummy_security_ops Each member of the macro call set_to_dummy_if_null, set dummy_security_ops
Members point to the initial value corresponding to the empty function dummy_xx function (security / dummy.c in)
Initialize the pointer security_ops = & dummy_security_ops
III
IV
78
root_plug.c / rootplug_init ()
vfs_caches_init (num_physpages): VFS layer file system initialization
Calculation reserved memory reserve = min ((num_physpages - nr_free_pages ()) * 3/2, mempages - 1), that is already reserved
I
When
reserve as the following initialization - 1.5 times the use of memory pages to free memory mempages = num_physpages
Use
ofthe
memory;
II
III
IV
dentry), ...)
ii
iii
V
dcache_init_early () has been initialized dentry_hashtable, where no initialization, why the early beginning
Initialization?
inode_init (mempages):
Dedicated cache initialization: inode_cachep = kmem_cache_create ( "inode_cache", sizeof (struct
i
inode), ...);
ii
iii
VI
VII
files_init (mempages): set a global variable files_stat members max_files initial value, calculated according to the remaining memory capacity
mnt_init (mempages):
Dedicated cache initialization: mnt_cache = kmem_cache_create ( "mnt_cache", sizeof (struct vfsmount), ...);
i
Mount_hashtable assigned to a page, and initializes the hash-table head
ii
iii
sysfs_init (): sysfs file system initialization, enable the compiler option SYSFS
A dedicated cache initialization: sysfs_dir_cachep = kmem_cache_create ( "sysfs_dir_cache",
14
Page 15
do_kern_mount (type-> name, 0, type-> name, NULL), follow-up to see the file attachment portion
init_rootfs (): direct call function register_filesystem (& rootfs_fs_type), register file system
rootfs_fs_type
init_mount_tree (): initialize the file system installation tree
A mnt = do_kern_mount ( "rootfs", 0, "rootfs", NULL): Install the internal file system rootfs
B allocates a struct namespace namespace structure and initialization
C list_add (& mnt-> mnt_list, & namespace-> list): the rootfs mount point was added namespace member list
VIII
79
80
81
D namespace-> root = mnt, mnt-> mnt_namespace = namespace: rootfs mount point as the root directory
E Set init_task.namespace = namespace
F in the system for each current thread p, set p-> namespace = namespace, p is of type task_struct
G set_fs_pwd: Set the current process in the current directory and mount point respectively namespace-> root and namespace->
root-> mnt_root
H set_fs_root: Setting the root directory of the current process and mount point respectively namespace-> root and namespace->
root-> mnt_root
bdev_cache_init (): block device initialization
Dedicated cache initialization: bdev_cachep = kmem_cache_create ( "bdev_cache", sizeof (struct
i
bdev_inode), ...);
ii
register_filesystem (& bd_type): Register block device file system bd_type
iii
bd_mnt = kern_mount (& bd_type): internal mounting block device file system
iv
blockdev_superblock = bd_mnt-> mnt_sb: Setting the block device superblocks
IX
chrdev_init (): Direct call cdev_map = kobj_map_init (base_probe, & chrdevs_lock)
radix_tree_init: radix tree initialization
Dedicated cache initialization: radix_tree_node_cachep = kmem_cache_create ( "radix_tree_node", sizeof (struct
I
radix_tree_node), ...);
II
radix_tree_init_maxindex (): initialize static variables height_to_maxindex [*] each component
III
hotcpu_notifier (radix_tree_callback, 0): Sign CPU startup notification chain
signals_init (): initialization signal, only initialization dedicated cache sigqueue_cachep = kmem_cache_create ( "sigqueue",
sizeof (struct sigqueue), ...);
page_writeback_init (): initialize the page write-back mechanism
Depending on memory capacity setting the global variables associated dirty_background_ratio and vm_dirty_ratio
I
II
III
IV
82
83
III
IV
V
VI
cpuset_init (): need to open compiler option CPUSETS, working set CPU initialization
15
Page 16
I
II
III
IV
V
84
early_identify_cpu (c): CPU preliminary recognition, first set phys_proc_id [smp_processor_id ()] is
APIC_ID
cpuid_eax (): further identify CPU
init_amd (c): If AMD processor
A get_model_name (c): Setting CPU Type value recorded boot_cpu_data.x86_model_id
B display_cacheinfo (c): show CPU Cache Information
C set the current number of CPU cores to c-> x86_num_cores in
D amd_detect_cmp (c): detection of multi-core CPU configuration
a cpu_core_id [cpu] = phys_proc_id [cpu] & ((1 << bits) -1): Sets the current CPU within this package
CPU core ID
b phys_proc_id [cpu] >> = bits: Set the current CPU This package ID (APIC_ID remove the CPU core series
number)
c If acpi_numa <= 0 is set cpu_to_node [cpu] = phys_proc_id [cpu]
d display flag message: printk (KERN_INFO "CPU% d (% d) -> Node% d -> Core% d \ n", ...);
iv
v
vi
vii
viii
II
III
85
86
phys_proc_id [cpu]);
D global variable cpu_core_id [smp_processor_id ()]
mcheck_init (c): need to open compiler option X86_MCE,
A mce_init: MCE (Machine Check Exception) initialization function, call the function do_machine_check
Etc., and access to relevant internal CPU registers
Page 17
V
VI
88
set_cpus_allowed (current, CPU_MASK_ALL): set the init process is allowed to run on the entire CPU
I
task_rq_lock = rq (P = Current , & the flags): Get the current process (init) where the run queue rq
II
III
IV
V
cpus_intersects (new_mask = CPU_MASK_ALL , cpu_online_map): Test the new CPU mask word online
Are CPU (cpu_online_map) mask word is empty, an error is empty
p-> cpus_allowed = new_mask: the process of setting a new CPU mask word
cpu_isset (task_cpu (p), new_mask): init process is currently running test whether the new CPU mask word, is the
The successful launch
migrate_task (p, any_online_cpu (new_mask), req): init processes currently running on the CPU is not a new word in the mask,
Init migration process to the new CPU mask word in any online CPU,
i
If init is not running the queue (p-> array = NULL && task_running (rq, p) == NULL)
ii
89
90
II
III
IV
V
nmi_watchdog_default (): set a global variable nmi_watchdog, if the current is not the default NMI_DEFAULT,
Direct return (command line parameters nmi_watchdog = xxx sets nmi_watchdog value), or if the INTEL
Or AMD CPU and type (boot_cpu_data.x86) parameter is 15, set nmi_watchdog =
NMI_LOCAL_APIC, otherwise it is set nmi_watchdog = NMI_IO_APIC
current_cpu_data = boot_cpu_data: Set the current CPU characteristic parameters, open the macro compiler option SMP
current_cpu_data defined as cpu_data [smp_processor_id ()]
current_thread_info () -> cpu = 0
enforce_max_cpus (max_cpus): Clear greater than max_cpus of CPU and global variables cpu_possible_map
cpu_present_map markup that is greater than the CPU number is unavailable max_cpus
prefill_possible_map (): need to open compiler option HOT_PLUG_CPU, the system supports a CPU NR_CPUS
All added to the cpu_possible_map
VI
smp_sanity_check (max_cpus): Verify feasibility SMP configuration, if the function fails to close SMP
BP test whether the current mark in the global variable phys_cpu_present_map unlabeled the mark again
i
If smp_found_config is 0, the SMP configuration fails, the direct return
ii
iii
Are boot_cpu_id has been marked in the global variable phys_cpu_present_map unlabeled the mark again
The presence or absence of APIC
iv
VII
connect_bsp_APIC (): If the current mode is APIC no action, or the switch from the current mode to the PIC APIC mode
Formula: call the function clear_local_APIC () to reset the local APIC, through the port 22h and 23h write data
VIII
Page 18
IX
ii
iii
If ACPI analytic function acpi_process_madt () has been set IOAPIC (acpi_ioapic = 1), is set
io_apic_irqs = 0, that is, all the IRQ through IOAPIC, otherwise it is set io_apic_irqs = PIC_IRQS
If ACPI analytic function acpi_process_madt () is not set up IOAPIC (acpi_ioapic = 0), the function is called
setup_ioapic_ids_from_mpc analyzed from the MPC table IOAPIC, set all IOAPIC associated registers significant
Shows flag information: printk (KERN_INFO "Using IO-APIC% d \ n", mp_ioapics [apic] .mpc_apicid)
iv
v
sync_Arb_IDs ()
setup_IO_APIC_irqs (): Setting IOAPIC each pin interrupt vector,
Command line
parameters
or "Apic = verbose"
You can print the kernel boot process
Note 1:
"Apic
= debug"
APIC Related letter
Interest rates,
which
When print more
debug
Note 2:See interrupt pin count structure
union IO_APIC_reg_01 {}definition,
IOAPIC See other registers Department
ColumnIO_APIC_reg_xx
Structure
{}definition
Note 3: the IOAPIC
Interrupt pin register no empty row, also Function
setup_IO_APIC_irqs ()
vi
init_IO_APIC_traps (): initialize IOAPIC interrupt vector entry, less than 16 for the interrupt vector, the function is called
Number make_8259A_irq, otherwise it is set irq_desc [*]. Handler = no_irq_type
vii
viii
X
91
iii
setup_APIC_timer (calibration_result): Start the local APIC timer
do_pre_smp_initcalls ():
I
migration_init ():
i
ii
iii
Page 19
Register notification to the CPU block migration_notifier startup / shutdown notification chain
II
92
93
spawn_ksoftirqd ():
i
cpu_callback (& cpu_nfb, CPU_UP_PREPARE, cpu)
A p = kthread_create (ksoftirqd, hcpu, "ksoftirqd /% d", hotcpu): creates a core CPU threads for the current
ksoftirqd N , N number of CPU, entry function ksoftirqd
B kthread_bind (p, hotcpu): with the help of a given thread ksoftirqd N can only run on the current CPU
C per_cpu (ksoftirqd, hotcpu) = p: p tag thread to CPU management structure
ii
cpu_callback (& cpu_nfb, CPU_ONLINE, cpu): calling only function wake_up_process
(Per_cpu (ksoftirqd, hotcpu)), the current CPU wake soft interrupt processing thread
iii
register_cpu_notifier (& cpu_nfb): Register notice to the CPU block cpu_nfb startup / shutdown notification chain
fixup_cpu_present_map (): if the global variable cpu_present_map is empty, cpu_possible_map recorded
Each CPU to mark the cpu_present_map
notifier_call_chain (& cpu_chain, CPU_UP_PREPARE, hcpu): CPU execution start chain cpu_chain
Each notification block
do_boot_cpu (i, apicid): Start a CPU, a logical number i, physics ID: apicid
Define an idle thread management structure c_idle
a
The definition of a work queue structure work: Executive function do_fork_idle
b
c
c_idle.idle = get_idle_for_cpu (i): Get the current process CPU idle control structure
If the idle process has been created (c_idle.idle 0), set the stack c_idle.idle-> thread.rsp, call
Init_idle idle process initialization function: set to a very low priority, set the mask to the current run intelligent
c_idle.idle) set idle_thread_array [i] = c_idle.idle, do_fork_idle (): the function is called
fork_idle ():
(I)
(II)
task = copy_process (CLONE_VM, 0 ...); copy of the current process as a new idle init process
init_idle (task, cpu): set the idle process parameters newly created
(I)
(Ii)
(Iii)
(Iv)
(III)
unhash_process (task): Remove idle newly created process from the process pid hash table
CPU set the current process is idle process: cpu_pda [i] .pcurrent = c_idle.idle
f
g
start_rip = setup_trampoline (): Get the code springboard physical address SMP_TRAMPOLINE_BASE
(6000h), and a springboard for the code [trampoline_data, trampoline_end] copied to the springboard area 6000h
h
init_rsp = c_idle.idle-> thread.rsp: Modify head.S file defined in the current process stack is idle
Process stack
19
Page 20
i
j
l
m
0xA: indicates the current system status to "crash reset from 40: 67h at the beginning of the implementation," that execute code start_rip
wakeup_secondary_via_INIT (apicid, start_rip): through inter-processor interrupt (IPI) to the target
CPU sends a start command, and the target CPU start code address start_rip, namely springboard Code
o
BP AP detects the current cycle is started successfully, the successful return 0, otherwise it returns failure code given below
AP startup process, from real mode assembly code trampoline_data begin:
The current CS: IP value of 0600: 0000
movl $ 0xA5A5A5A5, trampoline_data - r_base: the springboard code was originally written for the standard position
Hutchison value A5A5_A5A5h, in order to run the notification
Setting idt / gdt
% Ax = 1, lmsw% ax: entering protected mode
ljmpl $ __ KERNEL32_CS, $ (startup_32 -__ START_KERNEL_map): Jump to text
Member head.S started at the startup_32
Setting CR3, CR4, etc., to enter long mode
Startup_64 begin from 64-bit code, set CR3, use page tables init_level4_pgt
Set up the stack to init_rsp, that is the previous step to set the idle process stack h
Jump to C code is performed at initial_code that set the previous step j function start_secondary ()
Executive Office
cpu_init (): CPU initialization, call the function pda_init (cpu), initialization of the AP of pda structure,
Setting pda-> cpunumber = cpu, you can later use the function smp_processor_id () Gets CPU
And will display a flag information: printk ( "Initializing CPU #% d \ n", cpu)
smp_callin (): BP has begun to run reports AP
(I)
cpuid = smp_processor_id (): Get the current CPU logic number
(II)
setup_local_APIC (): Sets the current local APIC AP
(III)
calibrate_delay (): The current AP calibration performance bogmips
(IV)
disable_APIC_timer (): Close the current APIC timer
(V)
smp_store_cpu_info
(cpuid):
store to
thecpu_data
current AP
Copy boot_cpu_data
structure
[cpuid] in
(I)
(Ii)
identify_cpu (cpu_data + cpuid):
(VI)
cpu_set (cpuid, cpu_callin_map): the CPU ID to mark the current global variables
cpu_callin_map in
setup_secondary_APIC_clock (): call the function setup_APIC_timer (calibration_result)
Current local APIC timer settings of AP
If nmi_watchdog use NMI_IO_APIC, use LVT0 use as NMI
enable_APIC_timer (): enable the local APIC timer
set_cpu_sibling_map (smp_processor_id ()):
tsc_sync_wait (): call the function sync_tsc (0) synchronized TSC
cpu_set (smp_processor_id (), cpu_online_map): Set the current CPU to a global variable
20
Page 21
cpu_online_map in
cpu_idle (): The current AP into the idle state
(I)
while (1) {}: This function is an infinite loop, all of the following functions are in the loop body
(II)
while {} (need_resched ()!): If there is no scheduling needs, perform the following idle operation,
Otherwise execution schedule () function, the effective implementation of other processes, internal operating cycle:
(I)
(Ii)
D
II
III
(B)
(C)
(D)
Into the loop, execute function safe_halt (), ie, assembly instructions ( "sti;
while (! cpu_isset (I, cpu_online_map)): Wait AP startup is complete, the waiting is starting to set the AP
Global variables cpu_online_map, see AP start18
first
step.
iii notifier_call_chain (& cpu_chain, CPU_ONLINE, hcpu): Marks the current CPU startup is complete, normal
run
Display flag information: printk (KERN_INFO "Brought up% ld CPUs \ n", (long) num_online_cpus ())
smp_cpus_done (max_cpus): End of the SMP AP start
i
zap_low_mappings (): When unopened compile option HOTPLUG_CPU execution, clearing the page table init_level4_pgt
The virtual address 0 entries that can not be accessed by user space init_mm, and calls the function flush_tlb_all () Refresh
All TLB
ii
smp_cleanup_boot (): CMOS register 0xF cleared, the physical address 467h cleared
When unopened compilation options HOTPLUG_CPU, then release the first one (1000h) and SMP springboard page (6000h)
iii
IV
setup_ioapic_dest (): call the function set_ioapic_affinity_irq () settings on each IOAPIC interrupt pin CPU affinity
And sexual
V
VI
94
time_init_gtod (): Set the time structure, function pointer do_gettimeoffset set value, the display flag information:
printk (KERN_INFO "time.c:. Using% s based timekeeping \ n", timetype)
check_nmi_watchdog (): efficacy NMI watchdog effectiveness
Display flag information: printk (KERN_INFO "testing NMI watchdog ...")
i
Page 22
e
f
sd-> groups = & sched_group_phys [group]: set scheduling group scheduling domain
mp = sd
n Get the i-th CPU corresponding address field cpu_domains sd
o group = cpu_to_cpu_groups (i): Get the i-th unit corresponding to the CPU core CPU within the group
p sd-> span = cpu_sibling_map [i]: This field contains the only CPU scheduling unit within this core CPU
q sd-> parent = p: Setting scheduling domain of the parents is scheduled in front of the CPU package scheduling domain
r sd-> groups = & sched_group_cpus [group]: set scheduling group scheduling domain
s The scheduling domains need to turn compiler option SCHED_SMT
For each previous CPU unit, call the function init_sched_build_groups (sched_group_cpus,
this_sibling_map): The this_sibling_map each CPU is connected to a different sched_group_cpus
A circular linked list
For all systems NODE, NODE will not belong to the same CPU corresponding sched_group_phys
Together into a circular linked list
All online CPU corresponding sched_group_nodes together into a circular linked list
For each CPU in the system unit is set computing capabilities, including cpu_domain, phys_domains and
E
F
II hotcpu_notifier (update_sched_domains, 0): Hot-swap CPU scheduling domain setting update notification block
95
96
97
Page 23
iv
v
vi
vii
viii
IV
sysctl_init (): need to turn on and compile options SYSCTL PROC_FS valid,
i
register_proc_table (root_table, proc_sys_root): to root_table table Each entry registered a proc entry
ii
init_irq_proc ():! registered directory / proc / irq, and for each active interrupt vector (irq_desc [irq] .handler =
& No_irq_ type) registered smp_affinity file: / proc / irq / xx / smp_affinity, by setting the corresponding file in the
Interrupt vector affinity
iv
v
vi
VI
98
do_initcalls (): execution [__initcall_start, __initcall_end] between initialization function that is performed by the macro
core_initcall (fn), postcore_initcall (fn), arch_initcall (fn), subsys_initcall (fn), fs_initcall (fn),
mount_devfs (): Direct call sys_mount ( "devfs", "/ dev", "devfs", 0, NULL), installation devfs
md_run_setup (): set the RAID
i
create_dev ( "/ dev / md0", MKDEV (MD_MAJOR, 0), "md / 0"), the establishment of the device / dev / md0
ii
md_setup_drive ():
initrd_load (): read and use initrd
III
99
IV
mount_root ()
Release memory used during initialization: free_initmem ()
100
101
102
103
twenty three
Page 24
2 file system
2.1 mount operation
2.1.1
1
file_system_type ** p = find_filesystem (fs-> name): In a file_systems table head, single linked list
Search registered file system if there named "ext2", and returns the next domain address of the last node
iii
* P = fs: ext2_fs_type structure will be added to the list in the file system type
open
,then module_init (initfn)Defined as
NOTE:If1 defined in a module that compiler options
MODULE
function
static inline initcall_t __inittest (void) {return initfn;} // Define a function that returns
initfnaaddress
Command "mount -t ext2 device dir": Install an ext2 file system device, perform sys_mount system call;
sys_mount ():
I
copy_mount_options (): Copy the user space to kernel space parameters, each parameter occupies one page space
II
getname (dir_name): Replication of mount points pathname
i
ii
III
do_getname (dir_name, tmp): address bounds checking to prevent illegal content into the core space, call the function
strncpy_from_user () perform the copy operation
lock_kernel (): Prohibition kernel preemption scheduling
IV
V
do_mount (dev_name, dir_name, type_page, flags, data_page): perform the installation entity
Validate input parameters
I
Analysis flags parameter
II
III
IV
V
VI
VII
path_lookup (dir_name, LOOKUP_FOLLOW, & ND): Find the target installation point nameidata structure
do_remount (): reinstall command contains MS_REMOUNT mark
do_lookback (): Install the loopback device, command contains MS_BIND mark
do_move_mount (): Delete installation command contains MS_MOVE mark
do_new_mount (& ND, type_page, the flags, mnt_flags, dev_name, Data_Page): a new installation
VIII
path_release (& nd): nd release
NOTE: struct nameidata
Structure Description:
point
structure
struct dentry *
dentry: Point to the target
dentry
twenty four
Page 25
struct vfsmount * mnt : Point to the target point where the equipment is installed
strct qstr
last: For indexes
do_new_mount (): Install a new device;
Verifying the correctness of the input parameters
I
II
III
IV
follow_down: Forward to the root device has been installed, and while circulating through further testing new
Installation point until the end, that is no longer goes to the root of a device installed on the device so far.
if (nd-> mnt-> mnt_sb == newmnt-> mnt_sb && nd-> mnt-> mnt_root == nd-> dentry) goto fail: same
A file system device installation can not be repeated in the same directory
mnt-> mnt_namespace = current-> namespace
graft_tree (mnt, nd)
A
if (S_ISDIR (nd-> dentry-> d_inode-> i_mode)! = S_ISDIR (mnt-> mnt_root-> d_inode->
i_mode)) return: must be a directory
B
nd-> dentry or the root directory, or d_flags no mark DCACHE_UNHASHED, can execute
Line follow-up operation, the file system can not be installed in a directory with the tag DCACHE_UNHASHED
C
D
E
v
list_add_tail (& mnt-> mnt_child, & nd-> mnt-> mnt_mounts): The installation will be added to the mounting point
Word lists mount point directory installation point
e
nd-> dentry-> d_mounted ++: increase the installation directory installed count
list_add_tail (& head, & mnt-> mnt_list) in mnt-> mnt_list added a temporary head node
list_splice (& head, current-> namespace-> list.prev): the head where the list (ie mnt-> mnt_list)
Join current-> namespace-> list.prev in
mntput (mnt), release the counter. At this point, the installation process is completed
IV
mnt = alloc_vfsmnt (dev_name): assignment vfsmount structure dedicated cache mnt_cache and initialized, assigned
Memory used to store the device name dev_name, and mnt-> mnt_devname point of
sb = type-> get_sb (): call the super block of the file system read function body for the ext2 file system, ext2_fs_type
Get_sb defined as a function ext2_get_sb (), and ext2_get_sb () functions are called directly get_sb_bdev (fs_type, the flags,
dev_name, data, ext2_fill_super)
Setting vfsmount structure mnt:
i
ii
iii
Page 26
V
VI
7
iv
mnt-> mnt_parent = mnt: that this is a root installation
put_filesystem (type): the release of the file system
return mnt
inode = iget5_locked (bd_mnt-> mnt_sb, hash (dev), bdev_test, bdev_set, & i_rdev)
From bdev (bd_type), to obtain equipment in bdev inode file system
(I)
(II)
(III)
head = inode_hashtable + hash (sb, hashval): Looking for block device in the inode hash table
indoe, set the hash-table head, that bdev_inode.vfs_inode
inode = ifind (sb, head, bdev_test, ...), call the function find_inode () to find the target
indoe, lookup fails, call the function get_new_inode () to create inode
get_new_inode (): Create inode
(I)
inode = alloc_inode (sb): call the function sb-> s_op-> alloc_inode = bdev_alloc_
inode () allocates a data structure bdev_inode ei from the dedicated cache bdev_cachep,
And members ei-> vfs_inode return as VFS inode, and the inode conduct group
The initialization, such as inode-> i_mapping = inode-> i_data
(Ii)
(Iii)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
list_add (& inode-> i_list, & inode_in_use): The inode added to inode_in_ use
List
list_add (& inode-> i_sb_list, & sb-> s_inodes): was added to the inode
bd_mnt-> mnt_sb-> s_inodes list under
bdev = & BDEV_I (inode) -> bdev: get the inode located under with a block_inode
block_device address structure
bdev-> bd_inode = inode
inode-> i_rdev = dev: setting represents a block device inode number of the target device
inode-> i_bdev = bdev: block device
inode-> i_data.a_ops = & def_blk_aops
inode-> i_data.backing_dev_info = & default_backing_dev_info
list_add (& bdev-> bd_linst, & all_bdevs): recording block device
device
, Structure block_inode One to
Note 1:here
inode It represents a block
inode
Members, and with the other members
of the There are mutual between the pointer to the structure
block_device
( bdev-> bd_inode, inode-> i_bdev), The inode Also added toinode
the system
Hash table
.
Note 2:Block Device Management: block devices through
a
virtual
file
M
anagement,
system
bdev (bd_type)
The file systemstart_kernel
consists of () vfs_caches_init ()bdev_cache_init ()registered,
And Core installations
(
kern_mount
()do_kern_mount ()). Each corresponds to a block device
A bdev_inode Structure, via the virtual filebdev
system
Allocation and deallocation of the structure
member
vfs_inode As equipment
inode Join inode Hash table, in a conventional manner to access,
And by function
struct bdev_inode * BDEV_I (inode)and strcut block_device *
Page 27
and address
bdev_indoe Another member of the structure
bdev address.
struct bdev_inode {
struct block_device
bdev;
struct inode
vfs_inode;
}
c inode-> i_bdev = bdev, Note: Here inode is the device name (eg dev / hda1) corresponding inode
d inode-> i_mapping = bdev-> bd_inode-> i_mapping, alloc_inode initial value, has no other initial
Of
e
ii
iii
II
III
8
list_add (& inode-> i_devices, & bdev-> bd_inodes): The path represents the device inode set to join
Preparation of management in the inode list
blkdev_get (bdev, mode, 0): use local variables fake_file and fake_dentry execution do_open (bdev,
& Fake_file) operation
disk = get_gendisk (bdev-> bd_dev, & part)
To be continued
bd_claim (bdev, holder):
ext2_fill_super (struct super_block * sb, void * data, int silent): fill in the superblock reading device, set the super block operations
Method structure:
I
II
III
IV
Assign a ext2_sb_info structure sbi cleared and set sb-> s_fs_info = sbi
sb_block = get_sb_block (& data): Set the start position of the super block, if the parameter data contains the string "sb = xx",
Set sb_block = xx, otherwise it is set to 1 by default
Sets the logical block number logic_sb_block, superblock location
bh = sb_read (sb, logic_sb_block): read the superblock, direct call function __bread (sb-> s_bdev, block,
sb-> s_blocksize) read superblock:
i
bh = __getblk (bdev, block, size): To be continued , to find the target in buffer_cache bh
A
bh = __find_get_block (bdev, block, size)
a
Page 28
The LRU Cache front of the queue, and may be released to a BH last
d toch_buffer (bh): Marks the current bh has access
B
C
ii
bh = __bread_slow (bh), if the content is not updated bh (possibly newly allocated bh), is executed from the hard disk
Read target bh
A
B
C
D
submit_bh (READ, bh): Submit a read request to the driver, assign a struct bio structure and associated
Setting, the function is called submit_bio (READ, bio) submit a request to the driver
F
wait_on_buffer (bh): wait for the completion of degrees
Radix treeradix
(
as, follows. Maximum height
Comment:
tree), Buffer Cache use 64 Tree fork base management, data structure is12
use bit6360 As Index,
Level
use
As
an
index,
so
the
first
Level
use
As
a
10
bit5954
1
bit50
Index leaf nodes store valid data
struct radix_tree_root {
uint height;
int gfp_mask;
struct radix_tree_node * rnode
}
struct rasix_tree_node {
uint count;
void * slot [MAP_SIZE];
// 64
ulong tags [TAGS] [TAG_LONGS]; // 2, 1
}
V
VI
VII
VIII
IX
sbi initialization, call the function parse_options (data, sbi) further provided data based on the parameters sbi
Analyzed according to the data read superblock, relevant settings, if not using the actual block device reads the original block
Consistent with the need to re-read a function call sb_bread superblock
Superblock other initialization parameters
SB-> s_export_op = & ext2_export_ops : get_parent only two methods of operation and get_dentry
SB-> s_op = & ext2_sops : define the inode operation method and other methods:
alloc_inode = Ext2_alloc_inode
destroy_inode = ext2_destroy_inode
read_inode = Ext2_read_inode
write_inode = ext2_write_inode
root = iget (sb, ino = EXT2_ROOT_INO): to find the target in the global inode inode hash table, if fails stars
With a new inode and basic initialization
i
ifind_fast (sb, head, ino): to find the target in the inode hash table
inode = get_new_inode_fast (sb, head, ino):
a
inode = alloc_inode (sb): call the function sb-> s_op-> alloc_inode (sb) assign a inode, namely letter
28
Page 29
Number ext2_alloc_inode (sb): a distribution system from the dedicated cache ext2_inode_cachep
ii
ext2_inode_info structure ei, and members of the structure vfs_inode address (& ei-> vfs_inode) as
Returns for the allocation of the inode structure. New inode common initialization;
The inode added sb-> s_inode the list and global inode hash table
b
Call functions SB-> s_op-> read_inode ext2_read_inode (inode): read the contents of inode
A
B
C
D
ei = EXT2_I (inode): Get the address of where the structure inode ext2_inode_info
raw_inode = ext2_get_inode (inode-> i_sb, ino, & bh): read the content on the target disk inode
Computing Group target inode superblock where according to the records, and the group turned the offset amount and the number of dis
a
b
* Bh = sb_bread (sb, block): reading the target block, * bh saved Buffer Cache Address
Ext2_inode calculated in the offset Buffer Cache
c
d
return (struct ext2_inode *) (bh-> b_data + offset): Returns the destination address
inode-> i_mode = raw_inode-> i_mode, according inode-> i_mode judgment inode types handled as follows:
Regular file (S_ISREG (inode-> i_mode)):
a
b
File system does not buffer head (with "nobh" installation option, namely open EXT2_MOUNT_NOBH)
(1) inode-> i_mapping-> a_ops = & ext2_nobh_aops ;
(2) inode-> i_fop = & ext2_file_operations ;
Normal file system
Other Links
(1) inode-> i_op = & ext2_ the symlink _inode_operations ;
(2) file system does not buffer head (with "nobh" option during installation), inode-> i_mapping-> a_ops =
init_special_inode (inode, inode-> i_mode, devt): according to the type of device settings
(1) character devices:
inode-> i_fop = & def_chr_fops
29
Page 30
A
XI
list_add (& res-> d_alias, & root-> i_dentry): The dentry structure d_alias added pointer inode structure
The list of i_dentry
B
res-> d_inode = inode: d_inode pointer dentry inode structure to the target
Analyzing the file system: display flag message: printk (KERN_ERR "EXT2-fs: get root inode failed \ n");
printk (KERN_ERR "EXT2-fs: corrupt root inode, run e2fsck \ n")
XII
XIII
2.1.2
2.1.2.1
The process used to manage the file system block devices is loaded: the start_kernel () vfs_caches_init () bdev_cache_init (), and
30
Page 31
A Core installation (kern_mount () do_kern_mount (), the function bdev_cache_init () start the following analysis
1
bdev_cachep = kmem_cache_create ( "bdev_cache", sizeof (struct bdev_inode): assigned block special management
Establish a dedicated cache structure bdev_inode
register_filesystem (& bd_type): Registered dedicated file system bdev (bd_type), will be added to the registered structure bd_type
File system list in file_systems
bd_mnt = kern_mount (type = & bd_type): internally mounted file system, direct call function do_kern_mount
(Type-> name, 0, type-> name, NULL), namely the file system type "bdev", the name of the target device is installed "bdev"
I
type = get_fs_type (fstype)
II
III
mnt = alloc_vfsmnt (dev_name): assignment vfsmount structure dedicated cache mnt_cache and initialized, assigned
Memory used to store the device name dev_name, and mnt-> mnt_devname point of
sb = type-> get_sb (): call the super block of the file system read function for bdev file system, bd_type the set
Get_sb defined as a function bd_get_sb (), and bd_get_sb () functions are called directly get_sb_pseudo (fs_type, "bdev:",
& Bdev_sops, ...):
i
C
D
E
F
ii
iii
iv
v
vi
vii
IV
V
set_anon_super (s, NULL): idr use data structures used to produce a device ID number, set up to
s-> s_dev in
a idr_pre_get (& unnamed_dev_idr, ...): preparation of a node in unnamed_dev_idr
b idr_get_new (& unnamed_dev_idr, NULL, & dev): apply for a section in the unnamed_dev_idr
Point in the dev returned node ID, used as a minor number
c s-> s_dev = MKDEV (0, dev)
s-> s_type = type: set this to your file system superblock
strlcpy (s-> s_id, type-> name, ...): set the file system superblock to your name
list_add_tail (& s-> s_list, & super_blocks): The super block is added to a global list of the superblock
list_add (& s-> s_instance, & type-> fs_supers): The super block is added to this file system belongs superblock
The list
viii
s-> s_root = dentry: Setting the superblock root dentry structure
mnt-> mnt_sb = sb; mnt-> mnt_root = sb-> s_root; mnt-> mnt_mountpoint = sb-> s_root
mnt-> mnt_parent = mnt: This is a show with node
VI
return mnt: return to the installation point structure
blockdev_superblock = bd_mnt-> mnt_sb, save this file system superblock
Note: The final of the global variable bd_mnt save the installation information point, the global variable blockdev_superblock save superblock information
31
Page 32
2.1.2.2
The process file system for managing network connections, is loaded: the start_kernel () rest_init () the init () do_basic_setup ()
sock_init (), call the function register_filesystem (& sock_fs_type) registered sockfs file system, then execute sock_mnt =
kern_mount (& sock_fs_type) internally mounted file system, the registration process has been described in the previous section, the following main analysis kern_
(& Sock_fs_type) implementation process:
1
sock_mnt = kern_mount (type = & sock_fs_type): internally mounted file system, direct call function do_kern_mount
(Type-> name, 0, type-> name, NULL), namely the file system type "sockfs", the name of the target device is installed
"Sockfs"
sb = type-> get_sb (): call the super block of the file system read function for sockfs file system, sock_fs_type in
get_sb defined as a function sockfs_get_sb (), and sockfs_get_sb () functions are called directly get_sb_pseudo (fs_type,
3
4
And ultimately by the global variable sock_mnt save the installation point information vfsmount structure.
2.1.2.3
The loading process is: the start_kernel () proc_root_init (), first call the function register_filesystem (& proc_fs_type) Register
procfs file system, then execute proc_mnt = kern_mount (& proc_fs_type) internally mounted file system, the registration process has been described in the previou
The following analysis of the main differences between kern_mount (& proc_fs_type) implementation and installation of bd_type file system:
1
sb = type-> get_sb (): call the super block of the file system read function for the proc filesystem, proc_fs_type the set
Get_sb defined as a function proc_get_sb (), and proc_get_sb () functions are called directly get_sb_single (fs_type, flags, data,
proc_fill_super), perform the function given below:
I
s = sget (fs_type, compare_single, set_anon_super, ...): The procedure is when the file system is basically the same installation bdev
II
proc_fill_super (s, data, ...): fill in the superblock s data structure and set root inode
i
s-> s_op = & proc_sops: set the super-block operator interface
ii
root_inode = proc_get_inode (s, ino = PROC_ROOT_INO, de = & proc_root)
A
B
C
D
sb-> s_op-> read_inode (inode): call the function proc_read_inode () function to fill inode content,
This function is actually set only the inode time information for the current time.
c
unlock_new_inode (inode): inode is unlocked, can be used normally
PROC_I (inode) -> pde = de: set the inode structure where proc_inode members pde = & proc_root
Inode set parameters:
32
Page 33
a
b
iii
III
c
inode-> i_op = de-> proc_iops = proc_root_inode_operations
d
inode-> i_fop = de-> proc_fops = proc_root_operations
s-> s_root = d_alloc_root (root_inode)
A
res = d_alloc (NULL, & name): allocate a dentry structure, and set the path name and the associated initial list
Of
C
D
2.2 open
Core entry function sys_open (const char __user * filename, int flags, int mode), the analysis starts here
1
tmp = __getname (): macro, defined as kmem_cache_alloc (names_cachep, ...), to allocate memory from the dedicated cache
do_getname (filename, page = tmp):
If get_fs () = current_thread_info () -> addr_limit not equal KERNEL_DS, indicates that the user-space processes
i
Access, you need to check the address range, as follows:
If filename> TASK_SIZE (= TASK_SIZE64 = 8000_0000_0000h - 1000h) indicates the file
A
B
ii
__do_strncpy_from_user (page, filename, len, res): implementation of assembly instructions from the user address space
filename to copy the data to the core address space page, length len bytes, the error code is set to res
fd = get_unused_fd ():
I
files_struct structure of the current process: files = current-> files
II
fd = find_next_zero_bit (files-> open_fds-> fds_bits, ...): find a free mark bitmap file opens in
III
IV
V
FD_SET (fd, files-> open_fds): open_fds bitmap fd set, indicating that the corresponding file number is no longer idle
FD_CLR (fd, files-> close_on_exec): close_on_exec pointed bitmap fd cleared, said that if the current process
By exec () system call an executable file without having to close the file, the contents of the bitmap can be
Page 34
ii
iii
iv
v
dentry = cached_lookup (base, name, nd): find the target file / directory name in the directory base
a
dentry = __d_lookup (base, name):
(1) head = d_hash (base, name-> hash), looking at the global dentry hash table dentry_hashtable
D
E
dentry = __d_lookup (base, name): again to find the target dentry structure
read_seqretry (& rename_lock, seq): return value is not equal to 0 Repeat steps (1) Step
new = d_alloc (base, name), if cached_lookup lookup fails, and then allocate a dentry structure
Initialized
dentry = inode-> i_op-> lookup (dir = inode, new, nd), to find the target in the current inode dentry knot in
Construct new, for ext2 file systems function ext2_lookup ()
If dentry-> d_name.len> EXT2_NAME_LEN (255) returns -ENAMETOOLONG,
a
Name is too long, ext2 file system maximum support 255
b
ino = ext2_inode_by_name (dir, new): Find dentry inode structure in the new target
(1) de = ext2_find_entry (dir, new, & page)
(I) ei = EXT2_I (dir): Gets inode structure dir where ext2_inode_info structure
(II) ext2_get_page (dir, n)
(I) page = read_cache_page (mapping, n, mapping -> a_ops -> readpage,
NULL): For ext2 file system, readpage function ext2_readpage
(A) page = __read_cache_page (mapping, index, filler = ext2_readpage,
NULL)
(A) page = find_get_page (mapping, index): the function is called radix_tree_
lookup (mapping-> page_tree, index), with an offset in the base of the tree
For the index page, if found in the calling function page_cache_get (page) =
get_page (page) to increase page reference count and returns the page structure
(B) cache_page = page_cache_alloc_cold (mapping): If the previous step execution
Fails, this step calls the function alloc_pages () allocates a page assignment
(C) add_to_page_cache_lru (cache_page, mapping, index, ..):
34
Page 35
Directory names, matching exit, return to de structure, otherwise the test next dentry structure
(2) ino = le32_to_cpu (de-> inode): Get the target file / directory inode number
vi
vii
inode = iget (dir-> i_sb, ino): reads No. ino inode, which refer to the installation process ext2 file system
The first step 8.IX
d_splice_alias (inode, dentry): set the relationship between the inode and dentry
may_create (dir, dentry, nd): Test the target file / directory exists and permissions
dir-> i_op-> create (dir, dentry, mode, nd), that is, call the function ext2_create () founding documents:
a
inode = ext2_new_inode (dir, mode): assignment and set inode
(1)
(2)
sb = dir-> i_sb
inode = new_inode (sb): inode allocation and initialization
(I)
alloc_inode (sb): call sb-> s_op-> alloc_inode (sb), namely the function ext2_alloc_inode (),
Ext2_inode_info assigned a dedicated structure ei from the cache ext2_inode_cachep, and
(II)
(III)
(IV)
Vfs_inode address is returned as a member of the allocated inode. And set the inode structure of each member of the initial value
Use and i_sb_list i_list members were added to the list inode_in_use and sb_s_inodes
list_add (& inode-> i_list, & inode_in_use): The inode list to join the global use
list_add (& inode-> i_sb_list, & sb-> s_inodes): The inode superblock join the current list
35
Page 36
(3)
(V)
The number of inode last_ino as a static variable, record distribution: inode-> i_ino = ++ last_ino
If you want to create the directory and the super-block function is called with marking EXT2_MOUNT_OLDALLOC
find_group_dir, if the mark does not create a directory and then call the function find_group_orlov, if
Not a directory function is called find_group_other, explained below:
(4)
group = find_group_dir (sb, dir): Select the block for creating directory
(I)
(II)
ngroups = EXT2_SB (sb) -> s_groups_count: The current number of blocks included in the device
avefreei = ext2_count_free_inodes (sb) / ngroups: average block free inode number
(I) desc = ext2_get_group_desc (sb, i, NULL), get the i-th block description of the structure
(A)
(B)
(C)
(D)
(III)
(5)
(6)
group = find_group_other (sb, dir): to create a non-directories (file or link) selected block
If the directory dir where the blocks have free inode and free block directly select the time zone set dir
(I)
If the above conditions are not met, then the secondary hash with a way to select the other there are free inode (desc->
(II)
bg_free_inodes_count> 0) and block free block
If you still have not checked, then choose a free inode node block, regardless of the free block
(III)
Thus, the block selection is completed, select the first group a block
(7)
(8)
gdp = ext2_get_group_desc (sb, group, & bh2): Get block group address and location of buffer
cache structure
(9)
bitmap_bh = read_inode_bitmap (sb, group): Read the current block inode bitmap where the disk blocks
(I)
desc = ext2_get_group_desc (sb, group, NULL): Get the address block group
(II)
bh = sb_read (sb, desc-> bg_inode_bitmap): read the current paltry inode bitmap disk blocks
(10)
ino = ext2_find_next_zero_bit ((unsigned long *) bitmap_bh-> b_data, EXT2_
INODES_PER_GROUP (sb), ino): bitmap find an idle point, allocate disk inode
Ino If the test fails, the next area inode allocation group
(11)
(12)
ext2_set_bit_atomic (sb_bgl_lock (sbi, group), ino, bitmap_bh-> b_data), bitmap
bitmap_bh-> b_data marked the inode number has been used for the ino
(13)
mark_buffer_dirty (bitmap_bh): Mark the buffer_cache modified
(14)
sync_dirty_buffer (bitmap_bh): If sb-> s_flags marked MS_SYNCHRONOUS,
It said synchronous modification, you call the function submit_bh (WRITE, bh) writeback inode bitmap
(15)
(16)
ino + = group * EXT2_INODES_PER_GROUP (sb) + 1: the inode number converted to a block device
Within the number
(17)
(18)
percpu_counter_inc (& sbi-> s_dirs_counter): If you are creating a directory, add the directory count
36
Page 37
(19)
gdp-> bg_free_inodes_count - = 1: inner block free inode minus 1
(20)
sb-> s_dirt = 1: Modified superblock
(twenty one)
mark_buffer_dirty (bh2): mark area group buffer_cache modified
(twenty two)
inode-> i_ino = ino: set the inode itself where the inode number
Other initialization inode structure
(twenty three)
b
c
d
(twenty four)
mark_inode_dirty (inode): mark inode modified
Setting inode operations function pointer: inode-> i_op, inode-> i_mapping-> a_ops, inode-> i_fop
mark_inode_dirty (inode): mark inode modified
ext2_add_nondir (dentry, inode):
(1)
ext2_add_link (dentry, inode)
(I)
(II)
(III)
mark_inode_dirty (dir): mark the parent directory has been modified
(2)
d_instantiate (dentry, inode): If successful, it will point to establish relationships with dentry inode structure,
Otherwise release inode, an error is returned
viii
ix
x
xi
II
__follow_mount (& path): The path to the installation device is currently active
may_open (nd, acc_mode, flag): Test attribute tags and permissions, if marked with O_TRUNC, the number
Get written permission, and calls the function locks_verify_locked locking and file length is set to 0
4
5
6
nd-> dentry = path.dentry: nd-> dentry points to the new directory / file
Setting f associated members: f-> f_op = inode-> i_fop, f-> f_dentry = nd.dentry, f-> f_vfsmnt = nd.mnt,
f-> f_mapping = inode-> i_mapping
Few calls the function f-> f_op-> open (inode, f), namely generic_file_open function, function
Detection O_DIRECT mark
Back f structure
v
vi
fsnotify_open (f-> f_dentry): Only open in the implementation of the compiler options CONFIG_INOTIFY
fd_install (fd, f): set files-> fd [fd] = f
At this point, the file open completion
2.3 read
read through system calls sys_read process is complete, this system call, the first call the function file = fget_light (fd, & fput_needed)
Get the file descriptor file, and then calls the main function vfs_read (), then call back function file_pos_write () and fput_light (), reset
Read the file location and release file descriptors, the following analysis only trunk function vfs_read (file, buf, count, & pos)
Check the correctness of the parameters, as well as pointers file-> f_op, file-> f_op-> read, file-> f_op-> aio_read non-empty
1
2
3
access_ok (VERIFY_WRITE, buf, count): Verify that the user-space area [buf, buf + count) whether writable property
rw_verify_area (READ, file, pos, count): Verify that the file readable attribute whether the target area
Verification data read count must be less than the total amount of file data file-> f_maxcount
I
File data is read before and after must be valid, that is, POS 0 && + POS COUNT 0
II
If the file lock (inode-> i_flock) effective and mandatory file locking enabled, the macro MANDATORY_LOCK (inode) = TRUE,
III
That inode-> i_sb-> s_flag marked MS_MANDLOCK set and the set group ID but no execution flag, namely full
Foot inode-> i_mode & (S_ISGID | S_IXGRP) == S_ISGID, then call the function locks_mandatory_area
37
Page 38
struct iovec local_iov = {.iov_base = buf, .iov_len = count}; user space data receiving buffer zone
init_sync_kiocb (& kiocb, filp): initialize data structures kiocb
ret = __generic_file_aio_read (iocb = & kiocb, iov = & local_iov, nr_segs = 1, ppos): asynchronous I / O side
Read data type
The input parameter space is only one user indicates a buffer zone, i.e. only one component iov
i
ii
iii
iv
v
The amount of data for the use of a loop, verify iov [..] buffer each component given whether the user space, and read
The total amount of data whether there exists a negative value, if a component is illegal, then the follow-up to the sub-component and two will be
For the components of each component before services
If filp-> f_flags set mark O_DIRECT, Cache mechanism file system is not used, but directly from
The user buffer to device data access, continue, otherwise go to step vii Operation
mapping = filp-> f_mapping, inode = mapping-> host, size = inode-> i_size
retval = generic_file_direct_IO (READ, iocb, iov, offset = pos, nr_segs = 1):
A
file = iocb-> ki_filp, mapping = file-> f_mapping
If the write operation, perform the following steps:
B
a
b
write_len = iov_length (iov, nr_segs): Calculate the total amount of data currently IO operation block
If the file is executed address mapping (mmap) radio operation, that mapping_mapped (mapping) is true,
That mapping-> i_mmap-> prio_tree_node NULL or mapping-> i_mmap_nonlinear
NULL, then the function is called unmap_mapping_range (mapping, offset, write_len, 0), the release area
Domain [offset, offset + write_len] address mapping
filemap_write_and_wait (mapping)
If no address mapping that mapping-> nrpages = 0, then the direct return 0; otherwise continue
a
b
Page 39
vi
vii
IV
6
B declare local variables read_descriptor_t des, based on the input parameters and initializes
C do_generic_file_read (filp, ppos, & desc, file_read_actor) direct call function do_generic_
mapping_read (filp-> f_mapping, & filp-> f_ra, filp, ppos, desc, actor = file_read_actor)
to be continued
a
If the return ret = -EIOCBQUEUED, then call the function wait_on_sync_kiocb (& kiocb), so that the current process proceeds
TASK_UNINTERRUPTIBLE state, and process scheduling, wait for the completion of the reading process
do_sync_read (file, buf, count, pos)
I
II
III
IV
V
init_sync_kiocb (& kiocb, filp): initialize data structures kiocb, and set kiocb.ki_pos = * ppos
filp-> f_op-> aio_read (& kiocb, buf, len, kiocb.ki_pos): That is the calling function ret = generic_file_aio_read ()
i
struct iovec local_iov = {.iov_base = buf, .iov_len = count};
ii
__generic_file_aio_read (iocb, & local_iov, 1, & iocb-> ki_pos)
If the return value ret = -EIOCBRETRY function is called wait_on_retry_sync_kiocb (& kiocb), and repeat on
The steps until the return value is not equal -EIOCBRETRY. During the execution of the function __generic_file_aio_read See 5.III
step
If the return value ret = -EIOCBQUEUED function is called wait_on_sync_kiocb (& kiocb)
* Ppos = kiocb.ki_pos
2.4 write
Entry function for the system call sys_write, and sys_read similar system call, call VFS layer function vfs_write, if further letter
Number pointer File-> f_op-> the Write NULL, the function is called file-> f_op-> write (file, buf, count, pos), for the ext2 file system calls
Function ext2_generic_write, otherwise call the function do_sync_ write (file, buf, count, pos)
generic_file_write (file, buf, count, pos):
1
2
3
II
vfs_check_frozen (inode-> i_sb, SB_FREEZE_WRITE), the macro is defined as wait_event ((inode-> sb) ->
s_wait_unfrozen, ((inode-> sb) -> s_frozen <(SB_FREEZE_WRITE)))
current-> backing_dev_info = mapping-> backing_dev_info
generic_write_checks (file, & pos, & count, S_ISBLK (inode-> i_mode)): Prior to completion of the necessary write data
Data checking, such as the amount of data written and location, some errors may trigger SIGXFSZ
Call the function notify_change (dentry, & newattrs) after preparation parameters: remove_suid (file-> f_dentry)
vi
inode_update_time (inode, 1): Setting inode-> i_mtime and inode-> i_ctime current system time, and
Call the function mark_inode_dirty_sync (inode) marks the inode need to write back
vii
If file-> f_flags in marked O_DIRECT, then call the function generic_file_direct_write (iocb, iov,
39
Page 40
viii
D If synchronous access to the file (the condition (written> = 0 && ((file-> f_flags & O_SYNC) ||
IS_SYNC (inode)))), then calls the function generic_osync_inode (inode, mapping, OSYNC_
METADATA), the inode all the changes to your data file back to disk
to be continued
a
If file-> f_flags in marked O_DIRECT, then call the function generic_file_buffered_write (iocb, iov,
If the return value ret = -EIOCBQUEUED, then call the function wait_on_sync_kiocb (& kiocb), process proceeds
TASK_UNINTERRUPTIBLE state, and process scheduling.
* Ppos = kiocb.ki_pos
2.5 mmap
Corresponding system calls sys_mmap, if it is anonymous mapping (flags unmarked MAP_ANONYMOUS), that the use of File
The mapping function is called file = fget (fd) to obtain the file descriptor, and performs the mapping function body do_mmap_pgoff (file, addr, len, prot,
flags, off >> PAGE_SHIFT), following analysis of the function implementation
Validation parameters and map section is rounded to an integer multiple of the page size
1
2
addr = get_unmapped_area (file, addr, len, pgoff, flags)
If the mark flags marked MAP_FIXED not specified, that do not have to use the specified address addr, then use the following pointers
I
II
Page 41
iii
iv
Initial setting addr:? Addr = mm-> free_area_cache <begin begin: mm-> free_area_cache
In a for loop repeatedly calls the function find_vma (mm, addr), find free virtual address space to find return
4
5
6
7
8
9
10
11
12
II
III
IV
V
VI
VII
Page 42
shmem_zero_setup (vma): If the file is mapped and marked with VM_SHARED, then call this function to establish a common
Enjoy anonymity map
I
II
III
14
15
16
17
vma-> vm_ops = & generic_file_vm_ops: Key Operator address mapping, generic_file_vm_ops only two
Effective members: .nopage = filemap_nopage; .populate = filemap_populate
file = shmem_file_setup ( "dev / zero", size, vma-> vm_flags): establish a shared memory file
i
root = shm_mnt-> mnt_root;
ii
dentry = d_alloc (root, & this): assignment dentry structure shmfs file system
iii
iv
v
vi
d_instantiate (dentry, inode): to establish contact between the dentry and inode]
file-> f_vfsmnt = shm_mnt); file-> f_dentry = dentry; file-> f_mapping = inode-> i_mapping;
file-> f_op = & shmem_file_operations
vma_merge (): If it is an anonymous mapping function tries to call the merged vma
atomic_inc (& inode-> i_writecount): If the file mapping and write map count increased write
mm-> total_vm + = len >> PAGE_SHIFT: Record amount of mapping data
If with VM_LOCKED marked increase in the amount of memory locked mm-> locked_vm; call functions make_pages_present
(Addr, addr + len): the function is called get_user_pages (current, current-> mm, addr, len, write After verifying the correctness of the parameters,
force = 0, pages = NULL, vmas = NULL)
I
expand_stack (vma, addr): At this point the stack is managed vma's vma, room for expansion to include the address addr
A
anon_vma_prepare (vma): Find or assign an anonymous vma, if vma-> anon_vma NULL
Direct return that value, otherwise continue
a
anon_vma = find_mergeable_anon_vma (vma): to find whether the adjacent can be combined in vma
Anonymous vma, preventing distribution behind after the merger, that is found near the vma return of anon_vma
b
anon_vma = anon_vma_alloc (): not found the function to allocate an anonymous call
c
vma-> anon_vma = anon_vma
d
list_add (& vma-> anon_vma_node, & anon_vma-> head): Add to the list of anonymous vma
The address addr aligned to an integer multiple of the page address
II
III
B
C
size = address - vma-> vma_start; grow = (address - vma-> vm_end) >> PAGE_SHIFT
D
acct_stack_growth (vma, size, grow): Verify that you can increase the stack space
E
vma-> vm_start = address; vma-> vm_pgoff - = grow: Accept Extended
If vma contains VM_LOCKED flag, calling function make_pages_present (addr, start) distribution
v
Physical page
At this time, or vma is NULL, or which contains the start address
If the condition (vma = NULL && in_gate_area (tsk, start)), that is, start address is gate_vma (section
[VSYSCALL_START, VSYSCALL_END] in), for an existing page, fill pages and parameters vmas
Parameters, there is no page is returned encounter, if you continue to meet this Article
IV
page = follow_page (mm, start, write): direct call function __follow_page (mm, address, 0, write, 1): eligible
Take page address of the page where the address structure, if the page = NULL, then the function is called __handle_mm_fault (mm,
42
Page 43
V
VI
VII
Processing Next
If there is need to deal with pages that satisfy the condition len 0, then continue to the next turn I vma
VIII
After completion, the virtual address has a corresponding physical page
IX
If the parameter is marked with MAP_POPULATE, then call the function sys_remap_file_pages (addr, len, 0, pgof, flags &
18
MAP_NONBLOCK): remapping
to be continued
I
For the latter ext2 file system, complete address mapping will be generated when accessing the page fault
.nopage = filemap_nopage; .populate = filemap_populate
IX
2.6 path_lookup
2.7 file locking and counter
2.8 File System Summary
Process 3
There are three linux system calls fork, vfork, and clone the process used to produce, in the core, respectively sys_fork, sys_vfork and
sys_cloen, all further calls an internal function do_fork () is complete, the only difference is the different parameters call do_fork () of.
do_fork parameters:
unsigned long clone_flags: Characteristic Parameters
unsigned long stack_start: subprocess stack starting address
Register structure pointer
struct pt_regs * regs:
unsigned long stack_size: Stack size, the parameter is not used
int __user *
parent_tidptr: parent process pointer tid
int __user *
child_tidptr: subprocess tid pointer
sys_fork parameters:
struct pt_regs * regs
When calling do_fork format:
It sends the signal to the parent process child process end (terminate) or stop (stop) when: clone_flags = SIGCHLD
stack_start = regs-> rsp: common parent process stack, using a mechanism for replication COW
regs = regs
stack_size = 0
parent_tidptr = NULL
child_tidptr = NULL
sys_vfork parameters:
struct pt_regs * regs
When calling do_fork format:
clone_flags = CLONE_VFORK | CLONE_VM | SIGCHLD: shared with a parent process address space;
43
Page 44
And so the parent process hangs enters a wait state until the child process releases the address space, that is the end or perform a new program;
Other sys_fork same
sys_clone parameters:
unsigned long clone_flags
unsigned long newsp
void __user * parent_tid
void __user * child_tid
struct pt_regs * regs
When calling do_fork format:
stack_start = newsp:? regs-> rsp
stack_start = 0
III
CLONE_THREAD: The child process is added to the parent thread group, forced child share signal description of the parent process
symbol. CLONE_SIGHAND: shared signal indicated table, including the signal handler (handler), blocking and pending signals
If clone_flags with CLONE_SIGHAND mark but no CLONE_VM tag is wrong. CLONE_VM:
Parent and child share the virtual address space
IV
tsk = alloc_task_struct (): allocation process from a dedicated cache task_struct_cachep control structure
iii
ti = alloc_thread_info (): call macro __get_free_pages (, 1) allocated two consecutive physical pages as pipe thread
Word processing and kernel stack
iv
v
vi
vii
V
VI
VII
VIII
IX
X
The total number of new process to verify whether the owner of the process exceeds the limit, ie p-> user-> processes> = p-> signal-> rlim
[RLIMIT_NPROC] .rlim_cur, if exceeded, and no administrator privileges are not root user, then an error
Incrementing count: p-> user -> __ count, p-> user-> processes
get_group_info (p-> group_info): increase p-> group_info-> usage count
copy_flags (clone_flags, p): The p-> flags cleared PF_SUPERPRIV mark set PF_FORKNOEXEC
Mark, if clone_flags no CLONE_PTRACE mark, set p-> ptrace = 0. PF_SUPERPRIV:
Superuser mark, PF_FORKNOEXEC: fork but does mark
p-> pid = pid: PID set for the child process
If clone_flags in with CLONE_PARENT_SETTID mark, then p-> pid write user space variable
44
Page 45
XI
XII
parent_tidptr in
Set p-> proc_dentry = NULL, the initialization list p-> children, p-> sibling, initialize the spin lock p-> alloc_lock,
p-> proc_lock, init_sigpending (& p-> pending): Suspend initialization signal management structure, the other members of the initialization p
copy_semundo (clone_flags, p):
If clone_flags no CLONE_SYSVSEM mark, set p-> sysvsem.undo_list = NULL
i
After the return, otherwise continue
ii
XIII
iii
iv
p-> sysvsem.undo_list = undo_list
copy_files (clone_flags, p)
If there are signs clone_flags CLONE_FILES, increase the reference count current-> files-> count returns
i
Assign a dedicated cache files_cachep from files_struct structure and initialization
ii
iii
iv
v
XIV
XV
open_files = count_open_files (): Maximum file descriptor fd calculated using the current process
Copy files_struct in open_fds and close_on_exec marked unused portion cleared
Use a for loop, copy current-> files-> fd [..] in all open files to p-> files-> fd [..], the
If the file is not open, clear p-> file-> open_fds corresponding bitmaps
copy_fs (clone_flags, p): If clone_flags mark CLONE_FS set, increasing the current-> fs-> count count
Back after a few, otherwise it is set p-> fs = __copy_fs_struct (current-> fs): assign a dedicated cache from the fs_cachep
fs_struct structure, and set the initial value equal to the current process of structural fs_struct
copy_sighand (clone_flags, p): Copy the signal processing functions
If CLONE_SIGHAND or CLONE_THREAD labeled with, directly increasing the current-> sighand->
i
ii
iii
iv
XVI
copy_signal (clone_flags, p)
If clone_flags marked CLONE_THREAD set the reference count is increased current-> signal-> count
i
And after current-> signal-> live return
Assign a signal_struct structure sig from the dedicated cache signal_cachep
ii
iii
XVII
mm = allocate_mm (): assign a mm_struct structure from the dedicated cache mm_cachep
iii
iv
Page 46
d
v
init_new_context (p, mm): the function is called copy_ldt (new = & mm-> context, old = & current->
mm-> context)
A alloc_ldt (pc = new, mimcount = old-> size, reload = 0)
If new-> size> = old-> size will be returned directly, then return directly to the replication process
a
The mincount rounded up to an integer multiple of 512 bytes
b
c
The new LDT capacity mincount * LDT_ENTRY_SIZE, if the value is less than a call to
e
f
g
B memcpy (new-> ldt, old-> ldt, old-> size * LDT_ENTRY_SIZE): copy of the parent process mm-> context.
ldt content to the child process mm-> context.ldt, the content of the pointer consistent
4 locking mechanism
5 Memory Management
5.1 swap mechanism
swap mechanism recovered memory type:
page_launder () : Recycling inactive_dirty_list of page
refill_inactive_scan () : The active_list the page becomes inactive
swap_out () : from init_mm.mmlinst start scanning all mm_struct structure, swap out vm_area_struct management section
page
shrink_dcache_memory () : Recycling dentry structure
shrink_icache_memory () : Recycling inode structure
kmem_cache_reap () : Reclaiming Space slab structure
1
index:
address_space
587
dentry
dentry_operations
428
418
ext2_aops
587
ext2_dir_entry_2
427
ext2_dir_inode_operations
448
ext2_group_desc
533
ext2_inode
425
ext2_inode_info
458
46
Page 47
ext2_sb_ifno
526
ext2_sops
455
ext2_super_block
527
file
541
file_operations
416
file_system_type
497
files_struct
543
fs_struct
417
inode
423
nameidata
433
proc_dir_entry
657
proc_sops
657
super_block
524
vfsmount
510
6 core thread
migration_thread ()
ksoftirqd ()
7 Special Function Function
schedule ()
path_lookup ()
Part 8 global variables
CPU Unit
Operating System CPU smallest unit of recognition, if open SCHED_SMT, refers to each CPU thread, otherwise every finger
CPU cores
cpumask_t
cpumask_t
cpumask_t
u8
u8
int
cpu_core_id [NR_CPUS]: where each CPU Unit CPU core number, excluding package code
smp_num_siblings: The current number of CPU unit CPU package included, noting each CPU when SMT structure
Thread as a CPU unit
u8
47