Kernel Internals: Santosh Sam Koshy Santoshk@cdac - in Centre For Development of Advanced Computing, Hyderabad

Kernel Internals
Santosh Sam Koshy santoshk@cdac.in Centre for Development of Advanced Computing, Hyderabad
Agenda
IOCTL Kernel Synchronization Techniques
Wait Queues
Time Delays Deferred Executions
IOCTL
Most drivers need -in addition to the ability to read and write the device -the ability to perform various types of hardware control via the device driver. These operations are normally supported via the ioctl method In the user space, the ioctl command has the following format
int ioctl(int fd, unsigned long cmd, ...); The ioctl driver method has the prototype int (*ioctl) (struct inode *inode, struct file *filp, unsigned int cmd, unsigned long arg);
14/11/2012
Magic Numbers
Magic numbers are mechanisms of identifying the commands for a particular device. They must be unique over the system. These are maintained by the kernel in 4 bit-fields
type: This is the magic number present in the file ioctlnumber.txt. It is 8 bits wide number: The ordinal (sequential) number. It is also 8 bits wide direction: The direction of data transfer. Two bits size: The size of the user data involved. The size is architecture dependent, and is generally limited to 13 or 14 bits
Kernel Synchronization
Agenda
Sources of Concurrency in the Kernel Mechanisms to manage concurrency
Semaphores RW Semaphores Spinlocks RW Spinlocks Completions Atomic Variables Sequential Locks RCU
What is Synchronization
In the kernel there can be many tasks that execute pseudo concurrently. This may lead to data inconsistencies in accessing a common resource A well defined coordination between tasks in accessing shared data is a must and this coordination leads to synchronization between tasks.
Sources of Concurrency
In a linux system, there is a possibility that numerous process are executing in the user space, making system calls to the kernel SMP systems can access your code concurrently Kernel code is preemptible Interrupts are asynchronous events that can cause concurrent execution Delayed code execution mechanisms provided by the kernel Hot pluggable devices can suddenly stop the functioning of the code
Mechanisms to manage concurrency

Any time a hardware or software resource is shared beyond a single thread of execution, there is a possibility that one thread gets an inconsistent view of that resource
This calls for some resource access management and is brought about by mechanisms called locking or mutual exclusion making sure that only one resource can manipulate a shared resource at one time.
Semaphores
At its core, a semaphore is a single integer value combined with a pair of functions that are typically called up and down To use semaphores, the code must include asm/semaphore.h. The semaphore implementation in the kernel is just a structure semaphore.
struct semaphore {
atomic_t count; int sleepers; wait_queue_head_t wait;
};
Semaphores
There are two ways of creating a semaphore. The dynamic way uses the function
void sema_init(struct semaphore *sem, int val)
Statically, semaphores may be created by the macro

static DECLARE_SEMAPHORE_GENERIC(name,count)
The count or val in both cases specifies the initialization value of the semaphore. Setting it to 1 created the semaphore as a binary semaphore or a mutex (mutual exclusion semaphore)
Semaphores
Semaphores may also be created in the mutex mode by the following functions
DECLARE_MUTEX(name); DECLARE_MUTEX_LOCKED(name);
They may be initialized at runtime by the following

init_MUTEX(struct semaphore *sem); void init_MUTEX_LOCKED(struct semaphore *sem)
Semaphores
Semaphores may be accessed by calling one of the following functions
void down(struct semaphore *sem); int down_interruptible(struct semaphore *sem); int down_trylock(struct semaphore *sem);
Once access to the critical section is completed, the semaphore may be released by the function
void up(struct semaphore *sem)
Reader/Writer Semaphores
Code using rwsems must include linux/rwsem.h. The relevant data type for rwsem is struct rw_semaphore. An rwsem must be explicitly initialized at run time using
void init_rwsem(struct rw_semaphore *sem)
For read only access,

void down_read(struct rw_semaphore *sem); int down_read_trylock(struct rw_semaphore *sem); void up_read(struct rw_semaphore *sem)
For requirements wherein a long read is required after a quick write,

void downgrade_write(struct semaphore *sem)
Spinlocks
A spinlock is a mutual exclusion device that can have only two values locked and unlocked. It is implemented as a single bit in an integer value. Code wishing to take out a particular lock tests the relevant bit. Unlike semaphores, spinlocks may be used in code that cannot sleep. If the lock is taken by somebody else, the code goes into a tight loop where it repeatedly checks the lock until it becomes available
Spinlocks
Spinlocks are intended for use on multiprocessor systems although a uniprocessor workstation running a preemptive kernel behaves like SMP If a non preemptive uniprocessor ever went into a spinlock, it would spin forever; on other thread would ever be able to obtain the CPU to release the lock The Linux implementation nullifies the spinlock implementation if it is tried to be used on a uniprocessor system
Spinlocks
The required include file for spinlock primitives is linux/spinlock.h. A spinlock has the type spinlock_t and has to be initialized before it is used The static initialization for a spinlock is done by
spinlock_t my_lock = SPIN_LOCK_UNLOCKED
or at runtime as
void spin_lock_init(spinlock_t *lock);
A spinlock is obtiained and released by

void spin_lock(spinlock_t *lock); void spin_unlock(spinlock_t *lock);
Spinlocks and Interrupts

Spinlocks can be used in interrupt handlers, whereas semaphores may not be used since they sleep. If a lock is shared with an interrupt handler, local interrupts must be disabled before acquiring the lock. The kernel provides a separate interface for this which disables and enables local interrupts on acquiring and releasing the spinlock respectively
void spin_lock_irqsave(spinlock_t *lock, unsigned long flags); void spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags
Reader/Writer Spinlocks
Using read/write spinlocks is similar to rwsems It is initialized by
rwlock_t mylock =RW_LOCK_UNLOCKED //static way rwlock_t mylock; rw_lock_init (&my_rwlock); //Dynamic way
Reader and writer locks may be gained and released by

void read_lock(rmlock_t *lock); void read_unlock(rwlock_t *lock); void write_lock(rwlock_t *lock); void write_unlock(rwlock_t *lock);
Semaphores vs Spinlocks
Requirement
Low overhead locking Short lock hold time Long lock hold time Need to lock from interrupt context Need to sleep while holding lock
Recommended Lock
Spinlock Spinlock Semaphore Spinlock Semaphore
Completions
A common phenomenon is kernel programming is the initiation of some activity outside the current execution flow and then wait for that activity to complete. Consider the following code snippet
struct semaphore sem; init_MUTEX_LOCKED(&sem); start_external_task(&sem); down(&sem);
Completions
Completions are a simple light weight mechanism with one task: allowing one thread to tell another that the job is done A completion can be created with
DECLARE_COMPLETION(my_completion);
Waiting for the completion is by simply calling

void wait_for_completion(struct completion *c);
The actual completion event is signaled by

void complete(struct completion *c); void complete_all(struct completion *c);
Atomic Variables
Atomic variables are special data types that are provided by the kernel, to perform simple operations in an atomic manner. The kernel provides an atomic integer type called atomic_t and a set of functions that have to be used to perform operations on the atomic variables. The operations are very fast, because they compile to a simple machine instruction whenever possible
Atomic Integer Operations

Some important integer operations are
void atomic_set(atomic_t *v, int i); int atomic_read(atomic_t *v); void atomic_add(int i, atomic_t *v); void atomic_sub(int i, atomic_t *v); void atomic_inc(atomic_t *v); void atomic_dec(atomic_t *v); int atomic_inc_and_test(atomic_t *v);
Atomic Bit Operations

The atomic data type is good in working around with integers. It does not suffice bitwise operations. The kernel provides necessary functions that act on single bits. These are declared in asm/bitops.h The available bit operations are:
void set_bit(nr, void *addr); void clear_bit(nr, void *addr); void change_bit(nr, void *addr);
seqlocks
An added feature in the 2.6 kernel that is intended to provide fast, lockless access to a shared resource. Seq locks work in situations where write access is rare but must be fast They work by allowing readers free access to the resource but requiring those readers to check for collisions with writers and, when collisions occur, retry their access Cannot be used to protect data structures involving pointers because the reader may be following a pointer that is invalid while the writer may be changing the data structure
seqlocks
Seqlocks are defined in linux/seqlock.h. It may be initialized by
seqlock_t lock1= SEQLOCK_UNLOCKED;
The write path is obtained by

void write_seqlock(seqlock_t *lock); /*write lock is obtained....make changes*/ void write_sequnlock (seqlock_t *lock);
Readers may function in this pattern

do {
seq = read_seqbegin(&lock ); read the data here
}while (read_seqretry(&lock, seq);

reader 1 reader 2
Read Copy Update

writer
Pointer
reader 3 Pointer reader 4 reader 5 reader 1 Shared Resourc e
Copy of Shared Resourc e
reader 2
Pointer
reader 3 reader 4 reader 5
Pointer
Pointer
Shared Resourc e
Copy of Shared Resourc e
Wait Queues, Delays and Deferred Execution
Agenda
Wait Queues HZ
Jiffies
Long Delays Kernel Timers Tasklets Work Queues
Wait Queues
Wait Queues are mechanisms of putting a user space process into a sleep whenever the kernel driver is not able to suffice the user processs requirements. When a process is put to sleep, it is marked as being in a special state and removed from the schedulers run queue. The process will not be scheduled unless an event causes the scheduling. The linux scheduler maintains two special states that represent a wait state. They are defined as TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE
Declaration of Wait Queues

The wait queue may also be defined as a list of processes waiting for a specific event. Wait queues are managed by means of a wait queue head, a structure of type wait_queue_head_t. It may be defined and initialized as
DECLARE_WAIT_QUEUE(name); //static declaration Or wait_queue_head_t my_queue; //Dynamically init_waitqueue_head (&my_queue);
This only creates a wait queue list for appending future tasks to it
Using Wait Queues

When a process sleeps, it is in expectation that some condition will become true in the future. The simplest way of sleeping is calling the macro wait_event
wait_event (queue, condition); Wait_event_interruptible (queue, condition); Wait_event_timeout (queue, condition, timeout); Wait_event_interruptible_timeout (queue, condition, timeout);
The waking up process is either another user space process or may be an interrupt handler. It satisfies the condition for wake up and calls one of the appropriate functions
Void wake_up (wait_queue_head_t *queue); Void wake_up_interruptible(wait_queue_head_t *queue);
When to use Wait Queues

There are two behaviors that warrant the use of wait queues
If a process calls read but no data is available, the process must block. The process is awakened as soon as some data arrives, and that data is returned to the caller, even if there is less data than the amount requested in the count argument to the method.
If a process calls write and there is no space in the buffer, the process must block, and it must be on a different wait queue from the one used for reading. When some data has been written to the hardware device, and space becomes free in the output buffer, the process is awakened and the write call succeeds, although the data may be only partially written if there isnt room in the buffer for the count bytes that were requested
Exclusive Waits
Thundering Herd
In using wait queues, we may occur a situation wherein there are many processes waiting for the occurrence of an event. During the wakeup process, all processes waiting for the event are made ready to execute. This causes a herd of processes thunder-in together to gain exclusive access to the shared resource. Only one of these events is satisfied with the CPU and the rest have to go back into their sleep state. This thundering of processes for CPU access may deteriorate the overall system performance, if it is quite frequent. This problem is known as the Thundering Herd Problem and is sorted out using Exclusive Wait Mechanisms
Exclusive Waits
In response to the thundering herd problems, kernel developers have added an exclusive wait option to the kernel. There are two important differences in using exclusive waits:
When a wait queue entry has the WQ_FLAG_EXCLUSIVE flag set, it is added to the end of the wait queue. Entries without that flag are added to the beginning When wake_up is called on a wait queue, it stops after waking the first process that has the WQ_FLAG_EXCLUSIVE flag set
Putting a process into an interruptible wait is a simple matter of calling

Void prepare_to_wait_exclusive(wait_queue_head_t wait_queue_t *wait, int state) *queue,
HZ
The kernel keeps track of the flow of time by means of timer interrupts. These are generated by the systems timing hardware at regular intervals. This interval is programmed at system boot up by the kernel according to the value HZ, which is an architecture dependent variable. The default values range from 50 to 1200 and is typically set to 100 or 1000 on x86 machines Changing the value of HZ to a new effect will take its toll only on recompiling the kernel with the new value
Jiffies
Every time a timer interrupt occurs, the value of an internal kernel counter is incremented. The counter is initialized to 0 on system boot and therefore represents the number of timer ticks since last boot. The counter is a 64 bit variable and is called jiffies_64. However, driver writers access the jiffies variable, an unsigned long that is same as either jiffies_64 or its least significant bits.
Using the Jiffies Counter

The jiffies counter can be used in reading the present time and thereby calculating the future timestamp. This may be explained as follows
J = jiffies; Stamp_1 = J + HZ //Stamp_1 iterates to one second ahead Stamp_2 = J + HZ/2 // Stamp_2 may refer to half a second in the future
Delaying Execution
Long Delays:
Occasionally a driver needs to delay execution for relatively long periods more than one clock tick. There are few ways of implementing the same Busy Waiting
J = jiffies; Delay = J + 5 * HZ //A delay of 5 HZ from now
While (time_before (J, Delay)) { /* do nothing */ }
Delaying Execution
This method causes a busy looping in the while statement, which hogs the CPU for no productive outcome Yielding the Processor
While (time_before (J, Delay)) { schedule(); // yield the CPU }
The advantage of this method is that another process may get access to the CPU. The delay requested guaranteed but the process may not be scheduled exactly after the requested delay.
Delaying Execution
Short Delays:
The kernel implements functions that provide delays that may not otherwise be possible with the jiffies counter. These delays are implemented as function loop, depending on the architecture.
Void ndelay(unsigned long nsecs); Void udelay(unsigned long usecs); Void mdelay(unsigned long msecs;
Kernel Timers
Kernel timers are used to schedule execution of a function at a later instance of time, based on the clock tick. A kernel timer is a data structure that instructs the kernel to execute a user defined function with a user defined argument at a user defined time. The declaration can be found in linux/timer.h and the source code may be found in kernel/timer.c
Kernel Timers
The functions scheduled to run, may not run while the process that initiated it is executing. They are run asynchronously as in an interrupt context. Kernel timers may be considered as a software interrupt handlers and have certain constraints associated with their implementations. Primarily, they have to be atomic and there are additional constraints because of execution in the interrupt context The timers run on the same CPU that registered it.
The Timer API

The kernel provides drivers with a number of functions to declare, register and remove kernel timers
struct timer_list {
unsigned long expires; void (*function)(unsigned long); unsigned long data;
};
The member expires specifies the amount of time for delay. There is a function pointer to a user defined function and the third parameter data takes the arguments to the function pointer.
The Timer API

The timer may be initialized by using the function
void init_timer(struct timer_list *timer);
The public fields of the structure may be initialized after returning from the function
This timer may be added and deleted to and from the the kernel using the functions
void add_timer(struct timer_list *timer); void del_timer(struct timer_list *timer);
The timer is a one shot execution and is taken off the list before it is run.
Other Timer APIs

int mod_timer(struct timer_list *timer, unsigned long expires);
Allows the modification of the timer count
int timer_pending(const struct timer_list *timer);

Returns true or false to indicate whether the timer is currently scheduled to run by reading one of the opaque fields of the structure
Applications of Kernel Timers

They find various applications such as polling a device by checking its status registers at regular intervals, when the hardware cannot fire interrupts. Other applications may be turning off the floppy motor, shutting down the processor fan on system shutdown etc...
Tasklets
Tasklets is another kernel facility that allows deferring the execution of a process to a later instance. It is similar to kernel timers in that they run at interrupt time, they always run on the same CPU that schedules them and they receive an unsigned long argument They differ from kernel timers in the fact that they are not scheduled at a particular time. They are scheduled by the system at a later instance of time.
The Tasklet Data Structure

A tasklet exists as a data structure that must be initialized before use.
struct tasklet_struct {
void (* func)(unsigned long); unsigned long data;
Initialization is done by the function

void tasklet_init(struct tasklet_struct t, void (func)(unsigned long), unsigned long data);
Tasklets
A tasklet can be disabled and re-enabled later; it wont be executed until it is enabled as many times as it has been disabled A tasklet can re-register itself A tasklet can be scheduled to execute at normal priority or high priority Tasklets may be run immediately if the system is not under heavy load but never later than the next timer tick
Tasklet APIs
void tasklet_disable(struct tasklet_struct *t); void tasklet_enable(struct tasklet_struct *t); void tasklet_schedule(struct tasklet_struct *t); void tasklet_hi_schedule(struct tasklet_struct *t); void tasklet_kill(struct tasklet_struct *t);
Work Queues
Work queues allow the kernel code to request that a function be called at some future time. They differ as
Work Queue functions run in the context of a special kernel process These functions can sleep Kernel code can request that the execution of work queue functions be delayed for an explicit interval
The key difference between tasklets and work queues is that tasklets execute for a short period, immediately and are atomic. The same does not hold for work queues

Kernel Internals: Santosh Sam Koshy Santoshk@cdac - in Centre For Development of Advanced Computing, Hyderabad

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Kernel Internals: Santosh Sam Koshy Santoshk@cdac - in Centre For Development of Advanced Computing, Hyderabad

Enviado por

Direitos autorais:

Formatos disponíveis

Kernel Internals

Mechanisms to manage concurrency

Statically, semaphores may be created by the macro

They may be initialized at runtime by the following

For read only access,

For requirements wherein a long read is required after a quick write,

A spinlock is obtiained and released by

Spinlocks and Interrupts

Reader and writer locks may be gained and released by

Waiting for the completion is by simply calling

The actual completion event is signaled by

Atomic Integer Operations

Atomic Bit Operations

The write path is obtained by

Readers may function in this pattern

}while (read_seqretry(&lock, seq);

Read Copy Update

reader 3 Pointer reader 4 reader 5 reader 1 Shared Resourc e

Copy of Shared Resourc e

reader 3 reader 4 reader 5

Copy of Shared Resourc e

Wait Queues, Delays and Deferred Execution

Declaration of Wait Queues

Using Wait Queues

When to use Wait Queues

Putting a process into an interruptible wait is a simple matter of calling

Using the Jiffies Counter

While (time_before (J, Delay)) { /* do nothing */ }

The Timer API

The Timer API

Other Timer APIs

int timer_pending(const struct timer_list *timer);

Applications of Kernel Timers

The Tasklet Data Structure

Initialization is done by the function

Você também pode gostar