6. Linux Multitasking
6.1 Overview
This section will analyze data structures--the mechanism used to manage multitasking environment under Linux.
Task States
A Linux Task can be one of the following states (according to [include/linux.h]):
- TASK_RUNNING, it means that it is in the "Ready List"
- TASK_INTERRUPTIBLE, task waiting for a signal or a resource (sleeping)
- TASK_UNINTERRUPTIBLE, task waiting for a resource (sleeping), it is in same "Wait Queue"
- TASK_ZOMBIE, task child without father
- TASK_STOPPED, task being debugged
Graphical Interaction
______________ CPU Available ______________ | | ----------------> | | | TASK_RUNNING | | Real Running | |______________| <---------------- |______________| CPU Busy | /|\ Waiting for | | Resource Resource | | Available \|/ | ______________________ | | | TASK_INTERRUPTIBLE / | | TASK-UNINTERRUPTIBLE | |______________________| Main Multitasking Flow
6.2 Timeslice
PIT 8253 Programming
Each 10 ms (depending on HZ value) an IRQ0 comes, which helps us in a multitasking environment. This signal comes from PIC 8259 (in arch 386+) which is connected to PIT 8253 with a clock of 1.19318 MHz.
_____ ______ ______ | CPU |<------| 8259 |------| 8253 | |_____| IRQ0 |______| |___/|\| |_____ CLK 1.193.180 MHz // From include/asm/param.h #ifndef HZ #define HZ 100 #endif // From include/asm/timex.h #define CLOCK_TICK_RATE 1193180 /* Underlying HZ */ // From include/linux/timex.h #define LATCH ((CLOCK_TICK_RATE + HZ/2) / HZ) /* For divider */ // From arch/i386/kernel/i8259.c outb_p(0x34,0x43); /* binary, mode 2, LSB/MSB, ch 0 */ outb_p(LATCH & 0xff , 0x40); /* LSB */ outb(LATCH >> 8 , 0x40); /* MSB */
So we program 8253 (PIT, Programmable Interval Timer) with LATCH = (1193180/HZ) = 11931.8 when HZ=100 (default). LATCH indicates the frequency divisor factor.
LATCH = 11931.8 gives to 8253 (in output) a frequency of 1193180 / 11931.8 = 100 Hz, so period = 10ms
So Timeslice = 1/HZ.
With each Timeslice we temporarily interrupt current process execution (without task switching), and we do some housekeeping work, after which we'll return back to our previous process.
Linux Timer IRQ ICA
Linux Timer IRQ IRQ 0 [Timer] | \|/ |IRQ0x00_interrupt // wrapper IRQ handler |SAVE_ALL --- |do_IRQ | wrapper routines |handle_IRQ_event --- |handler() -> timer_interrupt // registered IRQ 0 handler |do_timer_interrupt |do_timer |jiffies++; |update_process_times |if (--counter <= 0) { // if time slice ended then |counter = 0; // reset counter |need_resched = 1; // prepare to reschedule |} |do_softirq |while (need_resched) { // if necessary |schedule // reschedule |handle_softirq |} |RESTORE_ALL
Functions can be found under:
- IRQ0x00_interrupt, SAVE_ALL [include/asm/hw_irq.h]
- do_IRQ, handle_IRQ_event [arch/i386/kernel/irq.c]
- timer_interrupt, do_timer_interrupt [arch/i386/kernel/time.c]
- do_timer, update_process_times [kernel/timer.c]
- do_softirq [kernel/soft_irq.c]
- RESTORE_ALL, while loop [arch/i386/kernel/entry.S]
Notes:
- Function "IRQ0x00_interrupt" (like others IRQ0xXY_interrupt) is directly pointed by IDT (Interrupt Descriptor Table, similar to Real Mode Interrupt Vector Table, see Cap 11 for more), so EVERY interrupt coming to the processor is managed by "IRQ0x#NR_interrupt" routine, where #NR is the interrupt number. We refer to it as "wrapper irq handler".
- wrapper routines are executed, like "do_IRQ","handle_IRQ_event" [arch/i386/kernel/irq.c].
- After this, control is passed to official IRQ routine (pointed by "handler()"), previously registered with "request_irq" [arch/i386/kernel/irq.c], in this case "timer_interrupt" [arch/i386/kernel/time.c].
- "timer_interrupt" [arch/i386/kernel/time.c] routine is executed and, when it ends,
- control backs to some assembler routines [arch/i386/kernel/entry.S].
Description:
To manage Multitasking, Linux (like every other Unix) uses a ''counter'' variable to keep track of how much CPU was used by the task. So, on each IRQ 0, the counter is decremented (point 4) and, when it reaches 0, we need to switch task to manage timesharing (point 4 "need_resched" variable is set to 1, then, in point 5 assembler routines control "need_resched" and call, if needed, "schedule" [kernel/sched.c]).
6.3 Scheduler
The scheduler is the piece of code that chooses what Task has to be executed at a given time.
Any time you need to change running task, select a candidate. Below is the ''schedule [kernel/sched.c]'' function.
|schedule |do_softirq // manages post-IRQ work |for each task |calculate counter |prepare_to__switch // does anything |switch_mm // change Memory context (change CR3 value) |switch_to (assembler) |SAVE ESP |RESTORE future_ESP |SAVE EIP |push future_EIP *** push parameter as we did a call |jmp __switch_to (it does some TSS work) |__switch_to() .. |ret *** ret from call using future_EIP in place of call address new_task
6.4 Bottom Half, Task Queues. and Tasklets
Overview
In classic Unix, when an IRQ comes (from a device), Unix makes "task switching" to interrogate the task that requested the device.
To improve performance, Linux can postpone the non-urgent work until later, to better manage high speed event.
This feature is managed since kernel 1.x by the "bottom half" (BH). The irq handler "marks" a bottom half, to be executed later, in scheduling time.
In the latest kernels there is a "task queue"that is more dynamic than BH and there is also a "tasklet" to manage multiprocessor environments.
BH schema is:
- Declaration
- Mark
- Execution
Declaration
#define DECLARE_TASK_QUEUE(q) LIST_HEAD(q) #define LIST_HEAD(name) \ struct list_head name = LIST_HEAD_INIT(name) struct list_head { struct list_head *next, *prev; }; #define LIST_HEAD_INIT(name) { &(name), &(name) } ''DECLARE_TASK_QUEUE'' [include/linux/tqueue.h, include/linux/list.h]
"DECLARE_TASK_QUEUE(q)" macro is used to declare a structure named "q" managing task queue.
Mark
Here is the ICA schema for "mark_bh" [include/linux/interrupt.h] function:
|mark_bh(NUMBER) |tasklet_hi_schedule(bh_task_vec + NUMBER) |insert into tasklet_hi_vec |__cpu_raise_softirq(HI_SOFTIRQ) |soft_active |= (1 << HI_SOFTIRQ) ''mark_bh''[include/linux/interrupt.h]
For example, when an IRQ handler wants to "postpone" some work, it would "mark_bh(NUMBER)", where NUMBER is a BH declarated (see section before).
Execution
We can see this calling from "do_IRQ" [arch/i386/kernel/irq.c] function:
|do_softirq |h->action(h)-> softirq_vec[TASKLET_SOFTIRQ]->action -> tasklet_action |tasklet_vec[0].list->func
"h->action(h);" is the function has been previously queued.
6.5 Very low level routines
set_intr_gate
set_trap_gate
set_task_gate (not used).
(*interrupt)[NR_IRQS](void) = { IRQ0x00_interrupt, IRQ0x01_interrupt, ..}
NR_IRQS = 224 [kernel 2.4.2]
6.6 Task Switching
When does Task switching occur?
Now we'll see how the Linux Kernel switchs from one task to another.
Task Switching is needed in many cases, such as the following:
- when TimeSlice ends, we need to give access to some other task
- when a task decide to access a resource, it sleeps for it, so we have to choose another task
- when a task waits for a pipe, we have to give access to other task, which would write to pipe
Task Switching
TASK SWITCHING TRICK #define switch_to(prev,next,last) do { \ asm volatile("pushl %%esi\n\t" \ "pushl %%edi\n\t" \ "pushl %%ebp\n\t" \ "movl %%esp,%0\n\t" /* save ESP */ \ "movl %3,%%esp\n\t" /* restore ESP */ \ "movl $1f,%1\n\t" /* save EIP */ \ "pushl %4\n\t" /* restore EIP */ \ "jmp __switch_to\n" \ "1:\t" \ "popl %%ebp\n\t" \ "popl %%edi\n\t" \ "popl %%esi\n\t" \ :"=m" (prev->thread.esp),"=m" (prev->thread.eip), \ "=b" (last) \ :"m" (next->thread.esp),"m" (next->thread.eip), \ "a" (prev), "d" (next), \ "b" (prev)); \ } while (0)
Trick is here:
- ''pushl %4'' which puts future_EIP into the stack
- ''jmp __switch_to'' which execute ''__switch_to'' function, but in opposite of ''call'' we will return to valued pushed in point 1 (so new Task!)
U S E R M O D E K E R N E L M O D E | | | | | | | | | | | | Timer | | | | | | | Normal | IRQ | | | | | | | Exec |------>|Timer_Int.| | | | | | | | | .. | | | | | | \|/ | |schedule()| | Task1 Ret| | | | | |_switch_to|<-- | Address | |__________| |__________| | | | | | | | |S | | Task1 Data/Stack Task1 Code | | |w | | | | T|i | | | | a|t | | | | | | | | s|c | | | | | | Timer | | k|h | | | | | Normal | IRQ | | |i | | | | | Exec |------>|Timer_Int.| |n | | | | | | | | .. | |g | | | | | \|/ | |schedule()| | | Task2 Ret| | | | | |_switch_to|<-- | Address | |__________| |__________| |__________| |__________| Task2 Data/Stack Task2 Code Kernel Code Kernel Data/Stack
6.7 Fork
Overview
Fork is used to create another task. We start from a Task Parent, and we copy many data structures to Task Child.
| | | .. | Task Parent | | | | | | | fork |---------->| CREATE | | | /| NEW | |_________| / | TASK | / | | --- / | | --- / | .. | / | | Task Child / | | / | fork |<-/ | | |_________| Fork SysCall
What is not copied
New Task just created (''Task Child'') is almost equal to Parent (''Task Parent''), there are only few differences:
- obviously PID
- child ''fork()'' will return 0, while parent ''fork()'' will return PID of Task Child, to distinguish them each other in User Mode
- All child data pages are marked ''READ + EXECUTE'', no "WRITE'' (while parent has WRITE right for its own pages) so, when a write request comes, a ''Page Fault'' exception is generated which will create a new independent page: this mechanism is called ''Copy on Write'' (see Cap.10 for more).
Fork ICA
|sys_fork |do_fork |alloc_task_struct |__get_free_pages |p->state = TASK_UNINTERRUPTIBLE |copy_flags |p->pid = get_pid |copy_files |copy_fs |copy_sighand |copy_mm // should manage CopyOnWrite (I part) |allocate_mm |mm_init |pgd_alloc -> get_pgd_fast |get_pgd_slow |dup_mmap |copy_page_range |ptep_set_wrprotect |clear_bit // set page to read-only |copy_segments // For LDT |copy_thread |childregs->eax = 0 |p->thread.esp = childregs // child fork returns 0 |p->thread.eip = ret_from_fork // child starts from fork exit |retval = p->pid // parent fork returns child pid |SET_LINKS // insertion of task into the list pointers |nr_threads++ // Global variable |wake_up_process(p) // Now we can wake up just created child |return retval fork ICA
- sys_fork [arch/i386/kernel/process.c]
- do_fork [kernel/fork.c]
- alloc_task_struct [include/asm/processor.c]
- __get_free_pages [mm/page_alloc.c]
- get_pid [kernel/fork.c]
- copy_files
- copy_fs
- copy_sighand
- copy_mm
- allocate_mm
- mm_init
- pgd_alloc -> get_pgd_fast [include/asm/pgalloc.h]
- get_pgd_slow
- dup_mmap [kernel/fork.c]
- copy_page_range [mm/memory.c]
- ptep_set_wrprotect [include/asm/pgtable.h]
- clear_bit [include/asm/bitops.h]
- copy_segments [arch/i386/kernel/process.c]
- copy_thread
- SET_LINKS [include/linux/sched.h]
- wake_up_process [kernel/sched.c]
Copy on Write
To implement Copy on Write for Linux:
- Mark all copied pages as read-only, causing a Page Fault when a Task tries to write to them.
- Page Fault handler creates a new page.
| Page | Fault | Exception | | -----------> |do_page_fault |handle_mm_fault |handle_pte_fault |do_wp_page |alloc_page // Allocate a new page |break_cow |copy_cow_page // Copy old page to new one |establish_pte // reconfig Page Table pointers |set_pte Page Fault ICA
- do_page_fault [arch/i386/mm/fault.c]
- handle_mm_fault [mm/memory.c]
- handle_pte_fault
- do_wp_page
- alloc_page [include/linux/mm.h]
- break_cow [mm/memory.c]
- copy_cow_page
- establish_pte
- set_pte [include/asm/pgtable-3level.h]
Next Previous Contents