5. Linux Peculiarities
5.1 Overview
Linux has some peculiarities that distinguish it from other OSs. These peculiarities include:
- Pagination only
- Softirq
- Kernel threads
- Kernel modules
- ''Proc'' directory
Flexibility Elements
Points 4 and 5 give system administrators an enormous flexibility on system configuration from user mode allowing them to solve also critical kernel bugs or specific problems without have to reboot the machine. For example, if you needed to change something on a big server and you didn't want to make a reboot, you could prepare the kernel to talk with a module, that you'll write.
5.2 Pagination only
Linux doesn't use segmentation to distinguish Tasks from each other; it uses pagination. (Only 2 segments are used for all Tasks, CODE and DATA/STACK)
We can also say that an interTask page fault never occurs, because each Task uses a set of Page Tables that are different for each Task. There are some cases where different Tasks point to same Page Tables, like shared libraries: this is needed to reduce memory usage; remember that shared libraries are CODE only cause all datas are stored into actual Task stack.
Linux segments
Under the Linux kernel only 4 segments exist:
- Kernel Code [0x10]
- Kernel Data / Stack [0x18]
- User Code [0x23]
- User Data / Stack [0x2b]
[syntax is ''Purpose [Segment]'']
Under Intel architecture, the segment registers used are:
- CS for Code Segment
- DS for Data Segment
- SS for Stack Segment
- ES for Alternative Segment (for example used to make a memory copy between 2 different segments)
So, every Task uses 0x23 for code and 0x2b for data/stack.
Linux pagination
Under Linux 3 levels of pages are used, depending on the architecture. Under Intel only 2 levels are supported. Linux also supports Copy on Write mechanisms (please see Cap.10 for more information).
Why don't interTasks address conflicts exist?
The answer is very very simple: interTask address conflicts cannot exist because they are impossible. Linear -> physical mapping is done by "Pagination", so it just needs to assign physical pages in an univocal fashion.
Do we need to defragment memory?
No. Page assigning is a dynamic process. We need a page only when a Task asks for it, so we choose it from free memory paging in an ordered fashion. When we want to release the page, we only have to add it to the free pages list.
What about Kernel Pages?
Kernel pages have a problem: they can be allocated in a dynamic fashion but we cannot have a guarantee that they are in contiguous area allocation, because linear kernel space is equivalent to physical kernel space.
For Code Segment there is no problem. Boot code is allocated at boot time (so we have a fixed amount of memory to allocate), and on modules we only have to allocate a memory area which could contain module code.
The real problem is the stack segment because each Task uses some kernel stack pages. Stack segments must be contiguous (according to stack definition), so we have to establish a maximum limit for each Task's stack dimension. If we exceed this limit bad things happen. We overwrite kernel mode process data structures.
The structure of the Kernel helps us, because kernel functions are never:
- recursive
- intercalling more than N times.
Once we know N, and we know the average of static variables for all kernel functions, we can estimate a stack limit.
If you want to try the problem out, you can create a module with a function inside calling itself many times. After a fixed number of times, the kernel module will hang because of a page fault exception handler (typically write to a read-only page).
5.3 Softirq
When an IRQ comes, task switching is deferred until later to get better performance. Some Task jobs (that could have to be done just after the IRQ and that could take much CPU in interrupt time, like building up a TCP/IP packet) are queued and will be done at scheduling time (once a time-slice will end).
In recent kernels (2.4.x) the softirq mechanisms are given to a kernel_thread: ''ksoftirqd_CPUn''. n stands for the number of CPU executing kernel_thread (in a monoprocessor system ''ksoftirqd_CPU0'' uses PID 3).
Preparing Softirq
Enabling Softirq
''cpu_raise_softirq'' is a routine that will wake_up ''ksoftirqd_CPU0'' kernel thread, to let it manage the enqueued job.
|cpu_raise_softirq |__cpu_raise_softirq |wakeup_softirqd |wake_up_process
- cpu_raise_softirq [kernel/softirq.c]
- __cpu_raise_softirq [include/linux/interrupt.h]
- wakeup_softirq [kernel/softirq.c]
- wake_up_process [kernel/sched.c]
''__cpu_raise_softirq'' routine will set right bit in the vector describing softirq pending.
''wakeup_softirq'' uses ''wakeup_process'' to wake up ''ksoftirqd_CPU0'' kernel thread.
Executing Softirq
TODO: describing data structures involved in softirq mechanism.
When kernel thread ''ksoftirqd_CPU0'' has been woken up, it will execute queued jobs
The code of ''ksoftirqd_CPU0'' is (main endless loop):
for (;;) { if (!softirq_pending(cpu)) schedule(); __set_current_state(TASK_RUNNING); while (softirq_pending(cpu)) { do_softirq(); if (current->need_resched) schedule } __set_current_state(TASK_INTERRUPTIBLE) }
- ksoftirqd [kernel/softirq.c]
5.4 Kernel Threads
Even though Linux is a monolithic OS, a few ''kernel threads'' exist to do housekeeping work.
These Tasks don't utilize USER memory; they share KERNEL memory. They also operate at the highest privilege (RING 0 on a i386 architecture) like any other kernel mode piece of code.
Kernel threads are created by ''kernel_thread [arch/i386/kernel/process]'' function, which calls ''clone'' [arch/i386/kernel/process.c] system call from assembler (which is a ''fork'' like system call):
int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags) { long retval, d0; __asm__ __volatile__( "movl %%esp,%%esi\n\t" "int $0x80\n\t" /* Linux/i386 system call */ "cmpl %%esp,%%esi\n\t" /* child or parent? */ "je 1f\n\t" /* parent - jump */ /* Load the argument into eax, and push it. That way, it does * not matter whether the called function is compiled with * -mregparm or not. */ "movl %4,%%eax\n\t" "pushl %%eax\n\t" "call *%5\n\t" /* call fn */ "movl %3,%0\n\t" /* exit */ "int $0x80\n" "1:\t" :"=&a" (retval), "=&S" (d0) :"0" (__NR_clone), "i" (__NR_exit), "r" (arg), "r" (fn), "b" (flags | CLONE_VM) : "memory"); return retval; }
Once called, we have a new Task (usually with very low PID number, like 2,3, etc.) waiting for a very slow resource, like swap or usb event. A very slow resource is used because we would have a task switching overhead otherwise.
Below is a list of most common kernel threads (from ''ps x'' command):
PID COMMAND 1 init 2 keventd 3 kswapd 4 kreclaimd 5 bdflush 6 kupdated 7 kacpid 67 khubd
'init' kernel thread is the first process created, at boot time. It will call all other User Mode Tasks (from file /etc/inittab) like console daemons, tty daemons and network daemons (''rc'' scripts).
Example of Kernel Threads: kswapd [mm/vmscan.c].
''kswapd'' is created by ''clone() [arch/i386/kernel/process.c]''
Initialisation routines:
|do_initcalls |kswapd_init |kernel_thread |syscall fork (in assembler)
do_initcalls [init/main.c]
kswapd_init [mm/vmscan.c]
kernel_thread [arch/i386/kernel/process.c]
5.5 Kernel Modules
Overview
Linux Kernel modules are pieces of code (examples: fs, net, and hw driver) running in kernel mode that you can add at runtime.
The Linux core cannot be modularized: scheduling and interrupt management or core network, and so on.
Under "/lib/modules/KERNEL_VERSION/" you can find all the modules installed on your system.
Module loading and unloading
To load a module, type the following:
insmod MODULE_NAME parameters example: insmod ne io=0x300 irq=9
NOTE: You can use modprobe in place of insmod if you want the kernel automatically search some parameter (for example when using PCI driver, or if you have specified parameter under /etc/conf.modules file).
To unload a module, type the following:
rmmod MODULE_NAME
Module definition
A module always contains:
- "init_module" function, executed at insmod (or modprobe) command
- "cleanup_module" function, executed at rmmod command
If these functions are not in the module, you need to add 2 macros to specify what functions will act as init and exit module:
- module_init(FUNCTION_NAME)
- module_exit(FUNCTION_NAME)
NOTE: a module can "see" a kernel variable only if it has been exported (with macro EXPORT_SYMBOL).
A useful trick for adding flexibility to your kernel
// kernel sources side void (*foo_function_pointer)(void *); if (foo_function_pointer) (foo_function_pointer)(parameter); // module side extern void (*foo_function_pointer)(void *); void my_function(void *parameter) { //My code } int init_module() { foo_function_pointer = &my_function; } int cleanup_module() { foo_function_pointer = NULL; }
This simple trick allows you to have very high flexibility in your Kernel, because only when you load the module you'll make "my_function" routine execute. This routine will do everything you want to do: for example ''rshaper'' module, which controls bandwidth input traffic from the network, works in this kind of matter.
Notice that the whole module mechanism is possible thanks to some global variables exported to modules, such as head list (allowing you to extend the list as much as you want). Typical examples are fs, generic devices (char, block, net, telephony). You have to prepare the kernel to accept your new module; in some cases you have to create an infrastructure (like telephony one, that was recently created) to be as standard as possible.
5.6 Proc directory
Proc fs is located in the /proc directory, which is a special directory allowing you to talk directly with kernel.
Linux uses ''proc'' directory to support direct kernel communications: this is necessary in many cases, for example when you want see main processes data structures or enable ''proxy-arp'' feature on one interface and not in others, you want to change max number of threads, or if you want to debug some bus state, like ISA or PCI, to know what cards are installed and what I/O addresses and IRQs are assigned to them.
|-- bus | |-- pci | | |-- 00 | | | |-- 00.0 | | | |-- 01.0 | | | |-- 07.0 | | | |-- 07.1 | | | |-- 07.2 | | | |-- 07.3 | | | |-- 07.4 | | | |-- 07.5 | | | |-- 09.0 | | | |-- 0a.0 | | | `-- 0f.0 | | |-- 01 | | | `-- 00.0 | | `-- devices | `-- usb |-- cmdline |-- cpuinfo |-- devices |-- dma |-- dri | `-- 0 | |-- bufs | |-- clients | |-- mem | |-- name | |-- queues | |-- vm | `-- vma |-- driver |-- execdomains |-- filesystems |-- fs |-- ide | |-- drivers | |-- hda -> ide0/hda | |-- hdc -> ide1/hdc | |-- ide0 | | |-- channel | | |-- config | | |-- hda | | | |-- cache | | | |-- capacity | | | |-- driver | | | |-- geometry | | | |-- identify | | | |-- media | | | |-- model | | | |-- settings | | | |-- smart_thresholds | | | `-- smart_values | | |-- mate | | `-- model | |-- ide1 | | |-- channel | | |-- config | | |-- hdc | | | |-- capacity | | | |-- driver | | | |-- identify | | | |-- media | | | |-- model | | | `-- settings | | |-- mate | | `-- model | `-- via |-- interrupts |-- iomem |-- ioports |-- irq | |-- 0 | |-- 1 | |-- 10 | |-- 11 | |-- 12 | |-- 13 | |-- 14 | |-- 15 | |-- 2 | |-- 3 | |-- 4 | |-- 5 | |-- 6 | |-- 7 | |-- 8 | |-- 9 | `-- prof_cpu_mask |-- kcore |-- kmsg |-- ksyms |-- loadavg |-- locks |-- meminfo |-- misc |-- modules |-- mounts |-- mtrr |-- net | |-- arp | |-- dev | |-- dev_mcast | |-- ip_fwchains | |-- ip_fwnames | |-- ip_masquerade | |-- netlink | |-- netstat | |-- packet | |-- psched | |-- raw | |-- route | |-- rt_acct | |-- rt_cache | |-- rt_cache_stat | |-- snmp | |-- sockstat | |-- softnet_stat | |-- tcp | |-- udp | |-- unix | `-- wireless |-- partitions |-- pci |-- scsi | |-- ide-scsi | | `-- 0 | `-- scsi |-- self -> 2069 |-- slabinfo |-- stat |-- swaps |-- sys | |-- abi | | |-- defhandler_coff | | |-- defhandler_elf | | |-- defhandler_lcall7 | | |-- defhandler_libcso | | |-- fake_utsname | | `-- trace | |-- debug | |-- dev | | |-- cdrom | | | |-- autoclose | | | |-- autoeject | | | |-- check_media | | | |-- debug | | | |-- info | | | `-- lock | | `-- parport | | |-- default | | | |-- spintime | | | `-- timeslice | | `-- parport0 | | |-- autoprobe | | |-- autoprobe0 | | |-- autoprobe1 | | |-- autoprobe2 | | |-- autoprobe3 | | |-- base-addr | | |-- devices | | | |-- active | | | `-- lp | | | `-- timeslice | | |-- dma | | |-- irq | | |-- modes | | `-- spintime | |-- fs | | |-- binfmt_misc | | |-- dentry-state | | |-- dir-notify-enable | | |-- dquot-nr | | |-- file-max | | |-- file-nr | | |-- inode-nr | | |-- inode-state | | |-- jbd-debug | | |-- lease-break-time | | |-- leases-enable | | |-- overflowgid | | `-- overflowuid | |-- kernel | | |-- acct | | |-- cad_pid | | |-- cap-bound | | |-- core_uses_pid | | |-- ctrl-alt-del | | |-- domainname | | |-- hostname | | |-- modprobe | | |-- msgmax | | |-- msgmnb | | |-- msgmni | | |-- osrelease | | |-- ostype | | |-- overflowgid | | |-- overflowuid | | |-- panic | | |-- printk | | |-- random | | | |-- boot_id | | | |-- entropy_avail | | | |-- poolsize | | | |-- read_wakeup_threshold | | | |-- uuid | | | `-- write_wakeup_threshold | | |-- rtsig-max | | |-- rtsig-nr | | |-- sem | | |-- shmall | | |-- shmmax | | |-- shmmni | | |-- sysrq | | |-- tainted | | |-- threads-max | | `-- version | |-- net | | |-- 802 | | |-- core | | | |-- hot_list_length | | | |-- lo_cong | | | |-- message_burst | | | |-- message_cost | | | |-- mod_cong | | | |-- netdev_max_backlog | | | |-- no_cong | | | |-- no_cong_thresh | | | |-- optmem_max | | | |-- rmem_default | | | |-- rmem_max | | | |-- wmem_default | | | `-- wmem_max | | |-- ethernet | | |-- ipv4 | | | |-- conf | | | | |-- all | | | | | |-- accept_redirects | | | | | |-- accept_source_route | | | | | |-- arp_filter | | | | | |-- bootp_relay | | | | | |-- forwarding | | | | | |-- log_martians | | | | | |-- mc_forwarding | | | | | |-- proxy_arp | | | | | |-- rp_filter | | | | | |-- secure_redirects | | | | | |-- send_redirects | | | | | |-- shared_media | | | | | `-- tag | | | | |-- default | | | | | |-- accept_redirects | | | | | |-- accept_source_route | | | | | |-- arp_filter | | | | | |-- bootp_relay | | | | | |-- forwarding | | | | | |-- log_martians | | | | | |-- mc_forwarding | | | | | |-- proxy_arp | | | | | |-- rp_filter | | | | | |-- secure_redirects | | | | | |-- send_redirects | | | | | |-- shared_media | | | | | `-- tag | | | | |-- eth0 | | | | | |-- accept_redirects | | | | | |-- accept_source_route | | | | | |-- arp_filter | | | | | |-- bootp_relay | | | | | |-- forwarding | | | | | |-- log_martians | | | | | |-- mc_forwarding | | | | | |-- proxy_arp | | | | | |-- rp_filter | | | | | |-- secure_redirects | | | | | |-- send_redirects | | | | | |-- shared_media | | | | | `-- tag | | | | |-- eth1 | | | | | |-- accept_redirects | | | | | |-- accept_source_route | | | | | |-- arp_filter | | | | | |-- bootp_relay | | | | | |-- forwarding | | | | | |-- log_martians | | | | | |-- mc_forwarding | | | | | |-- proxy_arp | | | | | |-- rp_filter | | | | | |-- secure_redirects | | | | | |-- send_redirects | | | | | |-- shared_media | | | | | `-- tag | | | | `-- lo | | | | |-- accept_redirects | | | | |-- accept_source_route | | | | |-- arp_filter | | | | |-- bootp_relay | | | | |-- forwarding | | | | |-- log_martians | | | | |-- mc_forwarding | | | | |-- proxy_arp | | | | |-- rp_filter | | | | |-- secure_redirects | | | | |-- send_redirects | | | | |-- shared_media | | | | `-- tag | | | |-- icmp_echo_ignore_all | | | |-- icmp_echo_ignore_broadcasts | | | |-- icmp_ignore_bogus_error_responses | | | |-- icmp_ratelimit | | | |-- icmp_ratemask | | | |-- inet_peer_gc_maxtime | | | |-- inet_peer_gc_mintime | | | |-- inet_peer_maxttl | | | |-- inet_peer_minttl | | | |-- inet_peer_threshold | | | |-- ip_autoconfig | | | |-- ip_conntrack_max | | | |-- ip_default_ttl | | | |-- ip_dynaddr | | | |-- ip_forward | | | |-- ip_local_port_range | | | |-- ip_no_pmtu_disc | | | |-- ip_nonlocal_bind | | | |-- ipfrag_high_thresh | | | |-- ipfrag_low_thresh | | | |-- ipfrag_time | | | |-- neigh | | | | |-- default | | | | | |-- anycast_delay | | | | | |-- app_solicit | | | | | |-- base_reachable_time | | | | | |-- delay_first_probe_time | | | | | |-- gc_interval | | | | | |-- gc_stale_time | | | | | |-- gc_thresh1 | | | | | |-- gc_thresh2 | | | | | |-- gc_thresh3 | | | | | |-- locktime | | | | | |-- mcast_solicit | | | | | |-- proxy_delay | | | | | |-- proxy_qlen | | | | | |-- retrans_time | | | | | |-- ucast_solicit | | | | | `-- unres_qlen | | | | |-- eth0 | | | | | |-- anycast_delay | | | | | |-- app_solicit | | | | | |-- base_reachable_time | | | | | |-- delay_first_probe_time | | | | | |-- gc_stale_time | | | | | |-- locktime | | | | | |-- mcast_solicit | | | | | |-- proxy_delay | | | | | |-- proxy_qlen | | | | | |-- retrans_time | | | | | |-- ucast_solicit | | | | | `-- unres_qlen | | | | |-- eth1 | | | | | |-- anycast_delay | | | | | |-- app_solicit | | | | | |-- base_reachable_time | | | | | |-- delay_first_probe_time | | | | | |-- gc_stale_time | | | | | |-- locktime | | | | | |-- mcast_solicit | | | | | |-- proxy_delay | | | | | |-- proxy_qlen | | | | | |-- retrans_time | | | | | |-- ucast_solicit | | | | | `-- unres_qlen | | | | `-- lo | | | | |-- anycast_delay | | | | |-- app_solicit | | | | |-- base_reachable_time | | | | |-- delay_first_probe_time | | | | |-- gc_stale_time | | | | |-- locktime | | | | |-- mcast_solicit | | | | |-- proxy_delay | | | | |-- proxy_qlen | | | | |-- retrans_time | | | | |-- ucast_solicit | | | | `-- unres_qlen | | | |-- route | | | | |-- error_burst | | | | |-- error_cost | | | | |-- flush | | | | |-- gc_elasticity | | | | |-- gc_interval | | | | |-- gc_min_interval | | | | |-- gc_thresh | | | | |-- gc_timeout | | | | |-- max_delay | | | | |-- max_size | | | | |-- min_adv_mss | | | | |-- min_delay | | | | |-- min_pmtu | | | | |-- mtu_expires | | | | |-- redirect_load | | | | |-- redirect_number | | | | `-- redirect_silence | | | |-- tcp_abort_on_overflow | | | |-- tcp_adv_win_scale | | | |-- tcp_app_win | | | |-- tcp_dsack | | | |-- tcp_ecn | | | |-- tcp_fack | | | |-- tcp_fin_timeout | | | |-- tcp_keepalive_intvl | | | |-- tcp_keepalive_probes | | | |-- tcp_keepalive_time | | | |-- tcp_max_orphans | | | |-- tcp_max_syn_backlog | | | |-- tcp_max_tw_buckets | | | |-- tcp_mem | | | |-- tcp_orphan_retries | | | |-- tcp_reordering | | | |-- tcp_retrans_collapse | | | |-- tcp_retries1 | | | |-- tcp_retries2 | | | |-- tcp_rfc1337 | | | |-- tcp_rmem | | | |-- tcp_sack | | | |-- tcp_stdurg | | | |-- tcp_syn_retries | | | |-- tcp_synack_retries | | | |-- tcp_syncookies | | | |-- tcp_timestamps | | | |-- tcp_tw_recycle | | | |-- tcp_window_scaling | | | `-- tcp_wmem | | `-- unix | | `-- max_dgram_qlen | |-- proc | `-- vm | |-- bdflush | |-- kswapd | |-- max-readahead | |-- min-readahead | |-- overcommit_memory | |-- page-cluster | `-- pagetable_cache |-- sysvipc | |-- msg | |-- sem | `-- shm |-- tty | |-- driver | | `-- serial | |-- drivers | |-- ldisc | `-- ldiscs |-- uptime `-- version
In the directory there are also all the tasks using PID as file names (you have access to all Task information, like path of binary file, memory used, and so on).
The interesting point is that you cannot only see kernel values (for example, see info about any task or about network options enabled of your TCP/IP stack) but you are also able to modify some of it, typically that ones under /proc/sys directory:
/proc/sys/ acpi dev debug fs proc net vm kernel
/proc/sys/kernel
Below are very important and well-know kernel values, ready to be modified:
overflowgid overflowuid random threads-max // Max number of threads, typically 16384 sysrq // kernel hack: you can view istant register values and more sem msgmnb msgmni msgmax shmmni shmall shmmax rtsig-max rtsig-nr modprobe // modprobe file location printk ctrl-alt-del cap-bound panic domainname // domain name of your Linux box hostname // host name of your Linux box version // date info about kernel compilation osrelease // kernel version (i.e. 2.4.5) ostype // Linux!
/proc/sys/net
This can be considered the most useful proc subdirectory. It allows you to change very important settings for your network kernel configuration.
core ipv4 ipv6 unix ethernet 802
/proc/sys/net/core
Listed below are general net settings, like "netdev_max_backlog" (typically 300), the length of all your network packets. This value can limit your network bandwidth when receiving packets, Linux has to wait up to scheduling time to flush buffers (due to bottom half mechanism), about 1000/HZ ms
300 * 100 = 30 000 packets HZ(Timeslice freq) packets/s 30 000 * 1000 = 30 M packets average (Bytes/packet) throughput Bytes/s
If you want to get higher throughput, you need to increase netdev_max_backlog, by typing:
echo 4000 > /proc/sys/net/core/netdev_max_backlog
Note: Warning for some HZ values: under some architecture (like alpha or arm-tbox) it is 1000, so you can have 300 MBytes/s of average throughput.
/proc/sys/net/ipv4
"ip_forward", enables or disables ip forwarding in your Linux box. This is a generic setting for all devices, you can specify each device you choose.
/proc/sys/net/ipv4/conf/interface
I think this is the most useful /proc entry, because it allows you to change some net settings to support wireless networks (see Wireless-HOWTO for more information).
Here are some examples of when you could use this setting:
- "forwarding", to enable ip forwarding for your interface
- "proxy_arp", to enable proxy arp feature. For more see Proxy arp HOWTO under Linux Documentation Project and Wireless-HOWTO for proxy arp use in Wireless networks.
- "send_redirects" to avoid interface to send ICMP_REDIRECT (as before, see Wireless-HOWTO for more).
Next Previous Contents