Handling the Hardware Cache and the TLB

 The last topic of memory addressing deals with how the kernel makes an optimal use

of the hardware caches. Hardware caches and Translation Lookaside Buffers play a

crucial role in boosting the performance of modern computer architectures. Several

techniques are used by kernel developers to reduce the number of cache and TLB

misses.





Handling the hardware cache





As mentioned earlier in this chapter, hardware caches are addressed by cache lines.
The L1_CACHE_BYTES macro yields the size of a cache line in bytes. On Intel models
earlier than the Pentium 4, the macro yields the value 32; on a Pentium 4, it yields
the value 128.




To optimize the cache hit rate, the kernel considers the architecture in making the
following decisions.




• The most frequently used fields of a data structure are placed at the low offset
within the data structure, so they can be cached in the same line.




• When allocating a large set of data structures, the kernel tries to store each of
them in memory in such a way that all cache lines are used uniformly.





Cache synchronization is performed automatically by the 80 × 86 microprocessors,
thus the Linux kernel for this kind of processor does not perform any hardware cache flushing. The kernel does provide, however, cache flushing interfaces for pro-
cessors that do not synchronize caches.




Handling the TLB





Processors cannot synchronize their own TLB cache automatically because it is the
kernel, and not the hardware, that decides when a mapping between a linear and a
physical address is no longer valid.
Linux 2.6 offers several TLB flush methods that should be applied appropriately,
depending on the type of page table change



Handling the TLB




Despite the rich set of TLB methods offered by the generic Linux kernel, every micro-
processor usually offers a far more restricted set of TLB-invalidating assembly lan-
guage instructions. In this respect, one of the more flexible hardware platforms is
Sun’s UltraSPARC. In contrast, Intel microprocessors offers only two TLB-invalidat-
ing techniques:
• All Pentium models automatically flush the TLB entries relative to non-global
pages when a value is loaded into the cr3 register.
• In Pentium Pro and later models, the invlpg assembly language instruction inval-
idates a single TLB entry mapping a given linear address.





avoid useless TLB flushing in multiprocessor systems, the kernel uses a tech-
nique called lazy TLBmode. The basic idea is the following: if several CPUs are using
the same page tables and a TLB entry must be flushed on all of them, then TLB
flushing may, in some cases, be delayed on CPUs running kernel threads.
In fact, each kernel thread does not have its own set of page tables; rather, it makes
use of the set of page tables belonging to a regular process. However, there is no need
to invalidate a TLB entry that refers to a User Mode linear address, because no ker-
nel thread accesses the User Mode address space.*
When some CPUs start running a kernel thread, the kernel sets it into lazy TLB
mode. When requests are issued to clear some TLB entries, each CPU in lazy TLB
mode does not flush the corresponding entries; however, the CPU remembers that its
current process is running on a set of page tables whose TLB entries for the User
Mode addresses are invalid. As soon as the CPU in lazy TLB mode switches to a reg-
ular process with a different set of page tables, the hardware automatically flushes
the TLB entries, and the kernel sets the CPU back in non-lazy TLB mode. However,
if a CPU in lazy TLB mode switches to a regular process that owns the same set of
page tables used by the previously running kernel thread, then any deferred TLB
invalidation must be effectively applied by the kernel. This “lazy” invalidation is
effectively achieved by flushing all non-global TLB entries of the CPU.
Some extra data structures are needed to implement the lazy TLB mode. The cpu_
tlbstate variable is a static array of NR_CPUS structures (the default value for this
macro is 32; it denotes the maximum number of CPUs in the system) consisting of
an active_mm field pointing to the memory descriptor of the current process (see
Chapter 9) and a state flag that can assume only two values: TLBSTATE_OK (non-lazy
TLB mode) or TLBSTATE_LAZY (lazy TLB mode). Furthermore, each memory descrip-
tor includes a cpu_vm_mask field that stores the indices of the CPUs that should
receive Interprocessor Interrupts related to TLB flushing. This field is meaningful
only when the memory descriptor belongs to a process currently in execution.
When a CPU starts executing a kernel thread, the kernel sets the state field of its
cpu_tlbstate element to TLBSTATE_LAZY; moreover, the cpu_vm_mask field of the active
memory descriptor stores the indices of all CPUs in the system, including the one
that is entering in lazy TLB mode. When another CPU wants to invalidate the TLB
entries of all CPUs relative to a given set of page tables, it delivers an Interprocessor
Interrupt to all CPUs whose indices are included in the cpu_vm_mask field of the corre-
sponding memory descriptor.
When a CPU receives an Interprocessor Interrupt related to TLB flushing and veri-
fies that it affects the set of page tables of its current process, it checks whether the state field of its cpu_tlbstate element is equal to TLBSTATE_LAZY. In this case, the ker-
nel refuses to invalidate the TLB entries and removes the CPU index from the cpu_
vm_mask field of the memory descriptor. This has two consequences:





• As long as the CPU remains in lazy TLB mode, it will not receive other Interpro-
cessor Interrupts related to TLB flushing.




• If the CPU switches to another process that is using the same set of page tables
as the kernel thread that is being replaced, the kernel invokes _ _flush_tlb() to
invalidate all non-global TLB entries of the CPU.

Popular Posts