1.initial Boot Sequence
1.initial Boot Sequence
Segment B
Segment C
Segment A
RAM
Modules
Segmentation Nowadays
• Segmentation is still present and always
enabled
• Each instruction that touches memory
implicitly uses a segment register:
– a jump instruction uses CS
– a push instruction uses SS
• Most segment registers can be loaded using a
mov instruction
• CS can be loaded only with a jmp or a call
x86 Real Mode
• 16-bit instruction execution mode
• 20-bit segmented memory address space
– 1 MB of total addressable memory
• Address in segment registers is the 16-bits
higher part
• Each segment can range from 1 byte to 65,536
bytes (16-bit offset)
Real Mode Addressing Resolution
Addressing in x86 Real Mode
FFFF:FFFF
Growing
Physical
Addresses
0000:0000
Addressing in x86 Real Mode
FFFF:FFFF
Weren't they
20 bits?
Growing
Physical
Addresses
0000:0000
Addressing in x86 Real Mode
FFFF:FFFF
Weren't they
20 bits?
Largest address
is FFFFF!
Growing
Physical
Addresses
0000:0000
First Fetched Instruction
BIOS ROM
0x000F0000 (960 Kb)
16-bit devices,
expansion ROM
The bootloader
is loaded here The only available
Low Memory "RAM" in the
early days
0x00000000
Boot Sequence
BIOS/UEFI The actual Hardware Startup
Partition table
Boot
partition
wait_for_8042:
inb %al, $0x64
tesb $2, %al # Bit 2 set = busy
jnz wait_for_8042
ret
x86 Protected Mode
• This execution mode was introduced in 80286
(1982)
• With 80386 (1985) it was extended by adding
paging
• CPUs start in Real Mode for backwards
compatibility
• Still today, x86 Protected Mode must be
activated during system startup
x86_64 Registers
x86_64 Registers
CR0
Entering Basic Protected Mode
• The code must set bit 0 (PE) of register CR0
• Setting PE to 1 does not immediately activate all
its facilites
• It happens when the CS register is first updated
• This can be only done using a far jump (ljmp)
instruction, as already mentioned.
• After this, code executes in 32/64-bit mode
Entering Basic Protected Mode
ljmp 0x0000, PE_mode
.code32
PE_mode:
# Set up the protected-mode data segment registers
movw $PROT_MODE_DSEG, %ax
movw %ax, %ds
movw %ax, %es
movw %ax, %fs
movw %ax, %gs
movw %ax, %ss
Segment Registers in Protected Mode
• In Protected Mode, a segment is no longer a raw
number
• It contains (also) an index into a table of
segment descriptors
• There are three types of segments:
– code
– data
– system
Descriptor Table Entry
Segmentation cannot
be disabled
Segment Caching
• Accessing the GDT for every memory access is
not performance-wise
• Segment registers have a non-programmable
hidden part to store the cached descriptor
Segment Register
Selector Descriptor
(non-programmable)
x86 Enforcing Protection
• A Descriptor Entry has a DPL
• The firmware must check if an access to a
certain segment is allowed
• There must be a way to change current privilege
Should be controlled/denied
Data Segment vs Code Segment
• RPL is present only in data
segment selectors (e.g. SS
or DS)
• Current Privilege Level
(CPL): this is only in CS,
which can be loaded only
with a ljmp/lcall
• Overall we have 3 different privilege-level fields:
CPL, RPL, and DPL
Protection upon Segment Load
• CPL is managed by the CPU: it's always equals to
the current CPU privilege level
Kernel Space
(Ring 0) Kernel routine A
Non-admitted
cross-segment
jump
Cross-segment jump
through a gate
User Space
(Ring 3) User routine
Gate Descriptors
• A gate descriptor is a segment descriptor of type
system:
– Call-gate descriptors
– Interrupt-gate descriptors
– Trap-gate descriptors
– Task-gate descriptors
offset 256
selector entries
IDTR
System GDT
segment
descriptor
Kernel Text
Segment
GDTR
GDT in Linux
Different for all cores
Shared across all cores
CPU State
Privilege-level stacks
Task State Segment (TSS)
• The Base field within the n-th core TSS register
points to the n-th entry of the int_tss array
• G=0 and Limit=0xeb
– given that TSS is 236 bytes in size
• DPL=0
•TSS cannot be accessed in user mode
Entering Ring 0 from Ring 3
Protected Mode Paging
• Since 80386, x86 CPUs add an additional step in
address translation
Memory Address Translation
RAM
Modules
Protected Mode Paging
• Paging has to be explicitly enabled
– Entering Protected Mode does not enable it
automatically
– Several data structures must be setup before
(Sticky bit)
i386 PTE entries
(Sticky bit)
RAM
Modules
Translation Lookaside Buffer
Relations to Trap/Interrupt Table
• Upon a TLB miss, firmware accesses the page table
• The first checked bit is PRESENT
• If this bit is zero, a page fault occurs which gives rise to a trap
• CPU registers (including EIP and CS) are saved on the system stack
• They will be restored when returning from trap: the trapped
instruction is re-executed
• Re-execution might give rise to additional traps, depending on
firmware checks on the page table
• As an example, the attempt to access a read only page in write mode
will give rise to a trap (which triggers the segmentation fault
handler)
Linux memory layout on i386
0xFFFFFFFF
Kernel
... (1 GB)
0xC0000000
0xBFFFFFFF
User Space
...
(3 GB)
0x00000000
Physical Address Extension (PAE)
• An attempt to extend over the 4GB limit on i386
systems
• Present since the Intel Pentium Pro
• Supported on Linux since kernel 2.6
• Addressing is extended to 36 bits
• This allows to drive up to 64 GB of RAM memory
• Paging uses 3 levels
• CR4.PAE-bit (bit 5) tells if PAE is enabled
Physical Address Extension (PAE)
x64 Paging Scheme
• PAE is extended via the so called “long addressing”
• 264 bytes of logical memory in theory
• Bits [49-64] are short-circuited
– Up to 248 canonical form addresses (lower and upper half)
– A total of 256 TB addressable memory
• Linux currently allows for 128 TB of logical
addressing of individual processes and 64 TB for
physical addressing
Canonical Addresses
64-bit 48-bit
Linux memory layout on x64
0xFFFF FFFF FFFF FFFF
TEXT
DATA
EFI_STATUS EFIAPI
efi_main(EFI_HANDLE ImageHandle, EFI_SYSTEM_TABLE *SystemTable) {
InitializeLib(ImageHandle, SystemTable);
Print(L"Hello World\n");
return EFI_SUCCESS;
}
GUID Partition Table
Secure Boot
• There is a kind of malware which takes control
of the system before the OS starts
– MBR RootKits
• Usually, these RootKits hijack the IDT for I/O
operations, to execute their own wrapper
• When the kernel is being loaded, the RootKit
notices that and patches the binary code while
loading it into RAM
Secure Boot
• UEFI allows to load only signed executables
• Keys to verify signatures are installed in UEFI
configuration
– Platform Keys (PK): tells who “owns and controls”
the hardware platform
– Key-Exchange Keys (KEK): shows who is allowed to
update the hardware platform
– Signature Database Keys (DB): show who is allowed
to boot the platform in secure mode
Dealing with multicores
• Who shall execute the startup code?
• For legacy reasons, the code is purely sequential
• Only one CPU core (the master) should run the
code