Turning on an ARM MMU and Living to tell the tale: The code

Turning on the MMU

In my last post we looked at the basic theory of an MMU and what it can do for us. In this post we are going to produce and understand the absolute minimum amount of code required (just 20 lines of assembler) to turn on an ARM MMU and come out the other side in one piece.

Our goal here is to create a simple identity mapping across the entire address space between virtual and physical memory addresses – such that the following formula holds true:

virtual address = physical address

In other words the address space from the processor’s point of view (or anyone else’s POV for that matter) will remain the same both before and after the MMU has been switched on. This goal may seem a little pointless – but it does act as a good starting point for further development. For example, you can start to make use of other features of the MMU such as specifying access permissions and attributes of pages.

When an MMU is in use it is able to automatically convert virtual addresses to physical addresses – however in order to do this it extensively utilises (at a performance cost) a set of translation tables (sometimes known as page tables) stored in physical memory. Therefore prior to enabling the MMU these tables must be appropriately set up and the hardware must be told where in memory they can be found. The translation tables are typically set up very early on during boot by an operating system.

The ARM MMU supports entries in the translation tables which can represent either an entire 1MB (section), 64KB (large page), 4KB (small page) or 1KB (tiny page) of virtual memory. In order to provide flexibility the translation tables are multi-level – there is a single top level table which divides the address space into 1 MB sections and each entry in that table can either describe a corresponding area of physical memory or provide a pointer to a second level table. Depending on the type of second level table pointed to – that megabyte of memory can be then be represented by multiple table entries describing memory areas of the other page sizes (and even mixed). As the tables are multi-level the lookup process performed by the MMU is often known as a translation table walk.

The ARM MMU table design is rather clever in that it allows you to mix and match page sizes – if this wasn’t the case then you would have to choose a single page size to work with which may come at the expense of the amount of RAM required to store the page tables. For example if the entire address space was represented by tiny pages of 1KB then the translation table would take up a massive 16MB of memory (sizeof(page table entry) / 1KB) * sizeof(address space)).

Besides the amount of storage required for page tables, when considering which page sizes to use – performance should also be a consideration – when the hardware performs a translation table walk it has to access physical memory at least once which is relatively slow. Thankfully the MMU has a dedicated cache for making a note of recent translations – it’s known as a Translation Lookaside Buffer (TLB).

Now let’s get on with some coding! As we wish to write the least amount of code possible – we will only utilise the first level page table. As we are creating an identity mapping we will populate each entry in the table (therefore 4096 entries) which points to a corresponding range of physical memory with the same address. We start by telling the TLB the base address of our table. How about this:

.global start
start:
   ldr r0, tlb_l1_base
   mcr p15, 0, r0, c2, c0, 0

tlb_l1_base:
.word 0x40200000

So we start of (within our well defined point of entry ‘start’) by loading the memory address 0x40200000 into register r0. We intend for this to represent the start of our first level page table. As we know page table entries in this table are 4 bytes long and that there is a maximum of 4096 entries (one for each 1MB of the address space) we can calculate the size of the table as 16KB. I’ve decided to locate this at the start of the available SRAM (see Figure 26-7 of the DM3730 TRM) – of course if you wanted to write more code you could first initialise your SDRAM controller and place the page tables in SDRAM instead. (Please note the hardware also demands that the table is located on a 16KB boundary).

The next line of source uses an ‘mcr‘ instruction to inform the TLB of our chosen location for the top level page table. The MMU/TLB is treated as a coprocessor to the ARM and as a result the ‘mcr‘ and ‘mrc‘ instructions must be used to pass register values to and from coprocessor registers. In this case we’re telling the ARM to transfer the value stored in r0 to register 2 of coprocessor 15 (as specified by he ARM architecture reference manual – see section B3.7 for more details).

The next step is to populate the tables. As we only intend to use the first level table we are constrained to either filling the table with ‘section’ entries or ‘page faults’ (entries which will always cause a page fault). Section entries represent an entire 1MB region of memory and have the following layout.

Bits 31:20 - Section base address
Bits 11:10 - Access permissions
Bits 8:5   - Domain
Bit  3:2   - Cachable / Bufferable
Bits 1:0   - Always 0b10 for a section page table entry / descriptor

Some bits are missing – these are bits which are either not used and should always be set to zero or ‘implementation defined’ – which means that it’s up to the ARM licensee to decide what to do with them – we will keep them at zero. The lower couple of bits are set to 0x2 which describes the entry as a section descriptor.

We’re most interested in the ‘Section Base Address’ – when the MMU wants to translate a virtual address – it finds the corresponding page table entry representing that range of virtual memory in the section table and substitutes the top 12 bits of the virtual address with the Section Base Address. For example if we set the value 0x1f2 as a Section Base Address in the second entry of the table then all we will get a translation scheme (just for that 1MB of memory) which looks like this:

0x001xxxxx (virtual) = 0x1f2xxxxx (physical)

Therefore to create an identity mapping – the values we need to use for the Section Base Address need to start with 0x000 for the first entry and increment by 1MB (or 0x1) each time.

0x000xxxxx (virtual) = 0x000xxxxx (due to entry 0)
0x001xxxxx (virtual) = 0x001xxxxx (due to entry 1)
0x002xxxxx (virtual) = 0x002xxxxx (due to entry 2)
...
0xfffxxxxx (virtual) = 0xfffxxxxx (due to entry 4096)

In order to create these tables we would need to write a loop – however in order to simplify this blog post – I’m just going to manually create entries for the page ranges which I intend to use (which isn’t many) – I end up with this:

ldr r0, entry4020
ldr r1, val4020
str r1, [r0]

entry4020:
.word 0x40201008
val4020:
.word 0x40200c02

Let’s examine this. As my application is just a tight loop – it doesn’t have many memory access requirements – as it doesn’t use a stack or any peripherals all it needs to access is the address containing those instructions – thus ‘mapping in’ the entire SRAM area will be sufficient. My code writes the value 0x40200c02 to 0x40201008 – the destination address is the offset in the page table which corresponds to the 1MB of memory which includes the SRAM (table base + 0x402 * 4). The value I’m writing to this address is our first section page table entry. The top 12 bits – the Section Base Address – matches that of the virtual address corresponding with this table entry – thus an identity. The remaining bits are set appropriately to allow us to access the page.

There is one more concept we need to understand – Domains – but we will skim over these. Every page table entry is associated with a domain (just a number) – each domain has an attribute which allows you to control access to it’s associated pages. It’s a good way of quickly disabling access to a whole range of pages without having to modify the access permissions of each page entry. In our example we assigned the page table entry to domain 0. We now need to set the access permissions for that domain – we will set it to ‘Manager’ – which means access permissions are not checked – i.e. turn it off. This is achieved through another coprocessor access:

mov r0, #0x3
mcr p15, 0, r0, c3, c0, 0

If you are still with me – it’s now time to turn on the MMU. If your mappings change the address of the code ‘you are standing on’ (which is a bad idea) – then you have to make sure that your code is compiled to be position independent – such that it can ‘carry on’ at it’s new address. Thankfully as we are just using an identity mapping we don’t need to worry about this.

mrc p15, 0, r0, c1, c0, 0
orr r0, r0, #0x1
mcr p15, 0, r0, c1, c0, 0

Let’s see what’s going on here. The three lines of code read the value from coprocessor register 1 of p15, modify that value and write it back. Register 1 contains lots of interesting things – such as where the exception table lives, if the caches are enabled and of course bit 0 determines if the MMU is enabled or not. In this case we set it and thus turn on the MMU.

In my source I also added a tight loop after enabling the MMU in the absence of anything else useful to do…

loop:
b loop

That’s all the code complete – we can build using the following commands:

arm-none-linux-gnueabi-gcc mmu.s -nostdlib -e start -Ttext=0x40204000
arm-none-linux-gnueabi-objcopy -O binary a.out a.bin
./signGP a.bin 0x40204000
mv a.bin.ift MLO

You may notice that I’ve set the link address (and thus entry point) to be 16KB into the SRAM area (0x40204000) – in other words just after our page tables.

If you compile and execute the code on your BeagleBoard and all has gone well then you should find your BeagleBoard is stuck in the last loop after enabling the MMU.

MMU Enabled (as shown in CSS)

MMU Enabled (as shown in CSS)

If it hasn’t gone too well – then you’re probably seeing data or instruction aborts (exceptions). It may take a number of attempts to get right thus it’s probably good idea to use a debugger and set break points in your exception handlers.

If you reached the tight loop (running with the MMU enabled) then congratulations you have successfully managed to turn on the MMU and live to tell the tale! [© 2011 embedded-bits.co.uk]

, , , , , , , , , , , , , , , , , , , , ,

About Andrew Murray

Andrew is an experienced commercial Linux developer with a first class degree in Software Engineering and is the founder of Embedded Bits Limited. His day-to-day role fulfils his passion for learning and provides him with plenty of embedded Linux experience including kernel and embedded applications development on a wide variety of platforms. He loves to talk about boot time reduction and has performed a number of presentations on the topic at technical conferences - he has also been successful in achieving sub-second cold boot on Linux based products. Feel free to drop him an email at amurray@embedded-bits.co.uk

7 Responses to “Turning on an ARM MMU and Living to tell the tale: The code”

  1. Mark August 1, 2011 at 4:08 pm # Reply

    Hi Andrew,

    I was curious how the C++ real-time operating system development is coming along on the beagleboard?

    I created the same for an LPC2378 board. When I looked into porting my OS to the beagleboard (I have a revC4, so it’s OMAP) I concluded that real time was not going to happen on the ARM CPU due to the complexity of the buses. Now that I’ve digested more of the beagleboard’s hardware, I believe the DSP is used to accomplish real time requirements, but have yet to tackle programming it. Also, you must lose all hard real time capabilities as soon as you start using the MMU?

    Are you planning to blog about the development and structure of your OS?

    –Mark

  2. Andrew Murray August 1, 2011 at 7:20 pm # Reply

    Hi Mark,

    Thanks for the comment. The OS is coming along but very slowly – there are many chicken and egg problems which make it difficult to come up with a ‘clean’ design that gives a warm and fuzzy feeling. I’ve got round-robin multi-tasking going with some use of an MMU – but all of this is still in the QEMU emulator.

    It’s a learning curve for me – and the more time I spend on it the more my goals change. Just turning on the cache makes a real time system a little more difficult. But then again the ‘real-time’ seems to have different meaning to different people. I guess I want to make it more deterministic than Linux. My main motivation is to really to make an operating system that is focused heavily on supporting embedded devices – with a strong focus on minimal boot times and a structure that makes it very easy to develop drivers and manage dependencies. I’m also keen to avoid the problem of Linux being a big puzzle when it comes to determining which tree should I use which best supports this device – and how can I validate all the features I need are fully supported – without having to try it out…

    A couple of people have asked me to blog about this – it seems like a good idea so I’m sure I will. Is OS creation something you are interested in?

  3. Mark August 2, 2011 at 4:24 pm # Reply

    OS design is interesting indeed. Add C++ and it starts to become fun!

  4. Igor March 18, 2012 at 12:39 pm # Reply

    Hi… thx for ur work.. I m a noob (and maybe a bit stupid too…) but really I don’t understand the:
    “the destination address is the offset in the page table which corresponds to the 1MB of memory which includes the SRAM (table base + 0×402 * 4).”

    I thought the first entry should be on the :
    “tlb_l1_base: .word 0x40200000″

    And so, where that 0x402*4 comes from??

    Did I miss something? Thank u very much..

    • Andrew Murray March 28, 2012 at 6:19 pm # Reply

      Hi Igor,

      The start of the page table is 0x40200000, each entry is 4 bytes long and each entry corresponds to a 1MB section of memory. The first entry in the page table (0x40200000) represents what happens when you try to access memory between the range of 0x00000000 to 0x00100000, the second entry (4 bytes into the table at 0x40200004) represents the memory range 0x00100000 to 0x00200000, etc.

      Therefore to find the page table associated with area of memory 0x40200000 to 0x40200000+1MB (which is the 402nd MB of memory) – you need to traverse the page table list by 402 entries – but each entry is 4 bytes long therefore you need to time this number by 4.

      Does that make sense?

      Andy

  5. Petri July 28, 2012 at 6:27 pm # Reply

    Hi,

    Good stuff! This really helped me to understand the absolute minimum of things I need to get started on my own quest to initialize an MMU.
    It could be hard to gain same understanding just to read ARM specs as reading your quick and fine introduction to MMU initialization.

    (Sure I have to read specs, but now I know what to look at)

    Difficult concepts can be made clear to understand by reading this article.

    Thank you :-)

Trackbacks/Pingbacks

  1. mixtape radio - July 8, 2012

    mixtape radio…

    […]Embedded Bits» Blog Archive » Turning on an ARM MMU and Living to tell the tale: The code[…]…

Leave a Reply