Friday, June 4, 2021

Technical Mitigations: Memory Tagging

This post was originally about a quarter of the length, but as I edited it, it continued to grow. I have put a lot of thought into MTE and I hope that some of the understanding that I've gained is absorbed by reading this. I suggest reading this with a good drink. In writing this I had a manhattan, an IPA, a golden ale, unsweet tea, double espresso, flavored water, so basically anything (maybe a grown-up drink to numb the pain of any math involved?) (Or if you're reading for the real information, tea?) 

This post is about Memory Tagging. The inspiration for this post is the Memory Tagging Extension (MTE) outlined in ARMv8.5-A. So, I'll draw all my examples from AARCH64 in ARMv8.5. If you're into non-ARM architectures... Meh. iPhone and Android run on ARMs. Deal with it.

Tagged Memory, Not Tagged Pointers

Clear the air here, Memory Tagging is not the same as Tagged Pointers. In both, bits are stolen from the pointer value to hold extra data, but tagged pointers are more generic and typically used for book-keeping things like reference counts or address granularity. Memory Tagging is a very specific use of stolen bits for security/reliability purposes.

Exploit Techniques

I know what you're thinking, "Ronman, I thought this was supposed to be a post about MTE, we are talking about exploits?" Yeah. Yes we are. This is important because we can't really appreciate what this mitigation technique does for us without understanding what it's supposed to prevent. Therefore, we're going to go over a few exploit classes relevant to the subject.

Note: If you're familiar with the general types of exploits, skip to "What do they have in common?".

Buffer Over/Underflow

This is the simplest bug possible really. Imagine an application that asks for your password. The application has reserved 64 bytes of space on the stack to store the password that you enter because, who would have a password that's 63 characters long? (Don't forget the null-terminator!) Well, what happens if you enter 68 characters? If there's no bounds checking on the input function, then you can actually overwrite extra memory on the stack. Or the heap. Or in an object that has a member with a statically sized buffer for... something.

Buffer underflows are similar to buffer overflows in that they involve writing outside the allocated buffer. The difference is that instead of writing after the allocated buffer, the bug exploited allows an attacker to write before the allocation.

Buffer over/underflows can also be caused by providing a bad length for an object, in some protocols structures/etc. are sent with a preceding length, so if you tell the other side you are going to send 32 bytes, they allocate 32 bytes, but allow you to write 36 bytes, that can be bad.
Integer Over/Under flows can cause bad indexing into a buffer. This is another form of buffer over/underflow but is more often characterized as an integer overflow bug, as that is the underlying error that allows out of bounds access.

Use After Free

As the name implies, a use after free exploit takes advantage of a bug in software where a reference to a struct/object is used after the actual object has been freed and already re-used for something else. This can happen if the interface of some function requires that a pointer passed to it be long-lived or passes ownership of the object but the caller does not treat it as such.

Easy ways to cause bugs that may result in use after free vulnerabilities would be things like:

  • Passing a stack address where a long-lived address is expected
    • Think, enqueueing something in a list, adding to a hashmap, or a pthread context parameter
  • Misusing functions that take ownership of memory passed to them
    • Picture a function that promises to call 'free' on a pointer passed to it

The thing to note here is that it becomes possible to have some code act on a "stale pointer", that is to say that the data at the address to which the pointer points no longer contains the data expected. Stale pointers on their own aren't bad, even using them could cause no noticeable problems. (DON'T USE STALE POINTERS) Where things get tricky is if the address to which the stale pointer points becomes allocated for something else. The stack address passed into a pthread context parameter becomes allocated for some function's stack frame in the thread that started the pthread. Now the second thread has a pointer to the stack of some other thread! Or in the case of passing memory to a function that will free it, then trying to use the memory subsequently. Then the pointer after the function call is stale and that address may refer to something that is allocated and in use by something else.

The point is that use after free is a exploit technique that takes advantage of situations wherein stale pointers cause effects in memory that has been reallocated since the pointer went stale.

Type Confusion

This is an edge case for memory tagging as a mitigation, but I bring it up because it can technically be helpful.
In a type confusion, an interface is abused/ignored. This means casting a polymorphic type down the chain of inheritance farther than you should, or casting a 'void*' to something that it's not, or misinterpreting some other thing to cause you to treat something some other way than you should.
Specifically in the polymorphic cast, and others where the incorrect type is larger than the correct type, then a read/write on that bad type could cause an out-of-bounds access on the allocation of the real type with respect to the address of the real type.

What do they have in common?

Out of Bounds Access. That's really what it comes down to. Out of bounds in space OR time. Either. Or both. A variable or structure exists within some memory range AND some time range. If you access it outside of either of those ranges, bad things can happen.

How do these show up in the wild?

OOB bugs are everywhere. Like, everywhere. They can be the driving bug that gets malicious code to execute or they can be one of a few in a chain that leads to code execution. Breaking the ability of malicious actors to have OOB access for read/write is fantastic for security.

What the Heck is Memory Tagging Anyway?

Memory tagging is a process by which the system tries to revoke invalid access to memory. This is the technological answer to the techniques discussed in the previous section.

Memory Tagging involves a separate memory space, reserved specifically for memory tagging, from here on, I'll refer to this as the Tag Store.

It also requires that upper bits are stolen from pointers. This usually isn't a problem in 64-bit architectures, because 64 bits can address way more memory that you would have mapped into a process. That said, this mitigation doesn't really work on 32-bit systems. In ARM, MTE uses the TBI (Top Byte Ignore) feature; this means the top byte is handled as completely separate from the pointer. Four out of the eight top bits are reserved for the memory tag. It's important for this to be the top byte because when arithmetic is done on the pointer, the tag should not be modified.

Memory Tagging works by acting at three different times:
  1. Memory Allocation
  2. Read/Write
  3. Memory Deallocation

You should note that these are basically the three stages of an allocation's lifespan, so, it kind of makes sense. Remember I mentioned it was the spatial and temporal boundaries of an allocation that this is meant to enforce.

I'll go into those action stages below, but I want to make clear the general concept of MTE. MTE essentially applies a lock-and-key system to read/write access to memory at the hardware level. The tag bits in the top byte are like the key to a door. (Not to be confused with a cryptographic key.) The tag store combined with the actual Memory Tagging Extension (a subcomponent of the CPU) make up the lock. Much like you can't sit at your neighbors door and try all 100,000 possible keys to see which will open their door, an attacker can't just try all 16 possible tags -remember the tag is only 4 bits wide. If one tag validation failure is detected by the tagging extension, bad things happen...

Memory Allocation

During allocation, a tag is assigned (randomly, but possibly such that the adjacent allocations have different tags, it depends on the implementation). The same tag is assigned to the entire block allocated. In  ARM, this is done with a resolution of 16 bytes. The idea of taggable blocks being larger than one byte is common because, with a tag size of 1/2 byte, your Tag Store would have to be 1/2 the size of your usable RAM, 1/3 of the total. With a resolution of 16 bytes, 1/32 of your total memory must be used for the tag store, that is 3.125%.

The tag is inserted into to the pointer before it is returned from the allocation function (malloc/calloc/etc). So if you were to wrap malloc with memory tagging logic, it would look something like this:

void* malloc_mte(size_t alloc_size) {
    // MTE works on 16B segments, allocations must be 16B increments
    size_t delta = alloc_size % 16;
    // Adjust size if needed
    if (delta) alloc_size += (16 - delta);
    // Perform normal malloc allocation
    void* allocation = malloc(alloc_size);
    // Don't try to tag NULL, that'd be silly
    if (allocation == NULL) return NULL;
    // Tag the region
    uint8_t tag_byte = apply_tag_to_range(allocation, alloc_size);
    // Apply the tag to the pointer
    return (tag_byte << 54) | allocation;
}

In a proper implementation, the malloc function would do the tagging on its own, also there are instructions in ARM to apply the tag to a memory region a pointer to the region at the same time, but for the sake of example, I separated that.

Read/Write

During a Read or Write operation, the pointer that holds the address must also contain the correct tag value. In this case, correct means that the bits embedded in the pointer match the bits in the tag store associated with address range of the read/write. If the tag is correct, the read/write happens successfully, like normal. If the tag is incorrect, then the system faults. This can result in a segfault, an exception, a hardware interrupt of some sort; basically, it's up to the implementation. What makes MTE so great is that this validation and possible fault happen during every read or write instruction. That is to say that there is no special "tagged read" or "tagged write" instruction. If MTE is enabled, the normal LDR, LDP, LDM, STR, STP, STM, etc. will cause the tag in the address register to be validated. So if normal code uses normal means to allocate memory (eg malloc/calloc or even mmap) then any time an address in that range is referenced, the tag is validated, completely invisibly to the code.

This process can be pseudo-coded as the following functions for generic 'read' and 'write':

uint64_t read64(void* addr, uintptr_t offset) {
    void* read_addr = addr + offset;
    if (!check_tag_for_addr(read_addr)) raise(SIGSEGV);
    return *((uint64_t*)read_addr);
}
void write64(void* addr, uintptr_t offset, uint64_t val) {
    void* write_addr = addr + offset;
    if (!check_tag_for_addr(write_addr)) raise(SIGSEGV);
    *((uint64_t*)write_addr) = val;
}

Now, again, this process would take place within the assembly instruction execution, so, the last line in each of those functions, the one that actually reads/writes, would be the only assembly instruction. With MTE enabled, the assembly for those functions would look like:

read64:
    LDR X0, [X0, X1]
    RET
write64:
    STR X2, [X0, X1]
    RET

Obviously, these functions would never be used, the instructions would just be in-line in functions, as normal read and writes. That's what makes it so great, the validation happens within the normal read/write.

Memory Deallocation

Deallocation has two effects. First, the tag is validated. If the validation fails, bad things happen. In the same way as with the read/write faults, a stale pointer is being freed, this is bad, so a fault is raised. How can I say that it's a stale pointer being freed? Because if the tag validation succeeds, the allocation is un-tagged (or re-tagged, again, up to implementation). This means that a double-free will likely result in a tag being invalid.

The implementation of this is a little more complicated, and necessarily intertwined with the implementation of 'free' because, if you remember your stdlib signatures, free takes only the address of the allocation to be freed. The heap implementation will be able to derive the size of the allocation, and therefore will be able to un-tag all of the allocation that was tagged, but without going into the internal function of heaps, it's just a little complicated to show here.

Isn't segfaulting dangerous?

Yes. In short, just raising a segfault is dangerous; it means the app will crash. But the system will be allowed to continue. So it's not really that bad. Further, segfaulting is much better than risking a malicious actor having arbitrary memory access.

I must also say that there are multiple proposed strategies to handle a failed tag validation. The most extreme example is a segfault. This is would would be described as a "fail-secure" solution. There is an alternative strategy, a less extreme, but no less effective.

The less extreme strategy is to record telemetry data and possibly enable higher levels of surveillance on the offending process when an invalid tag is detected. Virtually all exploit chains require more than one out of bounds read/write. This means that they will almost certainly be detected. When detected, anything they do can be more easily reverse engineered, and any further vulnerabilities can be patched. Even the initial vulnerability may be caught on the first invalid access, depending on what form it takes.

Further, combined strategies can be used, eg, if a process fails 3-5 tag validations, that process is stopped. This would help to mitigate really bad memory management so applications would be essentially unusable unless they were correctly handling memory. When deployed, this strategy would also be likely to stop attackers because that is still a very small number of accesses.

Wait... This sounds a little familiar...

If you're saying this, you're probably thinking of Address Sanitizer (ASAN) or Hardware-Assisted Address Sanitizer (HWASAN). Yes, it's similar, but this takes that idea further. ASAN is software driven and adds a lot of overhead in space and time, AND it won't catch near as much as MTE will. Even Hardware-Assisted ASAN, being faster/smaller than ASAN, pales in comparison to MTE. (Or similar implementations.)

MTE has much less code overhead, because it's built into the CPU architecture it's significantly faster, and it requires much less memory. On top of all that, MTE is much more capable of detecting bad access. Probabilistically, but greater than 90% of the time, out of bounds access, use after free, stack smashing, heap smashing, even some type confusions can be caught at runtime.

How Exactly?

We've talked a lot about exploits and MTE, compared it to (HW)ASAN, but how does it actually catch bad guys red-handed?

Basically, these bugs rely on accessing memory outside the allocation they reference. Either in space or time.

Buffer Over/Under flow or Type Confusion

I'll group Buffer Abuse and Type Confusions together here because, as far as MTE is concerned, they work in the same way: Access outside the spatial allocation. Remember, the point of MTE is to protect against access outside the allocation. The way it does this is by catching invalid access via bad/unmatched tags for the address. When you try to access an address that is an arbitrary offset from the address with a proper tag, there is a 1/16 (6.25%) chance that the tag will actually match and the read/write operation will succeed. This means the 15/16 times (93.75%) an invalid access will be caught before it has the chance to do anything bad. This may seem like it leave room for attackers, however, it really doesn't. Even if whatever process let an attacker try a read or write was repeatable, they would likely require more than one read/write to accomplish the goal. With two operations, the odds of getting away with it drop to .391% (1/256).

Use After Free

This type of bug comes about when memory is not being handled safely. Specifically, if memory is allowed to be reused before it's free, or freed and reallocated before its last use.

Here's an example of what I mean:

uint32_t* first = calloc(1, sizeof(uint32_t));
uint32_r* second = NULL;

free(first);

second = malloc(sizeof(uint32_t));
*second = 0;
*first = 1;
*second++;
printf("First @ 0x%016" PRIxPTR " = %i\n", first, *first);
printf("Second @ 0x%016" PRIxPTR " = %i\n", second, *second);

Hopefully you can see that this will result in undefined behavior. Everything is fine until the line that has "*first = 1;". Because 'first' has been freed, that pointer is stale. It may point to some unused place on the heap, or it may point to the same place that 'second' points. There's no way of knowing from this snippet. However, if MTE is applied to this situation, when 'first' is freed, that memory will be untagged, and when second is allocated, that memory will be tagged. So let's examine the possible outcomes:

First Points to Unused Heap Space

In this case, access to first will be caught by the tagging extension because first still contains a tag in the top byte, but that memory space has been untagged already. So that will almost certainly cause a fault.

First Points to the Same Memory as Second

In this case, the element in the tag store that tracks the space allocated for 'second' has been retagged since it was allocated for 'first', so, again, the access will almost certainly be caught. 

That sounds great, but how many arms and legs will it cost?

I know I'm selling this hard, but the benefits are important. There are costs, and they're not negligible. In my opinion, they're worth it, however, that will in the end, be up to the engineers working on tomorrow's designs to decide.

Transistors!

To me, the most obvious cost is actual transistors on the chip. Anytime you add functionality or capacity or anything to an integrated circuit (which processors are) it will cost transistors. These transistors could be used for something else.

If you don't know about the lower workings of processors or ICs in general, think of the substrate as a markerboard. You can draw whatever you want, but you're only allowed a certain amount of space. If you fill it all in with one component, you run out of space for anything else.

Tag Block Size as a Memory Cost

One cost of MTE is the added memory consumption. This comes in two different places. The first -and most obvious- is in the tag store. This store must exist somewhere, either in normal memory space or in dedicated hardware, it makes no difference, if you only have so much space on your chip, some of that space must be carved out for the tag store.

The second place you'll lose out on memory is the added consumption per reservation. Different allocation methods have different reservation resolutions, generally the stack is aligned with the register size of the CPU, so 32-bit CPUs use a 4-Byte aligned stack, likewise, 64-bit CPUs use 8-Byte aligned stacks. Unless you are using some special purpose memory structure -like a custom heap- the stack will usually use the smallest alignment. Malloc is guaranteed to return an allocated address that is aligned with the maximum aligned read/write size of the CPU, essentially, the same alignment as the stack. The important thing to note here is that malloc may return a more aligned piece of memory. If you're operating on a 64-bit CPU and malloc returns an address guaranteed to be 16-Byte aligned, then the guarantee that malloc returns 8-Byte aligned addresses is still satisfied. This is necessarily the case for MTE implementations.

Because the Memory Tag Extension only tracks memory addresses with a certain granularity (16 Bytes) the allocation system (the heap implementation in the case of malloc/free) must actually ensure that it only allocates increments of the tag resolution.

Tradeoffs on the Stack

Because using MTE will actually consume more instructions, there is going to be a processing time cost. Up to this point, we've basically been talking about MTE with respect to the heap, but, MTE can be used anywhere.

On the stack, MTE could be extremely useful for protecting members of the stack frame. The downside of this is that each element on the stack would likely need to be tagged separately, this means that rather than one instruction to allocate the stack space, it would take a number of instructions proportional to the number of variables on the stack. Then if there was stack space re-use, that space would likely need to be retagged as well!

Tagging Object Members and Array Elements

While it may seem like it could be possible to tag separate members or elements of structs or arrays, it's just not possible. In some higher level languages it may be possible, but for those the different elements in the classes/arrays are more like separate allocations. I'm really talking about low level languages (C/C++).

Imagine, for a minute... A Protocol Data Unit (PDU/Packet) with a type field and a payload field. Those could be tagged separately, right? Sure. Problem is, when you read the type field, it will tell you that you need to cast the packet to a different subtype of PDU. Anyone who's dealt with protocols or file formats will be familiar with this. It's everywhere. But what this means is that unless all the possible subtypes of PDUs have exactly the same structure, the fields in any subtype cannot be tagged. Because C/C++ data types are handled identically regardless of if they're nested in another type or on their own, this means that those sub-PDU types can never have their members tagged separately. Because any struct may be used as a subtype for a PDU (or cast from another struct type) (the compiler doesn't know what you're doing) no struct can ever be tagged.

Arrays are similar, basically what it comes down to is that C/C++ allow you to do some really nifty things with arrays and you can't start tagging the elements separately without taking a lot of power away.

Can we preserve ABI while implementing MTE?

ABI, or Application Binary Interface, that is the definition of the interface into the binary, functions, objects, etc, can be preserved.

Remember I said that everything between allocation and deallocation is handled the same as if MTE were not present. This means that only allocation and deallocation need be updated. On the stack, this is handled in the same place, so there's no interface change because the same code is allocating (would have to tag the memory) and deallocating (would want to re/un tag the memory).

Tagging the heap works in the same way. Despite free possibly being called in an entirely different part of the code as malloc, malloc and free are to be implemented by the same library, so whatever code is in charge of allocation may tag and deallocation code may untag. The callers of these functions can ignore that aspect.

If it were possible to tag different members of structs differently (it's not, see above) then that would require an ABI change, but that's not something we actually have to worry about.

Into the Wild

We've made it though the technical part of this post. I know it's a long one. If you made it this far, you might have several questions, one of those questions might be, "Ronman, why should we care about MTE, like, if no one has implemented it..."

You'd be correct. But also incorrect. I haven't heard of a system adopting MTE yet. That doesn't mean that the hardware doesn't exist, nor that software isn't catching up. From the circles that I primarily follow, there are two big names that come up with respect to MTE: Apple, Google.

Apple

Historically, Apple has led the charge on a lot of security/privacy related features, up to recently, MTE is no exception. Apple's current generation of mobile processor, the A14 is an ARMv8.5a, which would include MTE. Apple's latest desktop processor, the M1, would also include MTE. Where they haven't done anything yet is in software. While they likely have all the needed hardware support, they haven't rolled out the code to use it.

Google

Google announced that they were investigating using MTE a while back (w.r.t. 2021). This is an interesting switchup because Apple embraced PAC (Pointer Authentication Codes, a story for another day) and Google just seemed to wave as it passed. With MTE, Google seems more motivated though. In Android 11, they began inserting a dummy tag to all allocated memory. If your application frees this memory but has perturbed the dummy tag, your application will crash. This is just to prepare developers for when google throws the switch... Which may come soon.

Google's next phone (w.r.t. June 2021) will be the Pixel 6, which will have their first custom SOC. The Whitechapel chip is almost certainly contain MTE capable CPU hardware.

This is strictly based on rumor, Wikipedia only knows of two processor cores that support MTE Apple's Firestorm and Icestorm (https://en.wikipedia.org/wiki/Comparison_of_ARMv8-A_cores).

Rumors based on older information still list older cores that wouldn't support MTE. We'll just have to wait to see.

In Closing...

So MTE is going to drastically change the landscape for security in devices that implement it. This is going to be a really cool time.

If you found this interesting, I'm looking at writing up more on different technical mitigations, most won't be this long, but... Meh.

Further Reading

Armv8.5-A Memory Tagging Extension (Arm Whitepaper)

LLVM's Documentation on their usage of MTE: https://llvm.org/docs/MemTagSanitizer.html

If all else fails, "google it".


News Updates: MTE

 For anyone who reads that is actually tracking this issue, this won't be news, but I haven't seen that much buzz about it online an...