Site logo

Runtime in Swift: Iterator Performance

September, 2024

In this post I’ll describe how I made my Swift wrapper of the Objective-C runtime as fast as calling the C functions directly. This is the second part in a series:

  1. The lifetime of strings passed into C functions
  2. Getting my Swift wrapper to be as fast as calling into C directly (this post)
  3. Exposing some runtime-related classes that are marked as unavailable in Swift

My main goal was to make a nice Swift API for the runtime and I wasn’t too concerned if it added a little performance overhead. The test scenario I used was to iterate over every method of every class, which looks like this when using the runtime functions directly:

var classCount: UInt32 = 0
let classes = UnsafeMutablePointer(mutating: objc_copyClassList(&classCount)!)
for j in 0..<Int(classCount) {
    var methodCount: UInt32 = 0
    if let methods = class_copyMethodList(classes[j], &methodCount) {
        for k in 0..<Int(methodCount) {
            let method = methods[k]
            // ...
        }
        free(methods)
    }
}
free(classes)

Note that accessing the pointer returned from objc_copyClassList directly causes a crash, possibly because the Swift mistakenly bridges the return type to AutoreleasingUnsafeMutablePointer. Wrapping it in one of the other unsafe pointer types fixes it.

The equivalent using my Swift runtime API looks like this:

for cls in ObjCClass.all {
    for method in cls.methods {
        // ...
    }
}

I think you can agree that looks considerably nicer! Too bad it’s also considerably slower. When I ran a quick comparison using XCTestCase.measure I found that my wrapper was three times slower than using the runtime directly.

Performance benchmarking

Time to dig into some performance debugging. I had my suspicions about what the main cause of the slowdown was (heap allocations) but needed some better tools for benchmarking. Luckily I remembered there was a new official Benchmark package announced recently on Swift.org that supports loads of useful metrics around memory usage and allocations, throughput and even CPU instruction counts. It took a few minutes to set up but is definitely worth it for investigating performance issues like this.

Here’s an extract of the benchmark results, where “Direct calls” is the baseline of calling the runtime functions directly, and “Wrapper arrays” is using my Swift API:

Metric (p90)Direct callsWrapper arrays
Instructions (K)16608716
Malloc (large)11
Malloc (small)20489086
Memory Δ (resident peak) (K)389393
Object allocs07037
Releases189380
Retains1819
Throughput (# / s) (#)34871022
Time (total CPU) (μs)297987
Throughput as % of baseline100%29%

Clearly my wrapper is doing a lot more work allocating objects than it needs to, which would be the array created when cls.methods is called. Even though Array in Swift is a struct, it still needs to allocate memory for its backing storage.

In this case there were 2040 classes registered for a total of 32064 methods but depending on which frameworks are linked this can easily be 10x higher. How can I avoid creating an array for every one of those 2000 classes?

One option would be to use a block-based API which executes a closure for each method:

extension ObjCClass {
    func forEachMethod(block: (ObjCMethod) -> Void) {
        var methodCount: UInt32 = 0
        if let methods = class_copyMethodList(cls, &methodCount) {
            for m in 0..<Int(methodCount) {
                block(ObjCMethod(methods[m]))
            }
            free(methods)
        }
    }
}

// Called like:

cls.forEachMethod { method in
    // ...
}

This runs at ~90% the speed of calling the C functions directly which is not bad. Adding @inlinable to the function brought that up to the same performance as the baseline which is even better. Inlinable exports the whole function body in the public interface (instead of just the function signature) so that optimisations can be done across module boundaries. I assume that when the Swift compiler can see inside the function it optimises away the closure entirely.

So problem solved right?

Well...

I really wanted to keep that “for x in y” syntax, it just looks more natural to me (plus I just can’t resist a challenge). There’s only one way to do that without using arrays.

Iterators

The for-in syntax in Swift is available for anything that conforms to the Sequence protocol. A sequence has one requirement, and that is to produce an iterator. The IteratorProtocol also has a single requirement: to implement a next() function which returns the next item in the sequence, or nil at the end.

This means that if I create a custom sequence that iterates over the methods of a class then I can return that instead of an array, thereby avoiding an object allocation but keeping the same syntax at the call site. Here’s what that might look like:

struct MethodList: Sequence {
    let cls: AnyClass

    func makeIterator() -> MethodIterator {
        MethodIterator(cls: cls)
    }
}

class MethodIterator: IteratorProtocol {
    let methods: UnsafeBufferPointer<Method>
    var index = 0

    init(cls: AnyClass) {
        var methodCount: UInt32 = 0
        let methods = class_copyMethodList(cls, &methodCount)
        self.methods = UnsafeBufferPointer(start: methods, count: Int(methodCount))
    }

    deinit {
        free(UnsafeMutableRawPointer(mutating: methods.baseAddress))
    }

    func next() -> ObjCMethod? {
        guard index < methods.count else { return nil }
        defer { index += 1 }
        return ObjCMethod(methods[index])
    }
}

Returning a MethodList sequence instead of an array works as expected, but the performance is terrible! It runs at around 20% the speed of the baseline and somehow needs over eight times the number of CPU instructions. Let’s add it to the results table:

Metric (p90)Direct callsWrapper arraysIterator class
Instructions (K)1660871614000+
Malloc (large)111
Malloc (small)204890864372
Memory Δ (resident peak) (K)389393397
Object allocs070372324
Releases1893802342
Retains181918
Throughput (# / s) (#)34871022747
Time (total CPU) (μs)2979871348
Throughput as % of baseline100%29%21%

You might have already spotted the problem, I’ve just replaced a bunch of Array allocations with MethodIterator objects instead. But there was a good reason for making this a class instead of a struct, because I need to free the list pointer in deinit. Iterators don’t have a concept of “finishing” iteration, so there’s no way to know when it’s safe to free the pointer except when the iterator itself is deallocated. Maybe I could do it when next() returns nil, but what happens if I break out of the for loop before getting to the end?

I thought about using a non-copyable struct which is a special value type that has a lifetime like a class and which does support deinit. Unfortunately noncopyable structs can’t conform to protocols like IteratorProtocol so they can’t help here.

This might be fixed in Swift 6 although I haven’t checked yet if iterators will be marked as ~Copyable.

Stack-allocated classes

I read somewhere on the Swift forums that in certain cases the compiler can mark classes as eligible for stack promotion. This means that if it knows the layout of a class and can prove that it doesn’t escape its scope, the class can be created on the stack instead of the heap. I couldn’t find more details on this, but I assumed this could only work if the class was not internal to a module and hence hidden from the optimiser.

After trying a few things – benchmarking is really handy for this – I found that making everything on my custom sequence and iterator fully public and @inlinable did the trick! Suddenly I was seeing zero object allocations again and performance that was very close to the baseline, only around 1-10% slower:

Metric (p90)Direct callsWrapper arraysIterator classInlined iterator
Instructions (K)1660871614000+2087
Malloc (large)1111
Malloc (small)2048908643722048
Memory Δ (resident peak) (K)389393397397
Object allocs0703723240
Releases189380234218
Retains18191818
Throughput (# / s) (#)348710227473361
Time (total CPU) (μs)2979871348308
Throughput as % of baseline100%29%21%96%

A downside to this approach is that MethodList, MethodIterator and all the functions inside them need to be made public, although they are only implementing simple protocols so it isn’t a big issue. The properties inside them thankfully don’t need to also be made public if they’re instead marked as @usableFromInline.

Other observations

Here are a couple of interesting points I noticed or confirmed along the way:

  • Boxing a single type in a struct adds no overhead at runtime, it’s as if the type is only for the benefit of the compiler. To confirm this I tried using the runtime’s Method type directly instead of wrapping it in my ObjCMethod struct and it made zero difference to the number of CPU instructions.
  • Be careful when using map() on a sequence because it will create an intermediate array which can be expensive. I was using this on an UnsafeBufferPointer<Method> like methods.map(ObjCMethod.init) and even though wrapping it in a struct is “free” it was still creating an unnecessary array. Adding .lazy made it much faster and surprisingly added almost no overhead.
  • Stack-allocated classes are almost identical to structs in performance. Changing MethodIterator to a struct and commenting out deinit barely reduced the number of CPU instructions (although of course it then leaked memory).

One last trick

At the top of the post I got a list of all the runtime classes using objc_copyClassList, which allocates an array of classes for you and returns it. There’s a related function called objc_getClassList which takes an already-allocated buffer and fills it for you. Conveniently, Swift’s Array has a low-level initialiser which will give you the underlying buffer to fill. This means an array of classes can be created from the runtime very efficiently without any additional allocation or looping.

Here’s what that looks like:

struct ObjCClass {
    let cls: AnyClass

    static var allClasses: [ObjCClass] {
        let classCount = objc_getClassList(nil, 0)
        return [ObjCClass].init(unsafeUninitializedCapacity: Int(classCount)) {
            buffer, initializedCount in

            let classCount2 = objc_getClassList(
                AutoreleasingUnsafeMutablePointer(buffer.baseAddress),
                classCount
            )
            initializedCount = Int(min(classCount, classCount2))
        }
    }

What this is doing:

  1. The objc_getClassList function first needs to be called with no buffer so it returns the total number of classes.
  2. The array initialiser exposes a buffer of the specified size to be filled.
  3. On the second call to objc_getClassList I get the class count again because this function is a bit weird: it returns the total number of registered classes instead of the number it filled into the buffer. Normally this wouldn’t be an issue, except that the number of classes is different the second time it’s called!
  4. The initializedCount needs to be set to how much of the buffer was actually filled.
  5. I’m doing something slightly sneaky here and using the fact that AnyClass and ObjCClass have identical memory layouts to cast the buffer directly instead of mapping the type of each item. This happens in the AutoreleasingUnsafeMutablePointer initialiser, which discards the type of the underlying pointer.

This technique not only returns a proper Swift Array, but actually turns out to be slightly faster than the “C style” iteration in the first code snippet.


Overall this was an enlightening adventure in benchmarking but it’s easy to spend a lot of time on these micro-optimisations. Unless you’re working on really performance-critical code, code readability and maintainability is usually far more important. It never hurts to have an idea of what’s happening under the hood though!


Any comments or questions about this post? ✉️ nick @ this domain.

— Nick Randall

Tagged with: