Runtime in Swift: Iterator Performance
September, 2024
In this post I’ll describe how I made my Swift wrapper of the Objective-C runtime as fast as calling the C functions directly. This is the second part in a series:
- The lifetime of strings passed into C functions
- Getting my Swift wrapper to be as fast as calling into C directly (this post)
- Exposing some runtime-related classes that are marked as unavailable in Swift
My main goal was to make a nice Swift API for the runtime and I wasn’t too concerned if it added a little performance overhead. The test scenario I used was to iterate over every method of every class, which looks like this when using the runtime functions directly:
var classCount: UInt32 = 0
let classes = UnsafeMutablePointer(mutating: objc_copyClassList(&classCount)!)
for j in 0..<Int(classCount) {
var methodCount: UInt32 = 0
if let methods = class_copyMethodList(classes[j], &methodCount) {
for k in 0..<Int(methodCount) {
let method = methods[k]
// ...
}
free(methods)
}
}
free(classes)
Note that accessing the pointer returned from
objc_copyClassList
directly causes a crash, possibly because the Swift mistakenly bridges the return type toAutoreleasingUnsafeMutablePointer
. Wrapping it in one of the other unsafe pointer types fixes it.
The equivalent using my Swift runtime API looks like this:
for cls in ObjCClass.all {
for method in cls.methods {
// ...
}
}
I think you can agree that looks considerably nicer! Too bad it’s also considerably slower. When I ran a quick comparison using XCTestCase.measure
I found that my wrapper was three times slower than using the runtime directly.
Performance benchmarking
Time to dig into some performance debugging. I had my suspicions about what the main cause of the slowdown was (heap allocations) but needed some better tools for benchmarking. Luckily I remembered there was a new official Benchmark package announced recently on Swift.org that supports loads of useful metrics around memory usage and allocations, throughput and even CPU instruction counts. It took a few minutes to set up but is definitely worth it for investigating performance issues like this.
Here’s an extract of the benchmark results, where “Direct calls” is the baseline of calling the runtime functions directly, and “Wrapper arrays” is using my Swift API:
Metric (p90) | Direct calls | Wrapper arrays |
---|---|---|
Instructions (K) | 1660 | 8716 |
Malloc (large) | 1 | 1 |
Malloc (small) | 2048 | 9086 |
Memory Δ (resident peak) (K) | 389 | 393 |
Object allocs | 0 | 7037 |
Releases | 18 | 9380 |
Retains | 18 | 19 |
Throughput (# / s) (#) | 3487 | 1022 |
Time (total CPU) (μs) | 297 | 987 |
Throughput as % of baseline | 100% | 29% |
Clearly my wrapper is doing a lot more work allocating objects than it needs to, which would be the array created when cls.methods
is called. Even though Array
in Swift is a struct, it still needs to allocate memory for its backing storage.
In this case there were 2040 classes registered for a total of 32064 methods but depending on which frameworks are linked this can easily be 10x higher. How can I avoid creating an array for every one of those 2000 classes?
One option would be to use a block-based API which executes a closure for each method:
extension ObjCClass {
func forEachMethod(block: (ObjCMethod) -> Void) {
var methodCount: UInt32 = 0
if let methods = class_copyMethodList(cls, &methodCount) {
for m in 0..<Int(methodCount) {
block(ObjCMethod(methods[m]))
}
free(methods)
}
}
}
// Called like:
cls.forEachMethod { method in
// ...
}
This runs at ~90% the speed of calling the C functions directly which is not bad. Adding @inlinable
to the function brought that up to the same performance as the baseline which is even better. Inlinable exports the whole function body in the public interface (instead of just the function signature) so that optimisations can be done across module boundaries. I assume that when the Swift compiler can see inside the function it optimises away the closure entirely.
So problem solved right?
Well...
I really wanted to keep that “for x in y” syntax, it just looks more natural to me (plus I just can’t resist a challenge). There’s only one way to do that without using arrays.
Iterators
The for-in
syntax in Swift is available for anything that conforms to the Sequence
protocol. A sequence has one requirement, and that is to produce an iterator. The IteratorProtocol
also has a single requirement: to implement a next()
function which returns the next item in the sequence, or nil at the end.
This means that if I create a custom sequence that iterates over the methods of a class then I can return that instead of an array, thereby avoiding an object allocation but keeping the same syntax at the call site. Here’s what that might look like:
struct MethodList: Sequence {
let cls: AnyClass
func makeIterator() -> MethodIterator {
MethodIterator(cls: cls)
}
}
class MethodIterator: IteratorProtocol {
let methods: UnsafeBufferPointer<Method>
var index = 0
init(cls: AnyClass) {
var methodCount: UInt32 = 0
let methods = class_copyMethodList(cls, &methodCount)
self.methods = UnsafeBufferPointer(start: methods, count: Int(methodCount))
}
deinit {
free(UnsafeMutableRawPointer(mutating: methods.baseAddress))
}
func next() -> ObjCMethod? {
guard index < methods.count else { return nil }
defer { index += 1 }
return ObjCMethod(methods[index])
}
}
Returning a MethodList
sequence instead of an array works as expected, but the performance is terrible! It runs at around 20% the speed of the baseline and somehow needs over eight times the number of CPU instructions. Let’s add it to the results table:
Metric (p90) | Direct calls | Wrapper arrays | Iterator class |
---|---|---|---|
Instructions (K) | 1660 | 8716 | 14000+ |
Malloc (large) | 1 | 1 | 1 |
Malloc (small) | 2048 | 9086 | 4372 |
Memory Δ (resident peak) (K) | 389 | 393 | 397 |
Object allocs | 0 | 7037 | 2324 |
Releases | 18 | 9380 | 2342 |
Retains | 18 | 19 | 18 |
Throughput (# / s) (#) | 3487 | 1022 | 747 |
Time (total CPU) (μs) | 297 | 987 | 1348 |
Throughput as % of baseline | 100% | 29% | 21% |
You might have already spotted the problem, I’ve just replaced a bunch of Array
allocations with MethodIterator
objects instead. But there was a good reason for making this a class instead of a struct, because I need to free the list pointer in deinit
. Iterators don’t have a concept of “finishing” iteration, so there’s no way to know when it’s safe to free the pointer except when the iterator itself is deallocated. Maybe I could do it when next()
returns nil, but what happens if I break out of the for loop before getting to the end?
I thought about using a non-copyable struct which is a special value type that has a lifetime like a class and which does support deinit
. Unfortunately noncopyable structs can’t conform to protocols like IteratorProtocol
so they can’t help here.
This might be fixed in Swift 6 although I haven’t checked yet if iterators will be marked as
~Copyable
.
Stack-allocated classes
I read somewhere on the Swift forums that in certain cases the compiler can mark classes as eligible for stack promotion. This means that if it knows the layout of a class and can prove that it doesn’t escape its scope, the class can be created on the stack instead of the heap. I couldn’t find more details on this, but I assumed this could only work if the class was not internal to a module and hence hidden from the optimiser.
After trying a few things – benchmarking is really handy for this – I found that making everything on my custom sequence and iterator fully public
and @inlinable
did the trick! Suddenly I was seeing zero object allocations again and performance that was very close to the baseline, only around 1-10% slower:
Metric (p90) | Direct calls | Wrapper arrays | Iterator class | Inlined iterator |
---|---|---|---|---|
Instructions (K) | 1660 | 8716 | 14000+ | 2087 |
Malloc (large) | 1 | 1 | 1 | 1 |
Malloc (small) | 2048 | 9086 | 4372 | 2048 |
Memory Δ (resident peak) (K) | 389 | 393 | 397 | 397 |
Object allocs | 0 | 7037 | 2324 | 0 |
Releases | 18 | 9380 | 2342 | 18 |
Retains | 18 | 19 | 18 | 18 |
Throughput (# / s) (#) | 3487 | 1022 | 747 | 3361 |
Time (total CPU) (μs) | 297 | 987 | 1348 | 308 |
Throughput as % of baseline | 100% | 29% | 21% | 96% |
A downside to this approach is that MethodList
, MethodIterator
and all the functions inside them need to be made public, although they are only implementing simple protocols so it isn’t a big issue. The properties inside them thankfully don’t need to also be made public if they’re instead marked as @usableFromInline
.
Other observations
Here are a couple of interesting points I noticed or confirmed along the way:
- Boxing a single type in a struct adds no overhead at runtime, it’s as if the type is only for the benefit of the compiler. To confirm this I tried using the runtime’s
Method
type directly instead of wrapping it in myObjCMethod
struct and it made zero difference to the number of CPU instructions.
- Be careful when using
map()
on a sequence because it will create an intermediate array which can be expensive. I was using this on anUnsafeBufferPointer<Method>
likemethods.map(ObjCMethod.init)
and even though wrapping it in a struct is “free” it was still creating an unnecessary array. Adding.lazy
made it much faster and surprisingly added almost no overhead.
- Stack-allocated classes are almost identical to structs in performance. Changing
MethodIterator
to a struct and commenting outdeinit
barely reduced the number of CPU instructions (although of course it then leaked memory).
One last trick
At the top of the post I got a list of all the runtime classes using objc_copyClassList
, which allocates an array of classes for you and returns it. There’s a related function called objc_getClassList
which takes an already-allocated buffer and fills it for you. Conveniently, Swift’s Array
has a low-level initialiser which will give you the underlying buffer to fill. This means an array of classes can be created from the runtime very efficiently without any additional allocation or looping.
Here’s what that looks like:
struct ObjCClass {
let cls: AnyClass
static var allClasses: [ObjCClass] {
let classCount = objc_getClassList(nil, 0)
return [ObjCClass].init(unsafeUninitializedCapacity: Int(classCount)) {
buffer, initializedCount in
let classCount2 = objc_getClassList(
AutoreleasingUnsafeMutablePointer(buffer.baseAddress),
classCount
)
initializedCount = Int(min(classCount, classCount2))
}
}
What this is doing:
- The
objc_getClassList
function first needs to be called with no buffer so it returns the total number of classes. - The array initialiser exposes a buffer of the specified size to be filled.
- On the second call to
objc_getClassList
I get the class count again because this function is a bit weird: it returns the total number of registered classes instead of the number it filled into the buffer. Normally this wouldn’t be an issue, except that the number of classes is different the second time it’s called! - The
initializedCount
needs to be set to how much of the buffer was actually filled. - I’m doing something slightly sneaky here and using the fact that
AnyClass
andObjCClass
have identical memory layouts to cast the buffer directly instead of mapping the type of each item. This happens in theAutoreleasingUnsafeMutablePointer
initialiser, which discards the type of the underlying pointer.
This technique not only returns a proper Swift Array
, but actually turns out to be slightly faster than the “C style” iteration in the first code snippet.
Overall this was an enlightening adventure in benchmarking but it’s easy to spend a lot of time on these micro-optimisations. Unless you’re working on really performance-critical code, code readability and maintainability is usually far more important. It never hurts to have an idea of what’s happening under the hood though!
Useful links
- Next post: Exposing some runtime-related classes that are marked as unavailable in Swift
- Previous post: The lifetime of strings passed into C functions
- The Swift Benchmark package and announcement
- Apple’s docs on the
@inlinable
attribute - Non-copyable structs proposal and improvements in Swift 6
Any comments or questions about this post? ✉️ nick @ this domain.
— Nick Randall