If anyone hasn't watched it yet, it's worth sitting through the PDX talk with John Wordsworth (Technical Director) and Mark Dickie (Engine Team Lead), it covers a lot of PDS history, the current situation and covers the anatomy of a frame, what each thread contains, engine dev team sizes and much more.
https://twitter.com/JohnWordsworth/status/1194910120506998784
As the person who discovered that it is the vacant jobs that cause the performance drop, I can give a bit more of an insight given the presentation and the slides given through twitter.
Disclaimer: I am a software developer and do this for a living (Working on games, not just web or back end development that is). However anything I say is to be taken with a grain of salt and as an approximation to what's going on as I could still be wrong. Only by looking at the actual code and trying tests can anyone draw conclusions, What follows is simplified:
https://t.co/Ijk4mKh3bd
So, with that out of the way
the important slide is slide number 36. Right in the middle of that 16ms span the engine does "Game updates/Process Commands" and it does so in a multithreaded manner as well. So why don't we experience that and instead the game crawls to a halt with low CPU utilization?
Answer: As you can see, each thread deals with a seperate game subsystem. Checking for the pops is part of a single task that takes a single "slot" if you like others deal with ships, gates, migration and so on. These taks have different performance profiles, based on what memory and what other subsystems they touch, so it's impossible to expect them all to finish at the same time. The whole system was designed with a specific "workload budget" in mind, even before v1.0 was shipped. Back then we had the tiles system and economy.
Now here comes the new economy update and throws that budget out of whack. Once population in the galaxy sprawls, the thread responsible for the economy, needs more time to finish. Sure, it splits planets and only does a subset of it's work on each frame, postponing work for the next frame and so on, going through all colonies every month and doing a myriad other things related to the economy as well. It works perfectly - otherwise you wouldn't be able to play, and the resulting data changes the "delta" if you whish, must be sent through the multiplayer layer on each tick (every 16ms or so) to keep all the game clients in sync - this is also done for job placements, because it affects the reality on the colonies and the resulting resources - in single player games this does not happen, but in general it's not causing the slow down - I'm just mentioning this here because it's important.
As that wokload increases with more pops, that specific task completion time dominates all work and all other tasks in all other threads wait for it to finish, essentially making the game "single threaded". Also, as the AI or player introduces more subspecies through xeno compatibility or otherwise, that task performance slows down, since it has to access the subspecies data and its single cache becomes overloaded, swapping lines repeatedly (cache thrashing) increasing memory access time for most operations. This explains some of the graphs that people produced in the previous performance exploration discussions. You have 1 thread overloaded while most other are asleep so you get about 12.5% utilization on an 8 threaded CPU.
UI slowdown is completely seperate, since UI data must be populated from the game state *each* time you open any pannel or window. (eg pop resettlement window)
FPS issues are somewhat irrelevant, as a seperate thread was always responsible for doing the OpenGL/DirectX stuff. Only thread resource starvation can explain issues here with an unpaused game and remember that modern drivers and graphics stacks are multithreaded as well and so they are sensitive to your CPU total load.
Yes, the engine is multithreaded, but the workload distribution is not balanced. What people really want to happen when they say "make the game engine multithreaded", is to make a change where all cores can be used to proccess the economy - or *any* other task whose computational workload increases as population count (or any entity count) increases. This means splitting the workset into seperate partitions and have each core completing that work in parallel. For example, divide all colonies by the number of cores and have each CPU core calculate though each set.
This requires significant re-engineering effort and it won't happen - I will be extremely surprised if it happens. But expect something like that in Stellaris 2 under Clausewitz 2 or such, because the industry is going towards budget 16 core/ 32 thread CPUs.
And don't forget multiplayer where we have another upper cap limit: as you reach a specific large number of pops and you manage to calculate them in parallel as suggested, you
may get many pop reshufling changes per second and the game state delta (all the changes that must be sent to other clients) can
reach into the megabytes per second, clogging your internet connection upload capacity and bringing everything to a halt regardless of CPU,GPU or anything else.
The only reasonable quick and cheap solution is either optimization of the workload or reduction of the data set. The mod "Stellaris Immortal" is an attempt towards reduction, while mods that mess with job check scheduling (or de-scheduling) is an attempt towards optimization, as is having no vacant jobs which is kind of both.