colorify.js

Fiddling with thread stacks

So, you want to mess with threads' stack? it's a bit hard to find meaningful online examples. All you find are some trivial examples of calling pthread_attr_setstack and some guys asking why on earth you want to do that, and telling you that's a sign of bad design and such.
There can be a few reasons for setting stack size, stack address, and protector page size.


I'm not sure how much of these features are available on windows. I'll check at some point for the sake of my curiosity. But for now, this post is limited to linux and other unix-ish systems. 
Let's start with the simplest of them - the size. 

If your app has many threads, the size consumed by stack spaces becomes significant. Last time I checked the default size was 2MB. 250 threads make half a gig! so dropping the size to 1MB can be significant. Excessive amount of threads is obviously not a good practice for high performance but is just an example.

I hear that "memory is cheap" but we're into high performance right?
Every page needs an entry in the memory translation tables so overall you get longer lookups on every TLB miss. Unused memory is not a huge concern but when there's a lot it, memory tables grow fat and slow.
Physical memory is limited and getting swapped out is the last thing you need. My usual recommendation regarding swap is to disable it altogether.

You might also want to enlarge stack size. Sometimes threads need more space for calculations. One of the enemies of high performance is memory allocations so using the stack rather than the heap makes perfect sense.

Communications threads sometimes prefer to have buffers on the stack to encode and decode messages.
To be honest, I don't remember seeing threads exceed 2MB space without infinite loop bugs.

Now, why would anyone want to change the guard size to more than a single page? since we're going to touch the first guard byte first, this 1 byte should be enough right?

Well, no. If we're using the stack for big objects we can touch it well beyond the limit. Like here:
  char buf[100000];
  if( recv( m_socket, buf, sizeof(buf), flags ) > 0 )
    process( buf );
If our stack is 2MB and we get to this code when we're at 1.95MB, we'll be touching memory 50KB beyond the stack boundary!
The stack grows downwards, so the first byte of the buffer is outside the stack space and the last one is inside. This means that recv would always trash somebody else's memory.
We could easily get into big trouble also on platforms where the stack grows upwards. For example, recv could read less than 50k and then process(buf) would shift the stack pointer over the limit.

So, the obvious question is how do we figure out how much stack we need?

The answer is conceptually simple - sum up the variable sizes in call chains and add the stack frames overhead. However, it's harder in practice.
Luckily, compilers do at least some of that calculation anyway to know how to adjust functions and we can output this info and go summing up. In gcc, it's done using -fstack-usage. I did not check with other compilers.
Another approach, a simpler one, is to put a break point in the deepest function and look at the addresses in a debugger.
How about the guard size? this is easier to estimate. Assuming that any function can be called at any stack depth, we only have to find the biggest frame size and round up. To be extra cautious, we have to look for cases where a very big frame is left untouched before the next one is called, and sum them up.
This is the kind of analysis I'd expect the compiler to do for me. Would be hard with calling external libraries that call us back etc, but, dear compiler guys, you're welcome to make assumptions, or extra requirements.
Now, being aware of stack size and protector size considerations, we can start talking about allocating the stack ourselves.
The pthread library is usually doing this for us. A 1MB or 2MB space is allocated using mmap and a LWP is created. If we want to dig deeper into pthread we can have a look into the pthread struct.

So, if this allocation is taken care of, why do it ourselves?
Are there any hidden benefits of allocating this space ourselves?
There could be some. If we just use that space for our stack, and change nothing else, there probably isn't much difference.

Can we allocate space in a better way than pthread? I say we can. The performance difference would not be significant to most applications, but I feel like digging in this direction.

If we don't have too many threads, and are ok with stacks sizes in multiples of 2MB, we can allocate 2MB in a single huge page rather than 512 pages of 4KB. This is a benefit! We've just saved a bunch of translation entries and might have saved some TLB misses.

Needless to say, this could be a nice feature of the pthread library.

Another thing we should think of in NUMA machines is memory locality: a thread's stack can be allocated on the node the thread is running on. This memory will be faster to get into caches in the event of a cache miss but also have good secondary effects. Maybe I should say it would have less of the negative secondary effects. Memory controllers on other nodes would not have to mess with our memory requests so the corresponding CPU will execute code faster. And obviously our code will benefit from the higher memory bandwidth that would reduce cache miss count.


A controlled stack alignment can be exploited for another benefit- it's very fast to calculate the address of the bottom of the stack. We can use this to implement a TLS (or TSD, as it's sometimes called) that lays on the stack, like a local variable in the topmost stack frame.
This would be faster than the pthread implementation as we can skip being large and dynamic. And obviously, our code can be inlined.
The simplest we can do is to just put a struct in a predefined offset from the beginning of the stack.

FastTls* fastTls()
{
  union {

    FastTls* ftls;
    void* p;
    uintptr_t a;
  } u;
  u.p = &a;
  u.a &= ~(stack_size-1);

  u.a += stack_size - stack_bootstrap - sizeof(FastTls);
  return u.ftls;
}


The code assumes the stack size is a power of 2. stack_bootstrap is some amount we reduce for pthread's entry point function. This bootstrap is easy to measure when debugging.
If the stack size is bigger than the page size, say 4MB, alignment is not guaranteed to be the same as stack size so will require a bit more work.

To summarize, setting stack size, protector size, alignment, and memory allocation source, can have benefits for high performance apps. All the ideas above can be provided by pthread or any library we use for threading.
If I've missed any performance enhancing opportunity, in the scope of tweaking stacks, please let me know!

No comments:

Post a Comment