Stack usage is awful #640

pfalcon · 2014-05-30T23:53:05Z

I told long ago that I want to do basic stack usage info (nothing fancy, just by comparing addresses of stack-alloced vars). I didn't because I suspected that it will be such that the only outcome of that can be desire to drop everything and work on stack usage.

Anyway, I did that now (printed by mem_info() in unix port) and indeed it's huge:

def foo2():
    mem_info()

def foo():
    mem_info()
    foo2()

mem_info()
foo()

mem: total=2151, current=975, peak=1554
stack: 800
GC: total: 128000, used: 1216, free: 126784
 No. of 1-blocks: 24, 2-blocks: 14, max blk sz: 6
mem: total=2151, current=975, peak=1554
stack: 1280
GC: total: 128000, used: 1216, free: 126784
 No. of 1-blocks: 24, 2-blocks: 14, max blk sz: 6
mem: total=2151, current=975, peak=1554
stack: 1760
GC: total: 128000, used: 1216, free: 126784
 No. of 1-blocks: 24, 2-blocks: 14, max blk sz: 6

So, one level of Python function recursion costs 480 bytes of C stack (on x86).

The text was updated successfully, but these errors were encountered:

dhylands · 2014-05-31T00:01:05Z

Very interesting. Is your patch available somewhere?

pfalcon · 2014-05-31T00:10:49Z

Yep, pushed in 914bcf1 now

pfalcon · 2014-05-31T00:17:37Z

So, why this happens? 3 big reasons:

Long chains of functions with many args. (Each call thus re-pushes stuff again and again).
Good use of automatic variables (non-scalars first of all) and alloca(). The basic idea is that GC of stack stuff is much easier than of heap, but may be worth analyzing that aspect closer.
Not good enough use of alloca() - few funcs allocate big enough arrays on stack even if they won't be used.

Of these, 3 looks like low-hanging fruit for experimentation, and I submit this ticket essentially as grounds to tweak 1. Point 2 should be last in row to try, because it will affect performance.

dhylands · 2014-05-31T01:13:16Z

It appears to be similar for the STM port as well:

Stack: 744 bytes
Stack: 1160 bytes
Stack: 1576 bytes

or 416 bytes per level of function call.

My patch was different (sinc stm doesn't have a mem_info, so I added pyb.stack())

diff --git a/stmhal/modpyb.c b/stmhal/modpyb.c
index 7ec2503..de67f62 100644
--- a/stmhal/modpyb.c
+++ b/stmhal/modpyb.c
@@ -147,6 +147,15 @@ STATIC mp_obj_t pyb_info(uint n_args, const mp_obj_t *args) {
 }
 STATIC MP_DEFINE_CONST_FUN_OBJ_VAR_BETWEEN(pyb_info_obj, 0, 1, pyb_info);

+STATIC mp_obj_t pyb_stack(void) {
+    char *stack_top = (char *)&_estack + 1;
+    size_t usage = stack_top - (char *)&stack_top;
+    printf("Stack: %u bytes\n", usage);
+    return mp_const_none;
+}
+STATIC MP_DEFINE_CONST_FUN_OBJ_0(pyb_stack_obj, pyb_stack);
+
+
 //github.com/ \function unique_id()
 //github.com/ Returns a string of 12 bytes (96 bits), which is the unique ID for the MCU.
 STATIC mp_obj_t pyb_unique_id(void) {
@@ -321,6 +330,7 @@ STATIC const mp_map_elem_t pyb_module_globals_table[] = {

     { MP_OBJ_NEW_QSTR(MP_QSTR_bootloader), (mp_obj_t)&pyb_bootloader_obj },
     { MP_OBJ_NEW_QSTR(MP_QSTR_info), (mp_obj_t)&pyb_info_obj },
+    { MP_OBJ_NEW_QSTR(MP_QSTR_stack), (mp_obj_t)&pyb_stack_obj },
     { MP_OBJ_NEW_QSTR(MP_QSTR_unique_id), (mp_obj_t)&pyb_unique_id_obj },
     { MP_OBJ_NEW_QSTR(MP_QSTR_freq), (mp_obj_t)&pyb_freq_obj },
     { MP_OBJ_NEW_QSTR(MP_QSTR_repl_info), (mp_obj_t)&pyb_set_repl_info_obj },
diff --git a/stmhal/qstrdefsport.h b/stmhal/qstrdefsport.h
index d822a54..cb2a38c 100644
--- a/stmhal/qstrdefsport.h
+++ b/stmhal/qstrdefsport.h
@@ -31,6 +31,7 @@ Q(pyb)
 Q(unique_id)
 Q(bootloader)
 Q(info)
+Q(stack)
 Q(sd_test)
 Q(present)
 Q(power)

pfalcon · 2014-05-31T07:19:48Z

This really should be micropython.stack() (to return int). mem_info() is dirty practical hack, we should avoid proliferating those (unix just already has it, until it's removed).

pfalcon · 2014-05-31T10:05:47Z

I'm working on (initial steps of) point 1.

pfalcon · 2014-05-31T10:45:11Z

https://gcc.gnu.org/onlinedocs/gnat_ugn_unw/Static-Stack-Usage-Analysis.html

pfalcon · 2014-05-31T11:46:40Z

While trying -fstack-usage and then looking into assembly, I saw completely nonsensical stack usage and allocation. Ah, gotta remember that Intel sells new CPUs by forcing completely ludicrous stack alignments to push more data out of caches and then putting bigger cache sizes to increase something in specs.

Using -mpreferred-stack-boundary=2 (value is power of 2) cut stack size of one function from 80 to 60 bytes (even though default alignment should be 16). Then, one Py function recursion equals 292 bytes (with some changes to cut allocation already applied).

dhylands · 2014-05-31T17:42:36Z

This really should be micropython.stack() (to return int). mem_info() is dirty practical hack, we should avoid proliferating those (unix just already has it, until it's removed).

Good point. I wasn't really sure exactly how/where we should add such a function. It seems quite useful for embedded work.

I like to have a set of functions related to stack use:

1 - Have a function which can write the unused portion of the stack with some well known fixed value
2 - Have a function which scans the stack looking for the first disturbed word. As long as the stack hasn't overflowed this gives a good indication for high-water stack usage, taking into consideration stacked up ISRs and such.

3 - Have a function similar to the one coded which reports on the current stack usage.

Under linux, doing this is a bit tricky, but still doable. The stack grows dynamically, getting new pages added as needed, so you need to probe to find out if pages have been allocated or not. Usually you can query /proc/self/maps and look for a line with [stack] to find the set of pages which are mapped to the stack, and use that as the starting point.

You may also, for debugging purposes, just force the preallocation of a bunch of stack pages.

Perhaps we should create a stack module? Or maybe we should have a mem module that includes the stack and the gc? Or maybe just keep them separate? Once embedded projects get more complicated, I can definitely see the need to walk the heap and get information about all of the allocated objects in the heap, both broad (how many objects are allocated for each class/type) and detailed (get information about each allocated object, especially the larger ones).

dpgeorge · 2014-05-31T17:46:21Z

Re more sophisticated stack profiling, see also #264.

dpgeorge · 2014-06-01T11:45:07Z

See c60a261 for some small savings of stack space.

This reduces stack usage by 16 words (64 bytes) for stmhal/ port. See issue #640.

dpgeorge · 2014-06-07T13:28:19Z

See aabd83e for moderate savings of stack. This patch made the following improvements:

32 bit unix:
    saves 232 ROM bytes
    stack from 384 down to 272 bytes per call (96 words to 68 words)
    -> a saving of 112 bytes (28 words) per call

64 bit unix:
    saves 632 bytes ROM
    stack from 528 to 416 bytes per call (66 down to 52 words)
    -> a saving of 112 bytes stack (14 words)

stmhal:
    saves 48 bytes ROM
    stack from 288 bytes down to 224 bytes (72 words down to 56 words per call)
    -> saves 64 bytes (16 words) per call

So far we have gone from 480 to 272 on x86, and 416 to 224 on ARM Thumb. That's about a 45% decrease.

dhylands · 2014-06-07T18:04:17Z

Sweet.

I saw the blurb about switching over to using the heap if more state is needed. What attributes of a function would cause this to happen? I'd like to understand the impact on interrupt handlers.

dpgeorge · 2014-06-07T18:52:51Z

What attributes of a function would cause this to happen?

A function that has a lot of local variables, and/or lots of arguments, and/or has a complicated expression, eg a+b*c(d, c+f).

dpgeorge · 2014-08-02T11:14:49Z

With a few fixes (described above) stack usage is no longer "awful", so the title of this issue is fixed.

If someone considers stack usage to need further improvements, please provide a sensible metric to measure such improvements, and a goal, otherwise this issue will remain forever open.

add board.RX and .TX pins to metro_m4_express_revb

pfalcon added the optimize label May 30, 2014

dpgeorge added a commit that referenced this issue Jun 7, 2014

py: Merge mp_execute_bytecode into fun_bc_call.

aabd83e

This reduces stack usage by 16 words (64 bytes) for stmhal/ port. See issue #640.

dpgeorge closed this as completed Aug 2, 2014

ARF1 mentioned this issue Mar 2, 2017

Low recursion limit on Windows port with MSVC #2927

Closed

tannewt pushed a commit to tannewt/circuitpython that referenced this issue Feb 27, 2018

Merge pull request micropython#640 from dhalbert/metro_m4_tx_rx

6daf4bd

add board.RX and .TX pins to metro_m4_express_revb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack usage is awful #640

Stack usage is awful #640

pfalcon commented May 30, 2014

dhylands commented May 31, 2014

pfalcon commented May 31, 2014

pfalcon commented May 31, 2014

dhylands commented May 31, 2014

pfalcon commented May 31, 2014

pfalcon commented May 31, 2014

pfalcon commented May 31, 2014

pfalcon commented May 31, 2014

dhylands commented May 31, 2014

dpgeorge commented May 31, 2014

dpgeorge commented Jun 1, 2014

dpgeorge commented Jun 7, 2014

dhylands commented Jun 7, 2014

dpgeorge commented Jun 7, 2014

dpgeorge commented Aug 2, 2014

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

Stack usage is awful #640

Stack usage is awful #640

Comments

pfalcon commented May 30, 2014

dhylands commented May 31, 2014

pfalcon commented May 31, 2014

pfalcon commented May 31, 2014

dhylands commented May 31, 2014

pfalcon commented May 31, 2014

pfalcon commented May 31, 2014

pfalcon commented May 31, 2014

pfalcon commented May 31, 2014

dhylands commented May 31, 2014

dpgeorge commented May 31, 2014

dpgeorge commented Jun 1, 2014

dpgeorge commented Jun 7, 2014

dhylands commented Jun 7, 2014

dpgeorge commented Jun 7, 2014

dpgeorge commented Aug 2, 2014

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!