Content-Length: 347910 | pFad | https://github.com/micropython/micropython/issues/640#issue-34690031

99 Stack usage is awful · Issue #640 · micropython/micropython · GitHub
Skip to content

Stack usage is awful #640

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pfalcon opened this issue May 30, 2014 · 15 comments
Closed

Stack usage is awful #640

pfalcon opened this issue May 30, 2014 · 15 comments

Comments

@pfalcon
Copy link
Contributor

pfalcon commented May 30, 2014

I told long ago that I want to do basic stack usage info (nothing fancy, just by comparing addresses of stack-alloced vars). I didn't because I suspected that it will be such that the only outcome of that can be desire to drop everything and work on stack usage.

Anyway, I did that now (printed by mem_info() in unix port) and indeed it's huge:

def foo2():
    mem_info()

def foo():
    mem_info()
    foo2()

mem_info()
foo()

mem: total=2151, current=975, peak=1554
stack: 800
GC: total: 128000, used: 1216, free: 126784
 No. of 1-blocks: 24, 2-blocks: 14, max blk sz: 6
mem: total=2151, current=975, peak=1554
stack: 1280
GC: total: 128000, used: 1216, free: 126784
 No. of 1-blocks: 24, 2-blocks: 14, max blk sz: 6
mem: total=2151, current=975, peak=1554
stack: 1760
GC: total: 128000, used: 1216, free: 126784
 No. of 1-blocks: 24, 2-blocks: 14, max blk sz: 6

So, one level of Python function recursion costs 480 bytes of C stack (on x86).

@dhylands
Copy link
Contributor

Very interesting. Is your patch available somewhere?

@pfalcon
Copy link
Contributor Author

pfalcon commented May 31, 2014

Yep, pushed in 914bcf1 now

@pfalcon
Copy link
Contributor Author

pfalcon commented May 31, 2014

So, why this happens? 3 big reasons:

  1. Long chains of functions with many args. (Each call thus re-pushes stuff again and again).
  2. Good use of automatic variables (non-scalars first of all) and alloca(). The basic idea is that GC of stack stuff is much easier than of heap, but may be worth analyzing that aspect closer.
  3. Not good enough use of alloca() - few funcs allocate big enough arrays on stack even if they won't be used.

Of these, 3 looks like low-hanging fruit for experimentation, and I submit this ticket essentially as grounds to tweak 1. Point 2 should be last in row to try, because it will affect performance.

@dhylands
Copy link
Contributor

It appears to be similar for the STM port as well:

Stack: 744 bytes
Stack: 1160 bytes
Stack: 1576 bytes

or 416 bytes per level of function call.

My patch was different (sinc stm doesn't have a mem_info, so I added pyb.stack())

diff --git a/stmhal/modpyb.c b/stmhal/modpyb.c
index 7ec2503..de67f62 100644
--- a/stmhal/modpyb.c
+++ b/stmhal/modpyb.c
@@ -147,6 +147,15 @@ STATIC mp_obj_t pyb_info(uint n_args, const mp_obj_t *args) {
 }
 STATIC MP_DEFINE_CONST_FUN_OBJ_VAR_BETWEEN(pyb_info_obj, 0, 1, pyb_info);

+STATIC mp_obj_t pyb_stack(void) {
+    char *stack_top = (char *)&_estack + 1;
+    size_t usage = stack_top - (char *)&stack_top;
+    printf("Stack: %u bytes\n", usage);
+    return mp_const_none;
+}
+STATIC MP_DEFINE_CONST_FUN_OBJ_0(pyb_stack_obj, pyb_stack);
+
+
 //github.com/ \function unique_id()
 //github.com/ Returns a string of 12 bytes (96 bits), which is the unique ID for the MCU.
 STATIC mp_obj_t pyb_unique_id(void) {
@@ -321,6 +330,7 @@ STATIC const mp_map_elem_t pyb_module_globals_table[] = {

     { MP_OBJ_NEW_QSTR(MP_QSTR_bootloader), (mp_obj_t)&pyb_bootloader_obj },
     { MP_OBJ_NEW_QSTR(MP_QSTR_info), (mp_obj_t)&pyb_info_obj },
+    { MP_OBJ_NEW_QSTR(MP_QSTR_stack), (mp_obj_t)&pyb_stack_obj },
     { MP_OBJ_NEW_QSTR(MP_QSTR_unique_id), (mp_obj_t)&pyb_unique_id_obj },
     { MP_OBJ_NEW_QSTR(MP_QSTR_freq), (mp_obj_t)&pyb_freq_obj },
     { MP_OBJ_NEW_QSTR(MP_QSTR_repl_info), (mp_obj_t)&pyb_set_repl_info_obj },
diff --git a/stmhal/qstrdefsport.h b/stmhal/qstrdefsport.h
index d822a54..cb2a38c 100644
--- a/stmhal/qstrdefsport.h
+++ b/stmhal/qstrdefsport.h
@@ -31,6 +31,7 @@ Q(pyb)
 Q(unique_id)
 Q(bootloader)
 Q(info)
+Q(stack)
 Q(sd_test)
 Q(present)
 Q(power)

@pfalcon
Copy link
Contributor Author

pfalcon commented May 31, 2014

This really should be micropython.stack() (to return int). mem_info() is dirty practical hack, we should avoid proliferating those (unix just already has it, until it's removed).

@pfalcon
Copy link
Contributor Author

pfalcon commented May 31, 2014

I'm working on (initial steps of) point 1.

@pfalcon
Copy link
Contributor Author

pfalcon commented May 31, 2014

@pfalcon
Copy link
Contributor Author

pfalcon commented May 31, 2014

While trying -fstack-usage and then looking into assembly, I saw completely nonsensical stack usage and allocation. Ah, gotta remember that Intel sells new CPUs by forcing completely ludicrous stack alignments to push more data out of caches and then putting bigger cache sizes to increase something in specs.

Using -mpreferred-stack-boundary=2 (value is power of 2) cut stack size of one function from 80 to 60 bytes (even though default alignment should be 16). Then, one Py function recursion equals 292 bytes (with some changes to cut allocation already applied).

@dhylands
Copy link
Contributor

This really should be micropython.stack() (to return int). mem_info() is dirty practical hack, we should avoid proliferating those (unix just already has it, until it's removed).

Good point. I wasn't really sure exactly how/where we should add such a function. It seems quite useful for embedded work.

I like to have a set of functions related to stack use:

1 - Have a function which can write the unused portion of the stack with some well known fixed value
2 - Have a function which scans the stack looking for the first disturbed word. As long as the stack hasn't overflowed this gives a good indication for high-water stack usage, taking into consideration stacked up ISRs and such.

3 - Have a function similar to the one coded which reports on the current stack usage.

Under linux, doing this is a bit tricky, but still doable. The stack grows dynamically, getting new pages added as needed, so you need to probe to find out if pages have been allocated or not. Usually you can query /proc/self/maps and look for a line with [stack] to find the set of pages which are mapped to the stack, and use that as the starting point.

You may also, for debugging purposes, just force the preallocation of a bunch of stack pages.

Perhaps we should create a stack module? Or maybe we should have a mem module that includes the stack and the gc? Or maybe just keep them separate? Once embedded projects get more complicated, I can definitely see the need to walk the heap and get information about all of the allocated objects in the heap, both broad (how many objects are allocated for each class/type) and detailed (get information about each allocated object, especially the larger ones).

@dpgeorge
Copy link
Member

Re more sophisticated stack profiling, see also #264.

@dpgeorge
Copy link
Member

dpgeorge commented Jun 1, 2014

See c60a261 for some small savings of stack space.

dpgeorge added a commit that referenced this issue Jun 7, 2014
This reduces stack usage by 16 words (64 bytes) for stmhal/ port.

See issue #640.
@dpgeorge
Copy link
Member

dpgeorge commented Jun 7, 2014

See aabd83e for moderate savings of stack. This patch made the following improvements:

32 bit unix:
    saves 232 ROM bytes
    stack from 384 down to 272 bytes per call (96 words to 68 words)
    -> a saving of 112 bytes (28 words) per call

64 bit unix:
    saves 632 bytes ROM
    stack from 528 to 416 bytes per call (66 down to 52 words)
    -> a saving of 112 bytes stack (14 words)

stmhal:
    saves 48 bytes ROM
    stack from 288 bytes down to 224 bytes (72 words down to 56 words per call)
    -> saves 64 bytes (16 words) per call

So far we have gone from 480 to 272 on x86, and 416 to 224 on ARM Thumb. That's about a 45% decrease.

@dhylands
Copy link
Contributor

dhylands commented Jun 7, 2014

Sweet.

I saw the blurb about switching over to using the heap if more state is needed. What attributes of a function would cause this to happen? I'd like to understand the impact on interrupt handlers.

@dpgeorge
Copy link
Member

dpgeorge commented Jun 7, 2014

What attributes of a function would cause this to happen?

A function that has a lot of local variables, and/or lots of arguments, and/or has a complicated expression, eg a+b*c(d, c+f).

@dpgeorge
Copy link
Member

dpgeorge commented Aug 2, 2014

With a few fixes (described above) stack usage is no longer "awful", so the title of this issue is fixed.

If someone considers stack usage to need further improvements, please provide a sensible metric to measure such improvements, and a goal, otherwise this issue will remain forever open.

@dpgeorge dpgeorge closed this as completed Aug 2, 2014
tannewt pushed a commit to tannewt/circuitpython that referenced this issue Feb 27, 2018
add board.RX and .TX pins to metro_m4_express_revb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: https://github.com/micropython/micropython/issues/640#issue-34690031

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy