0% found this document useful (0 votes)
3 views

Lecture 7 - Optimizations - A 2025

The document discusses various optimization techniques for computer programs, focusing on measuring execution time and the role of compilers in improving efficiency. It highlights the importance of identifying critical code sections, the limitations of compilers in optimizing certain code patterns, and the impact of memory aliasing on performance. Examples illustrate how to optimize code through techniques like code motion and eliminating unnecessary memory references.

Uploaded by

idoamar2609
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture 7 - Optimizations - A 2025

The document discusses various optimization techniques for computer programs, focusing on measuring execution time and the role of compilers in improving efficiency. It highlights the importance of identifying critical code sections, the limitations of compilers in optimizing certain code patterns, and the impact of memory aliasing on performance. Examples illustrate how to optimize code through techniques like code motion and eliminating unnecessary memory references.

Uploaded by

idoamar2609
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 55

‫מבנה‬

‫מחשב‬
‫מצגת ‪Optimizations – 7‬‬

‫ד"ר מרינה קוגן‪-‬סדצקי‬


‫מבוסס על הרצאות של פרופ' גל קמינקא והרצאות של ‪Bryant and‬‬
‫‪O'Hallaron‬‬
Measurement Challenge
How Much Time Does Program X Require?
CPU time
How many total seconds are used when executing X?
Actual (“Wall-Clock”) Time
How many seconds elapse between start and completion of X?
Confounding Factors
How does time get measured?
Many processes share computing resources
The effects of switching between processes may be noticeable
“Time” on a Computer System

real (wall clock) time


= user time (time executing instructions in the user process)

= system time (time executing instructions in kernel on behalf of user process)

= some other user’s time (time executing instructions in different user’s process)

+ + = real (wall clock) time

We will use the word “time” to refer to user time.

cumulative user time


Interval Counting Example
The 90/10 Rule of Thumb
90% of execution time is in 10% of code

Lesson:
Find the 10% that really count!
Let the compiler worry about the rest

Important:
First make program work correctly
Make sure easy to maintain
Then optimize
Optimizing Compilers
Provide efficient mapping of program to machine
■register allocation – ‫שימוש ברגיסטרים‬
■code selection and ordering – ‫ שינוי סדר בשביל לנצל‬Pipeline
■eliminating minor inefficiencies –

Don’t improve asymptotic efficiency (usually)


■up to programmer to select best overall algorithm
Limitations of Compilers
● Must never change program behavior
■Prevents optimizations that might affect behavior
■Even if in practice, conditions can never happen

● Behavior that may be obvious to the programmer not clear to compiler


■e.g., data ranges may be more limited than variable types

● Most analysis is performed only within procedures


■Whole-program analysis is too expensive in most cases
■Or relies on source code which is not available

● Most analysis is based only on static information


■compiler has difficulty anticipating run-time inputs
■Exception: just-in-time compilation in virtual bytecode machines
What compilers can do (in general)
An Example
Original Code:
/* Sum 4-grid neighbors of val[i][j] */

sum = val[i-1][j]
+ val[i+1][j]
+ val[i][j-1]
+ val[i][j+1];

● This is a common step in image/video processing, neural networks, …


● Does this for every pixel i,j, so will be called N2 times!
● What can we do?
Reuse Common Sub-Expressions
Original Code:
/* Sum 4-grid neighbors of val[i][j] */

sum = val[i-1][j]
+ val[i+1][j]
+ val[i][j-1]
+ val[i][j+1];

We know it is really:
/* Sum neighbors of i,j */
up = val[(i-1)*n + j];
down = val[(i+1)*n + j];
left = val[i*n + j-1];
right = val[i*n + j+1];
sum = up + down + left + right;

3 multiplications: i*n, (i–1)*n, (i+1)*n


Reuse Common Sub-Expressions
Original Code:
/* Sum 4-grid neighbors of val[i][j] */

sum = val[i-1][j]
+ val[i+1][j]
+ val[i][j-1]
+ val[i][j+1];

We know it is really: Which can be transformed into:


/* Sum neighbors of i,j */ int inj = i*n + j;
up = val[(i-1)*n + j]; up = val[inj - n];
down = val[(i+1)*n + j]; down = val[inj + n];
left = val[i*n + j-1]; left = val[inj - 1];
right = val[i*n + j+1]; right = val[inj + 1];
sum = up + down + left + right; sum = up + down + left + right;

3 multiplications: i*n, (i–1)*n, (i+1)*n 1 multiplication: i*n


Reuse Common Sub-Expressions
Original Code:
/* Sum 4-grid neighbors of val[i][j] */

sum = val[i-1][j]
+ val[i+1][j]
+ val[i][j-1]
+ val[i][j+1];

We know it is really: Which can be transformed into:


/* Sum neighbors of i,j */ int inj = i*n + j;
up = val[(i-1)*n + j]; up = val[inj - n];
down = val[(i+1)*n + j]; down = val[inj + n];
left = val[i*n + j-1]; left = val[inj - 1];
right = val[i*n + j+1]; right = val[inj + 1];
sum = up + down + left + right; sum = up + down + left + right;

gcc can do this when optimizing at -O1 level


Easy: Code Motion in Loops

Original Code: We know it is really:


for (i = 0; i < n; i++) for (i = 0; i < n; i++)
for (j = 0; j < n; j++) for (j = 0; j < n; j++)
a[i][j] = b[j]; a[i*n+j] = b[j];

What can we do here?


Easy: Code Motion in Loops

Original Code: We know it is really:


for (i = 0; i < n; i++) for (i = 0; i < n; i++)
for (j = 0; j < n; j++) for (j = 0; j < n; j++)
a[i][j] = b[j]; a[i*n+j] = b[j];

Simple Code Motion:


for (i = 0; i < n; i++) {
int ni = n*i;
for (j = 0; j < n; j++)
a[ni + j] = b[j];
}
Easy: Code Motion in Loops

Original Code: We know it is really:


for (i = 0; i < n; i++) for (i = 0; i < n; i++)
for (j = 0; j < n; j++) for (j = 0; j < n; j++)
a[i][j] = b[j]; a[i*n+j] = b[j];

Simple Code Motion:


int ni = 0;
for (i = 0; i < n; i++) { for (i = 0; i < n; i++)
int ni = n*i; {
for (j = 0; j < n; j++) for (j = 0; j < n; j+
a[ni + j] = b[j]; +)
} a[ni + j] = b[j];
ni += n;
}
● Recognize sequences of products, replace by increments
● Form of strength reduction
What compilers can do – but it would be
wrong decision
What is the output of the program ?

#include <stdio.h>

int main() {

char c = 125;
while (c > 0) {
printf("%d ", c);
c++;
}

return 0;
}
What is the output of the program ?

#include <stdio.h>

int main() {

char c = 125;
while (c > 0) {
printf("%d ", c);
c++;
}

return 0;
}

125, 126,
127
What is the output of the program ?

#include <stdio.h>

int main() {

char c = 125;
while (c < (c + 1)) {
printf("%d ", c);
c++;
}

return 0;
}
What is the output of the program ?

#include <stdio.h> GCC compiler recognizes that for a signed char, c < (c + 1)
will always hold true because the result of the increment is
always a larger integer number, except when overflow
occurs, which causes a wrap-around but still maintains a
int main() { cycle where c always appears less than c + 1 in the signed
8-bit context. Given this insight, GCC decides to optimize the
loop condition while (c < (c + 1)) to while (true) because it
char c = 125; recognizes that the condition will never become false during
the execution of the loop for signed char.
while (c < (c + 1)) {
This decision is wrong in our case.
printf("%d ", c);
c++;
}

return 0;
}

output: infinite loop


since gcc decides that c is always smaller than
(c+1),
and replaces "while(c < (c+1))" by "while(true)"
What compilers cannot do (in general)
Moving Functions Out of Loop
Procedure to Convert String to Lower Case
void lower(char *s)
{
int i;
for (i = 0; i < strlen(s); i++)
if (s[i] >= 'A' && s[i] <= 'Z')
s[i] -= ('A' - 'a');
}

Often found in student exercise submissions

■ strlen executed every iteration


■ strlen linear in length of string
●Must scan string until finds '\0'
■Overall performance is quadratic
This would be better
void lower(char *s)
{
int i;
int len = strlen(s);
for (i = 0; i < len; i++)
if (s[i] >= 'A' && s[i] <=
'Z')
s[i] -= ('A' - 'a');
}

■Move call to strlen outside of loop


■Since result does not change from one iteration to another
■Form of code motion

Why can’t the compiler do this?


Optimization Blocker: Procedure Calls
⬛ Why couldn’t compiler move strlen out of inner loop?
▪ Function may have side effects
▪ Alters global state each time called
▪ Function may not return same value for given arguments
▪ Depends on other parts of global state
▪ Procedure lower could interact with strlen

⬛ What can the compiler do (very little):


▪ Treat procedure call as a black box
▪ Weak optimizations near them
▪ Inline the functions (sometimes)
▪ gcc can do this within file
Optimization Blocker: Procedure Calls
⬛ Why couldn’t compiler move strlen out of inner loop?
▪ Function may have side effects
▪ Alters global state each time called
▪ Function may not return same value for given arguments
▪ Depends on other parts of global state
▪ Procedure lower could interact with strlen

Write your own


⬛ What can the compiler do (very little): version inside the
▪ Treat procedure call as a black box file, and inline its use
▪ Weak optimizations near them
▪ Inline the functions (sometimes) inline size_t strlen(const char *s)
{
▪ gcc can do this within file size_t length = 0;
while (*s != '\0') {
s++; length++;
Do your own code motion! }
return length;
}
Compilers Blocked by Memory Aliasing

twiddle1(int *xp, *yp) {


*xp += *yp;
*xp += *yp;
}

twiddle2(int *xp, *yp) {


*xp += 2* (*yp);
}

● Twiddle 2 is faster (less memory accesses)


● Why compiler not transform twiddle1 into twiddle2?
Compilers Blocked by Memory Aliasing
twiddle1(int *xp, *yp) {
*xp += *yp;
*xp += *yp;
}

twiddle2(int *xp, *yp) {


*xp += 2* (*yp);
}

● Because memory aliasing cases affect behavior:


● twiddle1 (&a,&a) → a = 4a
● twiddle2 (&a,&a) → a = 3a
● Compiler has to consider equality, cannot optimize 1 into 2
An Example
Vector ADT
/* data structure for vectors */
length 0 1 2 length–1 typedef struct{
data …... size_t len;
data_t *data;
} vec;
Procedures
vec_ptr new_vec(int len)
●Create vector of specified length
int get_vec_element(vec_ptr v, int index, data_t *dest)
●Retrieve vector element, store at *dest
●Return 0 if out of bounds, 1 if successful
data_t *get_vec_start(vec_ptr v)
●Return pointer to start of vector data
■Similar to array implementations in Pascal, ML, Java
●E.g., always do bounds checking
Benchmark Computation
void combine1(vec_ptr v, data_t *dest)
{
int i;
*dest = IDENT;
for (i = 0; i < vec_length(v); i++)
{
data_t val;
get_vec_element(v, i, &val);
*dest = *dest OP val;
}
}

⬛data_t ⬛OP, IDENT


▪ int ▪ + / 0
▪ long ▪ * / 1
▪ float
▪ double Compute sum or product of
vector elements
Initial version (combine1)
void combine1(vec_ptr v, int *dest)
{
int i;
*dest = 0;
for (i = 0; i < vec_length(v); i++) {
int val;
get_vec_element(v, i, &val);
*dest += val;
}
}

Procedure
■Compute sum of all elements of vector
■Store result at destination location
Move vec_length Call Out of Loop
void combine2(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
*dest = 0;
for (i = 0; i < length; i++) {
int val;
get_vec_element(v, i, &val);
*dest += val;
}
}

Optimization
■Move call to vec_length out of inner loop
●Value does not change from one iteration to next
●Code motion
■ vec_length requires only constant time, but significant overhead
Reduction in Strength
void combine3(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
int *data = get_vec_start(v);
*dest = 0;
for (i = 0; i < length; i++) {
*dest += data[i];
}

Avoid procedure call to retrieve each vector element


●Get pointer to start of array before loop
●Within loop just do pointer reference
●Not as clean in terms of data abstraction
Eliminate Unneeded Memory Refs
void combine4(vec_ptr v, int
*dest)
{
int i;
int length = vec_length(v);
int *data = get_vec_start(v);
int sum = 0;
for (i = 0; i < length; i++)
sum += data[i];
*dest = sum;
}

■Don’t need to store in destination until end


■Local variable sum held in register
■Avoids 1 memory read, 1 memory write per cycle
■Memory references are expensive!
Optimization Blocker: Memory Aliasing
Aliasing
■Two different memory references specify single location
Example
■v: [3, 2, 17]
■combine1(v, get_vec_start(v)+2) --> [3,2,10]
■combine4(v, get_vec_start(v)+2) --> [3,2,22]
Observations
■Easy to occur in C
●Since allowed to do address arithmetic
●Direct access to storage structures
■Get in habit of introducing local variables
●Accumulating within loops
●Your way of telling compiler not to check for aliasing
Effect of Basic Optimizations
void combine4(vec_ptr v, data_t
*dest)
{
long i;
long length = vec_length(v);
data_t *d = get_vec_start(v);
data_t t = IDENT;
for (i = 0; i < length; i++)
t = t OP d[i];
*dest = t;
}

Combine1 22.68 20.02 19.98 20.18


unoptimized
Method Integer Double FP
Operation Add Mult Add Mult
Combine1 –O1 10.12 10.12 10.17 11.14
Combine4 1.27 3.01 3.01 5.01

⬛ Eliminates sources of overhead in loop


Machine-Independent Opt. Summary
Code Motion
■Compilers are good at this for simple loop/array structures
■Don’t do well in presence of procedure calls and memory aliasing

Reduction in Strength
■Shift, add instead of multiply or divide
●compilers are (generally) good at this
●Exact trade-offs machine-dependent
■Keep data in registers rather than memory
●compilers are not good at this, since concerned with aliasing

Share Common Subexpressions


■compilers have limited algebraic reasoning capabilities
■Help compiler overcome aliasing, use local variables
Loop Unrolling (2x1)
void unroll2a_combine(vec_ptr v, data_t *dest)
{
long length = vec_length(v);
long limit = length-1;
data_t *d = get_vec_start(v);
data_t x = IDENT;
long i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x = (x OP d[i]) OP d[i+1];
}
/* Finish any remaining elements */
for (; i < length; i++) {
x = x OP d[i];
}
*dest = x;
}

⬛ Perform 2x more useful work per iteration


Loop Unrolling with Reassociation (2x1a)
void unroll2aa_combine(vec_ptr v, data_t *dest)
{
long length = vec_length(v);
long limit = length-1;
data_t *d = get_vec_start(v);
data_t x = IDENT;
long i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x = x + (d[i] + d[i+1]);
}
/* Finish any remaining elements */
for (; i < length; i++) {
x = x OP d[i];
}
Compare to before
*dest = x;
} x = (x OP d[i]) OP
d[i+1];

⬛ Can this change the result of the computation?


⬛ Yes, for FP. Why?
Loop Unrolling with Separate Accumulators (2x2)
void unroll2a_combine(vec_ptr v, data_t *dest)
{
long length = vec_length(v);
long limit = length-1;
data_t *d = get_vec_start(v);
data_t x0 = IDENT;
data_t x1 = IDENT;
long i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x0 = x0 + d[i];
x1 = x1 + d[i+1];
}
/* Finish any remaining elements */
for (; i < length; i++) {
x0 = x0 OP d[i];
}
*dest = x0 OP x1;
}

⬛ Different form of reassociation


What About Branches?
⬛ Challenge
▪Instruction Control Unit must work well ahead of Execution Unit
to generate enough operations to keep EU busy

▪When encounters conditional branch, cannot reliably determine where to continue fetching

404663: mov $0x0,%eax Executing


404668: cmp (%rdi),%rsi
40466b: jge 404685 How to continue?
40466d: mov 0x8(%rdi),
%rax

. . .

404685: repz retq


Branch Outcomes
▪When encounter conditional branch, cannot determine where to continue fetching
▪Branch Taken: Transfer control to branch target
▪Branch Not-Taken: Continue with next instruction in sequence
▪Cannot resolve until outcome determined by branch/integer unit

404663: mov $0x0,%eax


404668: cmp (%rdi),%rsi
40466b: jge 404685
40466d: mov 0x8(%rdi), Branch Not-Taken
%rax

. . .
Branch Taken

404685: repz retq


Branch Prediction
⬛ Idea
▪ Guess which way branch will go
▪ Begin executing instructions at predicted position
▪ But don’t actually modify register or memory data

404663: mov $0x0,%eax


404668: cmp (%rdi),%rsi
40466b: jge 404685
40466d: mov 0x8(%rdi),
%rax
Predict Taken
. . .

404685: repz retq Begin


Execution
Branch Prediction Through Loop
401029: vmulsd (%rdx),%xmm0,%xmm0 Assume vector length = 100
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 98
Predict Taken (OK)
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 99
Predict Taken (Oops) Executed
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx Read invalid location
401034: jne 401029
i = 100
Fetched
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 101
Branch Misprediction Invalidation
401029: vmulsd (%rdx),%xmm0,%xmm0 Assume vector length = 100
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 98
Predict Taken (OK)
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 99
Predict Taken (Oops)
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029
i = 100
Invalidate
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 101
Branch Misprediction Recovery
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx i = 99 Definitely not taken
401034: jne 401029
401036: jmp 401040
. . . Reload
401040: vmovsd %xmm0,(%r12)
Pipeline

⬛ Performance Cost
▪ Multiple clock cycles on modern processor
▪ Can be a major performance limiter
Structure
Representation
Alignment Principles
‫זו דרישה‬
‫שמשתנה בין‬
⬛ Aligned Data ‫מערכות הפעלה‬
‫שונות‬ ‫ נדרש כדי ליעל גישה‬alignment
▪ Primitive data type requires K bytes ‫ לבצע מספר‬,‫למשתנים (כלומר‬
‫ כדי‬RAM -‫מינימלי של גישות ל‬
▪ Address must be multiple of K -‫ ל‬RAM -‫להביא משתנה מ‬
▪ Required on some machines; advised on x86-64 )CPU

⬛ Motivation for Aligning Data


▪ Memory accessed by aligned chunks of 8 bytes (system dependent)
⬛ Compiler
▪ Inserts gaps to ensure correct alignment of variables

‫ של קוד‬alignment
‫נעשה ע"י קומפיילר‬
‫בזמן קומפילציה‬
Specific Cases of Alignment (x86-64)
⬛ 1 byte: char, … ‫ יכול‬char ‫משתנה מסוג‬
RAM -‫לשבת בכל כתובת ב‬
▪ no restrictions on address
‫ יכול‬short ‫משתנה מסוג‬
⬛ 2 bytes: short, … ‫לשבת בכתובת שמתחלקת‬
.2 -‫ב‬

▪ lowest 1 bit of address must be 02 ...‫וכך הלאה‬

⬛ 4 bytes: int, float, …


▪ lowest 2 bits of address must be 002

⬛ 8 bytes: double, long, char *, …


▪ lowest 3 bits of address must be 0002

⬛ 16 bytes: long double (GCC on Linux)


▪ lowest 4 bits of address must be 00002
Structures & Alignment
⬛ Unaligned Data
‫אם נסכום את הגדלים‬
,S1 ‫של שדות של‬
bytes 17 ‫נקבל‬ struct S1
struct S1 {{
char
char c;
c;
int i[2];
int i[2];
double
double v;
v;
c i[0] i[1] v }} *p;
*p;
p p+ p+ p+ p+1
1 5 9 7
Satisfying Alignment with Structures
‫ המבנה‬:‫דרישה נוספת‬
‫חייב להתחיל מכתובת‬
⬛ Within structure: ‫שמתחלקת בגודל‬
-‫השדה הגדול ביותר ב‬
▪ Must satisfy each element’s alignment requirement struct
⬛ Overall structure placement
▪ Each structure has alignment requirement K struct S1
struct S1 {{
char
char c;
c;
▪ K = Largest alignment of any element
int i[2];
int i[2];
▪ Initial address & structure length must be multiples of K double
double v;
v;
⬛ Example: }} *p;
*p;
▪ K = 8, due to double element ‫ המבנה הזה‬,‫בפועל‬
‫ בגלל‬,‫טופס יותר‬ struct ‫הקצאה של‬
alignment ‫צריכה גם כן‬
‫להסתיים בכתובת‬
3 bytes 4 bytes ,K -‫שמתחלקת ב‬
c unused i[0] i[1] unused v
‫ הינו הגודל‬K ‫כאשר‬
p p+ p+ p+ p+12 p+1 p+2 ‫המקסימלי של שדה‬
‫ אינה‬p+1 ‫הכתובת‬ 1 4 8 6 4 struct -‫ב‬
, 4 -‫מתחלקת ב‬
‫ומכיוון שהשדה הבא‬ Multiple of 4 ‫ אינה‬p+12 ‫הכתובת‬ Multiple of 8
4 ‫ שגודלו‬int ‫הינו‬ ‫ ומכיוון‬, 8 -‫מתחלקת ב‬
‫ הקומפיילר‬,bytes double ‫שהשדה הבא הינו‬
alignment ‫יבצע‬ Multiple of 8 ‫ הקומפיילר‬,bytes 8 ‫שגודלו‬ Multiple of 8
‫לכתובת הקרובה שכן‬ ‫ לכתובת‬alignment ‫יבצע‬
4 -‫מתחלקת ב‬ 8 -‫הקרובה שכן מתחלקת ב‬
Meeting Overall Alignment Requirement
⬛ For largest alignment requirement K
⬛ Overall structure must be multiple of K

‫דוגמא לסידור‬
‫אחר של שדות‬
struct ‫בתוך‬ ‫בסידור החדש אנחנו עדיין‬
struct S1
struct S1 {{ struct S2
struct S2 {{ ‫מפסידים את אותה הכמות של‬
char
char c;
c; double
double v;v; ‫ אבל מרוויחים בכך‬, unused bytes
int
int i[2];
i[2]; int
int i[2];
i[2]; ,‫שכל השדות יושבים ברצף בזכרון‬
double
double v;v; char
char c;
c; ‫ בגלל הרציפות‬.‫ללא חורים ביניהם‬
}} *p;
*p; }} *p;
*p; cache ‫המבנה שלנו הופך להיות‬
.friendly

v i[0] i[1] c 7 bytes unused


p p+8 p+16 p+24

Multiple of K=8
Saving Space
‫דוגמא נוספת לסידורים שונים‬
‫ כאשר הפעם‬,struct ‫של אותו‬
⬛ Put large data types first ‫הסידור החדש חוסך לנו‬
! ‫מקום‬
⬛ Effect (K=4)
‫מכיוון שאין התנאיה של מיקום‬
‫ אם נרכז את כל‬, char ‫של‬
‫ אז‬,‫ ביחד‬char ‫השדות של‬
.‫נקבל חסכון בזכרון‬
struct S4
struct S4 {{ struct S5
struct S5 {{
char
char c;
c; int
int i;
i;
int
int i;
i; char
char c;
c;
char
char d;
d; char
char d;
d;
}} *p;
*p; }} *p;
*p;

c 3 bytes i d 3 bytes i c d 2 bytes


Getting High Performance
⬛ Use good compiler
⬛ Smart usage of compilation flags
⬛ Don’t do anything stupid
▪ Watch out for hidden algorithmic inefficiencies
▪ Write compiler-friendly code
▪ Watch out for optimization blockers:
procedure calls & memory references
▪ Look carefully at innermost loops (where most work is done)

⬛ Tune code for machine


▪ Exploit instruction-level parallelism
▪ Avoid unpredictable branches
▪ Make code cache friendly (discussed elsewhere in course)
!Thank You

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy