0% found this document useful (0 votes)

3 views

Lecture 7 - Optimizations - A 2025

The document discusses various optimization techniques for computer programs, focusing on measuring execution time and the role of compilers in improving efficiency. It highlights the importance of identifying critical code sections, the limitations of compilers in optimizing certain code patterns, and the impact of memory aliasing on performance. Examples illustrate how to optimize code through techniques like code motion and eliminating unnecessary memory references.

Uploaded by

idoamar2609

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Lecture 7 - Optimizations - A 2025

Uploaded by

idoamar2609

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 55

‫מבנה‬

‫מחשב‬
‫מצגת ‪Optimizations – 7‬‬

‫ד"ר מרינה קוגן‪-‬סדצקי‬

‫מבוסס על הרצאות של פרופ' גל קמינקא והרצאות של ‪Bryant and‬‬
‫‪O'Hallaron‬‬
Measurement Challenge
How Much Time Does Program X Require?
CPU time
How many total seconds are used when executing X?
Actual (“Wall-Clock”) Time
How many seconds elapse between start and completion of X?
Confounding Factors
How does time get measured?
Many processes share computing resources
The effects of switching between processes may be noticeable
“Time” on a Computer System

real (wall clock) time

= user time (time executing instructions in the user process)

= system time (time executing instructions in kernel on behalf of user process)

= some other user’s time (time executing instructions in different user’s process)

+ + = real (wall clock) time

We will use the word “time” to refer to user time.

cumulative user time

Interval Counting Example
The 90/10 Rule of Thumb
90% of execution time is in 10% of code

Lesson:
Find the 10% that really count!
Let the compiler worry about the rest

Important:
First make program work correctly
Make sure easy to maintain
Then optimize
Optimizing Compilers
Provide efficient mapping of program to machine
■register allocation – ‫שימוש ברגיסטרים‬
■code selection and ordering – ‫ שינוי סדר בשביל לנצל‬Pipeline
■eliminating minor inefficiencies –

Don’t improve asymptotic efficiency (usually)

■up to programmer to select best overall algorithm
Limitations of Compilers
● Must never change program behavior
■Prevents optimizations that might affect behavior
■Even if in practice, conditions can never happen

● Behavior that may be obvious to the programmer not clear to compiler

■e.g., data ranges may be more limited than variable types

● Most analysis is performed only within procedures

■Whole-program analysis is too expensive in most cases
■Or relies on source code which is not available

● Most analysis is based only on static information

■compiler has difficulty anticipating run-time inputs
■Exception: just-in-time compilation in virtual bytecode machines
What compilers can do (in general)
An Example
Original Code:
/* Sum 4-grid neighbors of val[i][j] */

sum = val[i-1][j]
+ val[i+1][j]
+ val[i][j-1]
+ val[i][j+1];

● This is a common step in image/video processing, neural networks, …

● Does this for every pixel i,j, so will be called N2 times!
● What can we do?
Reuse Common Sub-Expressions
Original Code:
/* Sum 4-grid neighbors of val[i][j] */

sum = val[i-1][j]
+ val[i+1][j]
+ val[i][j-1]
+ val[i][j+1];

We know it is really:
/* Sum neighbors of i,j */
up = val[(i-1)*n + j];
down = val[(i+1)*n + j];
left = val[i*n + j-1];
right = val[i*n + j+1];
sum = up + down + left + right;

3 multiplications: in, (i–1)n, (i+1)*n

Reuse Common Sub-Expressions
Original Code:
/* Sum 4-grid neighbors of val[i][j] */

sum = val[i-1][j]
+ val[i+1][j]
+ val[i][j-1]
+ val[i][j+1];

We know it is really: Which can be transformed into:

/* Sum neighbors of i,j */ int inj = i*n + j;
up = val[(i-1)*n + j]; up = val[inj - n];
down = val[(i+1)*n + j]; down = val[inj + n];
left = val[i*n + j-1]; left = val[inj - 1];
right = val[i*n + j+1]; right = val[inj + 1];
sum = up + down + left + right; sum = up + down + left + right;

3 multiplications: in, (i–1)n, (i+1)n 1 multiplication: in

Reuse Common Sub-Expressions
Original Code:
/* Sum 4-grid neighbors of val[i][j] */

sum = val[i-1][j]
+ val[i+1][j]
+ val[i][j-1]
+ val[i][j+1];

We know it is really: Which can be transformed into:

gcc can do this when optimizing at -O1 level

Easy: Code Motion in Loops

Original Code: We know it is really:

for (i = 0; i < n; i++) for (i = 0; i < n; i++)
for (j = 0; j < n; j++) for (j = 0; j < n; j++)
a[i][j] = b[j]; a[i*n+j] = b[j];

What can we do here?

Easy: Code Motion in Loops

Original Code: We know it is really:

for (i = 0; i < n; i++) for (i = 0; i < n; i++)
for (j = 0; j < n; j++) for (j = 0; j < n; j++)
a[i][j] = b[j]; a[i*n+j] = b[j];

Simple Code Motion:

for (i = 0; i < n; i++) {
int ni = n*i;
for (j = 0; j < n; j++)
a[ni + j] = b[j];
}
Easy: Code Motion in Loops

Original Code: We know it is really:

for (i = 0; i < n; i++) for (i = 0; i < n; i++)
for (j = 0; j < n; j++) for (j = 0; j < n; j++)
a[i][j] = b[j]; a[i*n+j] = b[j];

Simple Code Motion:

int ni = 0;
for (i = 0; i < n; i++) { for (i = 0; i < n; i++)
int ni = n*i; {
for (j = 0; j < n; j++) for (j = 0; j < n; j+
a[ni + j] = b[j]; +)
} a[ni + j] = b[j];
ni += n;
}
● Recognize sequences of products, replace by increments
● Form of strength reduction
What compilers can do – but it would be
wrong decision
What is the output of the program ?

#include <stdio.h>

int main() {

char c = 125;
while (c > 0) {
printf("%d ", c);
c++;
}

return 0;
}
What is the output of the program ?

#include <stdio.h>

int main() {

char c = 125;
while (c > 0) {
printf("%d ", c);
c++;
}

return 0;
}

125, 126,
127
What is the output of the program ?

#include <stdio.h>

int main() {

char c = 125;
while (c < (c + 1)) {
printf("%d ", c);
c++;
}

return 0;
}
What is the output of the program ?

#include <stdio.h> GCC compiler recognizes that for a signed char, c < (c + 1)
will always hold true because the result of the increment is
always a larger integer number, except when overflow
occurs, which causes a wrap-around but still maintains a
int main() { cycle where c always appears less than c + 1 in the signed
8-bit context. Given this insight, GCC decides to optimize the
loop condition while (c < (c + 1)) to while (true) because it
char c = 125; recognizes that the condition will never become false during
the execution of the loop for signed char.
while (c < (c + 1)) {
This decision is wrong in our case.
printf("%d ", c);
c++;
}

return 0;
}

output: infinite loop

since gcc decides that c is always smaller than
(c+1),
and replaces "while(c < (c+1))" by "while(true)"
What compilers cannot do (in general)
Moving Functions Out of Loop
Procedure to Convert String to Lower Case
void lower(char *s)
{
int i;
for (i = 0; i < strlen(s); i++)
if (s[i] >= 'A' && s[i] <= 'Z')
s[i] -= ('A' - 'a');
}

Often found in student exercise submissions

■ strlen executed every iteration

■ strlen linear in length of string
●Must scan string until finds '\0'
■Overall performance is quadratic
This would be better
void lower(char *s)
{
int i;
int len = strlen(s);
for (i = 0; i < len; i++)
if (s[i] >= 'A' && s[i] <=
'Z')
s[i] -= ('A' - 'a');
}

■Move call to strlen outside of loop

■Since result does not change from one iteration to another
■Form of code motion

Why can’t the compiler do this?

Optimization Blocker: Procedure Calls
⬛ Why couldn’t compiler move strlen out of inner loop?
▪ Function may have side effects
▪ Alters global state each time called
▪ Function may not return same value for given arguments
▪ Depends on other parts of global state
▪ Procedure lower could interact with strlen

⬛ What can the compiler do (very little):

▪ Treat procedure call as a black box
▪ Weak optimizations near them
▪ Inline the functions (sometimes)
▪ gcc can do this within file
Optimization Blocker: Procedure Calls
⬛ Why couldn’t compiler move strlen out of inner loop?
▪ Function may have side effects
▪ Alters global state each time called
▪ Function may not return same value for given arguments
▪ Depends on other parts of global state
▪ Procedure lower could interact with strlen

Write your own

⬛ What can the compiler do (very little): version inside the
▪ Treat procedure call as a black box file, and inline its use
▪ Weak optimizations near them
▪ Inline the functions (sometimes) inline size_t strlen(const char *s)
{
▪ gcc can do this within file size_t length = 0;
while (*s != '\0') {
s++; length++;
Do your own code motion! }
return length;
}
Compilers Blocked by Memory Aliasing

twiddle1(int xp, yp) {

*xp += *yp;
*xp += *yp;
}

twiddle2(int xp, yp) {

*xp += 2* (*yp);
}

● Twiddle 2 is faster (less memory accesses)

● Why compiler not transform twiddle1 into twiddle2?
Compilers Blocked by Memory Aliasing
twiddle1(int *xp, *yp) {
*xp += *yp;
*xp += *yp;
}

twiddle2(int xp, yp) {

*xp += 2* (*yp);
}

● Because memory aliasing cases affect behavior:

● twiddle1 (&a,&a) → a = 4a
● twiddle2 (&a,&a) → a = 3a
● Compiler has to consider equality, cannot optimize 1 into 2
An Example
Vector ADT
/* data structure for vectors */
length 0 1 2 length–1 typedef struct{
data …... size_t len;
data_t *data;
} vec;
Procedures
vec_ptr new_vec(int len)
●Create vector of specified length
int get_vec_element(vec_ptr v, int index, data_t *dest)
●Retrieve vector element, store at *dest
●Return 0 if out of bounds, 1 if successful
data_t *get_vec_start(vec_ptr v)
●Return pointer to start of vector data
■Similar to array implementations in Pascal, ML, Java
●E.g., always do bounds checking
Benchmark Computation
void combine1(vec_ptr v, data_t *dest)
{
int i;
*dest = IDENT;
for (i = 0; i < vec_length(v); i++)
{
data_t val;
get_vec_element(v, i, &val);
*dest = *dest OP val;
}
}

⬛data_t ⬛OP, IDENT

▪ int ▪ + / 0
▪ long ▪ * / 1
▪ float
▪ double Compute sum or product of
vector elements
Initial version (combine1)
void combine1(vec_ptr v, int *dest)
{
int i;
*dest = 0;
for (i = 0; i < vec_length(v); i++) {
int val;
get_vec_element(v, i, &val);
*dest += val;
}
}

Procedure
■Compute sum of all elements of vector
■Store result at destination location
Move vec_length Call Out of Loop
void combine2(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
*dest = 0;
for (i = 0; i < length; i++) {
int val;
get_vec_element(v, i, &val);
*dest += val;
}
}

Optimization
■Move call to vec_length out of inner loop
●Value does not change from one iteration to next
●Code motion
■ vec_length requires only constant time, but significant overhead
Reduction in Strength
void combine3(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
int *data = get_vec_start(v);
*dest = 0;
for (i = 0; i < length; i++) {
*dest += data[i];
}

Avoid procedure call to retrieve each vector element

●Get pointer to start of array before loop
●Within loop just do pointer reference
●Not as clean in terms of data abstraction
Eliminate Unneeded Memory Refs
void combine4(vec_ptr v, int
*dest)
{
int i;
int length = vec_length(v);
int *data = get_vec_start(v);
int sum = 0;
for (i = 0; i < length; i++)
sum += data[i];
*dest = sum;
}

■Don’t need to store in destination until end

■Local variable sum held in register
■Avoids 1 memory read, 1 memory write per cycle
■Memory references are expensive!
Optimization Blocker: Memory Aliasing
Aliasing
■Two different memory references specify single location
Example
■v: [3, 2, 17]
■combine1(v, get_vec_start(v)+2) --> [3,2,10]
■combine4(v, get_vec_start(v)+2) --> [3,2,22]
Observations
■Easy to occur in C
●Since allowed to do address arithmetic
●Direct access to storage structures
■Get in habit of introducing local variables
●Accumulating within loops
●Your way of telling compiler not to check for aliasing
Effect of Basic Optimizations
void combine4(vec_ptr v, data_t
*dest)
{
long i;
long length = vec_length(v);
data_t *d = get_vec_start(v);
data_t t = IDENT;
for (i = 0; i < length; i++)
t = t OP d[i];
*dest = t;
}

Combine1 22.68 20.02 19.98 20.18

unoptimized
Method Integer Double FP
Operation Add Mult Add Mult
Combine1 –O1 10.12 10.12 10.17 11.14
Combine4 1.27 3.01 3.01 5.01

⬛ Eliminates sources of overhead in loop

Machine-Independent Opt. Summary
Code Motion
■Compilers are good at this for simple loop/array structures
■Don’t do well in presence of procedure calls and memory aliasing

Reduction in Strength
■Shift, add instead of multiply or divide
●compilers are (generally) good at this
●Exact trade-offs machine-dependent
■Keep data in registers rather than memory
●compilers are not good at this, since concerned with aliasing

Share Common Subexpressions

■compilers have limited algebraic reasoning capabilities
■Help compiler overcome aliasing, use local variables
Loop Unrolling (2x1)
void unroll2a_combine(vec_ptr v, data_t *dest)
{
long length = vec_length(v);
long limit = length-1;
data_t *d = get_vec_start(v);
data_t x = IDENT;
long i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x = (x OP d[i]) OP d[i+1];
}
/* Finish any remaining elements */
for (; i < length; i++) {
x = x OP d[i];
}
*dest = x;
}

⬛ Perform 2x more useful work per iteration

Loop Unrolling with Reassociation (2x1a)
void unroll2aa_combine(vec_ptr v, data_t *dest)
{
long length = vec_length(v);
long limit = length-1;
data_t *d = get_vec_start(v);
data_t x = IDENT;
long i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x = x + (d[i] + d[i+1]);
}
/* Finish any remaining elements */
for (; i < length; i++) {
x = x OP d[i];
}
Compare to before
*dest = x;
} x = (x OP d[i]) OP
d[i+1];

⬛ Can this change the result of the computation?

⬛ Yes, for FP. Why?
Loop Unrolling with Separate Accumulators (2x2)
void unroll2a_combine(vec_ptr v, data_t *dest)
{
long length = vec_length(v);
long limit = length-1;
data_t *d = get_vec_start(v);
data_t x0 = IDENT;
data_t x1 = IDENT;
long i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x0 = x0 + d[i];
x1 = x1 + d[i+1];
}
/* Finish any remaining elements */
for (; i < length; i++) {
x0 = x0 OP d[i];
}
*dest = x0 OP x1;
}

⬛ Different form of reassociation

What About Branches?
⬛ Challenge
▪Instruction Control Unit must work well ahead of Execution Unit
to generate enough operations to keep EU busy

▪When encounters conditional branch, cannot reliably determine where to continue fetching

404663: mov $0x0,%eax Executing

404668: cmp (%rdi),%rsi
40466b: jge 404685 How to continue?
40466d: mov 0x8(%rdi),
%rax

. . .

404685: repz retq

Branch Outcomes
▪When encounter conditional branch, cannot determine where to continue fetching
▪Branch Taken: Transfer control to branch target
▪Branch Not-Taken: Continue with next instruction in sequence
▪Cannot resolve until outcome determined by branch/integer unit

404663: mov $0x0,%eax

404668: cmp (%rdi),%rsi
40466b: jge 404685
40466d: mov 0x8(%rdi), Branch Not-Taken
%rax

. . .
Branch Taken

404685: repz retq

Branch Prediction
⬛ Idea
▪ Guess which way branch will go
▪ Begin executing instructions at predicted position
▪ But don’t actually modify register or memory data

404663: mov $0x0,%eax

404668: cmp (%rdi),%rsi
40466b: jge 404685
40466d: mov 0x8(%rdi),
%rax
Predict Taken
. . .

404685: repz retq Begin

Execution
Branch Prediction Through Loop
401029: vmulsd (%rdx),%xmm0,%xmm0 Assume vector length = 100
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 98
Predict Taken (OK)
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 99
Predict Taken (Oops) Executed
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx Read invalid location
401034: jne 401029
i = 100
Fetched
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 101
Branch Misprediction Invalidation
401029: vmulsd (%rdx),%xmm0,%xmm0 Assume vector length = 100
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 98
Predict Taken (OK)
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 99
Predict Taken (Oops)
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029
i = 100
Invalidate
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx
401034: jne 401029 i = 101
Branch Misprediction Recovery
401029: vmulsd (%rdx),%xmm0,%xmm0
40102d: add $0x8,%rdx
401031: cmp %rax,%rdx i = 99 Definitely not taken
401034: jne 401029
401036: jmp 401040
. . . Reload
401040: vmovsd %xmm0,(%r12)
Pipeline

⬛ Performance Cost
▪ Multiple clock cycles on modern processor
▪ Can be a major performance limiter
Structure
Representation
Alignment Principles
‫זו דרישה‬
‫שמשתנה בין‬
⬛ Aligned Data ‫מערכות הפעלה‬
‫שונות‬ ‫ נדרש כדי ליעל גישה‬alignment
▪ Primitive data type requires K bytes ‫ לבצע מספר‬,‫למשתנים (כלומר‬
‫ כדי‬RAM -‫מינימלי של גישות ל‬
▪ Address must be multiple of K -‫ ל‬RAM -‫להביא משתנה מ‬
▪ Required on some machines; advised on x86-64 )CPU

⬛ Motivation for Aligning Data

▪ Memory accessed by aligned chunks of 8 bytes (system dependent)
⬛ Compiler
▪ Inserts gaps to ensure correct alignment of variables

‫ של קוד‬alignment
‫נעשה ע"י קומפיילר‬
‫בזמן קומפילציה‬
Specific Cases of Alignment (x86-64)
⬛ 1 byte: char, … ‫ יכול‬char ‫משתנה מסוג‬
RAM -‫לשבת בכל כתובת ב‬
▪ no restrictions on address
‫ יכול‬short ‫משתנה מסוג‬
⬛ 2 bytes: short, … ‫לשבת בכתובת שמתחלקת‬
.2 -‫ב‬

▪ lowest 1 bit of address must be 02 ...‫וכך הלאה‬

⬛ 4 bytes: int, float, …

▪ lowest 2 bits of address must be 002

⬛ 8 bytes: double, long, char *, …

▪ lowest 3 bits of address must be 0002

⬛ 16 bytes: long double (GCC on Linux)

▪ lowest 4 bits of address must be 00002
Structures & Alignment
⬛ Unaligned Data
‫אם נסכום את הגדלים‬
,S1 ‫של שדות של‬
bytes 17 ‫נקבל‬ struct S1
struct S1 {{
char
char c;
c;
int i[2];
int i[2];
double
double v;
v;
c i[0] i[1] v }} *p;
*p;
p p+ p+ p+ p+1
1 5 9 7
Satisfying Alignment with Structures
‫ המבנה‬:‫דרישה נוספת‬
‫חייב להתחיל מכתובת‬
⬛ Within structure: ‫שמתחלקת בגודל‬
-‫השדה הגדול ביותר ב‬
▪ Must satisfy each element’s alignment requirement struct
⬛ Overall structure placement
▪ Each structure has alignment requirement K struct S1
struct S1 {{
char
char c;
c;
▪ K = Largest alignment of any element
int i[2];
int i[2];
▪ Initial address & structure length must be multiples of K double
double v;
v;
⬛ Example: }} *p;
*p;
▪ K = 8, due to double element ‫ המבנה הזה‬,‫בפועל‬
‫ בגלל‬,‫טופס יותר‬ struct ‫הקצאה של‬
alignment ‫צריכה גם כן‬
‫להסתיים בכתובת‬
3 bytes 4 bytes ,K -‫שמתחלקת ב‬
c unused i[0] i[1] unused v
‫ הינו הגודל‬K ‫כאשר‬
p p+ p+ p+ p+12 p+1 p+2 ‫המקסימלי של שדה‬
‫ אינה‬p+1 ‫הכתובת‬ 1 4 8 6 4 struct -‫ב‬
, 4 -‫מתחלקת ב‬
‫ומכיוון שהשדה הבא‬ Multiple of 4 ‫ אינה‬p+12 ‫הכתובת‬ Multiple of 8
4 ‫ שגודלו‬int ‫הינו‬ ‫ ומכיוון‬, 8 -‫מתחלקת ב‬
‫ הקומפיילר‬,bytes double ‫שהשדה הבא הינו‬
alignment ‫יבצע‬ Multiple of 8 ‫ הקומפיילר‬,bytes 8 ‫שגודלו‬ Multiple of 8
‫לכתובת הקרובה שכן‬ ‫ לכתובת‬alignment ‫יבצע‬
4 -‫מתחלקת ב‬ 8 -‫הקרובה שכן מתחלקת ב‬
Meeting Overall Alignment Requirement
⬛ For largest alignment requirement K
⬛ Overall structure must be multiple of K

‫דוגמא לסידור‬
‫אחר של שדות‬
struct ‫בתוך‬ ‫בסידור החדש אנחנו עדיין‬
struct S1
struct S1 {{ struct S2
struct S2 {{ ‫מפסידים את אותה הכמות של‬
char
char c;
c; double
double v;v; ‫ אבל מרוויחים בכך‬, unused bytes
int
int i[2];
i[2]; int
int i[2];
i[2]; ,‫שכל השדות יושבים ברצף בזכרון‬
double
double v;v; char
char c;
c; ‫ בגלל הרציפות‬.‫ללא חורים ביניהם‬
}} *p;
*p; }} *p;
*p; cache ‫המבנה שלנו הופך להיות‬
.friendly

v i[0] i[1] c 7 bytes unused

p p+8 p+16 p+24

Multiple of K=8
Saving Space
‫דוגמא נוספת לסידורים שונים‬
‫ כאשר הפעם‬,struct ‫של אותו‬
⬛ Put large data types first ‫הסידור החדש חוסך לנו‬
! ‫מקום‬
⬛ Effect (K=4)
‫מכיוון שאין התנאיה של מיקום‬
‫ אם נרכז את כל‬, char ‫של‬
‫ אז‬,‫ ביחד‬char ‫השדות של‬
.‫נקבל חסכון בזכרון‬
struct S4
struct S4 {{ struct S5
struct S5 {{
char
char c;
c; int
int i;
i;
int
int i;
i; char
char c;
c;
char
char d;
d; char
char d;
d;
}} *p;
*p; }} *p;
*p;

c 3 bytes i d 3 bytes i c d 2 bytes

Getting High Performance
⬛ Use good compiler
⬛ Smart usage of compilation flags
⬛ Don’t do anything stupid
▪ Watch out for hidden algorithmic inefficiencies
▪ Write compiler-friendly code
▪ Watch out for optimization blockers:
procedure calls & memory references
▪ Look carefully at innermost loops (where most work is done)

⬛ Tune code for machine

▪ Exploit instruction-level parallelism
▪ Avoid unpredictable branches
▪ Make code cache friendly (discussed elsewhere in course)
!Thank You

Swipe To Unlock A Primer On Technology and Business Strategy PDF Ebook by Neel Mehta
0% (2)
Swipe To Unlock A Primer On Technology and Business Strategy PDF Ebook by Neel Mehta
3 pages
Training Plan Course: Unit of Competency: Module Title
No ratings yet
Training Plan Course: Unit of Competency: Module Title
3 pages
LEC12-Optimization and New Trends
No ratings yet
LEC12-Optimization and New Trends
23 pages
Program Optimization
No ratings yet
Program Optimization
63 pages
Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"
No ratings yet
Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"
57 pages
MC IA-2 (1)
No ratings yet
MC IA-2 (1)
14 pages
COMP171 Data Structures and Algorithm: Tutorial 2 TA: M.Y.Chan
No ratings yet
COMP171 Data Structures and Algorithm: Tutorial 2 TA: M.Y.Chan
48 pages
BSCS CC-112 PF 2022 Solved
No ratings yet
BSCS CC-112 PF 2022 Solved
8 pages
Session 3 - Module 3-Lectures 18-19
No ratings yet
Session 3 - Module 3-Lectures 18-19
25 pages
Flow Chart For Product of First N Natural Numbers: Syllabus/Lectures/Same/First Grade/programming 1 PDF
No ratings yet
Flow Chart For Product of First N Natural Numbers: Syllabus/Lectures/Same/First Grade/programming 1 PDF
7 pages
Esc101L9 PDF
No ratings yet
Esc101L9 PDF
27 pages
19 Code Optimization 17-02-2025
No ratings yet
19 Code Optimization 17-02-2025
32 pages
Lec 31
No ratings yet
Lec 31
29 pages
BCA Lab Project
No ratings yet
BCA Lab Project
18 pages
Acc Lab Manual
0% (1)
Acc Lab Manual
39 pages
C++ Manual
No ratings yet
C++ Manual
44 pages
10 Optimization
No ratings yet
10 Optimization
57 pages
07a. Looping (III)
No ratings yet
07a. Looping (III)
42 pages
Presentation 13627 Content Document 20231203040237PM
No ratings yet
Presentation 13627 Content Document 20231203040237PM
39 pages
CS50 2024 Week 2 - Arrays
No ratings yet
CS50 2024 Week 2 - Arrays
25 pages
2 DArrays
No ratings yet
2 DArrays
7 pages
Embedded C Programming
100% (1)
Embedded C Programming
57 pages
C answers
No ratings yet
C answers
6 pages
11
No ratings yet
11
7 pages
Lec1 Oop
No ratings yet
Lec1 Oop
11 pages
Dsa 1
No ratings yet
Dsa 1
12 pages
TA ZC142 Computer Programming Date: 23/01/2013
No ratings yet
TA ZC142 Computer Programming Date: 23/01/2013
43 pages
HPC Unit 5 b
No ratings yet
HPC Unit 5 b
31 pages
Codes 01
No ratings yet
Codes 01
26 pages
CSE115 Lec09&10 RepetitionStructures Part02
No ratings yet
CSE115 Lec09&10 RepetitionStructures Part02
15 pages
Array Unit 2 Notes
No ratings yet
Array Unit 2 Notes
39 pages
C Programs
No ratings yet
C Programs
16 pages
SRM Institute of Science and Technology College of Engineering and Technology School of Computing Department of Computing Technologies
No ratings yet
SRM Institute of Science and Technology College of Engineering and Technology School of Computing Department of Computing Technologies
15 pages
CPLManualV2.0
No ratings yet
CPLManualV2.0
40 pages
Unit 2 Basic Optimization Techniques For Serial Code
No ratings yet
Unit 2 Basic Optimization Techniques For Serial Code
31 pages
8
No ratings yet
8
17 pages
09 Pointers Arrays
No ratings yet
09 Pointers Arrays
34 pages
Intro Programming C
No ratings yet
Intro Programming C
11 pages
More Loops Arrays: ITP 165 - Fall 2015 Week 4, Lecture 1
No ratings yet
More Loops Arrays: ITP 165 - Fall 2015 Week 4, Lecture 1
60 pages
Lecture_Week_6.1
No ratings yet
Lecture_Week_6.1
27 pages
Bug Pirates Set2
No ratings yet
Bug Pirates Set2
6 pages
Module 3
No ratings yet
Module 3
21 pages
07. Looping (III)
No ratings yet
07. Looping (III)
42 pages
JIRA
No ratings yet
JIRA
18 pages
Chapter 17
No ratings yet
Chapter 17
10 pages
06-arrays
No ratings yet
06-arrays
60 pages
Programming For Engineers Lecture 05
No ratings yet
Programming For Engineers Lecture 05
37 pages
ESC LAB C Manual
No ratings yet
ESC LAB C Manual
15 pages
DSOOP Class 03
No ratings yet
DSOOP Class 03
40 pages
DS Record
No ratings yet
DS Record
26 pages
Lab 3
No ratings yet
Lab 3
5 pages
Experiment 3
No ratings yet
Experiment 3
19 pages
C++ Loops
No ratings yet
C++ Loops
38 pages
CS Homework
No ratings yet
CS Homework
5 pages
Bca 4
No ratings yet
Bca 4
31 pages
Programming For Engineers Lecture 08
No ratings yet
Programming For Engineers Lecture 08
70 pages
Computer Architecture: Take II: Example: Simple Variables
No ratings yet
Computer Architecture: Take II: Example: Simple Variables
10 pages
algo lab
No ratings yet
algo lab
21 pages
BUE - Programming Lec 4 - Arrays 1D
No ratings yet
BUE - Programming Lec 4 - Arrays 1D
56 pages
Oops Manual 1
No ratings yet
Oops Manual 1
51 pages
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
451.48 Win10 Win8 Win7 Release Notes
No ratings yet
451.48 Win10 Win8 Win7 Release Notes
42 pages
UNIT IV - Part 5: IOT Platforms Design Methodology
100% (1)
UNIT IV - Part 5: IOT Platforms Design Methodology
16 pages
Chapter Ii
No ratings yet
Chapter Ii
9 pages
Irina Mustea - CV
No ratings yet
Irina Mustea - CV
4 pages
Answer All Questions PART A - (5 2 10)
No ratings yet
Answer All Questions PART A - (5 2 10)
2 pages
PicoMite User Manual
No ratings yet
PicoMite User Manual
186 pages
Finite Fields of The Form GF
No ratings yet
Finite Fields of The Form GF
13 pages
Lcnews227 - Nexera Series
No ratings yet
Lcnews227 - Nexera Series
47 pages
Third Party Software Notices and Information
No ratings yet
Third Party Software Notices and Information
3 pages
Cybersecurity Complaince Framework & System Administration
No ratings yet
Cybersecurity Complaince Framework & System Administration
1 page
Project On Input Devices: Prepared by Aditya Natarajan MBA-International Business Roll No: 20
No ratings yet
Project On Input Devices: Prepared by Aditya Natarajan MBA-International Business Roll No: 20
17 pages
HassaanMurtaza SOP
No ratings yet
HassaanMurtaza SOP
1 page
CompactLogix 5370 Controllers
No ratings yet
CompactLogix 5370 Controllers
336 pages
X10i Eval 56-16526-6
No ratings yet
X10i Eval 56-16526-6
8 pages
1.0 Company Profile Electric Library 2021 - Compressed
No ratings yet
1.0 Company Profile Electric Library 2021 - Compressed
27 pages
G20-MAIT Intra-College Hackathon
No ratings yet
G20-MAIT Intra-College Hackathon
4 pages
Domain Brochure EC (2022 09) 0
No ratings yet
Domain Brochure EC (2022 09) 0
11 pages
MIPS Report File
No ratings yet
MIPS Report File
17 pages
Revised Part 10 - Grade 10 Edumate - Playing Minor Chords in A Song
No ratings yet
Revised Part 10 - Grade 10 Edumate - Playing Minor Chords in A Song
2 pages
Engineering Problem Solving With C++ 4th Edition Etter Test Bank Download
100% (24)
Engineering Problem Solving With C++ 4th Edition Etter Test Bank Download
5 pages
Wikimedia Commons: Jump To Navigation Jump To Search
No ratings yet
Wikimedia Commons: Jump To Navigation Jump To Search
7 pages
Screenshot 2024-11-24 at 00.13.56
No ratings yet
Screenshot 2024-11-24 at 00.13.56
1 page
An Effective Camera To LiDAR Spatiotemporal Calibration
No ratings yet
An Effective Camera To LiDAR Spatiotemporal Calibration
15 pages
Empirical Study On Malicious URL Detection Using Machine Learning
No ratings yet
Empirical Study On Malicious URL Detection Using Machine Learning
9 pages
Electronic Commerce Topologies
No ratings yet
Electronic Commerce Topologies
42 pages
Name: - Date: - Assinatura Aluno: - Eletrical/Light/Communications A320
No ratings yet
Name: - Date: - Assinatura Aluno: - Eletrical/Light/Communications A320
3 pages
2.3 Stages of Data Processing
No ratings yet
2.3 Stages of Data Processing
12 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture 7 - Optimizations - A 2025

Uploaded by

Lecture 7 - Optimizations - A 2025

Uploaded by

‫מבנה‬

‫ד"ר מרינה קוגן‪-‬סדצקי‬

real (wall clock) time

= system time (time executing instructions in kernel on behalf of user process)

+ + = real (wall clock) time

We will use the word “time” to refer to user time.

cumulative user time

Don’t improve asymptotic efficiency (usually)

● Behavior that may be obvious to the programmer not clear to compiler

● Most analysis is performed only within procedures

● Most analysis is based only on static information

● This is a common step in image/video processing, neural networks, …

3 multiplications: i*n, (i–1)*n, (i+1)*n

We know it is really: Which can be transformed into:

3 multiplications: i*n, (i–1)*n, (i+1)*n 1 multiplication: i*n

We know it is really: Which can be transformed into:

gcc can do this when optimizing at -O1 level

Original Code: We know it is really:

What can we do here?

Original Code: We know it is really:

Simple Code Motion:

Original Code: We know it is really:

Simple Code Motion:

output: infinite loop

Often found in student exercise submissions

■ strlen executed every iteration

■Move call to strlen outside of loop

Why can’t the compiler do this?

⬛ What can the compiler do (very little):

Write your own

twiddle1(int *xp, *yp) {

twiddle2(int *xp, *yp) {

● Twiddle 2 is faster (less memory accesses)

twiddle2(int *xp, *yp) {

● Because memory aliasing cases affect behavior:

⬛data_t ⬛OP, IDENT

Avoid procedure call to retrieve each vector element

■Don’t need to store in destination until end

Combine1 22.68 20.02 19.98 20.18

⬛ Eliminates sources of overhead in loop

Share Common Subexpressions

⬛ Perform 2x more useful work per iteration

⬛ Can this change the result of the computation?

⬛ Different form of reassociation

404663: mov $0x0,%eax Executing

404685: repz retq

404663: mov $0x0,%eax

404685: repz retq

404663: mov $0x0,%eax

404685: repz retq Begin

⬛ Motivation for Aligning Data

▪ lowest 1 bit of address must be 02 ...‫וכך הלאה‬

⬛ 4 bytes: int, float, …

⬛ 8 bytes: double, long, char *, …

⬛ 16 bytes: long double (GCC on Linux)

v i[0] i[1] c 7 bytes unused

c 3 bytes i d 3 bytes i c d 2 bytes

⬛ Tune code for machine

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

3 multiplications: in, (i–1)n, (i+1)*n

3 multiplications: in, (i–1)n, (i+1)n 1 multiplication: in

twiddle1(int xp, yp) {

twiddle2(int xp, yp) {

twiddle2(int xp, yp) {