-
Notifications
You must be signed in to change notification settings - Fork 0
/
TODO.txt
389 lines (306 loc) · 20.2 KB
/
TODO.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
Finished
Compilation Tooling
Make fuzz.autotools.sh more robust to handle binutils.
Specifically, it should handle originalLDFLAGS.txt better,
it should search for multiple targets that were generated,
and then compile all of them in parallel.
Make fuzz.binutils and fuzz.coreutils that download and compile both seamlessly.
FunctionCoveragePass
Add a FunctionCoveragePass, which does the exact same thing as BlockCoveragePass,
except it traces just functions, not whole blocks.
This will be useful for when the user wants to examine the output trace themselves,
and it will help reduce the output size since there are less functions than blocks,
especially when heavily inlined.
LEB128 Varint Encoding Compression
Write a LEB128 output buffer class for compression.
After testing on "nm-new.coverage `which opt-9-debug` | cxxfilt.coverage",
which generated 24 GB of coverage output for cxxfilt along (opt-9 is 1.25 GB),
I realized I really need to compress the output in some way,
especially since running normal compression tools on the resulting output
drastically compresses them:
blocks.bin: 400x smaller
nonSingle.bin: 540x smaller
single.bin: 4800x smaller
For onBlock, most sequentially executed blocks are near each other,
so if I use the difference between adjacent blocks, they'll be all mostly small.
Therefore, I can use a varint encoding, such as signed LEB128.
For onInfiniteBranch, there's no pattern, but varint encoding might still help.
For onMultiBranch, the numbers are likely very low, so varint encoding will definitely help.
Results of LEB128 encoding on "nm-new.coverage `which opt-9-debug` | cxxfilt.coverage > /dev/null":
blocks.bin: 7x smaller
nonSingle.bin: 2x smaller
Normal compression algorithms still help a lot on top of LEB128.
Runtime Performance Optimization
Investigate the performance degradation when adding the coverage.
"nm-new.coverage `which opt-9-debug` | cxxfilt.coverage > /dev/null" took ~8 minutes,
but "nm-new `which opt-9-debug` | cxxfilt > /dev/null" only took ~8 seconds.
I'm trying to figure out how to profile it now using clang (apparently gprof doesn't work).
Running perf on cxxfilt.coverage showed that most of the time is spent in __BranchCoverage_onMultiBranch()
calling pwrite(), because while non-single branches are buffered,
I flush the counts on each non-single branch, because I thought they were a lot rarer than they are.
This just means that I shouldn't flush the counts every time,
or maybe I should just mmap the memory for the counts so it's just a simple store instead of pwrite.
I fixed most of the slow performance by only flushing the counts
when the buffer is flushed also.
Since the buffers are unconditionally flushed on program exit/crash,
the counts will be saved always as well.
This removes pwrite() and __BranchCoverage_onMultiBranch() as a hotspot
in the perf flamegraph, speeding up execution of cxxfilt.coverage by ~10x.
It's still ~10x slower than the un-instrumented cxxfilt, though.
onMultiBranch Bugfix and Compression
I noticed that nonSingle.bin was getting very large,
even bigger than blocks.bin on many occasions.
The number of switch cases is generally small, so this seemed odd.
After collecting averages of LEB128 bytes used for each non-single branch part, and averages of the actual non-single branch parts,
I think I discovered the reason.
While LEB128 sizes still averaged ~1 byte,
the combined (1 bit + single branch bit index diff) averaged only ~4,
which is actually pretty small but expected based on how many switch cases there are.
But the low (switch value) averaged ~34 and high (num cases) averaged ~77.
This is extremely high, but it's also a product of the specific program.
cxxfilt uses a lot of large switch statements,
more than most programs, especially more modern programs.
There's not much I can do about this, however,
since the program was just written this way.
But in looking through the instrumented IR,
I discovered that there are a ton of redundant onMultiBranch() calls.
Many of the switch cases return to the same block,
causing repeated onMultiBranch() calls for that block.
For example, in d_count_templates_scopes(), which is one of the main demangling functions used in cxxfilt,
there is a switch statement with 82 cases, with one of the logical branches, probably the most common,
having 62 cases go into it (meaning each time it records 62 multi branches), and on top of that it's recursive.
This is bad for two reasons:
Firstly, each time on of those cases is taken,
all of the onMultiBranch() calls are made, when only one should be.
Secondly, this repetition of return blocks
means that there are a lot fewer actual branches in the switch.
When I instrument the code, I can keep track of the actual number of branches.
This will also reduce each of the switch values and the number of cases,
which will reduce the varint encoding output.
Also, this is the perfect case for profile-guided optimization.
I can sort the multi branches by probability,
so once varint encoded, the most likely branches produce the least output.
After fixing this, nonSingle.bin got a lot smaller on switch-case heavy programs.
"nm-new.coverage `which opt-9-debug` | cxxfilt.coverage" now generates
232 MB of nonSingle.bin instead of 5685 MB, a 25x improvement.
The averages also changed a lot:
combined: ~21 instead of ~4, which is expected, since there's no more onMultiBranch() calls in a row.
low: ~13 instead of ~34
high: ~24 instead of ~77, both of which are sizeable improvements.
onNonSingleBranch repeats
In a switch statement (a multi branch), one successor block might later jump to one of the other successor blocks.
In this case, onMultiBranch() gets called twice, but it was only supposed to be called once.
This is a behavior unique to switch statements due to their fall-through behavior.
The fall-through behavior allows multiple cases to go the same successor block.
We record the switch/case value immediately before the switch instruction, like we do for br instructions,
but then we record extra information for when multiple cases are mapped to a single successor block.
Therefore, we want to place the onMultiBranch() call at the beginning of the successor block.
But that block can be entered in other ways.
We can check if this is the case by checking if the successor block has predecessor blocks other than the switch block.
Only in this case do we need to fix the runtime behavior.
To fix this, we add two more onMultiBranch APIs:
api(onSwitchCase)(bool valid, u32 branchNum, u32 numBranches) noexcept;
Before the switch instruction, we insert a boolean local variable set to true
to indicate that a switch instruction is going to execute.
It's important that this variable is on the stack so that it is thread safe.
Then onSwitchCase() is called with the boolean variable as the first argument.
onSwitchCase() then calls onMultiBranch() if valid is true.
The boolean variable is then set to false so that no other successor blocks erroneously call onMultiBranch().
Branch to Block Recovery: BranchExecutionPass
Since we can recover the much larger block coverage data from the branch coverage data,
I have to figure out exactly how it will be done.
Firstly, we don't need to record as much branch coverage data,
because when recovering the block data, we know where in the code we are currently executing,
so we can use some of that information.
For single branches, we still need to record true or false.
For multi branches, we only need to record the actual branch value/number, not the number of total branches.
We know the number of total branches from the code, so we don't need to record it.
For infinite branches, it's a little more complicated (see below).
The change for multi branches decreases the size of nonSingle.bin by 4x for cxxfilt.
It's now smaller than single.bin for cxxfilt.
Branch to Block Recovery: BranchExecutionPass
Since we can recover the much larger block coverage data from the branch coverage data,
I have to figure out exactly how it will be done.
Firstly, we don't need to record as much branch coverage data,
because when recovering the block data, we know where in the code we are currently executing,
so we can use some of that information.
For single branches, we still need to record true or false.
For multi branches, we only need to record the actual branch value/number, not the number of total branches.
We know the number of total branches from the code, so we don't need to record it.
For infinite branches, it's a little more complicated (see below).
To recover the block coverage data, we could traverse the bitcode through an optimization pass,
traversing the code path taken according to the branch coverage data.
But this interpretation of the bitcode will be really slow.
Instead, we can create another BranchExecutionPass that converts the program
into an executable that only converts branch coverage data to block coverage data.
It "executes" the recorded branches in effect.
For every branch, switch, etc. instruction, we replace the normal value
with a call to nextSingleBranch(), nextMultiBranch(), or nextInfiniteBranch(),
which read the saved branch coverage data and returns the next branch value.
Then we can make calls to output APIs that output the recovered block coverage data.
All other instructions in the program are then removed, so that only branches
and the calls to the API are left.
Function calls to functions not statically linked in the bitcode are removed because they weren't traced.
And function calls to functions that are present in the bitcode are modified.
All arguments are removed, and all functions return void.
All the necessary data comes from the saved branch coverage data,
and all that remains in the executable is control flow but no data operations.
Branch and switch instructions are trivial to replace for this,
but virtual calls, both in the multiBranch case where the possible functions can be narrowed down,
or in the infiniteBranch case, where any function can be called, are more complicated.
For multiBranch virtual calls, we can replace them with switch instructions,
because that's essentially what we've reduced them to.
For infiniteBranch virtual calls, we need to jump to a function's address to keep executing.
Therefore, we need the actual address of the next function to jump to.
A block index of the next function or the function index of the next function don't suffice,
because the CPU needs to jump to an actual address.
But recording the actual address on infiniteBranches doesn't work either.
For one, the addresses will change between executions due to ASLR (address space layout randomization).
But it's also a different executable compared to the stripped, control flow only one.
To work around this, it's possible I could have a function index to address table,
or I could lookup up the function's symbol name in the executable for its address.
This could be done using dlsym() when compiled with -rdynamic to put all symbols in the dynamic symbol table.
Indirect Branches / Virtual Calls
From what I can tell, indirect branches can be generated from 4 LLVM IR instructions:
call
invoke
indirectbr
callbr
call, invoke, and callbr all jump to a function.
In most cases, the function is known and there's no indirect branch,
but sometimes the function is an unknown pointer and will generate an indirect branch.
indirectbr jumps to an address in the current function,
and the LLVM instruction also contains the possible destinations.
This means we can reimplement this as a switch instruction.
indirectbr and callbr are both very rare, so I won't worry about them too much.
call and invoke are identical, except invoke handles exceptions, so we'll treat them the same.
There are a number of cases where virtual calls are limited to a small, finite number of addresses.
For example, virtual method calls can only call one of the classes' methods.
This can be reduced to a multiBranch, or even a singleBranch if there are only 2 classes.
Sometimes (bool ? func1 : func2)(arg) is used,
which is the same as a branch instruction and direct calls,
but a select and indirect call instruction are generated instead.
This can be reduced to a singleBranch.
In other cases, especially in C code, there's a function pointer that is called,
but it is only assigned a few values.
Again, this can be reduced to a multiBranch,
but the original assignment to the function pointer is usually much farther removed from the call.
This is be harder to optimize, but it should possible to do.
Indirect Calls: Decision
I figured out the best way to integrate BranchCoverage and BranchExecution.
onFunction(u64 functionIndex) is called at the start of every function.
Normally, this does nothing, except right after an indirect call.
Before an indirect call, onInfiniteBranch() is called,
which sets a flag that we're in un-traced code currently.
If this flag is true in onFunction, the function index is recorded,
and the flag is reset.
This works even if the indirect call is to a dynamically linked function.
Each function is given an index (like in BlockCoverage),
and this index will stay consistent between BranchCoverage and BranchExecution.
In BranchExecution, we'll have a function table of every function
indexed by these function indices, so we can quickly lookup any function.
The number of functions is also recorded at compile time,
so an indirect call is in effect a multi branch (the index / numFunctions).
Therefore, onFunction() directly calls onMultiBranch().
Store all branch types as bits
I re-added the numBranches and numCases arguments
to onMultiBranch() and onSwitchCase().
Since during BranchExecution, we know the numBranches,
we can do even better than LEB128 varint encoding,
we can directly encode the significant bits into the BitWriteBuffer,
where single branches are written to.
This means we can store all the branch information in a single branches.bin file.
Its order is arbitrary; only the program following these branches can decode it,
and this is exactly what is done during BranchExecution.
TODO
Next
Compress single.bin.
Compression
After testing on "nm-new.coverage `which opt-9-debug` | cxxfilt.coverage",
which generated 24 GB of coverage output for cxxfilt along (opt-9 is 1.25 GB),
I realized I really need to compress the output in some way,
especially since running normal compression tools on the resulting output
drastically compresses them:
blocks.bin: 400x smaller
nonSingle.bin: 540x smaller
single.bin: 4800x smaller
For onBlock, most sequentially executed blocks are near each other,
so if I use the difference between adjacent blocks, they'll be all mostly small.
Therefore, I can use a varint encoding, such as signed LEB128.
For onInfiniteBranch, there's no pattern, but varint encoding might still help.
For onMultiBranch, the numbers are likely very low, so varint encoding will definitely help.
For onSingleBranch, varint encoding won't work since they're just single bits already,
but obviously a 5000x compression is insane, so there's a huge amount of redundancy.
I'll have to investigate the file itself to find patterns,
but it might be that there are long strings of 0s and then long strings of 1s (maybe from loops),
in which case I can record the value (0 or 1) and the number of repetitions
each time they switch, which can potentially drastically compress.
On top of this, it's also possible to run a generic compression algorithm
like xz, bzip2, or gzip, although I'll have to be careful to make sure
it compresses fast enough to not hold up the program.
Single-threaded (which it will be if embedded) xz can't compress a 13 GB blocks.bin
in any reasonable amount of time (definitely longer than how long the program took),
so certainly for blocks.bin xz is too slow and maybe the faster gzip is better.
There will always be more blocks than branches, since each branch is in it's own block,
so blocks.bin will need the fastest compressor.
Something like single.bin, though, could use a slower compressor like xz since it's smaller.
Parser
I need to write a parser for all the coverage output generated,
especially once I apply the compression above.
The question, though, is how it should be designed.
Should it be as a C++ library, maybe with CPython bindings?
Or should it just output to plain text (although this could be huge)?
I should also integrate the blocks source maps for blocks.bin.
I could also make the blocks source map binary encoded,
perhaps using varint encoding as well,
and then embed it in the executable itself as a raw text symbol during the LLVM pass.
Or maybe the source map should be placed in coverage.out.dir/<program name>.
Decision:
I should keep creating a separate text blocks source map.
But I should also create a binary blocks source map at compile time
that I embed in the executable itself.
At runtime, this embedded source map is written to a file in coverage.out.dir.
This way the source map data needed for reading the coverage data
is always right there with the coverage data.
Others
Make Mask constexpr, will be much faster in critical sections
Meta Compression
Junfeng gave me a great idea about how to drastically reduce the amount of output generated.
First, the branch coverage should be enough to fully recover the block coverage,
and from the block coverage the function coverage can be trivially recovered.
The branch coverage tells us all the decisions the code makes and what path it takes,
so from that, I can easily tell the order of blocks the program goes through.
The branch coverage output for single branches is much smaller (even uncompressed)
than the block coverage output, so this should be a big win.
The branch coverage output for multi branches (switch case) can be bigger,
but I'm still working on compressing it further.
The only place it can go wrong is on "infinite" branches,
i.e. where the program jumps to another variable location and starts executing,
like calling a function pointer or a virtual method call.
It may be possible to reduce some of these to multi branches,
such as virtual method calls (there are a limited number of classes whose virtual methods may be called),
or function pointers that are only assigned a limited number of constant pointers.
However, we could also bring back the block coverage just for these cases.
In __BranchCoverage_onInfiniteBranch(), we can set a flag indicating an infinite branch has just been made.
Then in __BlockCoverage_onBlock(), we can check this flag, and if true,
record the current block (or delta) instead of the function address.
Second, beyond storing single branches as repetition counts (which I'm not sure is always common),
there's another way to drastically reduce the branch coverage output.
Essentially, we can use profile guided optimization for this.
On the first run, or the first few, we record the full output.
From the output, we determine the most common branch in all cases (single, multi, infinite),
and then when running the program again, don't record any output for the most common branch,
and when parsing the output, interpret the absence of output
as indication that the most common branch was taken.
However, I'm not totally sure if this will work, at least for single branches.
A single branch can be stored in a single bit, which is already really small,
and since we don't where (which block) each branch came from, we can't tell the difference
between a missing branch and just the next branch.
If repetition counts are worth it, then single branches will be compressed even more,
so I don't think this should be that much of an issue.
This could work, however, for non-single branches, where there are a lot more possibilities in general.
I think we still need to save some output for the most common branch
since we still don't know where the branch is.
But these are variable (LEB128 varint) outputs, so we can select the smallest for the most common output.
For example, we could store 0 for the most common branch,
make the delta unsigned (store sign bit as LSB), and then add 1 to each delta.