[LLVMdev] LLVM Inliner
Xinliang David Li
xinliangli at gmail.com
Tue Nov 30 16:19:00 CST 2010
On Mon, Nov 29, 2010 at 10:56 AM, Chris Lattner <clattner at apple.com> wrote:
> On Nov 28, 2010, at 11:39 PM, Xinliang David Li wrote:
> 1. Run the inliner on bar (noop, since it has no call sites)
>> 2. Run the per-function passes on bar. This generally shrinks it, and
>> prevents "abstraction penalty" from making bar look too big to inline.
>> 3. Run the inliner on foo. Since foo calls bar, we consider inlining bar
>> into foo and do so if profitable.
>> 4. Run the per-function passes on foo. If bar got inlined, this means
>> that we're running the per-function passes over the inlined contents of bar
> On-the-fly clean up (optimization) while doing bottom up inlining is nice
> as you described. Many other compilers chose not to do this way due to
> scalability concerns (with IPO) -- this can make the IPO the biggest bottom
> neck in terms of compile time (as it is serialized). Memory many not be a
> big issue for LLVM as I can see the good locality in pass manager. (Just
> curious, what is biggest application LLVM can build with IPO?)
> I don't really know, and I agree with you that LLVM's LTO isn't very
> scalable (it currently loads all the IR into memory). I haven't thought a
> lot about this, but I'd tackle that problem in three stages:
> 1. Our LTO model runs optimizations at both compile and link time, the
> compile-time optimizations should work as they do now IMO. This is
> controversial though, because doing so could cause (e.g.) an inlining to
> happen "early" that would be seen as a bad idea with full LTO information.
> The advantage of doing compile-time optimizations is that it both shrinks
> the IR, and speeds up an incremental rebuild by avoiding having to do simple
> optimizations again.
> 2. At LTO time, the bottom-up processing of the callgraph is still goodness
> and presents good locality (unless you have very very large SCC's). The
> tweak that we'd have to implement is lazy deserialization (already
> implemented) and reserialization to disk (which is missing). With this, you
> get much better memory footprint than "hold everything in memory at once".
IR is just one memory consumer. In LTO, there are also global data
structures : global symtab, global type table, call graph, points-to graph,
mod-ref info etc. In a compiler I worked with before, serialization is done
on points-to graph and mod-ref info after the info is mapped from a global
view to a per TU local view, and IR for each TU is mapped/unmapped on
demand. The type/symbol info per TU is in different segment from the the
> 3. To support multiple cores/machines, you break the callgraph SCC DAG into
> parallel chunks that can be farmed out. There is a lot of parallelism in a
Parallelism distributed across machines can be tricky -- involving lots of
overhead such as rpc and data passing.
You may also be surprised with side effects due to the parallelism using
multi-core -- thrashing due to memory contention -- some per function level
pass may use lots of memory for temporary data structure. It won't scale for
the compiler workload -- i.e. get 2x speedup using 8 core.
> I understand that, but that implies that you have some model for code
> locality. Setting a global code growth limit is (in my opinion) a hack
> unless you are aiming for the whole program to fit in the icache (which I
> don't think anyone tries to do :).
Yes, global growth limit may be good for size control, but is a hack for
control icache footprint. However, as I mentioned, the bottom up inline
scheme make it impossible to use any heuristics involving 'global limit'
which can be more complicated and fancier than the simple growth limit. For
instance, there is no restriction that only one global limit can be used ---
the compiler can partition the call graph into multiple locality regions,
and set icache limit for each region. The inlining order can be done on a
region by region basis. For each region, the region limit is applied and the
priority queue must be used.
> With any other limit that is higher than your icache size, you are
> basically picking an *arbitrary* limit that is not based on the machine
> model or the instruction locality of the program.
> The "punish functions containing lots of calls" is based on the assumption
>> that functions which are mostly calls (again, this decision happens after
>> the callee has been inlined and simplified) aren't themselves doing much
> My point is that using static count of callsites as a indicator for this
> can be misleading. All the calls may be calls to cold external functions for
> Absolutely true. It may also be completely wrong for some functions. It's
> a heuristic :)
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the LLVMdev