175 lines
7.5 KiB
ReStructuredText
175 lines
7.5 KiB
ReStructuredText
.. _Readers:
|
|
|
|
Developing lld Readers
|
|
======================
|
|
|
|
Note: this document discuss Mach-O port of LLD. For ELF and COFF,
|
|
see :doc:`index`.
|
|
|
|
Introduction
|
|
------------
|
|
|
|
The purpose of a "Reader" is to take an object file in a particular format
|
|
and create an `lld::File`:cpp:class: (which is a graph of Atoms)
|
|
representing the object file. A Reader inherits from
|
|
`lld::Reader`:cpp:class: which lives in
|
|
:file:`include/lld/Core/Reader.h` and
|
|
:file:`lib/Core/Reader.cpp`.
|
|
|
|
The Reader infrastructure for an object format ``Foo`` requires the
|
|
following pieces in order to fit into lld:
|
|
|
|
:file:`include/lld/ReaderWriter/ReaderFoo.h`
|
|
|
|
.. cpp:class:: ReaderOptionsFoo : public ReaderOptions
|
|
|
|
This Options class is the only way to configure how the Reader will
|
|
parse any file into an `lld::Reader`:cpp:class: object. This class
|
|
should be declared in the `lld`:cpp:class: namespace.
|
|
|
|
.. cpp:function:: Reader *createReaderFoo(ReaderOptionsFoo &reader)
|
|
|
|
This factory function configures and create the Reader. This function
|
|
should be declared in the `lld`:cpp:class: namespace.
|
|
|
|
:file:`lib/ReaderWriter/Foo/ReaderFoo.cpp`
|
|
|
|
.. cpp:class:: ReaderFoo : public Reader
|
|
|
|
This is the concrete Reader class which can be called to parse
|
|
object files. It should be declared in an anonymous namespace or
|
|
if there is shared code with the `lld::WriterFoo`:cpp:class: you
|
|
can make a nested namespace (e.g. `lld::foo`:cpp:class:).
|
|
|
|
You may have noticed that :cpp:class:`ReaderFoo` is not declared in the
|
|
``.h`` file. An important design aspect of lld is that all Readers are
|
|
created *only* through an object-format-specific
|
|
:cpp:func:`createReaderFoo` factory function. The creation of the Reader is
|
|
parametrized through a :cpp:class:`ReaderOptionsFoo` class. This options
|
|
class is the one-and-only way to control how the Reader operates when
|
|
parsing an input file into an Atom graph. For instance, you may want the
|
|
Reader to only accept certain architectures. The options class can be
|
|
instantiated from command line options or be programmatically configured.
|
|
|
|
Where to start
|
|
--------------
|
|
|
|
The lld project already has a skeleton of source code for Readers for
|
|
``ELF``, ``PECOFF``, ``MachO``, and lld's native ``YAML`` graph format.
|
|
If your file format is a variant of one of those, you should modify the
|
|
existing Reader to support your variant. This is done by customizing the Options
|
|
class for the Reader and making appropriate changes to the ``.cpp`` file to
|
|
interpret those options and act accordingly.
|
|
|
|
If your object file format is not a variant of any existing Reader, you'll need
|
|
to create a new Reader subclass with the organization described above.
|
|
|
|
Readers are factories
|
|
---------------------
|
|
|
|
The linker will usually only instantiate your Reader once. That one Reader will
|
|
have its loadFile() method called many times with different input files.
|
|
To support multithreaded linking, the Reader may be parsing multiple input
|
|
files in parallel. Therefore, there should be no parsing state in you Reader
|
|
object. Any parsing state should be in ivars of your File subclass or in
|
|
some temporary object.
|
|
|
|
The key function to implement in a reader is::
|
|
|
|
virtual error_code loadFile(LinkerInput &input,
|
|
std::vector<std::unique_ptr<File>> &result);
|
|
|
|
It takes a memory buffer (which contains the contents of the object file
|
|
being read) and returns an instantiated lld::File object which is
|
|
a collection of Atoms. The result is a vector of File pointers (instead of
|
|
simple a File pointer) because some file formats allow multiple object
|
|
"files" to be encoded in one file system file.
|
|
|
|
|
|
Memory Ownership
|
|
----------------
|
|
|
|
Atoms are always owned by their File object. During core linking when Atoms
|
|
are coalesced or stripped away, core linking does not delete them.
|
|
Core linking just removes those unused Atoms from its internal list.
|
|
The destructor of a File object is responsible for deleting all Atoms it
|
|
owns, and if ownership of the MemoryBuffer was passed to it, the File
|
|
destructor needs to delete that too.
|
|
|
|
Making Atoms
|
|
------------
|
|
|
|
The internal model of lld is purely Atom based. But most object files do not
|
|
have an explicit concept of Atoms, instead most have "sections". The way
|
|
to think of this is that a section is just a list of Atoms with common
|
|
attributes.
|
|
|
|
The first step in parsing section-based object files is to cleave each
|
|
section into a list of Atoms. The technique may vary by section type. For
|
|
code sections (e.g. .text), there are usually symbols at the start of each
|
|
function. Those symbol addresses are the points at which the section is
|
|
cleaved into discrete Atoms. Some file formats (like ELF) also include the
|
|
length of each symbol in the symbol table. Otherwise, the length of each
|
|
Atom is calculated to run to the start of the next symbol or the end of the
|
|
section.
|
|
|
|
Other sections types can be implicitly cleaved. For instance c-string literals
|
|
or unwind info (e.g. .eh_frame) can be cleaved by having the Reader look at
|
|
the content of the section. It is important to cleave sections into Atoms
|
|
to remove false dependencies. For instance the .eh_frame section often
|
|
has no symbols, but contains "pointers" to the functions for which it
|
|
has unwind info. If the .eh_frame section was not cleaved (but left as one
|
|
big Atom), there would always be a reference (from the eh_frame Atom) to
|
|
each function. So the linker would be unable to coalesce or dead stripped
|
|
away the function atoms.
|
|
|
|
The lld Atom model also requires that a reference to an undefined symbol be
|
|
modeled as a Reference to an UndefinedAtom. So the Reader also needs to
|
|
create an UndefinedAtom for each undefined symbol in the object file.
|
|
|
|
Once all Atoms have been created, the second step is to create References
|
|
(recall that Atoms are "nodes" and References are "edges"). Most References
|
|
are created by looking at the "relocation records" in the object file. If
|
|
a function contains a call to "malloc", there is usually a relocation record
|
|
specifying the address in the section and the symbol table index. Your
|
|
Reader will need to convert the address to an Atom and offset and the symbol
|
|
table index into a target Atom. If "malloc" is not defined in the object file,
|
|
the target Atom of the Reference will be an UndefinedAtom.
|
|
|
|
|
|
Performance
|
|
-----------
|
|
Once you have the above working to parse an object file into Atoms and
|
|
References, you'll want to look at performance. Some techniques that can
|
|
help performance are:
|
|
|
|
* Use llvm::BumpPtrAllocator or pre-allocate one big vector<Reference> and then
|
|
just have each atom point to its subrange of References in that vector.
|
|
This can be faster that allocating each Reference as separate object.
|
|
* Pre-scan the symbol table and determine how many atoms are in each section
|
|
then allocate space for all the Atom objects at once.
|
|
* Don't copy symbol names or section content to each Atom, instead use
|
|
StringRef and ArrayRef in each Atom to point to its name and content in the
|
|
MemoryBuffer.
|
|
|
|
|
|
Testing
|
|
-------
|
|
|
|
We are still working on infrastructure to test Readers. The issue is that
|
|
you don't want to check in binary files to the test suite. And the tools
|
|
for creating your object file from assembly source may not be available on
|
|
every OS.
|
|
|
|
We are investigating a way to use YAML to describe the section, symbols,
|
|
and content of a file. Then have some code which will write out an object
|
|
file from that YAML description.
|
|
|
|
Once that is in place, you can write test cases that contain section/symbols
|
|
YAML and is run through the linker to produce Atom/References based YAML which
|
|
is then run through FileCheck to verify the Atoms and References are as
|
|
expected.
|
|
|
|
|
|
|