Bazel Core Concept

Posted by 皮皮潘 on 07-25,2022

What is Bazel

Bazel is an artifact-based build system rather than task-based build system (such as Make, Maven and Gradle)

As a Build system, Bazel's most important purposes are:

  • Manage dependencies
  • Transform the source code into executable binaries
  • Allow machines to create builds automatically

Build System Evolution

As the scale continues to expand, The build system produced the following evolution:

  1. Compiler
  2. Shell Script
  3. Task-Based Build System
  4. Artifact-Based Build System

It is easy to understand the difference between compiler and shell script while their advantages and disadvantages is also obvious. So I will no longer explain the two build system and let's focus on task-based build system and artifact-based build system

The task-based build system is the build systems that Engineer can define any script as a task and the System can only execute the task but lack of the entire information for building such as Ant

While engineer has a supreme right on the build procedure, the dark side of task-based build systems is as follows:

  1. Difficulty of parallelizing build steps: multi tasks may share and rewrite the same resources although they don't have dependency each other
  2. Difficulty performing incremental builds: tasks can do anything (May be the task writes a time stamp or download a file ), there's no way in general to check whether they've already been done or need to be done again
  3. Difficulty maintaining and debugging scripts

For the artifact-based build system, engineers still tell the system what to build, but the build system determines how to build it (Declarative Build File). The difference between task-based build system and artifact-based build system is same as that between Imperative Language and Functional Language

The artifact-based approach leaves the build system in charge of its own execution strategy (How) so that it knows more information about building process than task-based approach, so that the former can make stronger guarantees about parallelism, incremental builds and correctness.

Bazel as Artifact-Based Build System

Action is the lowest-level composable unit in Bazel: an action can do whatever it wants so long as it uses only its declared inputs and outputs, and Bazel takes care of scheduling actions and caching their results as appropriate. Every action is isolated from every other action via a filesystem sandbox so that it prevents from the problem in task-based build system that different tasks write into same file. Effectively, each action can see only a restricted view of the filesystem that includes the inputs it has declared and any outputs it has produced.

Bazel ensures the dependency's integrity by checking the third-party dependency's digest and it also decides when to redownload the dependency depending by comparing the cached dependency's digest to the desired digest in manifest

The benefits of smaller build targets really begin to show at scale because they lead to faster distributed builds and a less frequent need to rebuild targets. The advantages become even more compelling after testing enters the picture, as finer-grained targets mean that the build system can be much smarter about running only a limited subset of tests that could be affected by any given change.

Dependency Management

Bazel does not automatically download transitive dependencies and it requires a global file (WORKSPACE) that lists every single one of the repository’s external dependencies including transitive dependencies. That is, Bazel will not load external projects' WORKSPACE file. So it is better to extract the dependencies from the WORKSPACE and wrap them into a marco names as xxx_deps, so that the other project that imports the project can quickly import the transitive dependencies by calling the xxx_deps marco

Core Precudure

Bazel's build procedure can be summarized as three phases:

  • Loading phase. First, load and evaluate all extensions and all BUILD files that are needed for the build. The execution of the BUILD files simply executes the code of macros and instantiates rules (each time a rule is called, it gets added to a graph) while the real logic is in the impl field. This is where macros are evaluated. -> Create a Target (Rule Instance) Graph
  • Analysis phase. The code of the rules is executed (their implementation function), and actions are instantiated. An action describes how to generate a set of outputs from a set of inputs, such as "run gcc on hello.c and get hello.o". You must list explicitly which files will be generated before executing the actual commands. In other words, the analysis phase takes the graph generated by the loading phase and generates an action graph. -> Create an Action Graph
  • Execution phase. Actions are executed, when at least one of their outputs is required. If a file is missing or if a command fails to generate one output, the build fails. Tests are also run during this phase. -> Execute the Action

Corresponding to the three phase, there exists three query types:

  • query runs on the post-loading phase Target Graph
  • aquery is action graph query which operates on the post-analysis Configured Target Graph and exposes information about Actions, Artifacts and their relationships
  • cquery also runs on the post-loading phase Target Graph, but compared to query, the cquery properly handles configurations such as select statment and doesn't provide all of the possibility of the origin query

Execution Phase

In the common case, Bazel performs the following operations against Buildbarn when executing a build action:

  1. ActionCache.GetActionResult() is called to check whether a build action has already been executed previously. This call extracts an ActionResult message from the AC. If such a message is found, Bazel continues with step 5.
  2. Bazel constructs a Merkle tree of Action, Command and Directory messages and associated input files. It then calls ContentAddressableStorage.FindMissingBlobs() to determine which parts of the Merkle tree are not present in the CAS.
  3. Any missing nodes of the Merkle tree are uploaded into the CAS using ByteStream.Write().
  4. Execution of the build action is triggered through Execution.Execute(). Upon successful completion, this function returns an ActionResult message.
  5. Bazel downloads all of the output files referenced by the ActionResult message from the CAS to local disk using ByteStream.Read().

By enabling the Builds without the Bytes feature using the --remote_download_minimal command line flag, Bazel will no longer attempt to download output files to local disk. This feature causes a significant drop in build times and network bandwidth consumed. This is especially noticeable for workloads that yield large output files. Buildbarn should attempt to support those workloads.

Output Directory Layout

<workspace-name>/                         <== The workspace directory
  bazel-my-project => <...my-project>     <== Symlink to execRoot
  bazel-out => <...bin>                   <== Convenience symlink to outputPath
  bazel-bin => <...bin>                   <== Convenience symlink to most recent written bin dir $(BINDIR)
  bazel-testlogs => <...testlogs>         <== Convenience symlink to the test logs directory

/home/user/.cache/bazel/                  <== Root for all Bazel output on a machine: outputRoot
  _bazel_$USER/                           <== Top level directory for a given user depends on the user name:
                                              outputUserRoot
    install/
      fba9a2c87ee9589d72889caf082f1029/   <== Hash of the Bazel install manifest: installBase
        _embedded_binaries/               <== Contains binaries and scripts unpacked from the data section of
                                              the bazel executable on first run (such as helper scripts and the
                                              main Java file BazelServer_deploy.jar)
    7ffd56a6e4cb724ea575aba15733d113/     <== Hash of the client's workspace directory (such as
                                              /home/some-user/src/my-project): outputBase
      action_cache/                       <== Action cache directory hierarchy
                                              This contains the persistent record of the file
                                              metadata (timestamps, and perhaps eventually also MD5
                                              sums) used by the FilesystemValueChecker.
      action_outs/                        <== Action output directory. This contains a file with the
                                              stdout/stderr for every action from the most recent
                                              bazel run that produced output.
      command.log                         <== A copy of the stdout/stderr output from the most
                                              recent bazel command.
      external/                           <== The directory that remote repositories are
                                              downloaded/symlinked into.
      server/                             <== The Bazel server puts all server-related files (such
                                              as socket file, logs, etc) here.
        jvm.out                           <== The debugging output for the server.
      execroot/                           <== The working directory for all actions. For special
                                              cases such as sandboxing and remote execution, the
                                              actions run in a directory that mimics execroot.
                                              Implementation details, such as where the directories
                                              are created, are intentionally hidden from the action.
                                              All actions can access its inputs and outputs relative
                                              to the execroot directory.
        <workspace-name>/                 <== Working tree for the Bazel build & root of symlink forest: execRoot
          _bin/                           <== Helper tools are linked from or copied to here.

          bazel-out/                      <== All actual output of the build is under here: outputPath
            local_linux-fastbuild/        <== one subdirectory per unique target BuildConfiguration instance;
                                              this is currently encoded
              bin/                        <== Bazel outputs binaries for target configuration here: $(BINDIR)
                foo/bar/_objs/baz/        <== Object files for a cc_* rule named //foo/bar:baz
                  foo/bar/baz1.o          <== Object files from source //foo/bar:baz1.cc
                  other_package/other.o   <== Object files from source //other_package:other.cc
                foo/bar/baz               <== foo/bar/baz might be the artifact generated by a cc_binary named
                                              //foo/bar:baz
                foo/bar/baz.runfiles/     <== The runfiles symlink farm for the //foo/bar:baz executable.
                  MANIFEST
                  <workspace-name>/
                    ...
              genfiles/                   <== Bazel puts generated source for the target configuration here:
                                              $(GENDIR)
                foo/bar.h                     such as foo/bar.h might be a headerfile generated by //foo:bargen
              testlogs/                   <== Bazel internal test runner puts test log files here
                foo/bartest.log               such as foo/bar.log might be an output of the //foo:bartest test with
                foo/bartest.status            foo/bartest.status containing exit status of the test (such as
                                              PASSED or FAILED (Exit 1), etc)
              include/                    <== a tree with include symlinks, generated as needed. The
                                              bazel-include symlinks point to here. This is used for
                                              linkstamp stuff, etc.
            host/                         <== BuildConfiguration for build host (user's workstation), for
                                              building prerequisite tools, that will be used in later stages
                                              of the build (ex: Protocol Compiler)
        <packages>/                       <== Packages referenced in the build appear as if under a regular workspace

Bazel Clean

We can use bazel clean command to clean the workspace. In detail, bazel clean does an rm -rf on the outputPath and the action_cache directory. It also removes the workspace symlinks in the project. What's more, we can use The --expunge option to clean the entire outputBase which is under ~/.cache/bazel/_${user}_bazel/${md5 workspace}, including external repositories, tool chains and son on.