Commits
This is more a workaround since a transitive dependency has introduced a
global flag "cpuprofile", leading to a panic due to registring the flag
twice.
To make ourselves immune to this issue we can refactor our usages to use
a FlagSet, even for "main". This is a bigger and frankly inconvenient
change for a somewhat rare occurance. Instead we just rename our flag.
I feel comfortable renaming since this flag should only really be used
by Zoekt developers. There will be the issue that the flag will be shown
twice for commands, but I will report to the upstream repo about this
problem.
Test Plan: go get -u ./... && go run ./cmd/zoekt-git-index works
This fixes the failing 'fuzz test' check in CI. Example from https://github.com/sourcegraph/zoekt/pull/891:
```
Download action repository 'jidicula/go-fuzz-action@0206b61afc603b665297621fa5e691b1447a5e57' (SHA:0206b61afc603b665297621fa5e691b1447a5e57)
Getting action download info
Error: This request has been automatically failed because it uses a deprecated version of `actions/upload-artifact: v3`. Learn more: https://github.blog/changelog/2024-04-16-deprecation-notice-v3-of-the-artifact-actions/
```
Currently, we only record the number of compound shards. This PR adds a metric
for the total number of index shards. This metric is helpful for alerting, as
it's important to catch a significant decrease in the number of indexed data.
Follow up to https://github.com/sourcegraph/zoekt/pull/888, where I forgot to improve the test coverage.
Usually, if there are candidate matches with overlapping ranges, then we just remove matches that overlap. However, when `opts.ChunkMatches = false`, we had special logic to merge overlapping matches.
This PR removes the overlapping logic to simplify the behavior. I couldn't see a good reason to keep this special handling. Plus, we are moving towards making `ChunkMatches` the default.
Another benefit of this change is that it makes the BM25 behavior easier to understand. If we merged together ranges, then we would be calculating term frequencies for spurious terms (like `new`, `queue`, `newqueue`, `queuenew`, etc.) Note: we currently only use BM25 with `ChunkMatches = true`, so there's not an active bug here.
Currently, BM25 scoring only applies to the overall `FileMatch` score. The
algorithm gathered term frequencies from all candidate matches in the file to
produce a file-level score. However `LineMatch` and `ChunkMatch` scores were
still calculated using the classic Zoekt scoring algorithm.
This PR implements BM25 scoring for `LineMatch` and `ChunkMatch`. It does so by
calculating a BM25 per line. Compared to the classic Zoekt algorithm, this
rewards multiple term matches on a line. Because our term frequency calculation
also boosts symbol matches, the score smoothly balances between "many term
matches" and "interesting term matches".
Now, the code is structured as follows:
* `scoreChunk`: goes through each line in the chunk, calculating its score
through `scoreLine`, and returns the best-scoring line
* `scoreLine`: calculates the score for a single line
The mental model is that "the score of a chunk is always the score of its best
line".
This PR adds a new field `ChunkMatch.BestLineMatch` with the line number of top-scoring line in the chunk. This will let us address a long-standing issue with our new flexible keyword search, where chunk matches can become very large. Since our search results UX only shows the start of a chunk, the most relevant line may not even be visible. With this information on the best line match, we can adjust the search results UX to center the chunk on the most relevant line.
Relates to [SPLF-188](https://linear.app/sourcegraph/issue/SPLF-188/ensure-the-best-scoring-line-match-is-shown-in-search-results)
Add support to the GitLab mirror for excluding archived repos.
We had 2 copies of this logic in the code and this bit me when I
introduced id-based shards
indexbuilder.go seems an "ok" place to put. We cannot put it in the
shards package because of circular dependencies.
Test plan:
refactor, so relying on CI
Co-authored-by: Sourcegraph <batch-changes@sourcegraph.com>
When digging into our Natural Language Search (NLS) eval results, I found that one of the leading causes for flexible search types like "Fuzzy symbol search" and "Find logic" was noisy matches in top results. Currently, our BM25 ranking rewards any substring match equally. So for queries like 'extract tar', any match on 'tar' (even within unrelated terms like 'start', etc.) counts towards the term frequency.
This PR helps reduce noise by boosting symbol matches the same as we do filename matches. Our NLS evals show positive improvement, and context evals are the tiniest bit better.
This is an alternative to #875.
We run the health check with system priviledges. This way we run an
actual search, just like we do if tenant enforcement is off.
I also make sure we don't log system searches as "missing_tenant".
If we have lots of work done we start to truncate in net/trace. So
display information more succinctly.
Test Plan: CI doesn't complain should be good enough
I was inspecting logs in GKE and it incorrectly categorized the severity
of nearly all logs from zoekt-webserver and zoekt-indexserver. This is a
hack to make it work better without putting in the bigger work of
migrating us to structured logging.
Test Plan: go test
We have seen issues on large repos so lets default to off until we fix
again.
Before, we'd typically have two newlines, one coming from the corpus,
and one terminating the fmt.Printf call.
They are interpreted as AND, ie.
zoekt a b c
is equivalent to
zoekt "a b c"
The logging is neat for debugging, but is distracting when using
cmd/zoekt as a grep replacement.
It is hard to tell from net/trace how long we spend evaluating type:repo
sub-expressions. This adds in extra logging for when we do that, which
should help understand time taken to do this.
Test Plan: go test
This logging is useful since we directly call this search (and avoid
typeRepoSearcher) when evaluating List calls for type:repo searches. By
adding in this logging we can understand the performance of potentially
expensive queries here.
Some tasks, such as loading shards, require priviledged access on startup. Here I introduce systemtenant which we can use for these kinds of things.
This is motivated by bug where the symbol sidebar in multi-tenant node wouldn't work because ranked shards were not loaded correctly which in turn caused SelectRepoSet to return 0 shards always.
Test plan:
added unit test
manual testing: symbol sidebar works now
If we failed to List the repositories when loading a shard we would
never search it due to selectRepoSet optimization. In practice this
feels very rare to happen (only for logic error or disk corruption).
However, in those cases we should surface these crashes searches by
attempting to search the shard.
Additionally I add logging so we can notice when this happens. I didn't
add a metric since this is the sort of thing that I think is so rare we
would never think to check the metric (but may notice logs).
Note: I used the slightly tricky invariant that repos being nil means
error. If the shard is actually empty (eg all repos tombstoned) then we
still correctly apply the optimization. In practice having an empty
shard shouldn't really happen so I'm open to just treating empty repos
list as something we have to search.
This adds the tenant ID to the trace. I also move the pprof logging from `FromContext` to the higher level `typeRepoSearcher`. The number of events was too high, because we logged missing tenants per document.
I also fixed a bug where pprof logging didn't work at all, because we read the tenant enforcemnt ENV after we set the pprof profile, so the profile was always nil.
Test plan:
Checked locally that tenants show up in the traces and "missing_tenant" shows up as pprof profile.
This updates webserver and sourcegraph-indexserver to support multi-tenancy.
The change is behind an ENV feature-flag.
Key changes:
- tenant ID is now part of the index (repo metadata)
- GRPC: IndexOption and Repository have a new field TenantId
- If multi-tenancy is enabled, webserver checks if tenant in context matches the tenant id in the shard
- zoekt-git-index has a new parameter "-shard_prefix ". If set, the value will be used instead of repository name as prefix for the name of the shard. For Sourcegraph we use "<tenant id>_<repository id>" as prefix if multi-tenancy is enabled
Assumption:
All calls to Sourcegraph are privileged
Test plan:
- New tests
- Ran this together with Sourcegraph (with and without MT enabled)
The go-git optimizations have been running well on dot-com, so this PR enables
them by default. I'll remove the flag entirely once these optimizations have
been released to more instances.
We recently had an incident where this was accidently unset and lead to
lots of work done by zoekt which just got thrown away.
Test Plan: added unit test
By default go-git maintains an LRU cache of git objects of size 96MB. When an
object's contents are loaded, it's stored as a MemoryObject in this cache. This
cache is not super useful in the indexing access pattern, which accesses each
file only once. And in many profiles, we see a substantial number of
allocations from these memory objects.
This PR disables caching for most git objects by setting LargeObjectThreshold:
1. go-git still proactively caches packfile objects under 16KB (see
smallObjectThreshold here).
Follow up to #852. This change is also gated by the
ZOEKT_ENABLE_GOGIT_OPTIMIZATION feature flag.
Document ranks was an experimental feature of Sourcegraph. We have
already removed the code in Sourcegraph in the last release. This
is the corresponding cleanup for Zoekt.
Test plan:
updated tests
By default, the `go-git` library will open the packfile on every call to `Repository.BlobObject`, then close it. During indexing, we collect the list of files to index, then iterate through each one calling `Repository.BlobObject`. So on every object access the packfile reopened, and `go-git` reallocates some in-memory buffers.
This PR bypasses `git.PlainOpen` to allow us to enable the `KeepDescriptors` option. This option keeps packfile files open, and caches wrappers for them. The files then need to be explicitly closed when done with the repo.
Benefits:
* Avoid reallocating the memory buffers on every object access (see benchmark results below)
* (Highly speculative) I suspect this could improve OS decisions around when to cache portions of the packfile. Maybe constantly reopening and seeking within the file makes it harder for the OS to determine the true access pattern, which is roughly random access. This can affect decisions like readahead and whether to consider pages 'active'.
The heap profile is triggered when total heap allocated (inuse + garbage
objects) passes a certain threshold. However, the profiles are mainly useful
for looking at `inuse_space`. This PR improves the docs to make this a tiny bit
less confusing.
zoekt-merge-index can OOM on instances with <=4GB mem. Hence we adjust
the defaults. Larger instances can set the ENVs to higher values.
Test plan:
N/A
This updates zoekt-merge-index to print the name of a new compound shard to stdout. Indexserver picks it up and logs it. This has the nice property that indexserver now has all the info. If we want to log this to a file in the future, we don't have to worry as much about competing writes to the file.
Together with a new log line in vacuum we can now follow the full lifecycle of a compound shard in the logs.
Test plan:
updated unit tests
This switches the logs for vacuum from debug to error, which matches
what we do for merging.
Test plan:
N/A
When adding this metric, I got the calculation wrong. So the maximum bucket was capped at ~1.3 days, which we easily saturate at the 90% percentile.
This PR updates it to ~ 5.5 days to be safe:
```
>>> math.pow(2, 14 - 1) / 60 / 24
5.688888888888889
```
Note the `- 1`, which is the key piece I missed before.
* add zoekt-mirror-gitea
* * Clean up setting the default
* update note about topic filtering not being implemented as topics are missing from the API
* cleanup some pointers
* cleanup some code syntax
When constructing a shard we specify templates for constructing a URL.
Finally those URLs end up going via html/template which has pretty
strict escaping rules. This commit makes two changes:
URL construction via text/template. We still get the safety benefits
later on when finally rendering the output, but given we are
constructing URLs it makes more sense to use text/template.
Special escaping of + in URLs. I couldn't convince html/template to not
break URls containing + in it. So instead we use + escaped to %2B. I
tested gerrit, github and sourcegraph with %2B in filenames and they all
worked.
To do the above I introduced a template function called URLJoinPath
which is a wrapper around url.JoinPath with the additional behaviour
around + escaping.
Test Plan: Did lots of updates in tests. See diff.
When looking at large profiles for `inuse_space` on dot com, I noticed the
filename maps in `prepareNormalBuild` taking a bunch of memory. This PR avoids
allocating a separate map per branch, instead having `RepoWalker` collect all
the entries in a single instance variable.
Turns out git config doesn't support "_" in the keys.
https://git-scm.com/docs/git-config/2.22.0#_configuration_file
"The variable names are case-insensitive, allow only alphanumeric
characters and -, and must start with an alphabetic character."
Test plan:
New unit test
We ignore priority and instead use the latest commit date as repo rank.
This has a big impact for Sourcegraph because it means we switch from
star count to repo freshness as tiebreaker.
As a minor tweak, we also separate query based scores from tiebreakers.
To achieve this we reserve the last 7 digits of a score for tiebreakers:
- 5 digits (maxUint16) for repo rank
- 2 digits ([0,10]) for file order (2 digits).
Example:
Before:
score: 8775.35 <- atom(2):200, fragment:8550.00, repo-rank: 19, doc-order:6.35
After:
score: 8750_00019_06.35 <- atom(2):200, fragment:8550.00, repo-rank: 19, doc-order:6.35
This PR refactors the "build preparation" code to remove `branchMap`, in favor of storing everything in a single map. The main motivation is just to make the code clearer.
Specific changes:
* Store branches in the same map as blob locations, remove `branchMap`
* Introduce new struct `BlobIndexInfo` to hold the old location information, plus branches
* Rename `BlobLocation` -> `BlobRepo`
This is motivated by #832
We use archive index in our e2e tests. In order to test our latest improvements to ranking, archive index needs to set the latest commit date.
Test plan:
- new unit test
- I checked that the tar files downloaded from github have the correct mod time.
This adds a new test and a new repo to our e2e ranking suite. The idea is to have a query for which two repos (here golang/go and conc) of different freshness compete for the best match.
Among other fixes, Go 1.22.6 fixes a code signing issue on macOS and
XCode 16 that prevents Zoetk from running
This PR adds adds a debugging flag to periodically check memory usage against a
threshold. If it exceeds the threshold, then a memory profile like
`indexmemory.prof.1` is written to disk. No more than 10 profiles will be
written.
I've already found this more useful than the existing `-memprofile` flag, so I
removed that. It's hard to get insights using that flag, since it only takes a
single profile per shard, forces GC, and forces parallelism to 1.
While working on scoring I noticed that public repos always have an
advantage compared to private repos, because we have
`doc(rawConfig:RcOnlyPublic|...)` as part of the matchtree, which
contributues +1 to the atom count.
Test plan:
New unit test
Tiny follow up to https://github.com/sourcegraph/zoekt/pull/826. I resolved a conflict incorrectly and reverted a log line improvement.
This is a much more robust detection mechanism. Additionally we have
these signals we can also add in:
func IsConfiguration(path string) bool
func IsDocumentation(path string) bool
func IsDotFile(path string) bool
func IsImage(path string) bool
My main concern with this change is generated file detection on content
using up RAM or CPU. Will monitor this impact on pprof in production.
Test Plan: go test.
Test Plan: go run ./cmd/zoekt-index prints out useful instructions
Looking at heap profiles, the `ReadMetadata` function creates a ton of garbage
objects. The main contributor is in other sections from the TOC, specifically
decoding `compoundSection.offsets` . However, to read metadata, we only really
need to parse the metadata sections.
This PR introduces a `skip` method that skips over a section without reading
it. This greatly reduces the allocations from `ReadMetadata`.
Small improvement to how we handle unknown or malformed sections when reading the TOC.
Note: this PR originally removed some of the "skip section" handling. That has been restored, since it is important for downgrades.
Tiny clean-up, as we rarely use Datadog profiling (and I've never heard of us setting this env var).
This is more a workaround since a transitive dependency has introduced a
global flag "cpuprofile", leading to a panic due to registring the flag
twice.
To make ourselves immune to this issue we can refactor our usages to use
a FlagSet, even for "main". This is a bigger and frankly inconvenient
change for a somewhat rare occurance. Instead we just rename our flag.
I feel comfortable renaming since this flag should only really be used
by Zoekt developers. There will be the issue that the flag will be shown
twice for commands, but I will report to the upstream repo about this
problem.
Test Plan: go get -u ./... && go run ./cmd/zoekt-git-index works
This fixes the failing 'fuzz test' check in CI. Example from https://github.com/sourcegraph/zoekt/pull/891:
```
Download action repository 'jidicula/go-fuzz-action@0206b61afc603b665297621fa5e691b1447a5e57' (SHA:0206b61afc603b665297621fa5e691b1447a5e57)
Getting action download info
Error: This request has been automatically failed because it uses a deprecated version of `actions/upload-artifact: v3`. Learn more: https://github.blog/changelog/2024-04-16-deprecation-notice-v3-of-the-artifact-actions/
```
Usually, if there are candidate matches with overlapping ranges, then we just remove matches that overlap. However, when `opts.ChunkMatches = false`, we had special logic to merge overlapping matches.
This PR removes the overlapping logic to simplify the behavior. I couldn't see a good reason to keep this special handling. Plus, we are moving towards making `ChunkMatches` the default.
Another benefit of this change is that it makes the BM25 behavior easier to understand. If we merged together ranges, then we would be calculating term frequencies for spurious terms (like `new`, `queue`, `newqueue`, `queuenew`, etc.) Note: we currently only use BM25 with `ChunkMatches = true`, so there's not an active bug here.
Currently, BM25 scoring only applies to the overall `FileMatch` score. The
algorithm gathered term frequencies from all candidate matches in the file to
produce a file-level score. However `LineMatch` and `ChunkMatch` scores were
still calculated using the classic Zoekt scoring algorithm.
This PR implements BM25 scoring for `LineMatch` and `ChunkMatch`. It does so by
calculating a BM25 per line. Compared to the classic Zoekt algorithm, this
rewards multiple term matches on a line. Because our term frequency calculation
also boosts symbol matches, the score smoothly balances between "many term
matches" and "interesting term matches".
Now, the code is structured as follows:
* `scoreChunk`: goes through each line in the chunk, calculating its score
through `scoreLine`, and returns the best-scoring line
* `scoreLine`: calculates the score for a single line
The mental model is that "the score of a chunk is always the score of its best
line".
This PR adds a new field `ChunkMatch.BestLineMatch` with the line number of top-scoring line in the chunk. This will let us address a long-standing issue with our new flexible keyword search, where chunk matches can become very large. Since our search results UX only shows the start of a chunk, the most relevant line may not even be visible. With this information on the best line match, we can adjust the search results UX to center the chunk on the most relevant line.
Relates to [SPLF-188](https://linear.app/sourcegraph/issue/SPLF-188/ensure-the-best-scoring-line-match-is-shown-in-search-results)
When digging into our Natural Language Search (NLS) eval results, I found that one of the leading causes for flexible search types like "Fuzzy symbol search" and "Find logic" was noisy matches in top results. Currently, our BM25 ranking rewards any substring match equally. So for queries like 'extract tar', any match on 'tar' (even within unrelated terms like 'start', etc.) counts towards the term frequency.
This PR helps reduce noise by boosting symbol matches the same as we do filename matches. Our NLS evals show positive improvement, and context evals are the tiniest bit better.
Some tasks, such as loading shards, require priviledged access on startup. Here I introduce systemtenant which we can use for these kinds of things.
This is motivated by bug where the symbol sidebar in multi-tenant node wouldn't work because ranked shards were not loaded correctly which in turn caused SelectRepoSet to return 0 shards always.
Test plan:
added unit test
manual testing: symbol sidebar works now
If we failed to List the repositories when loading a shard we would
never search it due to selectRepoSet optimization. In practice this
feels very rare to happen (only for logic error or disk corruption).
However, in those cases we should surface these crashes searches by
attempting to search the shard.
Additionally I add logging so we can notice when this happens. I didn't
add a metric since this is the sort of thing that I think is so rare we
would never think to check the metric (but may notice logs).
Note: I used the slightly tricky invariant that repos being nil means
error. If the shard is actually empty (eg all repos tombstoned) then we
still correctly apply the optimization. In practice having an empty
shard shouldn't really happen so I'm open to just treating empty repos
list as something we have to search.
This adds the tenant ID to the trace. I also move the pprof logging from `FromContext` to the higher level `typeRepoSearcher`. The number of events was too high, because we logged missing tenants per document.
I also fixed a bug where pprof logging didn't work at all, because we read the tenant enforcemnt ENV after we set the pprof profile, so the profile was always nil.
Test plan:
Checked locally that tenants show up in the traces and "missing_tenant" shows up as pprof profile.
This updates webserver and sourcegraph-indexserver to support multi-tenancy.
The change is behind an ENV feature-flag.
Key changes:
- tenant ID is now part of the index (repo metadata)
- GRPC: IndexOption and Repository have a new field TenantId
- If multi-tenancy is enabled, webserver checks if tenant in context matches the tenant id in the shard
- zoekt-git-index has a new parameter "-shard_prefix ". If set, the value will be used instead of repository name as prefix for the name of the shard. For Sourcegraph we use "<tenant id>_<repository id>" as prefix if multi-tenancy is enabled
Assumption:
All calls to Sourcegraph are privileged
Test plan:
- New tests
- Ran this together with Sourcegraph (with and without MT enabled)
By default go-git maintains an LRU cache of git objects of size 96MB. When an
object's contents are loaded, it's stored as a MemoryObject in this cache. This
cache is not super useful in the indexing access pattern, which accesses each
file only once. And in many profiles, we see a substantial number of
allocations from these memory objects.
This PR disables caching for most git objects by setting LargeObjectThreshold:
1. go-git still proactively caches packfile objects under 16KB (see
smallObjectThreshold here).
Follow up to #852. This change is also gated by the
ZOEKT_ENABLE_GOGIT_OPTIMIZATION feature flag.
By default, the `go-git` library will open the packfile on every call to `Repository.BlobObject`, then close it. During indexing, we collect the list of files to index, then iterate through each one calling `Repository.BlobObject`. So on every object access the packfile reopened, and `go-git` reallocates some in-memory buffers.
This PR bypasses `git.PlainOpen` to allow us to enable the `KeepDescriptors` option. This option keeps packfile files open, and caches wrappers for them. The files then need to be explicitly closed when done with the repo.
Benefits:
* Avoid reallocating the memory buffers on every object access (see benchmark results below)
* (Highly speculative) I suspect this could improve OS decisions around when to cache portions of the packfile. Maybe constantly reopening and seeking within the file makes it harder for the OS to determine the true access pattern, which is roughly random access. This can affect decisions like readahead and whether to consider pages 'active'.
This updates zoekt-merge-index to print the name of a new compound shard to stdout. Indexserver picks it up and logs it. This has the nice property that indexserver now has all the info. If we want to log this to a file in the future, we don't have to worry as much about competing writes to the file.
Together with a new log line in vacuum we can now follow the full lifecycle of a compound shard in the logs.
Test plan:
updated unit tests
When adding this metric, I got the calculation wrong. So the maximum bucket was capped at ~1.3 days, which we easily saturate at the 90% percentile.
This PR updates it to ~ 5.5 days to be safe:
```
>>> math.pow(2, 14 - 1) / 60 / 24
5.688888888888889
```
Note the `- 1`, which is the key piece I missed before.
When constructing a shard we specify templates for constructing a URL.
Finally those URLs end up going via html/template which has pretty
strict escaping rules. This commit makes two changes:
URL construction via text/template. We still get the safety benefits
later on when finally rendering the output, but given we are
constructing URLs it makes more sense to use text/template.
Special escaping of + in URLs. I couldn't convince html/template to not
break URls containing + in it. So instead we use + escaped to %2B. I
tested gerrit, github and sourcegraph with %2B in filenames and they all
worked.
To do the above I introduced a template function called URLJoinPath
which is a wrapper around url.JoinPath with the additional behaviour
around + escaping.
Test Plan: Did lots of updates in tests. See diff.
We ignore priority and instead use the latest commit date as repo rank.
This has a big impact for Sourcegraph because it means we switch from
star count to repo freshness as tiebreaker.
As a minor tweak, we also separate query based scores from tiebreakers.
To achieve this we reserve the last 7 digits of a score for tiebreakers:
- 5 digits (maxUint16) for repo rank
- 2 digits ([0,10]) for file order (2 digits).
Example:
Before:
score: 8775.35 <- atom(2):200, fragment:8550.00, repo-rank: 19, doc-order:6.35
After:
score: 8750_00019_06.35 <- atom(2):200, fragment:8550.00, repo-rank: 19, doc-order:6.35
This PR refactors the "build preparation" code to remove `branchMap`, in favor of storing everything in a single map. The main motivation is just to make the code clearer.
Specific changes:
* Store branches in the same map as blob locations, remove `branchMap`
* Introduce new struct `BlobIndexInfo` to hold the old location information, plus branches
* Rename `BlobLocation` -> `BlobRepo`
This PR adds adds a debugging flag to periodically check memory usage against a
threshold. If it exceeds the threshold, then a memory profile like
`indexmemory.prof.1` is written to disk. No more than 10 profiles will be
written.
I've already found this more useful than the existing `-memprofile` flag, so I
removed that. It's hard to get insights using that flag, since it only takes a
single profile per shard, forces GC, and forces parallelism to 1.
This is a much more robust detection mechanism. Additionally we have
these signals we can also add in:
func IsConfiguration(path string) bool
func IsDocumentation(path string) bool
func IsDotFile(path string) bool
func IsImage(path string) bool
My main concern with this change is generated file detection on content
using up RAM or CPU. Will monitor this impact on pprof in production.
Test Plan: go test.
Looking at heap profiles, the `ReadMetadata` function creates a ton of garbage
objects. The main contributor is in other sections from the TOC, specifically
decoding `compoundSection.offsets` . However, to read metadata, we only really
need to parse the metadata sections.
This PR introduces a `skip` method that skips over a section without reading
it. This greatly reduces the allocations from `ReadMetadata`.