fork of https://github.com/sourcegraph/zoekt
0

Configure Feed

Select the types of activity you want to include in your feed.

index: add hybrid go-re2 engine for large file content matching (#1024)

Adds an optional hybrid regex engine (internal/hybridre2) that transparently
switches between grafana/regexp and wasilibs/go-re2 (RE2 via WebAssembly)
based on file content size. Disabled by default — no behaviour change
without opt-in via ZOEKT_RE2_THRESHOLD_BYTES.

## Motivation

Issue #323 identified regex as the dominant CPU consumer in zoekt's
webserver profile. Go's regexp engine (including the grafana/regexp fork
already in use) lacks a lazy DFA. RE2's lazy DFA provides linear-time
matching with much better constant factors for alternations, character
classes, and complex patterns on large inputs.

The tradeoff: go-re2 uses WebAssembly (~600ns per-call overhead), making
it slower than grafana/regexp for small inputs (<4KB) but dramatically
faster above the threshold. A full engine swap would regress small-file
searches, so a threshold-based hybrid is the pragmatic approach.

## Implementation

### New package: internal/hybridre2

hybridre2.Regexp compiles both engines once at query-parse time and
dispatches FindAllIndex based on len(input) >= Threshold():

func (re *Regexp) FindAllIndex(b []byte, n int) [][]int {
if useRE2(len(b)) {
return re.re2.FindAllIndex(b, n)
}
return re.grafana.FindAllIndex(b, n)
}

### Change to index/matchtree.go

regexpMatchTree gains a hybridRegexp field used for file content matching;
filename matching keeps using grafana/regexp directly (filenames are always
short, so WASM overhead dominates there).

### Configuration

ZOEKT_RE2_THRESHOLD_BYTES env var, read once at startup:

-1 (default): disabled — always grafana/regexp, zero behaviour change
0: always use go-re2 (useful for evaluation/testing)
32768: use go-re2 for files >= 32KB (recommended starting point)

## Benchmarks

Hardware: AMD EPYC 9B14, go-re2 v1.10.0 (WASM, no CGO).

Alternations — `func|var|const|type|import`:
32KB: grafana 2505µs go-re2 467µs 5.4x speedup
128KB: grafana 9900µs go-re2 1699µs 5.8x speedup
512KB: grafana 40.7ms go-re2 6.8ms 6.0x speedup

Complex — `(func|var)\s+[A-Z]\w*\s*(`:
32KB: grafana 1237µs go-re2 230µs 5.4x speedup
128KB: grafana 4935µs go-re2 911µs 5.4x speedup
512KB: grafana 19.9ms go-re2 3.8ms 5.3x speedup

Literal — `main` (grafana wins; threshold protects this case):
32KB: grafana 33.2µs go-re2 59.8µs

## Testing

go test ./internal/hybridre2/ # unit + correctness matrix
go test ./index/ -short # full existing suite: passes
go test ./... -short # full suite: passes

Correctness verified by asserting identical match offsets between grafana
and go-re2 for 9 patterns x 5 sizes (64B-256KB).

## Notes

- Binary/non-UTF-8 content: go-re2 stops at invalid UTF-8 (vs. grafana
which replaces with the replacement character). The default threshold of
-1 ensures zero behaviour change. Operators enabling the threshold should
be aware; future work could detect non-UTF-8 and force the grafana path.
- Dependency: github.com/wasilibs/go-re2 v1.10.0 — pure Go WASM, no system
deps. Binary size increase: ~2MB (the embedded RE2 WASM module).
- Rollout plan: enable in GitLab via feature flag starting at 32KB, compare
p95 regex latency before/after using per-shard timing in search responses.

author
Dmitry Gruzd
committer
GitHub
date (Mar 24, 2026, 2:12 PM +0200) commit 971fcf5e parent 1e121443
+550 -8
+5 -2
go.mod
··· 34 34 github.com/stretchr/testify v1.11.1 35 35 github.com/uber/jaeger-client-go v2.30.0+incompatible 36 36 github.com/uber/jaeger-lib v2.4.1+incompatible 37 + github.com/wasilibs/go-re2 v1.10.0 37 38 github.com/xeipuuv/gojsonschema v1.2.0 38 39 gitlab.com/gitlab-org/api/client-go v1.46.0 39 40 go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc v0.63.0 ··· 63 64 github.com/go-fed/httpsig v1.1.0 // indirect 64 65 github.com/hashicorp/go-version v1.7.0 // indirect 65 66 github.com/kylelemons/godebug v1.1.0 // indirect 67 + github.com/tetratelabs/wazero v1.9.0 // indirect 68 + github.com/wasilibs/wazero-helpers v0.0.0-20240620070341-3dff1577cd52 // indirect 66 69 go.opentelemetry.io/auto/sdk v1.2.1 // indirect 67 70 ) 68 71 ··· 85 88 github.com/cockroachdb/logtags v0.0.0-20241215232642-bb51bb14a506 // indirect 86 89 github.com/cockroachdb/redact v1.1.5 // indirect 87 90 github.com/cyphar/filepath-securejoin v0.4.1 // indirect 88 - github.com/davecgh/go-spew v1.1.1 // indirect 91 + github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc // indirect 89 92 github.com/emirpasic/gods v1.18.1 // indirect 90 93 github.com/fatih/color v1.18.0 // indirect 91 94 github.com/getsentry/sentry-go v0.31.1 // indirect ··· 118 121 github.com/mschoch/smat v0.2.0 // indirect 119 122 github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 // indirect 120 123 github.com/pjbgf/sha1cd v0.3.2 // indirect 121 - github.com/pmezard/go-difflib v1.0.0 // indirect 124 + github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 // indirect 122 125 github.com/power-devops/perfstat v0.0.0-20240221224432-82ca36839d55 // indirect 123 126 github.com/prometheus/client_model v0.6.1 // indirect 124 127 github.com/prometheus/common v0.62.0 // indirect
+10 -2
go.sum
··· 74 74 github.com/cyphar/filepath-securejoin v0.4.1 h1:JyxxyPEaktOD+GAnqIqTf9A8tHyAG22rowi7HkoSU1s= 75 75 github.com/cyphar/filepath-securejoin v0.4.1/go.mod h1:Sdj7gXlvMcPZsbhwhQ33GguGLDGQL7h7bg04C/+u9jI= 76 76 github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= 77 - github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c= 78 77 github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= 78 + github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc h1:U9qPSI2PIWSS1VwoXQT9A3Wy9MM3WgvqSxFWenqJduM= 79 + github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= 79 80 github.com/davidmz/go-pageant v1.0.2 h1:bPblRCh5jGU+Uptpz6LgMZGD5hJoOt7otgT454WvHn0= 80 81 github.com/davidmz/go-pageant v1.0.2/go.mod h1:P2EDDnMqIwG5Rrp05dTRITj9z2zpGcD9efWSkTNKLIE= 81 82 github.com/dustin/go-humanize v1.0.1 h1:GzkhY7T5VNhEkwH0PVJgjz+fX1rhBrR7pRT3mDkpeCY= ··· 273 274 github.com/pkg/errors v0.8.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0= 274 275 github.com/pkg/errors v0.9.1 h1:FEBLx1zS214owpjy7qsBeixbURkuhQAwrK5UwLGTwt4= 275 276 github.com/pkg/errors v0.9.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0= 276 - github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM= 277 277 github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4= 278 + github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 h1:Jamvg5psRIccs7FGNTlIRMkT8wgtp5eCXdBlqhYGL6U= 279 + github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4= 278 280 github.com/power-devops/perfstat v0.0.0-20240221224432-82ca36839d55 h1:o4JXh1EVt9k/+g42oCprj/FisM4qX9L3sZB3upGN2ZU= 279 281 github.com/power-devops/perfstat v0.0.0-20240221224432-82ca36839d55/go.mod h1:OmDBASR4679mdNQnz2pUhc2G8CO2JrUAVFDRBDP/hJE= 280 282 github.com/prashantv/gostub v1.1.0 h1:BTyx3RfQjRHnUWaGF9oQos79AlQ5k8WNktv7VGvVH4g= ··· 323 325 github.com/stretchr/testify v1.8.1/go.mod h1:w2LPCIKwWwSfY2zedu0+kehJoqGctiVI29o6fzry7u4= 324 326 github.com/stretchr/testify v1.11.1 h1:7s2iGBzp5EwR7/aIZr8ao5+dra3wiQyKjjFuvgVKu7U= 325 327 github.com/stretchr/testify v1.11.1/go.mod h1:wZwfW3scLgRK+23gO65QZefKpKQRnfz6sD981Nm4B6U= 328 + github.com/tetratelabs/wazero v1.9.0 h1:IcZ56OuxrtaEz8UYNRHBrUa9bYeX9oVY93KspZZBf/I= 329 + github.com/tetratelabs/wazero v1.9.0/go.mod h1:TSbcXCfFP0L2FGkRPxHphadXPjo1T6W+CseNNY7EkjM= 326 330 github.com/uber/jaeger-client-go v2.30.0+incompatible h1:D6wyKGCecFaSRUpo8lCVbaOOb6ThwMmTEbhRwtKR97o= 327 331 github.com/uber/jaeger-client-go v2.30.0+incompatible/go.mod h1:WVhlPFC8FDjOFMMWRy2pZqQJSXxYSwNYOkTr/Z6d3Kk= 328 332 github.com/uber/jaeger-lib v2.4.1+incompatible h1:td4jdvLcExb4cBISKIpHuGoVXh+dVKhn2Um6rjCsSsg= 329 333 github.com/uber/jaeger-lib v2.4.1+incompatible/go.mod h1:ComeNDZlWwrWnDv8aPp0Ba6+uUTzImX/AauajbLI56U= 334 + github.com/wasilibs/go-re2 v1.10.0 h1:vQZEBYZOCA9jdBMmrO4+CvqyCj0x4OomXTJ4a5/urQ0= 335 + github.com/wasilibs/go-re2 v1.10.0/go.mod h1:k+5XqO2bCJS+QpGOnqugyfwC04nw0jaglmjrrkG8U6o= 336 + github.com/wasilibs/wazero-helpers v0.0.0-20240620070341-3dff1577cd52 h1:OvLBa8SqJnZ6P+mjlzc2K7PM22rRUPE1x32G9DTPrC4= 337 + github.com/wasilibs/wazero-helpers v0.0.0-20240620070341-3dff1577cd52/go.mod h1:jMeV4Vpbi8osrE/pKUxRZkVaA0EX7NZN0A9/oRzgpgY= 330 338 github.com/xanzy/ssh-agent v0.3.3 h1:+/15pJfg/RsTxqYcX6fHqOXZwwMP+2VyYWJeWM2qQFM= 331 339 github.com/xanzy/ssh-agent v0.3.3/go.mod h1:6dzNDKs0J9rVPHPhaGCukekBHKqfl+L3KghI1Bc68Uw= 332 340 github.com/xeipuuv/gojsonpointer v0.0.0-20180127040702-4e3ac2762d5f/go.mod h1:N2zxlSyiKSe5eX1tZViRH5QA0qijqEDrYZiPEAiq3wU=
+29 -4
index/matchtree.go
··· 27 27 "github.com/grafana/regexp" 28 28 29 29 "github.com/sourcegraph/zoekt" 30 + "github.com/sourcegraph/zoekt/internal/hybridre2" 30 31 "github.com/sourcegraph/zoekt/internal/syntaxutil" 31 32 "github.com/sourcegraph/zoekt/query" 32 33 ) ··· 187 188 type regexpMatchTree struct { 188 189 regexp *regexp.Regexp 189 190 191 + // hybridRegexp is a size-aware wrapper that dispatches to go-re2 for 192 + // large file content when ZOEKT_RE2_THRESHOLD_BYTES is configured. 193 + // For small inputs and filename matches, regexp is used directly. 194 + hybridRegexp *hybridre2.Regexp 195 + 190 196 // origRegexp is the original parsed regexp from the query structure. It 191 197 // does not include mutations such as case sensitivity. 192 198 origRegexp *syntax.Regexp ··· 207 213 prefix = "(?i)" 208 214 } 209 215 216 + pattern := prefix + syntaxutil.RegexpString(s.Regexp) 217 + 218 + // hybridRegexp is only used for file content matching; skip the RE2 219 + // compilation overhead for filename-only regexps. 220 + var hr *hybridre2.Regexp 221 + if !s.FileName { 222 + hr = hybridre2.MustCompile(pattern) 223 + } 210 224 return &regexpMatchTree{ 211 - regexp: regexp.MustCompile(prefix + syntaxutil.RegexpString(s.Regexp)), 212 - origRegexp: s.Regexp, 213 - fileName: s.FileName, 225 + regexp: regexp.MustCompile(pattern), 226 + hybridRegexp: hr, 227 + origRegexp: s.Regexp, 228 + fileName: s.FileName, 214 229 } 215 230 } 216 231 ··· 802 817 } 803 818 804 819 cp.stats.RegexpsConsidered++ 805 - idxs := t.regexp.FindAllIndex(cp.data(t.fileName), -1) 820 + data := cp.data(t.fileName) 821 + // For file content, use hybridRegexp which dispatches to go-re2 when 822 + // len(data) >= ZOEKT_RE2_THRESHOLD_BYTES. For filename matching, use 823 + // grafana/regexp directly: filenames are always short, so the WASM 824 + // call overhead of go-re2 outweighs any benefit. 825 + var idxs [][]int 826 + if t.fileName { 827 + idxs = t.regexp.FindAllIndex(data, -1) 828 + } else { 829 + idxs = t.hybridRegexp.FindAllIndex(data, -1) 830 + } 806 831 found := t.found[:0] 807 832 for _, idx := range idxs { 808 833 cm := &candidateMatch{
+148
internal/hybridre2/hybridre2.go
··· 1 + // Copyright 2026 Google Inc. All rights reserved. 2 + // 3 + // Licensed under the Apache License, Version 2.0 (the "License"); 4 + // you may not use this file except in compliance with the License. 5 + // You may obtain a copy of the License at 6 + // 7 + // http://www.apache.org/licenses/LICENSE-2.0 8 + // 9 + // Unless required by applicable law or agreed to in writing, software 10 + // distributed under the License is distributed on an "AS IS" BASIS, 11 + // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 + // See the License for the specific language governing permissions and 13 + // limitations under the License. 14 + 15 + // Package hybridre2 provides a hybrid regex engine that switches between 16 + // grafana/regexp (an optimized fork of Go's stdlib regexp) and 17 + // wasilibs/go-re2 (RE2 via WebAssembly) based on input size. 18 + // 19 + // Motivation: Go's regexp engine lacks a lazy DFA, making it O(n·m) for 20 + // hard patterns. RE2's lazy DFA provides linear-time matching, which is 21 + // dramatically faster for large inputs (>32KB) or complex patterns. For 22 + // small inputs the WASM call overhead of go-re2 exceeds the savings, 23 + // so grafana/regexp remains the better choice there. 24 + // 25 + // The threshold is controlled by the ZOEKT_RE2_THRESHOLD_BYTES environment 26 + // variable, read once at program startup: 27 + // 28 + // - -1 (default): disabled, always use grafana/regexp 29 + // - 0: always use go-re2 30 + // - N > 0: use go-re2 when len(input) >= N bytes 31 + // 32 + // # Known tradeoffs 33 + // 34 + // Memory: each Regexp holds compiled state for both engines when RE2 is 35 + // enabled. Patterns are compiled per-search (not cached globally), so under 36 + // high concurrency with many unique patterns the WASM heap adds up. Monitor 37 + // RSS when first enabling the threshold in production. 38 + // 39 + // UTF-8 semantics: go-re2 stops at invalid UTF-8; grafana/regexp replaces 40 + // invalid bytes with U+FFFD and continues. Results may differ on binary 41 + // content that slips past content-type detection. See FindAllIndex for 42 + // details. 43 + // 44 + // RE2 compilation failure: if RE2 rejects a pattern that grafana/regexp 45 + // accepts (due to syntax differences between the two engines), Compile 46 + // returns an error rather than silently falling back to grafana/regexp. 47 + // This is intentional (fail-fast), but it means enabling the threshold 48 + // could surface errors for edge-case patterns that work today. Patterns 49 + // sourced from zoekt query parsing are validated before reaching this 50 + // package, so this is unlikely in practice. 51 + package hybridre2 52 + 53 + import ( 54 + "os" 55 + "strconv" 56 + "sync" 57 + 58 + grafanaregexp "github.com/grafana/regexp" 59 + re2regexp "github.com/wasilibs/go-re2" 60 + ) 61 + 62 + const ( 63 + // envThreshold is the environment variable name controlling the size 64 + // threshold (bytes) at which go-re2 is used instead of grafana/regexp. 65 + // Set to -1 (default) to disable go-re2 entirely, 0 to always use it. 66 + envThreshold = "ZOEKT_RE2_THRESHOLD_BYTES" 67 + 68 + // disabled is the sentinel value meaning go-re2 is never used. 69 + disabled = int64(-1) 70 + ) 71 + 72 + // threshold returns the configured byte threshold, reading 73 + // ZOEKT_RE2_THRESHOLD_BYTES from the environment exactly once. 74 + // Negative means disabled; zero means always use RE2. 75 + // 76 + // Tests may reassign this variable to override the threshold. 77 + var threshold = sync.OnceValue(func() int64 { 78 + if val, ok := os.LookupEnv(envThreshold); ok { 79 + if n, err := strconv.ParseInt(val, 10, 64); err == nil { 80 + return n 81 + } 82 + } 83 + return disabled 84 + }) 85 + 86 + // Regexp is a compiled regular expression that dispatches to either 87 + // grafana/regexp or go-re2 at match time, based on input size. 88 + type Regexp struct { 89 + grafana *grafanaregexp.Regexp 90 + re2 *re2regexp.Regexp // nil when threshold() < 0 (disabled) 91 + } 92 + 93 + // Compile returns a new Regexp. The grafana/regexp variant is always compiled. 94 + // The go-re2 variant is only compiled when ZOEKT_RE2_THRESHOLD_BYTES is set to 95 + // a non-negative value; when RE2 is disabled (the default), skipping WASM 96 + // compilation keeps the disabled path truly zero-cost. 97 + func Compile(pattern string) (*Regexp, error) { 98 + g, err := grafanaregexp.Compile(pattern) 99 + if err != nil { 100 + return nil, err 101 + } 102 + result := &Regexp{grafana: g} 103 + if threshold() >= 0 { 104 + r, err := re2regexp.Compile(pattern) 105 + if err != nil { 106 + return nil, err 107 + } 108 + result.re2 = r 109 + } 110 + return result, nil 111 + } 112 + 113 + // MustCompile is like Compile but panics on error. 114 + func MustCompile(pattern string) *Regexp { 115 + re, err := Compile(pattern) 116 + if err != nil { 117 + panic("hybridre2: Compile(" + pattern + "): " + err.Error()) 118 + } 119 + return re 120 + } 121 + 122 + // useRE2 reports whether the RE2 engine should be used for an input of the 123 + // given length, based on the current threshold setting. 124 + func useRE2(inputLen int) bool { 125 + t := threshold() 126 + return t >= 0 && int64(inputLen) >= t 127 + } 128 + 129 + // FindAllIndex returns successive non-overlapping matches of the expression 130 + // in b. It uses go-re2 when len(b) >= threshold() (and RE2 is enabled), 131 + // and grafana/regexp otherwise. Match indices are relative to b. 132 + // 133 + // NOTE: go-re2 stops matching at invalid UTF-8 bytes, whereas grafana/regexp 134 + // replaces them with U+FFFD and continues. This means results may differ on 135 + // binary or non-UTF-8 content when RE2 is active. The default threshold of -1 136 + // (disabled) ensures zero behaviour change for existing deployments; operators 137 + // enabling the threshold should be aware of this distinction. 138 + func (re *Regexp) FindAllIndex(b []byte, n int) [][]int { 139 + if re.re2 != nil && useRE2(len(b)) { 140 + return re.re2.FindAllIndex(b, n) 141 + } 142 + return re.grafana.FindAllIndex(b, n) 143 + } 144 + 145 + // String returns the source text used to compile the regular expression. 146 + func (re *Regexp) String() string { 147 + return re.grafana.String() 148 + }
+358
internal/hybridre2/hybridre2_test.go
··· 1 + // Copyright 2026 Google Inc. All rights reserved. 2 + // 3 + // Licensed under the Apache License, Version 2.0 (the "License"); 4 + // you may not use this file except in compliance with the License. 5 + // You may obtain a copy of the License at 6 + // 7 + // http://www.apache.org/licenses/LICENSE-2.0 8 + // 9 + // Unless required by applicable law or agreed to in writing, software 10 + // distributed under the License is distributed on an "AS IS" BASIS, 11 + // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 + // See the License for the specific language governing permissions and 13 + // limitations under the License. 14 + 15 + package hybridre2 16 + 17 + import ( 18 + "fmt" 19 + "testing" 20 + 21 + grafanaregexp "github.com/grafana/regexp" 22 + ) 23 + 24 + // withThreshold overrides the effective threshold for the duration of the test 25 + // and registers a t.Cleanup to restore it afterwards. 26 + // 27 + // NOT safe for concurrent use: do not call t.Parallel() after withThreshold, 28 + // and do not use it from TestMain or init(). 29 + func withThreshold(tb testing.TB, thresh int64) { 30 + tb.Helper() 31 + old := threshold 32 + threshold = func() int64 { return thresh } 33 + tb.Cleanup(func() { threshold = old }) 34 + } 35 + 36 + // ---- unit tests ---- 37 + 38 + func TestCompileValid(t *testing.T) { 39 + _, err := Compile(`foo.*bar`) 40 + if err != nil { 41 + t.Fatalf("unexpected error: %v", err) 42 + } 43 + } 44 + 45 + func TestCompileInvalid(t *testing.T) { 46 + _, err := Compile(`[invalid`) 47 + if err == nil { 48 + t.Fatal("expected error for invalid pattern, got nil") 49 + } 50 + } 51 + 52 + func TestMustCompilePanics(t *testing.T) { 53 + defer func() { 54 + if r := recover(); r == nil { 55 + t.Fatal("MustCompile should panic on invalid pattern") 56 + } 57 + }() 58 + MustCompile(`[invalid`) 59 + } 60 + 61 + func TestString(t *testing.T) { 62 + const pat = `foo.*bar` 63 + re := MustCompile(pat) 64 + if re.String() != pat { 65 + t.Fatalf("String() = %q, want %q", re.String(), pat) 66 + } 67 + } 68 + 69 + // TestFindAllIndexDisabled checks that with threshold=-1, we use grafana/regexp. 70 + func TestFindAllIndexDisabled(t *testing.T) { 71 + corpus := []byte("func main() { fmt.Println(\"hello world\") }") 72 + patterns := []string{`\w+`, `fmt\.\w+`, `(?i)MAIN`, `"[^"]*"`} 73 + 74 + withThreshold(t, disabled) 75 + for _, pat := range patterns { 76 + hybrid := MustCompile(pat) 77 + grafana := grafanaregexp.MustCompile(pat) 78 + got := hybrid.FindAllIndex(corpus, -1) 79 + want := grafana.FindAllIndex(corpus, -1) 80 + if !equalIndexSlices(got, want) { 81 + t.Errorf("disabled mode, pattern %q: hybrid=%v grafana=%v", pat, got, want) 82 + } 83 + } 84 + } 85 + 86 + // TestFindAllIndexForcedRE2 checks that with threshold=0, go-re2 is used and 87 + // produces identical results to grafana/regexp for standard patterns. 88 + func TestFindAllIndexForcedRE2(t *testing.T) { 89 + corpus := []byte("func main() { fmt.Println(\"hello world\") }") 90 + patterns := []string{`\w+`, `fmt\.\w+`, `(?i)MAIN`, `"[^"]*"`} 91 + 92 + withThreshold(t, 0) 93 + for _, pat := range patterns { 94 + hybrid := MustCompile(pat) 95 + grafana := grafanaregexp.MustCompile(pat) 96 + got := hybrid.FindAllIndex(corpus, -1) 97 + want := grafana.FindAllIndex(corpus, -1) 98 + if !equalIndexSlices(got, want) { 99 + t.Errorf("forced-re2 mode, pattern %q: hybrid=%v grafana=%v", pat, got, want) 100 + } 101 + } 102 + } 103 + 104 + // TestThresholdSwitching verifies the engine switches at the configured byte boundary. 105 + func TestThresholdSwitching(t *testing.T) { 106 + const thresh = int64(512) 107 + pattern := `func\s+\w+` 108 + grafana := grafanaregexp.MustCompile(pattern) 109 + 110 + smallCorpus := makeCorpus(300) // < 512 111 + largeCorpus := makeCorpus(600) // >= 512 112 + 113 + withThreshold(t, thresh) 114 + hybrid := MustCompile(pattern) 115 + 116 + for _, tc := range []struct { 117 + name string 118 + corpus []byte 119 + }{ 120 + {"small(<threshold)", smallCorpus}, 121 + {"large(>=threshold)", largeCorpus}, 122 + } { 123 + got := hybrid.FindAllIndex(tc.corpus, -1) 124 + want := grafana.FindAllIndex(tc.corpus, -1) 125 + if !equalIndexSlices(got, want) { 126 + t.Errorf("%s: hybrid=%v grafana=%v", tc.name, got, want) 127 + } 128 + } 129 + } 130 + 131 + // TestFindAllIndexIdenticalResults is a comprehensive correctness sweep across 132 + // pattern types and input sizes, asserting identical match positions. 133 + func TestFindAllIndexIdenticalResults(t *testing.T) { 134 + patterns := []struct { 135 + name string 136 + pattern string 137 + }{ 138 + {"literal", `hello`}, 139 + {"case-insensitive", `(?i)Hello`}, 140 + {"word-boundary", `\bfunc\b`}, 141 + {"alternation", `foo|bar|baz`}, 142 + {"char-class", `[a-zA-Z_]\w*`}, 143 + {"complex", `(func|var|const)\s+[A-Z]\w*`}, 144 + {"dot-plus", `.+`}, 145 + {"anchored-line", `(?m)^package\s+\w+`}, 146 + {"no-match", `XYZZY_NEVER_MATCHES`}, 147 + } 148 + 149 + sizes := []struct { 150 + name string 151 + size int 152 + }{ 153 + {"64B", 64}, 154 + {"512B", 512}, 155 + {"4KB", 4 * 1024}, 156 + {"64KB", 64 * 1024}, 157 + {"256KB", 256 * 1024}, 158 + } 159 + 160 + // Force re2 path to test its correctness across all sizes. 161 + withThreshold(t, 0) 162 + for _, sz := range sizes { 163 + corpus := makeCorpus(sz.size) 164 + for _, pat := range patterns { 165 + name := sz.name + "/" + pat.name 166 + t.Run(name, func(t *testing.T) { 167 + hybrid := MustCompile(pat.pattern) 168 + grafana := grafanaregexp.MustCompile(pat.pattern) 169 + 170 + got := hybrid.FindAllIndex(corpus, -1) 171 + want := grafana.FindAllIndex(corpus, -1) 172 + if !equalIndexSlices(got, want) { 173 + t.Errorf("pattern=%q size=%d: len(hybrid)=%d len(grafana)=%d", 174 + pat.pattern, sz.size, len(got), len(want)) 175 + if len(got) > 0 && len(want) > 0 { 176 + t.Errorf(" first hybrid=%v first grafana=%v", got[0], want[0]) 177 + } 178 + } 179 + }) 180 + } 181 + } 182 + } 183 + 184 + // TestFindAllIndexLimitN verifies that the n parameter (match count limit) is 185 + // honoured identically by both engines. 186 + func TestFindAllIndexLimitN(t *testing.T) { 187 + corpus := makeCorpus(64 * 1024) // large enough to have many matches 188 + patterns := []string{`func\s+\w+`, `\bvar\b`, `[A-Z]\w*`} 189 + 190 + withThreshold(t, 0) // force re2 path 191 + for _, pat := range patterns { 192 + hybrid := MustCompile(pat) 193 + grafana := grafanaregexp.MustCompile(pat) 194 + 195 + got := hybrid.FindAllIndex(corpus, 1) 196 + want := grafana.FindAllIndex(corpus, 1) 197 + if !equalIndexSlices(got, want) { 198 + t.Errorf("n=1, pattern=%q: hybrid=%v grafana=%v", pat, got, want) 199 + } 200 + // Sanity: n=1 should return at most one match. 201 + if len(got) > 1 { 202 + t.Errorf("n=1, pattern=%q: got %d matches, want <= 1", pat, len(got)) 203 + } 204 + } 205 + } 206 + 207 + // TestNoMatchReturnsEmpty verifies no-match returns nil/empty consistently. 208 + func TestNoMatchReturnsEmpty(t *testing.T) { 209 + corpus := makeCorpus(1024) 210 + 211 + for _, thresh := range []int64{disabled, 0} { 212 + t.Run(fmt.Sprintf("thresh=%d", thresh), func(t *testing.T) { 213 + withThreshold(t, thresh) 214 + // MustCompile must be after withThreshold so that the lazy RE2 215 + // compilation in Compile() sees the overridden threshold and 216 + // actually initialises re.re2 when thresh=0. 217 + re := MustCompile(`XYZZY_NEVER_MATCHES`) 218 + if got := re.FindAllIndex(corpus, -1); len(got) != 0 { 219 + t.Errorf("thresh=%d: expected empty, got %v", thresh, got) 220 + } 221 + }) 222 + } 223 + } 224 + 225 + // ---- benchmarks ---- 226 + 227 + // BenchmarkEngines measures FindAllIndex performance for grafana/regexp vs 228 + // go-re2 across multiple input sizes and pattern complexities. 229 + // 230 + // Run with: 231 + // 232 + // go test -bench=BenchmarkEngines -benchmem -benchtime=3s ./internal/hybridre2/ 233 + func BenchmarkEngines(b *testing.B) { 234 + patterns := []struct { 235 + name string 236 + pattern string 237 + }{ 238 + {"literal", `main`}, 239 + {"case-insensitive", `(?i)func`}, 240 + {"alternation-5", `func|var|const|type|import`}, 241 + {"complex", `(func|var)\s+[A-Z]\w*\s*\(`}, 242 + {"hard-no-match", `XYZZY_NEVER_MATCHES_AT_ALL`}, 243 + } 244 + 245 + sizes := []struct { 246 + name string 247 + size int 248 + }{ 249 + {"512B", 512}, 250 + {"4KB", 4 * 1024}, 251 + {"32KB", 32 * 1024}, 252 + {"128KB", 128 * 1024}, 253 + {"512KB", 512 * 1024}, 254 + } 255 + 256 + // Pre-build all corpora outside the benchmark loop. 257 + corpora := make(map[string][]byte, len(sizes)) 258 + for _, sz := range sizes { 259 + corpora[sz.name] = makeCorpus(sz.size) 260 + } 261 + 262 + for _, pat := range patterns { 263 + grafanaRe := grafanaregexp.MustCompile(pat.pattern) 264 + 265 + for _, sz := range sizes { 266 + corpus := corpora[sz.name] 267 + name := pat.name + "/" + sz.name 268 + 269 + b.Run("grafana/"+name, func(b *testing.B) { 270 + b.SetBytes(int64(len(corpus))) 271 + b.ResetTimer() 272 + for i := 0; i < b.N; i++ { 273 + _ = grafanaRe.FindAllIndex(corpus, -1) 274 + } 275 + }) 276 + 277 + b.Run("go-re2/"+name, func(b *testing.B) { 278 + withThreshold(b, 0) // force re2 for all sizes 279 + // MustCompile must be after withThreshold so that re.re2 280 + // is initialised (lazy compilation checks threshold() at 281 + // compile time, not match time). 282 + hybridRe := MustCompile(pat.pattern) 283 + b.SetBytes(int64(len(corpus))) 284 + b.ResetTimer() 285 + for i := 0; i < b.N; i++ { 286 + _ = hybridRe.FindAllIndex(corpus, -1) 287 + } 288 + }) 289 + } 290 + } 291 + } 292 + 293 + // ---- helpers ---- 294 + 295 + // makeCorpus returns a realistic-looking Go source corpus of approximately 296 + // the requested size. 297 + func makeCorpus(size int) []byte { 298 + const template = `package main 299 + 300 + import ( 301 + "fmt" 302 + "strings" 303 + ) 304 + 305 + // Foo is an exported function that transforms its input. 306 + func Foo(input string) string { 307 + return strings.ToUpper(input) 308 + } 309 + 310 + // Bar demonstrates calling Foo. 311 + func Bar() { 312 + result := Foo("hello world") 313 + fmt.Println(result) 314 + } 315 + 316 + var globalVar = "some value" 317 + const MaxItems = 100 318 + 319 + type MyStruct struct { 320 + Name string 321 + Value int 322 + } 323 + 324 + func (m MyStruct) String() string { 325 + return fmt.Sprintf("%s=%d", m.Name, m.Value) 326 + } 327 + 328 + ` 329 + buf := make([]byte, 0, size) 330 + for len(buf) < size { 331 + buf = append(buf, []byte(template)...) 332 + } 333 + return buf[:size] 334 + } 335 + 336 + func equalIndexSlices(a, b [][]int) bool { 337 + if len(a) != len(b) { 338 + return false 339 + } 340 + for i := range a { 341 + if !equalIntSlice(a[i], b[i]) { 342 + return false 343 + } 344 + } 345 + return true 346 + } 347 + 348 + func equalIntSlice(a, b []int) bool { 349 + if len(a) != len(b) { 350 + return false 351 + } 352 + for i := range a { 353 + if a[i] != b[i] { 354 + return false 355 + } 356 + } 357 + return true 358 + }