Don't truncate file before detecting language (#740) · boltless.me/zoekt@1c158f9

fork of https://github.com/sourcegraph/zoekt

Don't truncate file before detecting language (#740)

Currently, we truncate a file's contents to 2048 bytes before passing it to
`go-enry`. I ran into a few cases where this is causing us to misclassify
files.

This PR removes the truncation. It should still be fine in terms of
performance, since `go-enry` is quite fast in general: ~1ms in my local
testing, even for large files. And we only run language detection if we plan to
index the file, which means we skip binary files and large files.

author

Julie Tibshirani committer

GitHub date 2 years ago (Feb 11, 2024, 6:11 PM -0800) commit 1c158f9b 1c158f9b866148246bad89f5bc6d4876c752681f parent b227501a b227501acf82ca21a07a6cf1d7b36616a2b21327

+1 -6

1 changed file

Expand all

indexbuilder.go

+1 -6

indexbuilder.go

··· 397 397 398 398 func DetermineLanguageIfUnknown(doc *Document) { 399 399 if doc.Language == "" { 400 - c := doc.Content 401 - // classifier is faster on small files without losing much accuracy 402 - if len(c) > 2048 { 403 - c = c[:2048] 404 - } 405 - doc.Language = enry.GetLanguage(doc.Name, c) 400 + doc.Language = enry.GetLanguage(doc.Name, doc.Content) 406 401 } 407 402 } 408 403

Configure Feed

Configure Feed