Tackle the issue of XML files filtered as binaries in search results (#910)
When skipping a doc, we currently report the detected language as "binary" (if
it looks like binary) or "skipped" (if it's skipped for any other reason).
Skipped docs are still added to the index and can still be returned as search
results, for example if you only match on filename. So sometimes file matches
are returned with "skipped" as their language, even though the file path is
clearly some other language like XML.
This PR updates the indexing logic to still detect the language even if the
document is skipped. However, we avoid passing the contents to the language
detection library to avoid running detection on huge files.
---------
Co-authored-by: Julie Tibshirani <julietibs@apache.org>