Is there a content search program/port (not index based) that's efficient for files needing text extraction filters

Ask Question

Asked 8 years, 5 months ago

Modified 8 years, 5 months ago

Viewed 103 times

If using FreeBSD as a file server for windows clients, it's useful to be able to run file searches server-side rather than client-side.

A typical example might be: find all files meeting some metadata criteria (name, path, size, date etc) with some literal or regex in their text-extracted content. The search is across a large recursive directory that contains mixed files, and content hits could be in any (or multiple) of: .txt notes, .docx/.xlsx documents, .pdf, .zip/.rar/.tgz/.iso compressed archives, or failing which maybe even strings in a binary file.

The first part is easy, just use find. Searching in one type of file isn't hard either. But FreeBSD doesn't have a notion of "well known" file filters or a specific single API for parsing file data to text that uses pluggable filters to a common format (although there are well-known text extraction filters for many individual filetypes such as pdf, doc/docx, xls/xlsx, archive formats, sqlite databases, binary files containing strings, etc) so you can't just throw grep, find -exec, pdftotext, or unzip | sed using Microsoft XML extraction code universally at the results. I guess you would have to generate a list or stream of filenames with find, then pass each through its appropriate filter based on extension or file, and gather up whatever passes through, as the output.

If I need to do this kind of content search quite often in a large file store, is there a specific tool that's designed and more efficient for it, or what's the most efficient approach out there?

Update - I'm only interested in direct file-by-file CLI search. I'm not interested even slightly, in indexing content and later searching an index. This question relates to file-by-file on-the-spot literal/regex search, as with find, but when the content is also searched and isn't plain text but is multiple file types with varied text-extract filters. So it's not a dup of the existing questions about indexed content searching. Sorry this wasn't clear before, I hadn't realised the ambiguity.

edited Jun 23, 2017 at 10:57

asked Jun 22, 2017 at 18:05

Stilez

1,31119 silver badges32 bronze badges

I'm not asking for indexing based systems, but pure search systems. Question updated to clarify this, as This question asks about non-indexed searching (similar to find but across different format files). Sorry this wasn't clear, please unmark as dup.

Stilez
– Stilez

2017-06-23 05:27:56 +00:00
Commented Jun 23, 2017 at 5:27
Is something like find / -exec metamail … or other program that processes files based on a mailcap file acceptable? I believe the most efficient way that acts on the live data (and not some prepared index or similar) must be supported by the file system.

Kai Burghardt
– Kai Burghardt

2024-11-15 23:12:09 +00:00
Commented Nov 15, 2024 at 23:12

Add a comment |

0 You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Stack Exchange Network

Is there a content search program/port (not index based) that's efficient for files needing text extraction filters

0

You must log in to answer this question.

Linked

Hot Network Questions

Is there a content search program/port (*not* index based) that's efficient for files needing text extraction filters

0

You must log in to answer this question.

Linked

Related

Hot Network Questions

Is there a content search program/port (not index based) that's efficient for files needing text extraction filters