Wosp: advanced full-text search on the command line

Published 2025-11-04
Summary
Wosp is a command-line program that performs full-text search on text documents. Wosp stands for word-oriented search and print. It is designed for advanced searchers. It works differently than line-oriented search tools like grep, so it can search for matches spanning multiple lines. Wosp supports an expressive query language that contains both Boolean and proximity operators. It also supports nested queries, truncation, wildcard characters, and fuzzy searching.
Git repository
The source code is available on GitHub. I have licensed the project under the GPL version 3. The README file in the repository covers the build and installation instructions.
Motivations and requirements
I am a scientist and a researcher. As part of my research, I often search for obscure information and hard-to-define concepts. Some of my searches occur locally. Over the years, I have accumulated a substantial amount of documents that are unavailable online, so I also depend on being able to search for information locally.
Search engines often appear magical. Searchers, like wizards, summon the relevant information with the right incantation (well, query). Of course, search is not magic, but the seemingly miraculous nature of search makes it difficult at times to grasp what went wrong when a searcher cannot find something. Should I narrow this search more? Did I use the wrong term? How does a search engine work anyway?
In the last few years, people have often discussed how "search is broken" in some way. This statement usually applies to online search engines, especially Google. Several factors have contributed to the drop in search quality: extensive search engine optimization, alleged attempts by Google to increase advertising revenue, or the recent introduction of AI overviews hindering deeper searching. Searching on Google now feels adversarial at times, like you have to trick the system to get good results.
The common issue in all three of these factors is that searchers have little control over the results Google returns. That realization made me wonder if I can create some search tool that puts me and other searchers firmly in control of our own searches.
Searchers need repeatable tools that work reliably and do not degrade over time as external incentives change. Ideally, these tools should work locally to search private information. Search tools should also be easy to understand, so that searchers can "look under the hood" and improve the tools as needed. That removes the magical feeling, but it was never magic to begin with. Searchers do not have to accept broken products that operate like magical and untouchable black boxes.
Initially, I did not want to create another program, so I searched for something. I found several interesting programs but concluded that no existing program satisfies my requirements.
I need a search program with the following features:
-
Command-line operation. The program must work according to the Unix philosophy.
-
Simple and clear program architecture. The program must have a well-defined scope to avoid feature creep. I want a simple and clear program architecture that follows guidelines like the Suckless philosophy among others.
-
Boolean and proximity operators. The program must have sentence and paragraph proximity operators. IBM and Data Central (now LexisNexis) independently invented proximity search in the 1960s (Bourne 2003). Proximity search ensures that a document discusses two concepts in the same context, which matters when searching especially long documents.
-
Wildcard search. Wildcard search enables searching for a variety of terms compactly.
-
Fuzzy search. Fuzzy search helps search poorly OCRed documents and locates relevant documents even if they have typos.
-
Nested queries. Support for nested queries means that the query language should allow for the arbitrary combination of all operations. For example, some programs only allow proximity search for words and not for phrases or groups of words. Nested queries make the query language expressive by enabling searchers to search for more precise concepts.
-
No stored index required. The program must not require a stored index, so it can be run on any arbitrary collection of documents. I need to use the program on any bit of text rather than on a limited library of documents.
I have considered various tools, but none met all of these requirements.
-
grep works on the command line, but it lacks any of the more advanced features like proximity search. Moreover, grep is line-oriented and better suited for source code than text documents.
-
agrep adds fuzzy search to grep, but suffers many of the same issues as grep.
-
pdfgrep extends regular-expression search to PDF files, but it still largely falls under the grep model of searching and therefore suffers its same limitations.
-
Recoll has proximity search, but does not support nested proximity searches. It has a command-line component, but it requires users to index all of their documents before searching them.
Seeing as there is nothing out there that does what I want, I decided to write my own full-text search program designed for advanced searchers. The next section goes into how to use Wosp.
How to use Wosp
The basic format of a Wosp command is as follows:
$ wosp "QUERY" file1.txt file2.txt ...
Queries are how searchers find information using Wosp's query language. The query language is influenced by the query language used at the United States Patent and Trademark Office. The USPTO's query language is an implementation of the query language used in BRS/Search, which was a copy of the query language used in IBM STAIRS. That said, Wosp's query language includes many unique features not found in previous systems.
To search for a keyword in some documents, a searcher just needs to use the word:
$ wosp "detective" A_Study_in_Scarlet.txt
A_Study_in_Scarlet.txt:816:am the only one in the world. I'm a consulting detective, if you can understand
...
All of my examples come from the first Sherlock Holmes book, A Study in Scarlet.
The wildcard character is ?. That character matches any character in the
text. Wosp also supports truncation using either the dollar sign $ or the
number sign # followed optionally by a number for the maximum number of
characters to include. Searchers should include the number or else performance
may suffer, though. Wildcards and truncation allow for powerful queries that
can match many different words with one wildcard given:
$ wosp "#2m?n" A_Study_in_Scarlet.txt
A_Study_in_Scarlet.txt:286:free as an income of eleven shillings and sixpence a day will permit a man to
...
#2m?n matches "man", "men", "woman", "women", and "human" all in a single
term.
Wosp supports different case sensitivity options and fuzzy search with a given edit distance. The edit distance is the number of errors that Wosp can correct. It widens the search for keywords and can find typos, but too much edit distance may find too many results.
The default is case insensitive search with an edit distance of zero.
Searchers control the case sensitivity and edit distance using the search
option commands ICASE (insensitive case), SCASE (sensitive case), LCASE
(lowercase), UCASE (uppercase), and TCASE (title case or more specifically
start case, where the initial letter is capitalized). These commands must
precede either a wildcard term or parentheses.
$ wosp "TCASE holmes" A_Study_in_Scarlet.txt
...
A_Study_in_Scarlet.txt:425:"Dr. Watson, Mr. Sherlock Holmes," said Stamford, introducing us.
...
To specify the edit distance with one of these commands, add the edit distance number to the end of the command:
$ wosp "ICASE1 detective" A_Study_in_Scarlet.txt
...
A_Study_in_Scarlet.txt:878:months or so. It might be made a text-book for detectives to teach them what to
...
In this case, Wosp returned the word "detectives" since it is one error from
"detective". A better way to include plurals is to use truncation like with
detective#1, since it is more explicit about where the extra character goes.
Wosp supports Boolean operations. Boolean operations return all matches for
documents that meet certain conditions. The four Boolean operators are OR,
AND, NOT, and XOR.
OR returns all matches for any term given in all documents. Like all Boolean
operations, it takes two arguments. Searchers can chain together multiple OR
statements to produce long lists of terms to search all at once.
$ wosp "detective#1 OR officer#1" A_Study_in_Scarlet.txt
A_Study_in_Scarlet.txt:260:however, with many other officers who were in the same situation as myself, and
A_Study_in_Scarlet.txt:816:am the only one in the world. I'm a consulting detective, if you can understand what that is. Here in London we have lots of Government detectives and lots of
...
Searchers can string together many OR operations in a row.
$ wosp "detective#1 OR officer#1 OR police OR policem?n OR constable#1" A_Study_in_Scarlet.txt
A_Study_in_Scarlet.txt:260:however, with many other officers who were in the same situation as myself, and
A_Study_in_Scarlet.txt:487:might start a paper on those lines. Call it the Police News of the Past."
...
AND returns results from files that contain both terms searched for. It is
good for limiting the number of files to be searched. NOT returns results
from files that contain the first term but not the second. XOR works
similarly to OR but only returns documents with one of the terms given in the
sequence of XOR operations.
Wosp supports OR as the default operator. Default operators are a feature of
some previous search systems, but I personally do not use them since I prefer
explicit queries. Due to its support for default operators, Wosp treats
queries like (first second third) as (first OR second OR third). It does
not raise an error for a missing operator. It is possible to change the
default operator but that requires re-compiling Wosp at the moment.
Wosp supports using parentheses to nest operations. Parentheses are on the highest precedence level, so they are evaluated first, just like in arithmetic operations. Parentheses allow for searchers to create expressive and precise queries that join many different ideas into a single query. This feature saves advanced searchers time by preventing the needless repetition of many nearly identical queries.
Wosp supports many different proximity operations. Proximity
search is a search
technique that allows searchers to find terms that are close to each other in
the text. Closeness typically implies that the terms are related, so proximity
search ensures that searchers find relevant terms in the same context rather
than in different parts of a long document. In that sense, proximity search
works like a more limiting form of the Boolean AND.
The most basic form of proximity search is phrase search in quotes, which is typically the only form of proximity search most search engines support today.
$ wosp "'police constable'" A_Study_in_Scarlet.txt
A_Study_in_Scarlet.txt:1070:and against this wall was leaning a stalwart police constable, surrounded by a
This query is the same as police ADJ constable. ADJ is the adjacency
operator. It finds neighboring words in the order given. One advantage of
using ADJ over quotes is that queries can support Boolean operations as
needed. For example, using parentheses to group terms, a searcher can search
for police ADJ (constable#1 OR officer#1 OR detective#1 OR m?n) to broaden
their search without having to type police multiple times.
The other proximity operators do not have a given order like ADJ. Instead,
they search in both directions within a given series of language elements.
NEAR searches neighboring words, AMONG searches in the same clause, WITH
searches in the same sentence, ALONG searches in the same line, and SAME
searches in the same paragraph.
$ wosp "detective#1 WITH (case#1 OR evidence)" A_Study_in_Scarlet.txt
A_Study_in_Scarlet.txt:1360:thing which may help you in the case," he continued, turning to the two detectives. "There has been murder done, and the murderer was a man. He was
A_Study_in_Scarlet.txt:2217:"When Mrs. Charpentier paused," the detective continued, " I saw that the whole case hung upon one point. Fixing her with my eye in a way which I always found
...
Like the search option operators, all proximity operators support added numbers at the end, and those numbers specify how many neighboring language elements to search. The default is one language element. The following example searches for several terms to within 5 words of each other:
$ wosp "(private OR police OR government) NEAR5 (constable#1 OR officer#1 OR detective#1)" A_Study_in_Scarlet.txt
A_Study_in_Scarlet.txt:817:what that is. Here in London we have lots of Government detectives and lots of private ones. When these fellows are at fault they come to me, and I manage to
A_Study_in_Scarlet.txt:1070:and against this wall was leaning a stalwart police constable, surrounded by a
A_Study_in_Scarlet.txt:2006:"It's the Baker Street division of the detective police force," said my
...
Some systems support proximity search, but only between neighboring words. In these systems, it is impossible to nest proximity searches using parentheses. Wosp allows nested proximity search without limitation.
For example, searchers can search for proximity matches in proximity to other terms or other proximity matches.
$ wosp "night#1 WITH (bod#3 NEAR4 victim#1)" A_Study_in_Scarlet.txt
A_Study_in_Scarlet.txt:2264:killed him without leaving any mark. The night was so wet that no one was about, so Charpentier dragged the body of his victim into the empty house. As
Wosp's query language is designed to be expressive like this. It is designed for advanced users who want power and precision.
Wosp also supports negated proximity operators (NOT WITH, etc.). These are
useful on occasion but I often avoid them unless I have a compelling reason to
use them.
How Wosp works: inverted indices (technical discussion)
On a high level, Wosp performs the following steps:
-
Read all documents into doubly-linked lists where each list element is a word.
-
Create a full inverted index.
-
Lex a query, parse it using a recursive descent parser, and evaluate it using the full inverted index.
-
Output the results of the query.
The first step is the easiest to understand. All words in the doubly-linked list then have known neighbors and positions in each document, as depicted below.
The second step requires more explanation. An inverted index is an index for every word used in a series of documents, akin to the index in the back of books. Given a particular word, the index allows searchers to find every instance of that word. Once a searcher has built an inverted index, it is trivial to find any given word. The index especially speeds up queries that require finding many different words, since it stores all possible word locations. With an inverted index, the rest of the program only needs to perform various operations on any searched terms. Wosp performs these operations using an interpreter that turns search queries into a list of operations to perform in a tree of operations that ends with wildcard searches, which just look up terms in the inverted index. The search option commands modify how wildcard search occurs.
The doubly-list list of all words means that every word has a specified position in each document. Building the inverted index then is just a matter of going through each document again and noting the location of each word. Wosp does this using a trie data structure, as depicted below.
A trie is a special type of tree data structure where each node contains a letter of a word and all edges lead to the next possible letter. For example, the word "works" starts at "w", continues does the edge for "o", then "r", then takes the edge "k" instead of "d", then "s", and then finally arrives at a list of all positions of all words "works" in all documents. Wosp does not implement these positions as integers but instead as pointers, but the same principle applies.
There are certainly more efficient ways to create an inverted index, but I stuck with a trie due to its simplicity.
Wosp does not support stop words, so all words are indexed (including extremely common words). I made this decision to keep Wosp lean.
Indexed search is entirely different than the sequential search methods used by programs like grep. On a high level, grep looks for a pattern in each line of text in sequence (I realize that some implementations do not implement this idea literally). Sequential search does not require an overarching data structure for organizing that search. Both approaches have their advantages and disadvantages. The primary advantage of sequential search methods is speed, since it may be possible to read each document once and return results while reading the document. It is possible to implement proximity search using sequential search methods, but it is easier to implement nested proximity search using an indexed search method. This observation is easy to see once you realize that nested queries may require multiple passes over the same document anyway, which defeats the purpose of a sequential search algorithm.
How Wosp works: query language (technical discussion)
I designed Wosp for advanced users who know what they want to search. To that end, Wosp queries do exactly what the user requests --- no more and no less. Wosp does not engage in any query expansion, so it does not find any plurals or synonyms, does not perform any stemming (unless explicitly requested using trailing truncation), and does not correct any spelling errors (unless explicitly requested using a higher edit distance). This design choice keeps Wosp lean and grants searchers more power.
I developed the query language's order of operations (precedence) slowly. I initially intended to follow previous efforts like BRS/Search and IBM STAIRS, but despite extensive searching on my part, I never found any formal documentation for the order of operations for the query language of those programs. I did find three references that provided incomplete information about the order of operations of these systems, but the information was contradictory. For example, The USPTO's Patent Public Search notes that the the precedence of proximity operators goes from the largest language elements (lowest precedence) to the smallest language elements (highest precedence), but it does not comment on the precedence of any Boolean operators. I found two other third-party documents that list the order of operations for IBM STAIRS (Gagjokov et al. 1976 and Salton and McGill 1983), but their reported Boolean operator precedence contradicts each other.
In the end, I implemented the same order of operations for the Boolean operators as for logical connectives in mathematics. I credit my mother for recommending this. This choice means that Wosp may not behave the same as previous systems, but users at least know how it does behave given its explicit order of operations.
The operator precedence levels are given below.
-
Atoms: wildcards, parentheses (nested queries).
-
Search options:
ICASE,LCASE,SCASE,TCASE,UCASE. -
Adjacency operations:
ADJ,NOT ADJ, phrases (quotes). -
Word proximity operations:
NEAR,NOT NEAR. -
Clause proximity operations:
AMONG,NOT AMONG. -
Sentence proximity operations:
WITH,NOT WITH,ALONG,NOT ALONG. -
Paragraph proximity operations:
SAME,NOT SAME. -
Boolean negation:
NOT. -
Boolean conjunction:
AND. -
Boolean disjunction:
OR,XOR.
Future work
As much as I want to keep Wosp clean and minimal in its implementation, there are still many features that I want to include.
The most immediate features that I want to add:
-
Different character encodings. Wosp currently only supports ASCII well. Searchers can search for words with international characters, and use them with Boolean and proximity operations, but wildcards do not work properly with international characters. Eventually Wosp needs to support international characters properly from the ground up.
-
Command-line options. I have endeavored to put as many of the options into the query language itself rather than into command line arguments, but it should also be possible to control several options at the command line too. I have preferred extending the query language because this provides much more expressiveness overall, since command line options may be limited to global options.
-
Highlighted keywords. At the moment, Wosp does not highlight keywords, but it should. The match data structures already contain all of the information for highlighting.
-
Input options. Wosp must include options for how to split words and treat punctuation. At the moment, Wosp defines words as anything between whitespace, but some systems support additional ways to define word boundaries. For example, I should add an option to determine whether hyphenated words count as one word or two.
-
Output options. I have included some additional output options already, but these are hardcoded and require re-compilation to enable at the moment. That is obviously quite limiting. At the moment, Wosp does not support ranking the results in any form at all. It should be possible to specify how to rank the results, especially since this grants the user far more control than what many web search engines grant at the moment.
Other features that I may add eventually:
-
Additional operations. I have some ideas for additional operations, but I am wary of adding any that do not have a particular use case. Wosp already supports page numbers but lacks any operator for page-based proximity search. This would be trivial to add if needed. I have considered
UPONas the syntax for this. -
Field support. Fields are information like titles, abstracts, the list of authors, and other information about a document. Wosp already divides each document up into individual fields, but it lacks any facility to select individual fields. At the moment, the only field loaded is the text itself. One issue is that text files typically do not have any additional fields (other than the filename, file extension, and other file properties), but PDF files or Markdown files might have some, though I am debating whether direct support for PDF files is in the scope of the program or not.
-
Interactive REPL. Given all of the overhead involved in loading documents, I would appreciate the ability to perform multiple searches without having to re-create the inverted index, especially to refine the results. A REPL allows that. A REPL should also introduce the ability to modify previous searches using variables, like what the USPTO's Patent Public Search allows with L numbers.