Wosp: advanced full-text search on the command line

2025-11-04T00:00:00+00:00

Summary</h2>
Wosp is a command-line program that performs full-text search on text documents. Wosp stands for word-oriented search and print. It is designed for advanced searchers. It works differently than line-oriented search tools like grep, so it can search for matches spanning multiple lines. Wosp supports an expressive query language that contains both Boolean and proximity operators. It also supports nested queries, truncation, wildcard characters, and fuzzy searching.

Git repository</h2>
The source code is available on GitHub</a>. I have licensed the project under the GPL version 3. The README file in the repository covers the build and installation instructions.
Motivations and requirements</h2>
I am a scientist and a researcher. As part of my research, I often search for obscure information and hard-to-define concepts. Some of my searches occur locally. Over the years, I have accumulated a substantial amount of documents that are unavailable online, so I also depend on being able to search for information locally.
Search engines often appear magical. Searchers, like wizards, summon the relevant information with the right incantation (well, query). Of course, search is not magic, but the seemingly miraculous nature of search makes it difficult at times to grasp what went wrong when a searcher cannot find something. Should I narrow this search more? Did I use the wrong term? How does a search engine work anyway?
In the last few years, people have often discussed how "search is broken" in some way. This statement usually applies to online search engines, especially Google. Several factors have contributed to the drop in search quality: extensive search engine optimization</a>, alleged attempts by Google to increase advertising revenue</a>, or the recent introduction of AI overviews hindering deeper searching. Searching on Google now feels adversarial at times, like you have to trick the system to get good results.
The common issue in all three of these factors is that searchers have little control over the results Google returns. That realization made me wonder if I can create some search tool that puts me and other searchers firmly in control of our own searches.
Searchers need repeatable tools that work reliably and do not degrade over time as external incentives change. Ideally, these tools should work locally to search private information. Search tools should also be easy to understand, so that searchers can "look under the hood" and improve the tools as needed. That removes the magical feeling, but it was never magic to begin with. Searchers do not have to accept broken products that operate like magical and untouchable black boxes.
Initially, I did not want to create another program, so I searched for something. I found several interesting programs but concluded that no existing program satisfies my requirements.
I need a search program with the following features:

Command-line operation. The program must work according to the Unix philosophy</a>. </li>

Simple and clear program architecture. The program must have a well-defined scope to avoid feature creep. I want a simple and clear program architecture that follows guidelines like the Suckless philosophy</a> among others. </li>

Boolean and proximity operators. The program must have sentence and paragraph proximity operators. IBM and Data Central (now LexisNexis) independently invented proximity search</a> in the 1960s (Bourne 2003</a>). Proximity search ensures that a document discusses two concepts in the same context, which matters when searching especially long documents. </li>

Wildcard search. Wildcard search</a> enables searching for a variety of terms compactly. </li>

Fuzzy search. Fuzzy search</a> helps search poorly OCRed documents and locates relevant documents even if they have typos. </li>

Nested queries. Support for nested queries means that the query language should allow for the arbitrary combination of all operations. For example, some programs only allow proximity search for words and not for phrases or groups of words. Nested queries make the query language expressive by enabling searchers to search for more precise concepts. </li>

No stored index required. The program must not require a stored index, so it can be run on any arbitrary collection of documents. I need to use the program on any bit of text rather than on a limited library of documents. </li> </ul>
I have considered various local search tools, but none met all of these requirements. agrep</a>, pdfgrep</a>, and ripgrep</a> all follow the basic model provided by grep</a> of searching for regular expressions on the command line without any stored indices. qgrep</a> adds stored indices to speed up the search. Recoll</a>, Aleph</a>, and Open Semantic Search</a> operate using a graphical-user interface and require stored indices.

A Venn diagram comparing the features of various search programs, including their support for fuzzy or proximity search and their use of stored indices.</figcaption> </figure>
The above Venn diagram compares the features of these local search tools. The Venn diagram excludes several variants of grep due to them lacking fuzzy and proximity search. Aleph is the only existing local search program with both fuzzy and proximity search</a>, but Aleph uses a client-server architecture, making it better suited for organizations than individuals. Recoll requires far less setup and preparation but still requires a stored index. Aleph and Recoll both have the weak form of proximity search that disallows nested proximity operations.
Since no existing program met my requirements, I decided to write my own full-text search program designed for advanced searchers. The next section goes into how to use Wosp.
How to use Wosp</h2>
The basic format of a Wosp command is as follows:
$ wosp "QUERY" file1.txt file2.txt ... </code></pre>Queries are how searchers find information using Wosp's query language. The query language is influenced by the query language used at the United States Patent and Trademark Office</a>. The USPTO's query language is an implementation of the query language used in BRS/Search</a>, which was a copy of the query language used in IBM STAIRS</a>. That said, Wosp's query language includes many advanced features not found in many other systems. To search for a keyword in some documents, a searcher just needs to use the word: $ wosp "detective" A_Study_in_Scarlet.txt A_Study_in_Scarlet.txt:816:am the only one in the world. I'm a consulting detective, if you can understand ... </code></pre> All of my examples come from the first Sherlock Holmes book, A Study in Scarlet</a>. The wildcard character is ?</code>. That character matches any character in the text. Wosp also supports truncation using either the dollar sign $</code> or the number sign #</code> followed optionally by a number for the maximum number of characters to include. Searchers should include the number or else performance may suffer, though. Wildcards and truncation allow for powerful queries that can match many different words with one wildcard given: $ wosp "#2m?n" A_Study_in_Scarlet.txt A_Study_in_Scarlet.txt:286:free as an income of eleven shillings and sixpence a day will permit a man to ... </code></pre> #2m?n</code> matches "man", "men", "woman", "women", and "human" all in a single term. Wosp supports different case sensitivity options and fuzzy search with a given edit distance. The edit distance is the number of errors that Wosp can correct. It widens the search for keywords and can find typos, but too much edit distance may find too many results. The default is case insensitive search with an edit distance of zero. Searchers control the case sensitivity and edit distance using the search option commands ICASE</code> (insensitive case), SCASE</code> (sensitive case), LCASE</code> (lowercase), UCASE</code> (uppercase), and TCASE</code> (title case or more specifically start case, where the initial letter is capitalized). These commands must precede either a wildcard term or parentheses. $ wosp "TCASE holmes" A_Study_in_Scarlet.txt ... A_Study_in_Scarlet.txt:425:"Dr. Watson, Mr. Sherlock Holmes," said Stamford, introducing us. ... </code></pre> To specify the edit distance with one of these commands, add the edit distance number to the end of the command: $ wosp "ICASE1 detective" A_Study_in_Scarlet.txt ... A_Study_in_Scarlet.txt:878:months or so. It might be made a text-book for detectives to teach them what to ... </code></pre> In this case, Wosp returned the word "detectives" since it is one error from "detective". A better way to include plurals is to use truncation like with detective#1</code>, since it is more explicit about where the extra character goes. Wosp supports Boolean operations. Boolean operations return all matches for documents that meet certain conditions. The four Boolean operators are OR</code>, AND</code>, NOT</code>, and XOR</code>. OR</code> returns all matches for any term given in all documents. Like all Boolean operations, it takes two arguments. Searchers can chain together multiple OR</code> statements to produce long lists of terms to search all at once. $ wosp "detective#1 OR officer#1" A_Study_in_Scarlet.txt A_Study_in_Scarlet.txt:260:however, with many other officers who were in the same situation as myself, and A_Study_in_Scarlet.txt:816:am the only one in the world. I'm a consulting detective, if you can understand what that is. Here in London we have lots of Government detectives and lots of ... </code></pre> Searchers can string together many OR</code> operations in a row. $ wosp "detective#1 OR officer#1 OR police OR policem?n OR constable#1" A_Study_in_Scarlet.txt A_Study_in_Scarlet.txt:260:however, with many other officers who were in the same situation as myself, and A_Study_in_Scarlet.txt:487:might start a paper on those lines. Call it the Police News of the Past." ... </code></pre> AND</code> returns results from files that contain both terms searched for. It is good for limiting the number of files to be searched. NOT</code> returns results from files that contain the first term but not the second. XOR</code> works similarly to OR</code> but only returns documents with one of the terms given in the sequence of XOR</code> operations. Wosp supports OR</code> as the default operator. Default operators are a feature of some previous search systems, but I personally do not use them since I prefer explicit queries. Due to its support for default operators, Wosp treats queries like (first second third)</code> as (first OR second OR third)</code>. It does not raise an error for a missing operator. It is possible to change the default operator but that requires re-compiling Wosp at the moment. Wosp supports using parentheses to nest operations. Parentheses are on the highest precedence level, so they are evaluated first, just like in arithmetic operations. Parentheses allow for searchers to create expressive and precise queries that join many different ideas into a single query. This feature saves advanced searchers time by preventing the needless repetition of many nearly identical queries. Wosp supports many different proximity operations. Proximity search</a> is a search technique that allows searchers to find terms that are close to each other in the text. Closeness typically implies that the terms are related, so proximity search ensures that searchers find relevant terms in the same context rather than in different parts of a long document. In that sense, proximity search works like a more limiting form of the Boolean AND</code>. The most basic form of proximity search is phrase search in quotes, which is typically the only form of proximity search most search engines support today. $ wosp "'police constable'" A_Study_in_Scarlet.txt A_Study_in_Scarlet.txt:1070:and against this wall was leaning a stalwart police constable, surrounded by a </code></pre> This query is the same as police ADJ constable</code>. ADJ</code> is the adjacency operator. It finds neighboring words in the order given. One advantage of using ADJ</code> over quotes is that queries can support Boolean operations as needed. For example, using parentheses to group terms, a searcher can search for police ADJ (constable#1 OR officer#1 OR detective#1 OR m?n)</code> to broaden their search without having to type police</code> multiple times. The other proximity operators do not have a given order like ADJ</code>. Instead, they search in both directions within a given series of language elements. NEAR</code> searches neighboring words, AMONG</code> searches in the same clause, WITH</code> searches in the same sentence, ALONG</code> searches in the same line, and SAME</code> searches in the same paragraph. $ wosp "detective#1 WITH (case#1 OR evidence)" A_Study_in_Scarlet.txt A_Study_in_Scarlet.txt:1360:thing which may help you in the case," he continued, turning to the two detectives. "There has been murder done, and the murderer was a man. He was A_Study_in_Scarlet.txt:2217:"When Mrs. Charpentier paused," the detective continued, " I saw that the whole case hung upon one point. Fixing her with my eye in a way which I always found ... </code></pre> Like the search option operators, all proximity operators support added numbers at the end, and those numbers specify how many neighboring language elements to search. The default is one language element. The following example searches for several terms to within 5 words of each other: $ wosp "(private OR police OR government) NEAR5 (constable#1 OR officer#1 OR detective#1)" A_Study_in_Scarlet.txt A_Study_in_Scarlet.txt:817:what that is. Here in London we have lots of Government detectives and lots of private ones. When these fellows are at fault they come to me, and I manage to A_Study_in_Scarlet.txt:1070:and against this wall was leaning a stalwart police constable, surrounded by a A_Study_in_Scarlet.txt:2006:"It's the Baker Street division of the detective police force," said my ... </code></pre> Some systems support proximity search, but only between neighboring words. In these systems, it is impossible to nest proximity searches using parentheses. Wosp allows nested proximity search without limitation. For example, searchers can search for proximity matches in proximity to other terms or other proximity matches. $ wosp "night#1 WITH (bod#3 NEAR4 victim#1)" A_Study_in_Scarlet.txt A_Study_in_Scarlet.txt:2264:killed him without leaving any mark. The night was so wet that no one was about, so Charpentier dragged the body of his victim into the empty house. As </code></pre> Wosp's query language is designed to be expressive like this. It is designed for advanced users who want power and precision. Wosp also supports negated proximity operators (NOT WITH</code>, etc.). These are useful on occasion but I often avoid them unless I have a compelling reason to use them. How Wosp works: inverted indices (technical discussion)</h2> On a high level, Wosp performs the following steps: Read all documents into doubly-linked lists where each list element is a word. </li> Create a full inverted index</a>. </li> Lex a query, parse it using a recursive descent parser</a>, and evaluate it using the full inverted index. </li> Output the results of the query. </li> </ol> The first step is the easiest to understand. All words in the doubly-linked list then have known neighbors and positions in each document, as depicted below. A document represented as a doubly-linked list of words with given positions.</figcaption> </figure> The second step requires more explanation. An inverted index is an index for every word used in a series of documents, akin to the index in the back of books. Given a particular word, the index allows searchers to find every instance of that word. Once a searcher has built an inverted index, it is trivial to find any given word. The index especially speeds up queries that require finding many different words, since it stores all possible word locations. With an inverted index, the rest of the program only needs to perform various operations on any searched terms. Wosp performs these operations using an interpreter that turns search queries into a list of operations to perform in a tree of operations that ends with wildcard searches, which just look up terms in the inverted index. The search option commands modify how wildcard search occurs. The doubly-list list of all words means that every word has a specified position in each document. Building the inverted index then is just a matter of going through each document again and noting the location of each word. Wosp does this using a trie data structure, as depicted below. An inverted index going from a given word to all of its positions in a document.</figcaption> </figure> A trie is a special type of tree data structure where each node contains a letter of a word and all edges lead to the next possible letter. For example, the word "works" starts at "w", continues does the edge for "o", then "r", then takes the edge "k" instead of "d", then "s", and then finally arrives at a list of all positions of all words "works" in all documents. Wosp does not implement these positions as integers but instead as pointers, but the same principle applies. There are certainly more efficient ways to create an inverted index, but I stuck with a trie due to its simplicity. Wosp does not support stop words</a>, so all words are indexed (including extremely common words). I made this decision to keep Wosp lean. Indexed search is entirely different than the sequential search methods used by programs like grep. On a high level, grep looks for a pattern in each line of text in sequence (I realize that some implementations</a> do not implement this idea literally). Sequential search does not require an overarching data structure for organizing that search. Both approaches have their advantages and disadvantages. The primary advantage of sequential search methods is speed, since it may be possible to read each document once and return results while reading the document. It is possible to implement proximity search</a> using sequential search methods, but it is easier to implement nested proximity search using an indexed search method. This observation is easy to see once you realize that nested queries may require multiple passes over the same document anyway, which defeats the purpose of a sequential search algorithm. How Wosp works: query language (technical discussion)</h2> Previous work</a> has divided query languages into two classes: languages that are powerful for advanced searchers but difficult for most searchers, and languages that are weak for advanced searchers but easy for most searchers. I designed Wosp for advanced users who know what they want to search. To that end, Wosp queries do exactly what the user requests --- no more and no less. Wosp does not engage in any query expansion</a>, so it does not find any plurals or synonyms, does not perform any stemming (unless explicitly requested using trailing truncation), and does not correct any spelling errors (unless explicitly requested using a higher edit distance). This design choice keeps Wosp lean and grants searchers more power. I developed the query language's order of operations (precedence) slowly. I initially intended to follow previous efforts like BRS/Search and IBM STAIRS, but despite extensive searching on my part, I never found any formal documentation for the order of operations for the query language of those programs. I did find three references that provided incomplete information about the order of operations of these systems, but the information was contradictory. For example, The USPTO's Patent Public Search</a> notes that the the precedence of proximity operators goes from the largest language elements (lowest precedence) to the smallest language elements (highest precedence), but it does not comment on the precedence of any Boolean operators. I found two other third-party documents that list the order of operations for IBM STAIRS (Gagjokov et al. 1976</a> and Salton and McGill 1983</a>), but their reported Boolean operator precedence contradicts each other. In the end, I implemented the same order of operations for the Boolean operators as for logical connectives</a> in mathematics. I credit my mother for recommending this. This choice means that Wosp may not behave the same as previous systems, but users at least know how it does behave given its explicit order of operations. The operator precedence levels are given below. Atoms: wildcards, parentheses (nested queries). </li> Search options: ICASE</code>, LCASE</code>, SCASE</code>, TCASE</code>, UCASE</code>. </li> Adjacency operations: ADJ</code>, NOT ADJ</code>, phrases (quotes). </li> Word proximity operations: NEAR</code>, NOT NEAR</code>. </li> Clause proximity operations: AMONG</code>, NOT AMONG</code>. </li> Sentence proximity operations: WITH</code>, NOT WITH</code>, ALONG</code>, NOT ALONG</code>. </li> Paragraph proximity operations: SAME</code>, NOT SAME</code>. </li> Boolean negation: NOT</code>. </li> Boolean conjunction: AND</code>. </li> Boolean disjunction: OR</code>, XOR</code>. </li> </ol> Future work</h2> As much as I want to keep Wosp clean and minimal in its implementation, there are still many features that I want to include. The most immediate features that I want to add: Different character encodings. Wosp currently only supports ASCII well. Searchers can search for words with international characters, and use them with Boolean and proximity operations, but wildcards do not work properly with international characters. Eventually Wosp needs to support international characters properly from the ground up. </li> Command-line options. I have endeavored to put as many of the options into the query language itself rather than into command line arguments, but it should also be possible to control several options at the command line too. I have preferred extending the query language because this provides much more expressiveness overall, since command line options may be limited to global options. </li> Highlighted keywords. At the moment, Wosp does not highlight keywords, but it should. The match data structures already contain all of the information for highlighting. </li> Input options. Wosp must include options for how to split words and treat punctuation. At the moment, Wosp defines words as anything between whitespace, but some systems support additional ways to define word boundaries. For example, I should add an option to determine whether hyphenated words count as one word or two. </li> Output options. I have included some additional output options already, but these are hardcoded and require re-compilation to enable at the moment. That is obviously quite limiting. At the moment, Wosp does not support ranking the results in any form at all. It should be possible to specify how to rank the results, especially since this grants the user far more control than what many web search engines grant at the moment. </li> </ul> Other features that I may add eventually: Additional operations. I have some ideas for additional operations, but I am wary of adding any that do not have a particular use case. Wosp already supports page numbers but lacks any operator for page-based proximity search. This would be trivial to add if needed. I have considered UPON</code> as the syntax for this. </li> Field support. Fields are information like titles, abstracts, the list of authors, and other information about a document. Wosp already divides each document up into individual fields, but it lacks any facility to select individual fields. At the moment, the only field loaded is the text itself. One issue is that text files typically do not have any additional fields (other than the filename, file extension, and other file properties), but PDF files or Markdown files might have some, though I am debating whether direct support for PDF files is in the scope of the program or not. </li> Interactive REPL. Given all of the overhead involved in loading documents, I would appreciate the ability to perform multiple searches without having to re-create the inverted index, especially to refine the results. A REPL</a> allows that. A REPL should also introduce the ability to modify previous searches using variables, like what the USPTO's Patent Public Search</a> allows with L numbers</a>. </li> </ul>

Andrew Trettel's website

Wosp: advanced full-text search on the command line

Research

Welcome to my website

Contact

Andrew Trettel's website

Wosp: advanced full-text search on the command line

How Wosp works: inverted indices (technical discussion)</h2> On a high level, Wosp performs the following steps:</p> Read all documents into doubly-linked lists where each list element is a word.</p> </li>

Research

Welcome to my website

Website design</h2> All design is about tradeoffs. I wanted a website that I did not need to spend much time maintaining, focused on the content, but nonetheless is visually distinctive. In "good, fast, cheap, pick two", I picked "good" and "cheap".</p>

Contact