TL;DR:
The Problem with AI Apps Today
Most AI apps share the same playbook: Lock you into their AIs, cap your usage with arbitrary limits, and obscure what's actually happening under the hood. They rely on the "all you can eat" pricing model, which creates perverse incentives:
- They use the cheapest AI model you'll tolerate.
- They restrict your access to better models with rate limits.
- They deliberately limit how much context they send to save costs.
- They hide these limitations behind opaque interfaces that don't give you control.
The result? Your AI assistant suddenly becomes unavailable during critical work. It mysteriously "forgets" context you know you provided. It gives inconsistent answers with no way to understand why. You tell it to search, and instead, it just answers with what it already knows.
An Evidence-Based Alternative
We ran rigorous benchmarks against professional medical and legal reasoning tasks - the kind where getting things wrong has real consequences. We randomly selected 150 questions from LegalBench & USMLE Steps I, II, & III
The results were striking:
Service | Accuracy | Model | Web Access | Price |
---|---|---|---|---|
Perplexity Free | 66% | Perplexity | ✓ | free |
Telosnex Free | 80% | Llama 3.1 8B | ✓ | free* |
ChatGPT | 84% | GPT-4o | - | free |
Anthropic | 86% | Sonnet 3.5 | - | $20/month |
Perplexity Pro | 86% | GPT-4o | ✓ | $20/month |
ChatGPT Search | 88% | GPT-4o (custom) | ✓ | $20/month |
Telosnex | 95% | GPT-4o | ✓ | $10/month** |
* Used paid Search ($0.001/search).
** Plus usage at cost. GPT-4o is ~$0.40/hour at 250 wpm reading speed.
The Technical Edge
Three key technical decisions drive our superior results:
-
Mandatory Search: Unlike other AI search tools that let the AI decide when to search (which actually decreases accuracy), we always retrieve relevant context. Our research shows this provides crucial domain-specific language that reduces perplexity in the final reasoning step.
-
Full Context: We send complete context to the AI, not arbitrarily truncated versions. This costs us more but produces materially better results.
-
Provider Flexibility: Run locally on your machine with free open models, use your own API keys, or purchase credits at cost. No artificial barriers.
Honest Economics
We run a straightforward business:
- $10/month membership fee.
- Usage billed at our cost.
- ~15 hours of basic AI per dollar.
- ~1 hour of premium AI per dollar.
- Free if you use your own compute or API keys.
We're profitable at these rates because we're a small, efficient team focused on technical excellence that makes a difference for real professionals, instead of growth hacking. We believe if we provide genuine value, word will spread organically.
The Evidence
Our benchmarks use:
- Professional medical licensing exam questions (MedQA)
- Legal reasoning tasks (LegalBench)
- Proper sample sizes for 99.9% confidence
- Automated guards against "cheating" via direct answer lookup
- Full methodology and results published at [links to spreadsheets]
Technical Details
The remaining sections detail our methodology, correct for potential biases, and provide raw data for independent verification. The short version is: we're extremely confident in our results.
Benchmarks
Evaluation was performed in November 2024.
We compared Telosnex to other AI apps on benchmarks evaluating complex professional reasoning with high stakes and liability. The two datasets we used were LegalBench and MedQA, as both law and medicine require high levels of knowledge recall paired with complex reasoning.
Datasets
- LegalBench is an open, collaborative effort by the legal community to develop a benchmark for legal reasoning. It includes 162 types of tasks, each with many questions. It is maintained by the Hazy Research Group at Stanford University.
- MedQA is a collection of questions from professional medical licensings exams. It contains 13,000 questions from the United States Medical Licensing Examination (USMLE), and another 48,000 from Chinese and Taiwanese medical licensing exams.
Samples
Both datasets contains tens of thousands of questions and would be prohibitively expensive to run in their entirety. We calculated the sample
- LegalBench: Required a sample size of 170. A Dart script randomly selected 2 questions from each of the 162 tasks, for a total of 324 questions. We decided to use all 324 questions, rather than the 194 sample size, in order to provide full coverage of all tasks. size required to achieve a 95% confidence level and a 5% margin of error.
- MedQA: Required a sample size of 194. We used numbergenerator.org to generate random numbers between 1 and the total number of questions in each dataset, and then selected the first 170 questions using a Dart script.
Sample sizes were refined based on the population proportion, the # of answers Telosnex got correct. Desired confidence level was increased from 95% to 99.9% for MedQA, and the 5% margin of error was maintained. This indicated a sample size of 98 for LegalBench and 85 for MedQA.
Providers
We ran the benchmarks on the leading AI providers as of November 2024.
- Perplexity: free for the base model.
- Perplexity Pro: $20/month. Set to use GPT-4o.
- Google's Gemini 1.5 Pro: $20/month. Accessed through gemini.google.com.
- Anthropic's Claude Sonnet 3.5: Released November 2024. Anthropic's top-tier AI. Accessed through claude.ai.
- OpenAI's GPT-4o: $20/month. Released August 2024. Accessed through OpenAI's API.
- OpenAI's SearchGPT: $20/month. Released November 2024. Accessed through chatgpt.com, and enabling Search for each question individually.
- Telosnex: Used with GPT-4o and Serper, a search API that uses Google's search engine. GPT-4o was used to provide clarity when comparing against OpenAI and Perplexity, which both can use GPT-4o.
- Telosnex Free: Used with an on-device AI, Llama 3.1 8B, and Serper for search results. ($0.001/search, one-tenth of a cent)
Question Formatting
0-shot for MedQA, some N-shot for LegalBench.
AI evaluations often are "N-shot", meaning the AI is given N examples of a task and then asked to perform the task. This increases scores, but, is not representative of real-world use.
Thus, we use 0-shot for MedQA. The alternative would be ex. injection 3 question/answer pairs before asking the question that we are evaluating.
For LegalBench, we use the question as written. Many of 162 tasks are formatted as N-shot. For example:
- a task where the input is a contrast clause of a contract, and the question is "Does the clause specify a minimum order size or minimum amount or units pertime period that one party must buy from the counterparty?"
- a task where the input is a question in a courtroom transcript, and the task is to label the function it serves, where the choices are "Background, Clarification, Communicate, Criticism, Humor, Implications, Support"
Chain of Thought for MedQA, not for LegalBench.
Another core prompt engineering technique is to encourage the AI to "think" about the question, then answer it. We readily include this in our prompts, as it is simple and representative of real-world use.
We do not use this for LegalBench, as the questions are already formatted as N-shot. LegalBench is formatted as:
(question)
Provide the final label/answer that I left blank, that is all I need from you.
MedQA is formatted as:
(question)
Take a deep breath, work through the problem step by step, and after working through it, select your final answer
(answer choices)
Search Results Prompt
Perplexity's prompt introducing search results is unknown. It seems to include snippets from up to 20 web pages from 3 search queries.
Telosnex's is documented below. It is the last message before the AI's answer.
The entire chat history submitted looks like:
- User: (prompt)
- User: (search results, up to 3000 words, 3 search queries, 10 documents)
- User: (search results prompt)
Telosnex Search Results Prompt
The information above is search results you'll be using the accomplish the following task:
# Incoming Request
{{ prompt }}
# Task
You must provide a response to the request, drawing only upon search results that are directly relevant to answering the question. It is crucial that you carefully evaluate the applicability of each search result and only include citations to those that provide valuable and pertinent information. Irrelevant search results should not be mentioned or cited under any circumstances. In case it's helpful when you're sifting through results, the current date is {{datetime}}, you don't need to mention that to the user unless they asked directly.
# Rules
- Do not repeat the intent, the user knows it already.
- Use advanced Markdown through the text for formatting.
- Assume the character of an expert professional familiar with the subjects in the intent.
- Speak authoritatively and concisely.
- Think carefully when reasoning about dates, ex. if a query is for upcoming events, that would imply it expects answers of events with dates *after* today.
- Thoroughly analyze each search result to determine its relevance to the request.
- Add citations -- using the format [^N^], where N is the snippet # of the citation.
- If you find yourself unable to locate search results that are indisputably relevant to the request and can be cited to support your response, simply provide a complete answer to the best of your knowledge without including any citations. It is better to submit a response without citations than to cite irrelevant information.
- Use your expert-level knowledge to supplement the search results. Don't share links from your knowledge, knowledge about current events / news, and any knowledge before 2021 - Its {{datetime}}, but you last learned new information in 2021. Rely on the search results for links and current events. Explain concisely, densely.
- Demonstrate your subject-level knowledge by sharing your step by step thinking through all factors mentioned in the intent.
- Include data points, p-values, and quantitative information mentioned in the articles, reinforcing the credibility of your writing.
- Make sure to cite your findings to the matching note you got them from.
- At the end of your response, DO NOT INCLUDE MARKDOWN FOOTNOTES. Pause silently when you have finished. I will then provide them.
- Reinforce credibility by actively pointing out things that could seem like errors: ex. referencing studies about children when providing an answer for an adult.
- Respond in well-formatted & well-designed Markdown.
The other AIs do not provide search results.
Inference Settings
LLMs have settings that influencing the AI's behavior, such as temperature, max tokens, and top p.
Only Telosnex allows the user to change it. We used 0.5 temperature, 2000 max tokens for all providers.
Fresh Context
The AI was given a fresh context for each question. This is a common practice in AI evaluations, as it is representative of real-world use.
Preventing Cheating
Telosnex was coded to exclude sending web page snippets that matched > 5 consecutive words from the question. All providers were tested using Telosnex. Thus, all providers except Perplexity had an automated guard.
Perplexity was hand-checked several times across ~600 questions. We only found 1 instance of a web page snippets having the question + answer in it. Perplexity's snippets were accessed by clicking the 3 dot menu below the answer, then clicking "View Sources".
Results Dataset
Results are recorded in this Google Sheet.
There are 2 sheets:
- LegalBench
- MedQA
The format is roughly one row per question. The columns are:
- question
- formatted question (as discussed in the previous section)
- correct answer
- 2 columns for each AI.
- The AI's answer, or blank, if the AI was correct. N/A is used when the AI did not provide an answer that was one of the choices, even after a follow-up prompt.
- The other column says either "correct" or "incorrect", based on the answer and AI answer columns.
Results
Service | Legal | Medical | Average |
---|---|---|---|
Perplexity | 58% | 75% | 66% |
Telosnex On Device | 74% | 86% | 80% |
GPT-4o | 77% | 91% | 84% |
Sonnet 3.5 | 80% | 90% | 85% |
Perplexity Pro x GPT4o | 87% | 85% | 86% |
SearchGPT | 87% | 92% | 89% |
Telosnex | 94% | 96% | 95% |