500 million tokens are actually equivalent to about 750,000 pages of text.

In a recent entrepreneurial project, my company Truss developed and launched several features based on Large Language Models (LLM). These features have achieved good results, so I’d like to share some unusual insights and experiences after processing more than five hundred million tokens (my estimate).

Key information is as follows: we mainly implemented the OpenAI model. For those curious about how we view other models, you can refer to the Q&A section at the end of the article. In our use cases, GPT-4 models accounted for 85%, with the remaining 15% being GPT-3.5. We mainly deal with text-related tasks, so we do not involve other types of features like GPT-4-vision, Sora, whisper, etc. Our target is the B2B scene, focusing on summarizing, analyzing, and extracting data, but the details of each case may be different. As for the processed five hundred million tokens, it is actually equivalent to about 750,000 pages of text, so the actual significance of this data volume should be understood more accurately.

When using prompts, we made an interesting discovery: concise prompts tend to be more effective. If some information is already considered common knowledge, then there’s no need to include too many detail lists or instructions in the prompt. Too much detail can actually distract and confuse the model. This is different from programming, which requires precise instructions. We encountered an example of this: part of the system was designed to identify the association of text blocks with the 50 states or the federal government of the USA. This task isn’t complicated, perhaps just requiring string operations or regular expressions to achieve, but that would involve many small exceptions and longer processing time.

The initial method we tried was like this:

This is a piece of text. One of the fields should be “locality_id”, which should be determined using the following list for one of the 50 states or the federal: [{“locality”: “Alabama”, “locality_id”: 1}, {“locality”: “Alaska”, “locality_id”: 2} … ]

Although this method works most of the time (about 98% of the cases), it often fails when a more in-depth understanding is needed.

Upon closer examination, we found that there was no need to require it to return the full name of the state, the model could autonomously return the correct name. Therefore, we switched to a simple string search to identify the state names in the text. After adopting this change, the performance significantly improved.

Indeed, there exists a more ideal approach: “You obviously know the 50 states, GPT, so just give me the full name of the state this pertains to, or Federal if this pertains to the US government.” In doing so, when we ask a question related to a specific state or the federal government, GPT would directly return the full name of the state or indicate its association with the federal government. This is sufficient to prove that by giving vague instructions, GPT might provide higher quality generalized responses. This is a clear sign of high-level delegation and cognitive ability.

Note 1: You might believe that GPT is fundamentally a stochastic model, but it is noteworthy that it is most prone to errors when dealing with states that start with the letter M.

Note 2: When we request GPT to select an ID from a list, if we send it as formatted JSON with each state separated by a new line instead of commas, GPT performs better. In other words, the newline character (\n) seems to be a more effective delimiter than a comma.

During the use of the Chat API, we found that there was no need for Langchain or any advanced features that OpenAI released in their API last year. Langchain can be seen as an example of premature abstraction. Initially, we thought it was necessary, as that was the sentiment prevalent on the internet. However, later we realized that even after using millions or tens of millions of tokens and equipping a production environment with 3 to 4 completely different LLM features, our openai_service file still only retained a function with about 40 lines of code:

def extract_json(prompt, variable_length_input, number_retries)

The only API we used and relied upon was chat. We always extract JSON through it, and we did not rely on JSON mode, function calls, and support features (although we offered them), nor did we ever use system prompts, which may have been helpful. When we migrated to gpt-4-turbo, we simply changed a string in the codebase, which demonstrates the essence of a powerful and universal model – “less is more”.

In this 40-line function, most of the code handles common 500 errors or socket closure errors, which occur occasionally, especially considering the OpenAI API’s load. Although the OpenAI API service is becoming increasingly stable, having issues is not surprising.

We wrote an automatic truncation code so that we don’t have to worry about context length limitations. We developed our own token length estimator:

if s.length > model_context_size * 3  # truncate it!

However, in some cases, this code may become ineffective due to an overabundance of periods or numbers.

When dealing with specific error codes, such as “context_length_exceeded,” we adopted a unique retry mechanism. By using the following code, we truncated the input to avoid exceeding the model’s context size limit and achieved good results:

s.truncate(model_context_size * 3 / 1.3)

This method proved not only effective but also flexible enough to meet our needs. Furthermore, the use of the streaming API has improved response times, and it is capable of adjusting the output speed according to user needs, representing a significant innovation in user experience for ChatGPT. Although we might mistake this just for a gimmick, the positive feedback from users proves its practical value.

One particular point worth noting about GPT’s features is its performance when faced with instructions to return an empty output if nothing is found. Instructions like “Return an empty output if you don’t find anything” pose a significant challenge for GPT, which tends to produce outputs containing error messages rather than returning nothing, leading to more empty output errors.

In most cases, our prompts look like this:

“Here is a text describing a company, we hope you extract this company and output JSON. If no relevant content is found, return empty. The text is as follows: [Text Content]”

We also encountered a bug where a text block could be empty, and GPT would generate terrible hallucinations. The solution was to fix this bug, which meant not issuing a prompt if there was no text content. However, determining via programming whether a text is empty could be challenging, which is where GPT’s involvement becomes necessary.

It is worth noting that the term “context window” is not entirely accurate because, in reality, while the input window has been expanded, the size of the output window has not been correspondingly increased. Additionally, it is little known that the maximum input window allowed for GPT-4 is 128K, while the output window still remains at 4K. Therefore, the term “context window” might be misleading. This issue is particularly problematic when we ask GPT to return a JSON object in a list format.

Imagine a scenario made up of an array of JSON tasks, each task carrying its own name and tags. However, when faced with a task that requires returning more than 10 items, even enforcing a return of 15 items, the probability of success might only be 15%. The initial guess is due to the 4K size context window, but it’s observed that GPT stops after using only about 700 to 800 tokens.

Of course, you can change your approach and use the output as input. You could first provide a prompt to return one task, then use that prompt and task together as the next input in a cycle. However, this process is like playing a game of telephone, and you also have to deal with complex situations like Langchain.

For most users, tools like vector databases and RAG/embeddings are almost of no practical value. When conceiving a promising application scenario for RAG/embeddings, you often feel helpless and hopeless. In my opinion, the real purpose behind vector databases or RAG is for search, and specifically designed for search tools like Google or Bing. This is for several reasons:

Lack of clear relevance judgment. While it is possible to try to set some heuristic rules for relevance, this isn’t always reliable. Therefore, it might lead to RAG providing irrelevant search results, or being too cautious and missing key results.
Storing vectors in a closed-off database, isolated from other data, is not a suitable solution for small to medium-sized applications. This approach results in losing a lot of contextual information.
If the search isn’t as extremely open as searching the entire internet, users usually oppose semantic search because it often brings many irrelevant returns.
In commercial application searches, users are often experts in their field who do not need the system to guess their intentions, as they will directly state their needs.

In my view, for most search scenarios, a better way to use language models (LLM) is to turn search requests into more explicit surface searches or more complex queries (like SQL) through regular prompt completion, which is not in line with the mechanism of RAG.

In practical applications, hallucination effects are actually not common. In most cases, our needs are to extract specific information from a text through GPT. For example, if asked to list company names from a text, GPT won’t randomly create a non-existent company name—unless the text doesn’t mention any companies at all (this is a zero-assumption issue). Engineers might have noticed that GPT does not produce hallucinatory code; it does not introduce errors or create non-existent variables during the rewriting of code blocks. When asked to generate standard library functions, although sometimes the answers provided are not real, I tend to view it as a zero-assumption response for “don’t know” situations. However, if your use case is entirely based on: “This is all the context information, please analyze/summarize/extract,” then the reliability of GPT is very high.

Many newly launched products emphasize one key point: the quality of the input data directly affects the quality of the generated response. With reasonable data input, GPT can provide more pertinent responses.

Summary: How will the future unfold?

For some common questions about this topic, I provide direct answers here.

Question: Can we achieve Artificial General Intelligence (AGI)?
Answer: It’s unachievable. With the current model converters, internet data and infrastructure, we can’t reach AGI.

Question: Does GPT-4 really have practical value, or is it just a marketing gimmick?
Answer: It definitely has practical value. We are still in the early stages of the internet.

Question: Will GPT lead to unemployment for everyone?
Answer: No. Essentially, it lowers the barrier to entry into the machine learning/AI field, which was once the exclusive domain of big companies like Google.

Question: Have you tried other models like Claude, Gemini, etc.?
Answer: Although I haven’t done formal A/B testing, trials in daily coding indicate that these models still need improvements in some subtle aspects, such as perceiving user intentions.

Question: How can one keep up with the latest developments in LLMs/AI?
Answer: There’s no need to pay special attention. In the long run, the overall improvement in model performance will far exceed small, local optimizations. Thus, what you mainly need to worry about is when GPT-5 will appear, everything else is secondary.

Question: How outstanding will GPT-5 be?
Answer: I’m like everyone else, also trying to pick up clues about GPT-5 from OpenAI. However, personally, I believe what we will see is just gradual improvement. While I don’t doubt that “GPT-5 will change everything,” I also don’t hold too high an expectation. The main reason is economic factors. I once thought that from GPT-3 to GPT-3.5, we might see a superlinear progress, that is, if difficulty doubles, performance could increase by more than twice. But reality does not align with our hopes, and what we observe is a logarithmic relationship. In fact, to achieve marginal improvements, token speed growth declines exponentially, while the cost per token increases exponentially. If this is an inherent law, then GPT-4 may already be near optimization—although compared to GPT-3.5, I’d be willing to pay 20 times the price for GPT-4. But honestly, in transitioning from GPT-4 to GPT-5, I wouldn’t be willing to pay 20 times the price per token. GPT-5 might break the mold, but it could also be merely equivalent to the upgrade from iPhone 5 to iPhone 4. I wouldn’t be disappointed about that.

Germany Embraces Linux Again: After two decades of technological evolution, Germany seems ready to embrace the Linux system once more. This time, they plan to migrate tens of thousands of systems from Windows to the Linux platform, attempting to avoid the challenges they have faced before.

Google’s Massive Layoffs Spark Collective Protest: The recent large-scale layoffs by Google have sparked collective protests among the company’s senior employees. The protestors, who have referred to the leadership as “empty-headed,” have harshly criticized the expansion of the middle management team, citing an increase in system vulnerabilities. These issues have reportedly led to the wrongful imprisonment of over a hundred people and have been exacerbated by the failure of multi-million dollar investments in cloud services.

Major Corporation Suffers from Japanese Software Pitfalls: After struggling for twenty years, a well-known major corporation has ultimately failed to achieve its transition to cloud services. During their attempted system upgrades, they encountered significant setbacks, resulting in a loss of 280 million yuan.

Meta Pushes Aggressively Forward: Meta has been actively demonstrating its innovative spirit in the technology sector. Following the Llama 3 project, Meta has announced the release of an “Android” operating system for the MR field, which is seen as a significant technological breakthrough in the MR industry.