Codex CLI(opens in a brand new window) is our cross-platform native device agent, designed to provide top of the range, dependable device adjustments whilst working safely and successfully for your system. We’ve discovered an incredible quantity about how one can construct a world-class device agent since we first introduced the CLI in April. To unpack the ones insights, that is the primary submit in an ongoing collection the place we’ll discover more than a few sides of the way Codex works, in addition to hard earned courses. (For an much more granular view on how the Codex CLI is constructed, take a look at our open supply repository at https://github.com/openai/codex(opens in a brand new window). Lots of the finer main points of our design selections are memorialized in GitHub problems and pull requests if you happen to’d like to be informed extra.)
To kick off, we’ll focal point at the agent loop, which is the core good judgment in Codex CLI this is chargeable for orchestrating the interplay between the person, the mannequin, and the equipment the mannequin invokes to accomplish significant device paintings. We are hoping this submit offers you a just right view into the function our agent (or “harness”) performs in applying an LLM.
Earlier than we dive in, a handy guide a rough be aware on terminology: at OpenAI, “Codex” includes a suite of device agent choices, together with Codex CLI, Codex Cloud, and the Codex VS Code extension. This submit makes a speciality of the Codex harness, which supplies the core agent loop and execution good judgment that underlies all Codex studies and is surfaced throughout the Codex CLI. For ease right here, we’ll use the phrases “Codex” and “Codex CLI” interchangeably.
On the middle of each AI agent is one thing known as “the agent loop.” A simplified representation of the agent loop looks as if this:
To begin, the agent takes enter from the person to incorporate within the set of textual directions it prepares for the mannequin referred to as a suggested.
The next move is to question the mannequin via sending it our directions and asking it to generate a reaction, a procedure referred to as inference. Right through inference, the textual suggested is first translated into a chain of enter tokens(opens in a brand new window)—integers that index into the mannequin’s vocabulary. Those tokens are then used to pattern the mannequin, generating a brand new collection of output tokens.
The output tokens are translated again into textual content, which turns into the mannequin’s reaction. As a result of tokens are produced incrementally, this translation can occur because the mannequin runs, which is why many LLM-based programs show streaming output. In apply, inference is generally encapsulated at the back of an API that operates on textual content, abstracting away the main points of tokenization.
As the results of the inference step, the mannequin both (1) produces a last reaction to the person’s unique enter, or (2) requests a software name that the agent is predicted to accomplish (e.g., “run ls and document the output”). On the subject of (2), the agent executes the software name and appends its output to the unique suggested. This output is used to generate a brand new enter that’s used to re-query the mannequin; the agent can then take this new data under consideration and check out once more.
This procedure repeats till the mannequin stops emitting software calls and as an alternative produces a message for the person (known as an assistant message in OpenAI fashions). In lots of circumstances, this message immediately solutions the person’s unique request, nevertheless it can be a follow-up query for the person.
For the reason that agent can execute software calls that vary the native surroundings, its “output” isn’t restricted to the assistant message. In lots of circumstances, the main output of a device agent is the code it writes or edits for your system. Nonetheless, each and every flip at all times ends with an assistant message—corresponding to “I added the structure.md you requested for”—which alerts a termination state within the agent loop. From the agent’s viewpoint, its paintings is entire and regulate returns to the person.
The adventure from person enter to agent reaction proven within the diagram is known as one flip of a communique (a thread in Codex). Although this communique flip can come with many iterations between the mannequin inference and software calls. Each and every time you ship a brand new message to an present communique, the communique historical past is incorporated as a part of the suggested for the brand new flip, which incorporates the messages and power calls from earlier turns:
Which means because the communique grows, so does the period of the suggested used to pattern the mannequin. This period issues as a result of each mannequin has a context window, which is the utmost selection of tokens it will probably use for one inference name. Observe this window comprises each enter and output tokens. As you may believe, an agent may come to a decision to make masses of software calls in one flip, probably arduous the context window. Because of this, context window control is likely one of the agent’s many duties. Now, let’s dive in to look how Codex runs the agent loop.
The Codex CLI sends HTTP requests to the Responses API(opens in a brand new window) to run mannequin inference. We’ll read about how data flows via Codex, which makes use of the Responses API to force the agent loop.
Let’s discover how Codex creates the suggested for the primary inference name in a communique.
As an finish person, you don’t specify the suggested used to pattern the mannequin verbatim while you question the Responses API. As an alternative, you specify more than a few enter varieties as a part of your question, and the Responses API server makes a decision how one can construction this knowledge right into a suggested that the mannequin is designed to devour. You’ll be able to recall to mind the suggested as a “record of things”; this phase will provide an explanation for how your question will get remodeled into that record.
Within the preliminary suggested, each merchandise within the record is related to a task. The function signifies how a lot weight the related content material will have to have and is likely one of the following values (in reducing order of precedence): machine, developer, person, assistant.
The equipment box is a listing of software definitions that agree to a schema outlined via the Responses API. For Codex, this comprises equipment which might be offered via the Codex CLI, equipment which might be offered via the Responses API that are meant to be made to be had to Codex, in addition to equipment offered via the person, generally by the use of MCP servers:
1. A message with function=developer that describes the sandbox that applies handiest to the Codex-provided shell software outlined within the equipment phase. This is, different equipment, corresponding to the ones offered from MCP servers, aren’t sandboxed via Codex and are chargeable for implementing their very own guardrails.
2. (Not obligatory) A message with function=developer whose contents are the developer_instructions price learn from the person’s config.toml document.
As soon as Codex has achieved the entire above computation to initialize the enter, it appends the person message to begin the communique.
The former examples targeted at the content material of each and every message, however be aware that each and every part of enter is a JSON object with sort, function(opens in a brand new window), and content material as follows:
As soon as Codex builds up the overall JSON payload to ship to the Responses API, it then makes the HTTP POST request with an Authorization header relying on how the Responses API endpoint is configured in ~/.codex/config.toml (further HTTP headers and question parameters are added if specified).
When an OpenAI Responses API server receives the request, it makes use of the JSON to derive the suggested for the mannequin as follows (to make sure, a customized implementation of the Responses API may make a special selection):
As you’ll be able to see, the order of the primary 3 pieces within the suggested is decided via the server, now not the customer. That stated, of the ones 3 pieces, handiest the content material of the machine message may be managed via the server, because the equipment and directions are decided via the customer. Those are adopted via the enter from the JSON payload to finish the suggested.
Now that we have got our suggested, we’re able to pattern the mannequin.
This HTTP request to the Responses API initiates the primary “flip” of a communique in Codex. The server replies with a Server-Despatched Occasions (SSE(opens in a brand new window)) movement. The information of each and every tournament is a JSON payload with a "sort" that begins with "reaction", which might be one thing like this (a complete record of occasions can also be present in our API doctors(opens in a brand new window)):
Codex consumes the movement of occasions(opens in a brand new window) and republishes them as inner tournament gadgets that can be utilized via a consumer. Occasions like reaction.output_text.delta are used to improve streaming within the UI, while different occasions like reaction.output_item.added are remodeled into gadgets which might be appended to the enter for next Responses API calls.
Think the primary request to the Responses API comprises two reaction.output_item.achieved occasions: one with sort=reasoning and one with sort=function_call. Those occasions will have to be represented within the enter box of the JSON once we question the mannequin once more with the reaction to the software name:
The ensuing suggested used to pattern the mannequin as a part of the following question would appear to be this:
Specifically, be aware how the previous suggested is an actual prefix of the brand new suggested. That is intentional, as this makes next requests a lot more environment friendly as it allows us to benefit from suggested caching (which we’ll talk about within the subsequent phase on efficiency).
Having a look again at our first diagram of the agent loop, we see that there might be many iterations between inference and power calling. The suggested might keep growing till we in any case obtain an assistant message, indicating the tip of the flip:
Within the Codex CLI, we provide the assistant message to the person and focal point the composer to signify to the person that it’s their “flip” to proceed the communique. If the person responds, each the assistant message from the former flip, in addition to the person’s new message, will have to be appended to the enter within the Responses API request to begin the brand new flip:
As soon as once more, as a result of we’re proceeding a communique, the period of the enter we ship to the Responses API assists in keeping expanding:
Let’s read about what this ever-growing suggested manner for efficiency.
You may well be asking of yourself, “Wait, isn’t the agent loop quadratic relating to the quantity of JSON despatched to the Responses API over the process the communique?” And you’d be proper. Whilst the Responses API does improve an non-compulsory previous_response_id(opens in a brand new window) parameter to mitigate this factor, Codex does now not use it as of late, basically to stay requests absolutely stateless and to improve 0 Information Retention (ZDR) configurations.
Fending off previous_response_id simplifies issues for the supplier of the Responses API as it guarantees that each request is stateless. This additionally makes it easy to improve shoppers who’ve opted into 0 Information Retention (ZDR)(opens in a brand new window), as storing the knowledge required to improve previous_response_id could be at odds with ZDR. Observe that ZDR shoppers don’t sacrifice the facility to get pleasure from proprietary reasoning messages from prior turns, because the related encrypted_content can also be decrypted at the server. (OpenAI persists a ZDR buyer’s decryption key, however now not their information.) See PRs #642(opens in a brand new window) and #1641(opens in a brand new window) for the comparable adjustments to Codex to improve ZDR.
In most cases, the price of sampling the mannequin dominates the price of community visitors, making sampling the main goal of our potency efforts. This is the reason suggested caching is so vital, because it allows us to reuse computation from a prior inference name. Once we get cache hits, sampling the mannequin is linear somewhat than quadratic. Our suggested caching (opens in a brand new window)documentation explains this in additional element:
Cache hits are handiest imaginable for precise prefix fits inside a suggested. To understand caching advantages, position static content material like directions and examples initially of your suggested, and put variable content material, corresponding to user-specific data, on the finish. This additionally applies to photographs and equipment, which will have to be similar between requests.
With this in thoughts, let’s believe what forms of operations may motive a “cache pass over” in Codex:
- Converting the
equipmentto be had to the mannequin in the midst of the communique. - Converting the
mannequinthat’s the goal of the Responses API request (in apply, this adjustments the 3rd merchandise within the unique suggested, because it accommodates model-specific directions). - Converting the sandbox configuration, approval mode, or present operating listing.
When imaginable, we deal with configuration adjustments that occur mid-conversation via appending a new message to enter to replicate the trade somewhat than enhancing an previous message:
We move to nice lengths to make sure cache hits for efficiency. There’s some other key useful resource we need to organize: the context window.
Our common option to steer clear of operating out of context window is to compact the communique as soon as the selection of tokens exceeds some threshold. Particularly, we exchange the enter with a brand new, smaller record of things this is consultant of the communique, enabling the agent to proceed with an figuring out of what has came about to this point. An early implementation of compaction(opens in a brand new window) required the person to manually invoke the /compact command, which might question the Responses API the usage of the prevailing communique plus customized directions for summarization(opens in a brand new window). Codex used the ensuing assistant message containing the abstract as the brand new enter(opens in a brand new window) for next communique turns.
We’ve presented the Codex agent loop and walked via how Codex crafts and manages its context when querying a mannequin. Alongside the best way, we highlighted sensible issues and easiest practices that observe to any individual construction an agent loop on most sensible of the Responses API.
Whilst the agent loop supplies the root for Codex, it’s handiest the start. In upcoming posts, we’ll dig into the CLI’s structure, discover how software use is carried out, and take a more in-depth have a look at Codex’s sandboxing mannequin.


