Introducing GPT-5.2 | OpenAI

5.2 blog 16x9.png


We’re introducing GPT‑5.2, probably the most succesful style collection but for pro wisdom paintings.

Already, the typical ChatGPT Undertaking person says AI saves them 40–60 mins an afternoon, and heavy customers say it saves them greater than 10 hours per week. We designed GPT‑5.2 to release much more financial price for folks; it’s higher at developing spreadsheets, development shows, writing code, perceiving photographs, figuring out lengthy contexts, the usage of gear, and dealing with advanced, multi-step tasks.

GPT‑5.2 units a brand new state-of-the-art throughout many benchmarks, together with GDPval, the place it outperforms business pros at well-specified wisdom paintings duties spanning 44 occupations.

In ChatGPT, GPT‑5.2 Speedy, Considering, and Professional will start rolling out nowadays, beginning with paid plans. Within the API, they’re to be had now to all builders.

Total, GPT‑5.2 brings vital enhancements typically intelligence, long-context figuring out, agentic tool-calling, and imaginative and prescient—making it higher at executing advanced, real-world duties end-to-end than any earlier style.

GPT‑5.2 Considering is the most productive style but for real-world, skilled use. On GDPval, an eval measuring well-specified wisdom paintings duties throughout 44 occupations, GPT‑5.2 Considering units a brand new state of the art rating, and is our first style that plays at or above a human skilled point. In particular, GPT‑5.2 Considering beats or ties most sensible business pros on 70.9% of comparisons on GDPval wisdom paintings duties, in line with skilled human judges. Those duties come with making shows, spreadsheets, and different artifacts. GPT‑5.2 Considering produced outputs for GDPval duties at >11x the velocity and <1% the price of skilled pros, suggesting that once paired with human oversight, GPT‑5.2 can assist with skilled paintings. Velocity and value estimates are in response to ancient metrics; velocity in ChatGPT might range.

In GDPval, fashions strive well-specified wisdom paintings spanning 44 occupations from the highest 9 industries contributing to US GDP. Duties request genuine paintings merchandise, equivalent to gross sales shows, accounting spreadsheets, pressing care schedules, production diagrams, or brief movies. In ChatGPT, GPT‑5.2 Considering has new gear that GPT‑5 Considering does no longer.

When reviewing one particularly just right output, one GDPval pass judgement on commented, “It’s a thrilling and noticeable jump in output high quality… [it] seems to had been completed by way of a certified corporate with group of workers, and has an incredibly nicely designed format and recommendation for each deliverables, although with one we nonetheless have some minor mistakes to right kind.”

Moreover, on our inner benchmark of junior funding banking analyst spreadsheet modeling duties—equivalent to striking in combination a three-statement style for a Fortune 500 corporate with right kind formatting and citations, or development a leveraged buyout style for a take-private—GPT 5.2 Considering’s moderate rating in line with process is 9.3% increased than GPT‑5.1’s, emerging from 59.1% to 68.4%.

Facet-by-side comparisons display stepped forward sophistication and formatting in spreadsheets and slides generated by way of GPT‑5.2 Considering:

Side by side example of spreadsheet outputs from GPT-5.1 vs GPT-5.2

Instructed: Create a staff making plans style: headcount, hiring plan, attrition, and funds affect. Come with engineering, advertising, prison, and gross sales departments.

To make use of the brand new spreadsheet and presentation functions in ChatGPT, you should be on a Plus, Professional, Industry, or Undertaking plan and make a choice both GPT‑5.2 Considering or Professional. Complicated generations can take many mins to supply.

GPT‑5.2 Considering units a brand new state-of-the-art of 55.6% on SWE-Bench Professional, a rigorous analysis of real-world instrument engineering. In contrast to SWE-bench Verified, which simplest assessments Python, SWE-Bench Professional assessments 4 languages and goals to be extra contamination-resistant, difficult, numerous, and industrially related.

On SWE-bench Verified (no longer plotted), GPT‑5.2 Considering ratings our new excessive of 80%.

For on a regular basis skilled use, this interprets right into a style that may extra reliably debug manufacturing code, put into effect characteristic requests, refactor massive codebases, and send fixes end-to-end with much less guide intervention.

GPT‑5.2 Considering may be higher at front-end instrument engineering than GPT‑5.1 Considering. Early testers discovered it considerably more potent at front-end building and sophisticated or unconventional UI paintings—particularly involving 3-d components—making it an impressive day by day spouse for engineers around the stack. See a couple of examples of what it may well produce from a unmarried steered:

Instructed: Create a single-page app in one HTML dossier with the next necessities:
- Identify: Ocean Wave Simulation
- Purpose: Show life like animated waves.
- Options: Alternate wind velocity, wave top, lighting fixtures.
- The UI must be calming and life like.

Early testers shared their comments on GPT‑5.2’s coding functions:

“GPT-5.2 represents the largest jump for GPT fashions in agentic coding since GPT-5 and is a SOTA coding style in its value vary. The model bump undersells the leap in intelligence. We’re excited to make it the default throughout Windsurf and a number of other core Devin workloads.”

Jeff Wang, CEO, Windsurf

GPT‑5.2 Considering hallucinates not up to GPT‑5.1 Considering. On a suite of de-identified queries from ChatGPT, responses with mistakes had been 30%rel much less not unusual. For pros, this implies fewer errors when the usage of the style for analysis, writing, research, and choice make stronger—making the style extra unswerving for on a regular basis wisdom paintings.

Reasoning effort was once set to the utmost to be had and a seek instrument was once enabled. Mistakes had been detected by way of different fashions, which can make mistakes themselves. Declare-level error charges are a ways less than response-level error charges, as maximum responses comprise many claims.

Like every fashions, GPT‑5.2 Considering is imperfect. For anything else essential, double take a look at its solutions.

GPT‑5.2 Considering units a brand new state-of-the-art in long-context reasoning, attaining main efficiency on OpenAI MRCRv2—an analysis that assessments a style’s skill to combine data unfold throughout lengthy paperwork. On real-world duties like deep record research, which require similar data throughout masses of 1000’s of tokens, GPT‑5.2 Considering is considerably extra correct than GPT‑5.1 Considering. Particularly, it’s the primary style we’ve observed that achieves close to 100% accuracy at the 4-needle MRCR variant (out to 256k tokens).

In sensible phrases, this allows pros to make use of GPT‑5.2 to paintings with lengthy paperwork—equivalent to stories, contracts, analysis papers, transcripts, and multi-file tasks—whilst keeping up coherence and accuracy throughout masses of 1000’s of tokens. This makes GPT‑5.2 particularly nicely fitted to deep research, synthesis, and sophisticated multi-source workflows.

In OpenAI-MRCR⁠(opens in a brand new window) v2 (multi-round co-reference decision), a couple of similar “needle” person requests are inserted into lengthy “haystacks” of an identical requests and responses, and the style is requested to breed the reaction to nth needle. Model 2 of the eval fixes ~5% of duties that had flawed floor fact values. Imply fit ratio measures the typical string fit ratio between the style’s reaction and the proper solution. The issues at 256k max enter tokens constitute averages over 128k–256k enter tokens, and so on. Right here, 256k represents 256 * 1,024 = 262,144 tokens. Reasoning effort was once set to the utmost to be had.

For duties that take pleasure in considering past the utmost context window, GPT‑5.2 Considering is appropriate with our new Responses /compact endpoint, which extends the style’s efficient context window. This we could GPT‑5.2 Considering take on extra tool-heavy, long-running workflows that will differently be restricted by way of context duration. Learn extra in our API documentation(opens in a brand new window).

GPT‑5.2 Considering is our most powerful imaginative and prescient style but, slicing error charges kind of in part on chart reasoning and instrument interface figuring out.

For on a regular basis skilled use, this implies the style can extra correctly interpret dashboards, product screenshots, technical diagrams, and visible stories—supporting workflows in finance, operations, engineering, design, and buyer make stronger the place visible data is central.

In ScreenSpot-Professional(opens in a brand new window), fashions should explanation why about high-resolution screenshots of graphical person interfaces from numerous skilled settings. A Python instrument was once enabled and reasoning effort was once set to most. With out the Python instrument, ratings are a lot decrease. We advise enabling the Python instrument on imaginative and prescient duties like those.

In comparison to earlier fashions, GPT‑5.2 Considering has a more potent grab of ways components are situated inside of a picture, which is helping on duties the place relative format performs a key position in fixing the issue. Within the instance under, we ask the style to spot the elements in a picture enter (on this case, a motherboard) and go back labels with approximate bounding containers. Even on a low-quality symbol, GPT‑5.2 identifies the principle areas and puts containers that every so often fit the actual places of every element, whilst GPT‑5.1 simplest labels a couple of portions and displays a miles weaker figuring out in their spatial association. Each fashions shed light on errors, however GPT‑5.2 displays higher comprehension of the picture.

Example output of GPT-5.1 identifying components in an image
Example output of GPT-5.2 identifying components in an image

GPT‑5.2 Considering achieves a brand new state-of-the-art of 98.7% on Tau2-bench Telecom, demonstrating its skill to reliably use gear throughout lengthy, multi-turn duties.

For latency-sensitive use instances, GPT‑5.2 Considering additionally plays significantly better at reasoning.effort=’none’, considerably outperforming GPT‑5.1 and GPT‑4.1.

In τ2-bench⁠(opens in a brand new window), fashions use gear to finish buyer make stronger duties in a multi-turn interplay with a simulated person. For the Telecom area, we integrated a temporary, typically useful instruction within the machine steered to spice up efficiency. We exclude the Airline subset as a result of lower-quality floor fact grading.

For pros, this interprets into more potent end-to-end workflows—equivalent to resolving buyer make stronger instances, pulling information from a couple of programs, working analyses, and producing ultimate outputs with fewer breakdowns between steps.

For instance, when asking a posh customer support query that calls for multi-step decision, the style can extra successfully coordinate a complete workflow throughout a couple of brokers. Within the case under, a traveler stories a not on time flight, a ignored connection, an in a single day keep in New York, and a clinical seating requirement. GPT‑5.2 manages all of the chain of duties—rebooking, special-assistance seating, and reimbursement—handing over a extra entire end result than GPT‑5.1.

Example of tool calling output in GPT-5.1
Example of tool calling output in GPT-5.2

One in all our hopes for AI is that it’ll boost up clinical analysis for the advantage of everybody. Towards this, we’ve been running with and paying attention to scientists to look how AI can accelerate their paintings, and ultimate month we shared some early collaborative experiments right here.

We consider GPT‑5.2 Professional and GPT‑5.2 Considering are the arena’s easiest fashions for aiding and accelerating scientists. On GPQA Diamond, a graduate-level Google-proof Q&A benchmark, GPT‑5.2 Professional achieves 93.2%, adopted intently by way of GPT‑5.2 Considering at 92.4%.

On FrontierMath (Tier 1–3), an analysis of expert-level arithmetic, GPT‑5.2 Considering set a brand new state-of-the-art, fixing 40.3% of issues.

We are starting to see AI fashions meaningfully boost up growth in math and science in tangible techniques. For instance, in contemporary paintings with GPT‑5.2 Professional, researchers explored an open query in statistical studying principle. In a slim, well-specified surroundings, the style proposed an evidence that was once due to this fact verified by way of the authors and reviewed with exterior mavens, illustrating how frontier fashions can help mathematical analysis beneath shut human oversight.

On ARC-AGI-1 (Verified), a benchmark designed to measure normal reasoning skill, GPT‑5.2 Professional is the primary style to move the 90% threshold, bettering from 87%(opens in a brand new window) by way of o3‑preview ultimate 12 months whilst lowering the price of attaining that efficiency by way of kind of 390×.

On ARC-AGI-2 (Verified), which raises the trouble and higher isolates fluid reasoning, GPT‑5.2 Considering achieves a brand new state-of-the-art for chain-of-thought fashions, scoring 52.9%. GPT‑5.2 Professional plays even increased, achieving 54.2%, additional extending the style’s skill to explanation why via novel, summary issues.

Enhancements throughout those critiques mirror GPT‑5.2’s more potent multi-step reasoning, better quantitative accuracy, and extra dependable drawback fixing on advanced technical duties.

Right here’s what our early testers say about GPT‑5.2:

“GPT-5.2 unlocked a whole structure shift for us. We collapsed a delicate, multi-agent machine right into a unmarried mega-agent with 20+ gear. The most productive phase is, it simply works. The mega-agent is quicker, smarter, and 100x more straightforward to care for. We’re seeing dramatically decrease latency, a lot more potent instrument calling, and we now not want sprawling machine activates as a result of 5.2 will execute cleanly off a easy, one-line steered. It looks like natural magic.”

AJ Orbach, CEO, Triple Whale

In ChatGPT, customers must understand GPT‑5.2 feels higher to make use of day after day—extra structured, extra dependable, and nonetheless relaxing to speak to.

GPT‑5.2 Speedy is a quick, succesful workhorse for on a regular basis paintings and studying, with transparent enhancements in info-seeking questions, how-tos and walk-throughs, technical writing, and translation, development at the hotter conversational tone presented in GPT‑5.1 Speedy. Early testers specifically famous clearer explanations that floor key data in advance.

GPT‑5.2 Considering is designed for deeper paintings, serving to customers take on extra advanced duties with better polish—particularly for coding, summarizing lengthy paperwork, answering questions on uploaded information, running via math and common sense step-by-step, and supporting making plans and selections with clearer construction and extra helpful element.

GPT‑5.2 Professional is our smartest and maximum faithful possibility for tough questions the place a higher-quality solution is well worth the wait, with early trying out appearing fewer primary mistakes and more potent efficiency in advanced domain names like programming.

GPT‑5.2 builds at the protected of completion analysis we presented with GPT‑5, which teaches the style to present probably the most useful solution whilst nonetheless staying inside of protection barriers.

With this free up, we endured our paintings to enhance our fashions’ responses in touchy conversations, with significant enhancements in how they reply to activates indicating indicators of suicide or self hurt, psychological well being misery, or emotional reliance at the style. Those centered interventions have led to fewer unwanted responses in each GPT‑5.2 Speedy and GPT‑5.2 Considering as in comparison to GPT‑5.1 and GPT‑5 Speedy and Considering fashions. Additional main points can also be discovered within the machine card.

We’re within the early phases of rolling out our age prediction style in order that we will routinely observe content material protections for customers who’re beneath 18, with a purpose to restrict get right of entry to to touchy content material. This builds on our current technique to customers we all know are beneath 18 and our parental controls.

GPT‑5.2 is one step in an ongoing collection of enhancements, and we’re a ways from completed. Whilst this free up delivers significant good points in intelligence and productiveness, we all know there are spaces the place folks need extra. In ChatGPT, we’re running on identified problems like over-refusals, whilst proceeding to lift the bar on protection and reliability general. Those adjustments are advanced, and we’re considering getting them proper.

In ChatGPT, we’ll start rolling out GPT‑5.2 (Speedy, Considering, and Professional) nowadays, beginning with paid plans (Plus, Professional, Move, Industry, Undertaking). We deploy GPT‑5.2 progressively to stay ChatGPT as easy and dependable as we will; for those who don’t see it to start with, please take a look at once more later. In ChatGPT, GPT‑5.1 will nonetheless be to be had to paid customers for 3 months beneath legacy fashions, and then we can sundown GPT‑5.1.

In our API Platform, GPT‑5.2 Considering is to be had nowadays within the Responses API and Chat Completions API as gpt-5.2, and GPT‑5.2 Speedy as gpt-5.2-chat-latest. GPT‑5.2 Professional is to be had within the Responses API as gpt-5.2-pro. Builders can now set the reasoning parameter in GPT‑5.2 Professional, and each GPT‑5.2 Professional and GPT‑5.2 Considering now make stronger the brand new 5th reasoning effort of xhigh, for duties the place high quality is maximum vital.

GPT‑5.2 is priced at $1.75/1M enter tokens and $14/1M output tokens, with a 90% cut price on cached inputs. On a couple of agentic evals, we discovered that regardless of GPT‑5.2’s better value in line with token, the price of achieving a given point of high quality ended up more cost effective because of GPT‑5.2’s better token potency.

Whilst ChatGPT subscription pricing stays the similar, within the API GPT‑5.2 is priced increased in line with token than GPT‑5.1 as a result of this is a extra succesful style. It’s nonetheless priced under different frontier fashions, so folks can proceed to make use of it deeply of their day by day paintings and core packages.

We don’t have any present plans to deprecate GPT‑5.1, GPT‑5, or GPT‑4.1 within the API and can keep up a correspondence any deprecation plans with considerable advance understand for builders. Whilst GPT‑5.2 will paintings nicely out of the field in Codex, we predict to free up a model of GPT‑5.2 optimized for Codex within the coming weeks.

GPT‑5.2 was once in-built collaboration with our long-standing companions NVIDIA and Microsoft. Azure information facilities and NVIDIA GPUs, together with H100, H200, and GB200-NVL72, underpin OpenAI’s at-scale coaching infrastructure, riding vital good points in style intelligence. In combination, this collaboration lets in us to scale compute with self belief and produce new fashions to marketplace extra briefly.

Beneath, we record complete benchmark ratings for GPT‑5.2 Considering, in conjunction with a subset for GPT‑5.2 Professional.

Fashions had been run with most to be had reasoning effort in our API (xhigh for GPT‑5.2 Considering & Professional, and excessive for GPT‑5.1 Considering), except for for the pro evals, the place GPT‑5.2 Considering was once run with reasoning effort heavy, the utmost to be had in ChatGPT Professional. Benchmarks had been carried out in a analysis setting, which might supply somewhat other output from manufacturing ChatGPT in some instances.

* For SWE-Lancer, we put out of your mind 40/237 issues that didn’t run on our infrastructure.




Leave a Comment

Your email address will not be published. Required fields are marked *