Review of “OpenAI (2023), GPT‑4 Technical Report” [4 March 2024]

Posted on July 29, 2025 by JimS

By Jim Shimabukuro (assisted by ChatGPT)
Editor

(Also see A Review of Ouyang et al.’s 2022 Paper aka ‘InstructGPT’ and Must-Read Publications That Are Guiding Chatbot Development?)

Introduction: I collaborated with ChatGPT on this report. This is a follow-up to “A Review of Ouyang et al.’s 2022 Paper aka ‘InstructGPT’” (ETC, 29 July 2025). “OpenAI (2023)” clarifies the connection between InstructGPT and ChatGPT. -js

JS: Please locate and review “OpenAI (2023),” OpenAI’s “GPT-4 Technical Report” (04 March 2024) at (https://arxiv.org/html/2303.08774v6) and generate a 1000-word essay that clarifies the authors’ thesis and main supporting points. In short, what are they trying to say and how do they develop their points? Use a conversational tone that’s natural, and please provide brief definitions for abbreviated and jargon words. Append an annotated list of references, in APA style, that have informed your response.

ChatGPT: Here’s a clearer, more conversational breakdown of “OpenAI (2023), GPT‑4 Technical Report” (arXiv:2303.08774v6, Mar 4, 2024)—what the authors are actually saying, how they build their case, and why it matters.

🎯 What Are They Trying to Say?

Thesis: GPT‑4 is a major leap forward in large multimodal AI—able to understand text and images, performing at elite human levels on benchmarks—thanks to a Transformer-based model trained at massive scale, combined with reinforcement learning and safety-focused alignment. It excels across a wide range of tasks yet still carries known limitations and safety challenges.

🧩 How They Develop Their Argument

1. Introduction & Scope

Model type: GPT‑4 is a “Transformer-based model” (a neural network architecture famous for language tasks). It’s “pre-trained” on public and licensed data to predict “the next token” (basically the next word or piece of text) (arXiv, Wikipedia).
Fine-tuning: After initial training, GPT‑4 is improved via Reinforcement Learning from Human Feedback (RLHF)—people rate outputs to guide the model toward better factuality and helpfulness (arXiv).
Transparency: The report does not share full details on model size or exact compute. Instead, it focuses on capabilities, limitations, and safety (arXiv).

2. Scaling Predictably

Big claim: The team engineered the training infrastructure—data pipelines, optimization methods, hardware usage—to scale robustly.
Key insight: By observing performance on much smaller models, they used power‑law scaling laws to predict GPT‑4’s performance (e.g., based on smaller models with 1/10,000th the compute) (herbertroitblat.substack.com, arXiv).
Result: They could confidently scale up model training without costly trial‑and‑error at full scale.

3. Multimodal Abilities

GPT‑4 handles both text and images in prompts, generating coherent text responses even about visuals—like describing a meme panel-by-panel (arXiv).
It shows effectiveness in multimodal understanding, applying the same strategies like few‑shot prompting and chain‑of‑thought reasoning even when images are involved.

4. Benchmarking Human-Level Performance

Real‑world performance: GPT‑4 achieves human-level or better scores on various tests:
- Simulated Bar Exam: top 10% among takers
- Standardized exams: strong performances on SAT, LSAT, GRE etc. (arXiv, Emergent Mind, Wikipedia).
- MMLU benchmark: 86.4% in English; surpassing past models across 24 of 26 other languages (arXiv).
Even in coding (HumanEval), vision tasks, and multilingual evaluation, GPT‑4 consistently eclipses GPT‑3.5 and other predecessors.

5. Limits & Reliability

However, GPT‑4 isn’t perfect:
- Hallucinations: It can “make up” facts or confidently provide incorrect information.
- Context limit: There’s a maximum window of input it can handle.
- Static knowledge: The model doesn’t update continuously from new data (arXiv, Emergent Mind).
Safety concerns: The model enables fresh risks—bias, disinformation, cybersecurity vulnerabilities, etc.

6. Safety & Alignment Measures

GPT‑4 underwent extensive adversarial testing with domain experts to uncover dangerous behaviors before deployment.
A model-assisted safety pipeline and a “system card” detail mitigation strategies against misuse—like filtering prompts and adding rules via an RBRM (Rule-Based Reward Model) (Emergent Mind, arXiv).
Results: 19–82% drop in failure rates vs. GPT‑3.5 on factuality checks and toxic/harmful outputs (arXiv, Emergent Mind).

7. Vision & Conversations

GPT‑4V (vision-capable) can interpret diagrams, images, memes, and screenshots—demonstrating robust multimodal understanding (arXiv, Emergent Mind).

8. Future Outlook

OpenAI calls for independent auditing via third-party researchers.
They highlight future needs: reducing hallucinations, improving updating knowledge, expanding context windows, and refining safety and reliability.

✍️ Definitions & Jargon

Transformer: A deep learning model architecture using attention mechanisms—excels at language tasks.
Token: A unit of text (word, part of word, punctuation).
Pre‑training: Initial unsupervised learning where the model guesses the next token using vast text data.
Fine-tuning: Additional supervised training or RLHF to refine model behavior.
RLHF: Reinforcement learning guided by humans ranking outputs for preferences.
Few-shot prompting: Giving the model a few examples in the prompt to guide output.
Chain-of-thought: Prompting strategy that encourages step-by-step logical reasoning.
Adversarial testing: Attempts by experts to break the system or make it fail.
System card: A document outlining model capabilities, risks, and safe usage guidelines.
RBRM: Rule-Based Reward Model—a built-in system to penalize unsafe behavior.

🧭 So, What’s the Real Narrative?

Scalable engineering enabled predictable performance from tiny models up to GPT‑4.
Cutting-edge multimodality lets GPT‑4 handle both images and text with human-level performance across tasks.
Rigorous safety work: alignment via RLHF, adversarial testing, and rule-based systems significantly reduce harms.
Limitations remain—hallucinations, static knowledge, and context constraints mean it’s not infallible.
Ongoing work ahead: transparency, reliability, real-time knowledge, and safer behavior are the key milestones ahead.

📚 Annotated References (APA style)

OpenAI. (2023). GPT‑4 technical report (arXiv:2303.08774v6). arXiv. Provides the foundational description of GPT‑4’s architecture, performance (e.g., Bar Exam top 10%), and multimodal abilities (arXiv, Emergent Mind, ResearchGate, Wikipedia).
Emergent Mind. (n.d.). GPT‑4 Technical Report summary. Highlights benchmark wins like 86.4% on MMLU, and reductions in toxicity and harmful responses (Emergent Mind).
Wikipedia contributors. (2025, July). GPT‑4. In Wikipedia. Details on GPT‑4’s release date, versions, context windows, and training overview (Emergent Mind).
Bubeck, S., Chandrasekaran, V., Eldan, R., & et al. (2023). Sparks of artificial general intelligence: Early experiments with GPT‑4 (arXiv:2303.12712). Discusses GPT‑4’s performance and its implications for a form of early AGI (arXiv).

JS: Are any current chatbots freely accessible by the public based on GPT-4? If yes, please name them.

ChatGPT: Yes, as of mid-2025, several publicly accessible chatbots are based on GPT-4, although access levels (free vs. paid) vary depending on the platform. Here are the key ones:

✅ Chatbots Based on GPT-4 (Public Access)

1. Microsoft Copilot / Bing Chat

Model: GPT-4 (with some customization)
Access: ✅ Free to the public
Platform: Bing.com, Edge browser, Microsoft Copilot apps
Notes:
- Integrated with real-time web browsing.
- Includes image generation (via DALL·E) and multimodal input in some versions.
- Often described as GPT-4 with “tools” (plugins like search, code interpreter).

2. OpenAI ChatGPT (chat.openai.com)

Model:
- GPT-3.5: ✅ Free
- GPT-4 (specifically GPT-4-turbo): 🔒 Paid (ChatGPT Plus, $20/mo)
Notes:
- Free-tier users are limited to GPT-3.5.
- GPT-4-turbo is faster and cheaper than the original GPT-4, with more memory (128k context window).
- Includes tools like Python code interpreter, image understanding, and web browsing (for pro users).

3. Poe by Quora

Model options:
- GPT-4: 🔒 Limited free access + subscription required for more use
- GPT-3.5 and Claude Instant: ✅ Free
Platform: poe.com
Notes:
- Users get a few GPT-4 messages per day for free, but must pay for ongoing use.

🧠 Other GPT-4 Access Points (More Specialized)

4. Perplexity.ai (Pro mode)

Model: GPT-4 and Claude 3 available in Pro tier
Access: 🔒 Subscription only for GPT-4, but GPT-3.5-style model is used in free tier.

5. Julius AI, TypingMind, and other wrappers

Model: Often include GPT-4 via OpenAI API
Access: 🔒 GPT-4 requires bringing your own API key (paid via OpenAI), but some tools allow limited free use or trials.

Summary Table

Chatbot	GPT-4 Access	Free?	Notes
Bing Chat (Copilot)	✅ Yes	✅ Yes	GPT-4 w/ web tools
ChatGPT (OpenAI)	✅ Yes	❌ No (GPT-4 is paid)	Free tier = GPT-3.5
Poe.com (Quora)	✅ Yes	⚠️ Limited	Small daily quota for free
Perplexity.ai (Pro)	✅ Yes	❌ No	Free tier = weaker models
Others (e.g., TypingMind)	✅ Often	⚠️ Limited	Usually requires API key

JS: Are other popular chatbots such as Gemini and Claude based on GPT-4 (Public Access)?

ChatGPT: No, Gemini (by Google DeepMind) and Claude (by Anthropic) are not based on OpenAI’s GPT‑4 architecture, though they share some capabilities. Here’s a clearer breakdown:

🤖 Google Gemini

Not built on GPT‑4––it is Google’s own large multimodal model developed by Google DeepMind.
Gemini uses a Mixture‑of‑Experts (MoE) architecture with separate modules for different tasks (e.g., text, image, code, audio), unlike GPT‑4’s single transformer‑decoder architecture (Wikipedia).
Models include Gemini Nano, Pro, and Ultra, offering flexibility across devices. Gemini Pro powers the Bard chatbot for public use (in many regions) (Ars Technica).
Public access: Yes—Gemini is publicly accessible via Bard and underlying Google AI services, but it is not GPT‑4 nor based on OpenAI code (Wikipedia, Wikipedia).

🧠 Anthropic Claude

Claude is a series of models built independently of OpenAI’s GPT‑4. It uses Constitutional AI combined with Reinforcement Learning from Human Feedback (RLHF) to train safer and more aligned behavior (Wikipedia).
Claude 3 (released March 2024) includes Haiku, Sonnet, and Opus models. Opus outperformed GPT-4 and Gemini Ultra on many benchmarks and supports image input. Sonnet and Haiku offer faster or more budget-friendly options (TechCrunch).
Claude 4 launched in May 2025, with Opus and Sonnet. Sonnet 4 is available free to the public via claude.ai interface (Tech Funding News).
Public access: Yes—Claude Sonnet 4 is freely accessible on Anthropic’s consumer portal. Opus is generally available via API or enterprise platforms.

✅ Access Summary

Model	Based on GPT‑4?	Public Access	Notes
Gemini	❌ No	✅ Yes	Powered via Bard; Google’s own architecture
Claude	❌ No	✅ Yes (free Sonnet)	Independent system by Anthropic

🧾 Quick Glossary

GPT‑4: A multimodal transformer model by OpenAI, trained to predict next tokens, fine-tuned using RLHF through humans and AI feedback (Tech Funding News, Wikipedia, Wikipedia).
Mixture‑of‑Experts (MoE): An architecture dividing tasks among expert sub-models, as used by Gemini for modular specialization (Fireflies.ai).
Constitutional AI: Anthropic’s method where a model self-critiques its output based on a set of guiding ethical principles (“constitution”), enabling safer behaviors without massive human labeling (Wikipedia).
Context window: The maximum number of tokens (words/subwords) a model can consider in a conversation. Gemini and Claude offer extremely large windows (up to 1 million tokens), surpassing GPT‑4 Turbo’s ~128 k token window (Fireflies.ai).

🧭 Conclusion

Although Claude and Gemini are often compared to GPT‑4 for benchmarking purposes, they are independently built systems, not derivatives of GPT‑4. Both models are publicly accessible: Gemini via Google Bard and Claude via Anthropic’s web app—Sonnet 4 is even free to use.

Filed under: Uncategorized |

« A Review of Ouyang et al.’s 2022 Paper aka “InstructGPT” AI Alpha-Portals and Right-Brain Prompt Engineering »

	HigherEd AI Daily: M… on Status of Agentic AI in Higher…
	HigherEd AI Daily: M… on Status of Agentic AI in Higher…
	Newsletter 25/26-12:… on Review of Kestin et al.’…
	Anonymous on A Beginner’s Guide to Tw…
	JimS on Three Prominent Critics of The…
	Anonymous on Three Prominent Critics of The…
	JimS on How to Deal With Weevils in Wh…
	Anonymous on How to Deal With Weevils in Wh…
	JimS on Ed Tech in Higher Ed – Three I…
	TheRetroArcade.Games on Ed Tech in Higher Ed – Three I…
	Mirror Cells in Cons… on A Conversation About Mirror…
	Máquinas Virtuais Li… on Upgrade Choice for 2025: Intel…
	AI Weekly News 76: A… on AI in February 2026: Three Cri…
	Newsletter 25/26-11:… on Review of Kestin et al.’…
	AI and the Age of Ra… on A Historical Perspective on th…

Educational Technology and Change Journal

Recent Posts

Recent Comments

To Comment

Archives

Categories