Review of “OpenAI (2023), GPT‑4 Technical Report” [4 March 2024]

By Jim Shimabukuro (assisted by ChatGPT)
Editor

(Also see A Review of Ouyang et al.’s 2022 Paper aka ‘InstructGPT’ and Must-Read Publications That Are Guiding Chatbot Development?)

Introduction: I collaborated with ChatGPT on this report. This is a follow-up to “A Review of Ouyang et al.’s 2022 Paper aka ‘InstructGPT’” (ETC, 29 July 2025). “OpenAI (2023)” clarifies the connection between InstructGPT and ChatGPT. -js

ChatGPT: Here’s a clearer, more conversational breakdown of “OpenAI (2023), GPT‑4 Technical Report” (arXiv:2303.08774v6, Mar 4, 2024)—what the authors are actually saying, how they build their case, and why it matters.


🎯 What Are They Trying to Say?

Thesis: GPT‑4 is a major leap forward in large multimodal AI—able to understand text and images, performing at elite human levels on benchmarks—thanks to a Transformer-based model trained at massive scale, combined with reinforcement learning and safety-focused alignment. It excels across a wide range of tasks yet still carries known limitations and safety challenges.


🧩 How They Develop Their Argument

1. Introduction & Scope

  • Model type: GPT‑4 is a “Transformer-based model” (a neural network architecture famous for language tasks). It’s “pre-trained” on public and licensed data to predict “the next token” (basically the next word or piece of text) (arXiv, Wikipedia).
  • Fine-tuning: After initial training, GPT‑4 is improved via Reinforcement Learning from Human Feedback (RLHF)—people rate outputs to guide the model toward better factuality and helpfulness (arXiv).
  • Transparency: The report does not share full details on model size or exact compute. Instead, it focuses on capabilities, limitations, and safety (arXiv).

2. Scaling Predictably

  • Big claim: The team engineered the training infrastructure—data pipelines, optimization methods, hardware usage—to scale robustly.
  • Key insight: By observing performance on much smaller models, they used power‑law scaling laws to predict GPT‑4’s performance (e.g., based on smaller models with 1/10,000th the compute) (herbertroitblat.substack.com, arXiv).
  • Result: They could confidently scale up model training without costly trial‑and‑error at full scale.

3. Multimodal Abilities

  • GPT‑4 handles both text and images in prompts, generating coherent text responses even about visuals—like describing a meme panel-by-panel (arXiv).
  • It shows effectiveness in multimodal understanding, applying the same strategies like few‑shot prompting and chain‑of‑thought reasoning even when images are involved.

4. Benchmarking Human-Level Performance

  • Real‑world performance: GPT‑4 achieves human-level or better scores on various tests:
    • Simulated Bar Exam: top 10% among takers
    • Standardized exams: strong performances on SAT, LSAT, GRE etc. (arXiv, Emergent Mind, Wikipedia).
    • MMLU benchmark: 86.4% in English; surpassing past models across 24 of 26 other languages (arXiv).
  • Even in coding (HumanEval), vision tasks, and multilingual evaluation, GPT‑4 consistently eclipses GPT‑3.5 and other predecessors.

5. Limits & Reliability

  • However, GPT‑4 isn’t perfect:
    • Hallucinations: It can “make up” facts or confidently provide incorrect information.
    • Context limit: There’s a maximum window of input it can handle.
    • Static knowledge: The model doesn’t update continuously from new data (arXiv, Emergent Mind).
  • Safety concerns: The model enables fresh risks—bias, disinformation, cybersecurity vulnerabilities, etc.

6. Safety & Alignment Measures

  • GPT‑4 underwent extensive adversarial testing with domain experts to uncover dangerous behaviors before deployment.
  • A model-assisted safety pipeline and a “system card” detail mitigation strategies against misuse—like filtering prompts and adding rules via an RBRM (Rule-Based Reward Model) (Emergent Mind, arXiv).
  • Results: 19–82% drop in failure rates vs. GPT‑3.5 on factuality checks and toxic/harmful outputs (arXiv, Emergent Mind).

7. Vision & Conversations

  • GPT‑4V (vision-capable) can interpret diagrams, images, memes, and screenshots—demonstrating robust multimodal understanding (arXiv, Emergent Mind).

8. Future Outlook

  • OpenAI calls for independent auditing via third-party researchers.
  • They highlight future needs: reducing hallucinations, improving updating knowledge, expanding context windows, and refining safety and reliability.

✍️ Definitions & Jargon

  • Transformer: A deep learning model architecture using attention mechanisms—excels at language tasks.
  • Token: A unit of text (word, part of word, punctuation).
  • Pre‑training: Initial unsupervised learning where the model guesses the next token using vast text data.
  • Fine-tuning: Additional supervised training or RLHF to refine model behavior.
  • RLHF: Reinforcement learning guided by humans ranking outputs for preferences.
  • Few-shot prompting: Giving the model a few examples in the prompt to guide output.
  • Chain-of-thought: Prompting strategy that encourages step-by-step logical reasoning.
  • Adversarial testing: Attempts by experts to break the system or make it fail.
  • System card: A document outlining model capabilities, risks, and safe usage guidelines.
  • RBRM: Rule-Based Reward Model—a built-in system to penalize unsafe behavior.

🧭 So, What’s the Real Narrative?

  1. Scalable engineering enabled predictable performance from tiny models up to GPT‑4.
  2. Cutting-edge multimodality lets GPT‑4 handle both images and text with human-level performance across tasks.
  3. Rigorous safety work: alignment via RLHF, adversarial testing, and rule-based systems significantly reduce harms.
  4. Limitations remain—hallucinations, static knowledge, and context constraints mean it’s not infallible.
  5. Ongoing work ahead: transparency, reliability, real-time knowledge, and safer behavior are the key milestones ahead.

📚 Annotated References (APA style)

  • OpenAI. (2023). GPT‑4 technical report (arXiv:2303.08774v6). arXiv. Provides the foundational description of GPT‑4’s architecture, performance (e.g., Bar Exam top 10%), and multimodal abilities (arXiv, Emergent Mind, ResearchGate, Wikipedia).
  • Emergent Mind. (n.d.). GPT‑4 Technical Report summary. Highlights benchmark wins like 86.4% on MMLU, and reductions in toxicity and harmful responses (Emergent Mind).
  • Wikipedia contributors. (2025, July). GPT‑4. In Wikipedia. Details on GPT‑4’s release date, versions, context windows, and training overview (Emergent Mind).
  • Bubeck, S., Chandrasekaran, V., Eldan, R., & et al. (2023). Sparks of artificial general intelligence: Early experiments with GPT‑4 (arXiv:2303.12712). Discusses GPT‑4’s performance and its implications for a form of early AGI (arXiv).

ChatGPT: Yes, as of mid-2025, several publicly accessible chatbots are based on GPT-4, although access levels (free vs. paid) vary depending on the platform. Here are the key ones:


Chatbots Based on GPT-4 (Public Access)

1. Microsoft Copilot / Bing Chat

  • Model: GPT-4 (with some customization)
  • Access: ✅ Free to the public
  • Platform: Bing.com, Edge browser, Microsoft Copilot apps
  • Notes:
    • Integrated with real-time web browsing.
    • Includes image generation (via DALL·E) and multimodal input in some versions.
    • Often described as GPT-4 with “tools” (plugins like search, code interpreter).

2. OpenAI ChatGPT (chat.openai.com)

  • Model:
    • GPT-3.5: ✅ Free
    • GPT-4 (specifically GPT-4-turbo): 🔒 Paid (ChatGPT Plus, $20/mo)
  • Notes:
    • Free-tier users are limited to GPT-3.5.
    • GPT-4-turbo is faster and cheaper than the original GPT-4, with more memory (128k context window).
    • Includes tools like Python code interpreter, image understanding, and web browsing (for pro users).

3. Poe by Quora

  • Model options:
    • GPT-4: 🔒 Limited free access + subscription required for more use
    • GPT-3.5 and Claude Instant: ✅ Free
  • Platform: poe.com
  • Notes:
    • Users get a few GPT-4 messages per day for free, but must pay for ongoing use.

🧠 Other GPT-4 Access Points (More Specialized)

4. Perplexity.ai (Pro mode)

  • Model: GPT-4 and Claude 3 available in Pro tier
  • Access: 🔒 Subscription only for GPT-4, but GPT-3.5-style model is used in free tier.

5. Julius AI, TypingMind, and other wrappers

  • Model: Often include GPT-4 via OpenAI API
  • Access: 🔒 GPT-4 requires bringing your own API key (paid via OpenAI), but some tools allow limited free use or trials.

Summary Table

ChatbotGPT-4 AccessFree?Notes
Bing Chat (Copilot)✅ Yes✅ YesGPT-4 w/ web tools
ChatGPT (OpenAI)✅ Yes❌ No (GPT-4 is paid)Free tier = GPT-3.5
Poe.com (Quora)✅ Yes⚠️ LimitedSmall daily quota for free
Perplexity.ai (Pro)✅ Yes❌ NoFree tier = weaker models
Others (e.g., TypingMind)✅ Often⚠️ LimitedUsually requires API key

ChatGPT: No, Gemini (by Google DeepMind) and Claude (by Anthropic) are not based on OpenAI’s GPT‑4 architecture, though they share some capabilities. Here’s a clearer breakdown:


🤖 Google Gemini

  • Not built on GPT‑4––it is Google’s own large multimodal model developed by Google DeepMind.
  • Gemini uses a Mixture‑of‑Experts (MoE) architecture with separate modules for different tasks (e.g., text, image, code, audio), unlike GPT‑4’s single transformer‑decoder architecture (Wikipedia).
  • Models include Gemini Nano, Pro, and Ultra, offering flexibility across devices. Gemini Pro powers the Bard chatbot for public use (in many regions) (Ars Technica).
  • Public access: Yes—Gemini is publicly accessible via Bard and underlying Google AI services, but it is not GPT‑4 nor based on OpenAI code (Wikipedia, Wikipedia).

🧠 Anthropic Claude

  • Claude is a series of models built independently of OpenAI’s GPT‑4. It uses Constitutional AI combined with Reinforcement Learning from Human Feedback (RLHF) to train safer and more aligned behavior (Wikipedia).
  • Claude 3 (released March 2024) includes Haiku, Sonnet, and Opus models. Opus outperformed GPT-4 and Gemini Ultra on many benchmarks and supports image input. Sonnet and Haiku offer faster or more budget-friendly options (TechCrunch).
  • Claude 4 launched in May 2025, with Opus and Sonnet. Sonnet 4 is available free to the public via claude.ai interface (Tech Funding News).
  • Public access: Yes—Claude Sonnet 4 is freely accessible on Anthropic’s consumer portal. Opus is generally available via API or enterprise platforms.

✅ Access Summary

ModelBased on GPT‑4?Public AccessNotes
Gemini❌ No✅ YesPowered via Bard; Google’s own architecture
Claude❌ No✅ Yes (free Sonnet)Independent system by Anthropic

🧾 Quick Glossary

  • GPT‑4: A multimodal transformer model by OpenAI, trained to predict next tokens, fine-tuned using RLHF through humans and AI feedback (Tech Funding News, Wikipedia, Wikipedia).
  • Mixture‑of‑Experts (MoE): An architecture dividing tasks among expert sub-models, as used by Gemini for modular specialization (Fireflies.ai).
  • Constitutional AI: Anthropic’s method where a model self-critiques its output based on a set of guiding ethical principles (“constitution”), enabling safer behaviors without massive human labeling (Wikipedia).
  • Context window: The maximum number of tokens (words/subwords) a model can consider in a conversation. Gemini and Claude offer extremely large windows (up to 1 million tokens), surpassing GPT‑4 Turbo’s ~128 k token window (Fireflies.ai).

🧭 Conclusion

Although Claude and Gemini are often compared to GPT‑4 for benchmarking purposes, they are independently built systems, not derivatives of GPT‑4. Both models are publicly accessible: Gemini via Google Bard and Claude via Anthropic’s web app—Sonnet 4 is even free to use.

Leave a comment