Deep Reinforcement Learning with Python: RLHF for Chatbots and Large Language Models

✍ Scribed by Nimish Sanghi

Publisher: Apress
Year: 2024
Tongue: English
Leaves: 650
Edition: 2
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

Gain a theoretical understanding to the most popular libraries in deep reinforcement learning (deep RL). This new edition focuses on the latest advances in deep RL using a learn-by-coding approach, allowing readers to assimilate and replicate the latest research in this field.

New agent environments ranging from games, and robotics to finance are explained to help you try different ways to apply reinforcement learning. A chapter on multi-agent reinforcement learning covers how multiple agents compete, while another chapter focuses on the widely used deep RL algorithm, proximal policy optimization (PPO). You'll see how reinforcement learning with human feedback (RLHF) has been used by chatbots, built using Large Language Models, e.g. ChatGPT to improve conversational capabilities.

You'll also review the steps for using the code on multiple cloud systems and deploying models on platforms such as Hugging Face Hub. The code is in Jupyter Notebook, which canbe run on Google Colab, and other similar deep learning cloud platforms, allowing you to tailor the code to your own needs.

Whether it’s for applications in gaming, robotics, or Generative AI, Deep Reinforcement Learning with Python will help keep you ahead of the curve.

What You'll Learn

Explore Python-based RL libraries, including StableBaselines3 and CleanRL
Work with diverse RL environments like Gymnasium, Pybullet, and Unity ML
Understand instruction finetuning of Large Language Models using RLHF and PPO
Study training and optimization techniques using HuggingFace, Weights and Biases, and Optuna

Who This Book Is For

Software engineers and machine learning developers eager to sharpen their understanding of deep RL and acquire practical skills in implementing RL algorithms fromscratch.

✦ Table of Contents

Table of Contents
About the Author
About the Technical Reviewer
Acknowledgments
Introduction
Chapter 1: Introduction to Reinforcement Learning
Reinforcement Learning
Machine Learning Branches
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Emerging Sub-branches
Self-Supervised Learning
Generative AI
Generative AI vs Other Learning Paradigms
Core Elements of RL
Deep Learning with Reinforcement Learning
Examples and Case Studies
Autonomous Vehicles
Robots
Recommendation Systems
Finance and Trading
Healthcare
Large Language Models and Generative AI
Game Playing
Libraries and Environment Setup
Local Install (Recommended for a Local Option)
Local Install with VS Code
Running on Google Colab (Recommended for a Cloud Option)
Running on Kaggle
Using devcontainer-Based Environments
Running devcontainer Locally
Running on GitHub Codespaces
Running on AWS Studio Lab
Running Using Lightning.ai
Other Options to Run Code
Summary
Chapter 2: The Foundation: Markov Decision Processes
Definition of Reinforcement Learning
Agent and Environment
Rewards
Markov Processes
Markov Chains
Markov Reward Processes
Markov Decision Processes
Policies and Value Functions
Bellman Equations
Optimality Bellman Equations
Train Your First Agent
First Agent
Walkthrough of Common Libraries Used
Environments: Gymnasium and OpenAI Gym
Stable Baselines3 (SB3)
RL Baselines3 Zoo
Hugging Face
Second Agent
RL Zoo Baselines3
Solution Approaches with a Mind Map
Summary
Chapter 3: Model-Based Approaches
Grid World Environment
Dynamic Programming
Policy Evaluation/Prediction
Policy Improvement and Iterations
Value Iteration
Generalized Policy Iteration
Asynchronous Backups
Summary
Chapter 4: Model-Free Approaches
Estimation/Prediction with Monte Carlo
Bias and Variance of MC Predication Methods
Control with Monte Carlo
Off-Policy MC Control
Importance Sampling
Temporal Difference Learning Methods
Temporal Difference Control
Cliff Walking
Taxi
Cart Pole
On-Policy SARSA
Q-Learning: An Off-Policy TD Control
Maximization Bias and Double Learning
Expected SARSA Control
Replay Buffer and Off-Policy Learning
Q-Learning for Continuous State Spaces
n-Step Returns
Eligibility Traces and TD(λ)
Relationships Between DP, MC, and TD
Summary
Chapter 5: Function Approximation and Deep Learning
Introduction
Theory of Approximation
Coarse Coding
Tile Encoding
Challenges in Approximation
Incremental Prediction: MC, TD, TD(λ)
Incremental Control
Semi-gradient n-step SARSA Control
Semi-gradient SARSA(λ) Control
Convergence in Functional Approximation
Gradient Temporal Difference Learning
Batch Methods (DQN)
Linear Least Squares Method
Deep Learning Libraries
PyTorch
What Are Neural Networks
Training with Back-Propagation
PyTorch Lightning
TensorFlow
Summary
Chapter 6: Deep Q-Learning (DQN)
Deep Q Networks
OpenAI Gym vs Farma Gymnasium
Recording Videos of Trained Agents
End-to-End Training with SB3
End to End Training with SB3 Zoo
Hyperparameter Optimization
Integration with Rliable library()
Atari Game-Playing Agent Using DQN
Atari Environment in Gymnasium
Preprocessing and Training
Overview of Various RL Environments and Libraries
PyGame
MuJoCo
Unity ML Agents
PettingZoo
Bullet Physics Engine and Related Environments
CleanRL
MineRL
FinRL
FlappyBird Environment
Summary
Chapter 7: Improvements to DQN
Prioritized Replay
Double DQN (DDQN)
Dueling DQN
NoisyNets DQN
Categorical 51-Atom DQN (C51)
Quantile Regression DQN
Hindsight Experience Replay
Summary
Chapter 8: Policy Gradient Algorithms
Introduction
Pros and Cons of Policy-Based Methods
Policy Representation
Discrete Cases
Continuous Cases
Policy Gradient Derivation
Objective Function
Derivative Update Rule
Intuition Behind the Update Rule
The REINFORCE Algorithm
Variance Reduction with Rewards-to-Go
Further Variance Reduction with Baselines
Actor-Critic Methods
Defining Advantage
Advantage Actor-Critic (A2C)
Implementation of the A2C Algorithm
Asynchronous Advantage Actor-Critic
Trust Region Policy Optimization Algorithm
Proximal Policy Optimization Algorithm (PPO)
Curiosity-Driven Learning
Summary
Chapter 9: Combining Policy Gradient and Q-Learning
Tradeoffs in Policy Gradient and Q-Learning
General Framework to Combine Policy Gradient with Q-Learning
Deep Deterministic Policy Gradient
Q-Learning in DDPG (Critic)
Policy Learning in DDPG (Actor)
Pseudocode and Implementation
Gymnasium Environments Used in Code
Code Listing
Policy Network Actor
Q-Network Critic Implementation
Combined Model-Actor-Critic Implementation
Experience Replay
Q-Loss Implementation
Policy Loss Implementation
One-Step Update Implementation
DDPG: Main Loop
Twin Delayed DDPG
Target-Policy Smoothing
Q-Loss (Critic)
Policy Loss (Actor)
Delayed Update
Pseudocode and Implementation
Code Implementation
Combined Model-Actor-Critic Implementation
Q-Loss Implementation
Policy-Loss Implementation
One-Step Update Implementation
TD3 Main Loop
Reparameterization Trick
Score/Reinforce Way
Reparameterization Trick and Pathwise Derivatives
Experiment
Entropy Explained
Soft Actor-Critic
SAC vs. TD3
Q-Loss with Entropy-Regularization
Policy Loss with the Reparameterization Trick
Pseudocode and Implementation
Policy Network-Actor Implementation
Q-Network, Combined Model, and Experience Replay
Q-Loss and Policy-Loss Implementation
One-Step Update and SAC Main Loop
Summary
Chapter 10: Integrated Planning and Learning
Model-Based Reinforcement Learning
Planning with a Learned Model
Integrating Learning and Planning (Dyna)
Dyna Q and Changing Environments
Dyna Q+
Expected vs. Sample Updates
Exploration vs. Exploitation
Multi-Arm Bandit
Regret: Measure the Quality of Exploration
Epsilon Greedy Exploration
Upper Confidence Bound Exploration
Thompson Sampling Exploration
Comparing Different Exploration Strategies
Planning at Decision Time and Monte Carlo Tree Search
Example Uses of MCTS
AlphaGo
AlphaGo Zero and AlphaZero
AlphaFold with MCTS
Use of MCTS in Other Domains
Summary
Chapter 11: Proximal Policy Optimization (PPO) and RLHF
Theoretical Foundations of PPO
Score Function and MLE Estimator
Fisher Information Matrix (FIM) and Hessian
Natural Gradient Method
Trust Region Policy Optimization (TRPO)
PPO Deep Dive
PPO CLIP Objective
Advantage Calculation
Value and Entropy Loss Objectives
Implementation Details of PPO
1. Vectorized Environment
2. Parameter Initialization
3. Adam Optimizer’s Epsilon Parameter
4. Adam Learning Rate Annealing
5. Generalized Advantage Estimation
6. Mini-Batch Updates
7. Normalization of Advantages
8. Clipped Surrogate Objective
9. Value Function Loss Clipping
10. Overall Loss and Entropy Bonus
11. Global Gradient Clipping
12. Debug Variables
13. Shared and Separate MLP Networks for Policy and Value Functions
Running CleanRL PPO
Asynchronous PPO
Large Language Models()
Prompt Engineering
Prompting Techniques
RAG and Chat Bots
LLMs as Operating Systems
Fine-Tuning
Parameter Efficient Fine-Tuning (PEFT)
Chaining LLMs Together
Auto Agents
Multimodal Generative AI
RL with Human Feedback
Latest Advances in LLM Alignment
Libraries and Frameworks for RLHF
VertexAI from Google
SageMaker from AWS Using Trlx
TRL Library from HuggingFace
Walkthrough of RLHF Tuning
Summary
Chapter 12: Multi-Agent RL (MARL)
Key Challenges in MARL
MARL Taxonomy
Communication Between Agents
Mapping with Game Theory
Solutions in MARL
MARL and Core Algorithms
Value Iteration
TD Approach with Joint Action Learning
Minimax Q-Learning
Nash Q-Learning
Correlated Q-Learning
Assumptions on Agents
Policy-Based Learning
No-Regret Learning
Deep MARL
Petting Zoo Library
Sample Training
Summary
Chapter 13: Additional Topics and Recent Advances
Other Interesting RL Environments
MineRL
Donkey Car RL
FinRL
Star Craft II: PySc2
Godot RL Agents
Model-Based RL: Additional Approaches
World Models
Imagination-Augmented Agents (I2A)
Model-Based RL with Model-Free Fine-Tuning (MBMF)
Model-Based Value Expansion (MBVE)
IRIS: Transformers as World Models
Causal World Models
Offline RL
Decision Transformers
Automatic Curriculum Learning
Imitation Learning and Inverse Reinforcement Learning
Derivative-Free Methods
Transfer Learning and Multitask Learning
Meta-Learning
Unsupervised Zero-Shot Reinforcement Learning
REINFORCE Learning from Human Feedback in LLMs
How to Continue Studying
Summary
Index

📜 SIMILAR VOLUMES

Deep Reinforcement Learning with Python

📁 Deep Reinforcement Learning with Python : RLHF for Chatbots and Large Language Models

✍ Nimish Sanghi 📂 Library 📅 2024 🏛 Apress 🌐 English

Deep Reinforcement Learning with Python:

📁 Deep Reinforcement Learning with Python: RLHF for Chatbots and Large Language Models

✍ Nimish Sanghi 📂 Library 📅 2024 🏛 Apress 🌐 English

Deep Reinforcement Learning with Python, Second Edition Gain a theoretical understanding to the most popular libraries in deep reinforcement learning (deep RL). This new edition focuses on the latest advances in deep RL using a learn-by-coding approach, allowing readers to assimilate and replicat

Deep Reinforcement Learning with Python:

📁 Deep Reinforcement Learning with Python: RLHF for Chatbots and Large Language Models

✍ Nimish Sanghi 📂 Library 📅 2024 🏛 Apress 🌐 English

Natural Language Understanding with Pyth

📁 Natural Language Understanding with Python: Combine natural language technology, deep learning, and large language models

✍ Deborah A. Dahl 📂 Library 📅 2023 🏛 Packt Publishing Pvt Ltd 🌐 English

Build advanced Natural Language Understanding Systems by acquiring data and selecting appropriate technology. Key Features Master NLU concepts from basic text processing to advanced deep learning techniques Explore practical NLU applications like chatbots, sentiment analysis, and language trans

Practical Deep Reinforcement Learning wi

📁 Practical Deep Reinforcement Learning with Python

✍ Gridin, Ivan; 📂 Library 📅 2022 🏛 BPB Publications 🌐 English

Mastering Large Language Models with Pyt

📁 Mastering Large Language Models with Python

✍ Raj Arun R 📂 Library 📅 2024 🏛 Orange Education Pvt Ltd 🌐 English

<p>"Mastering Large Language Models with Python" is an indispensable resource that offers a comprehensive exploration of Large Language Models (LLMs), providing the essential knowledge to leverage these transformative AI models effectively. From unraveling the intricacies of LLM architecture to prac