𝔖 Scriptorium
✩   LIBER   ✩

📁

Deep Reinforcement Learning with Python: RLHF for Chatbots and Large Language Models

✍ Scribed by Nimish Sanghi


Publisher
Apress
Year
2024
Tongue
English
Leaves
650
Edition
2
Category
Library

⬇  Acquire This Volume

No coin nor oath required. For personal study only.

✩ Synopsis


Gain a theoretical understanding to the most popular libraries in deep reinforcement learning (deep RL). This new edition focuses on the latest advances in deep RL using a learn-by-coding approach, allowing readers to assimilate and replicate the latest research in this field.

New agent environments ranging from games, and robotics to finance are explained to help you try different ways to apply reinforcement learning. A chapter on multi-agent reinforcement learning covers how multiple agents compete, while another chapter focuses on the widely used deep RL algorithm, proximal policy optimization (PPO). You'll see how reinforcement learning with human feedback (RLHF) has been used by chatbots, built using Large Language Models, e.g. ChatGPT to improve conversational capabilities.

You'll also review the steps for using the code on multiple cloud systems and deploying models on platforms such as Hugging Face Hub. The code is in Jupyter Notebook, which canbe run on Google Colab, and other similar deep learning cloud platforms, allowing you to tailor the code to your own needs.

Whether it’s for applications in gaming, robotics, or Generative AI, Deep Reinforcement Learning with Python will help keep you ahead of the curve.


What You'll Learn

  • Explore Python-based RL libraries, including StableBaselines3 and CleanRL
  • Work with diverse RL environments like Gymnasium, Pybullet, and Unity ML
  • Understand instruction finetuning of Large Language Models using RLHF and PPO
  • Study training and optimization techniques using HuggingFace, Weights and Biases, and Optuna

    Who This Book Is For

    Software engineers and machine learning developers eager to sharpen their understanding of deep RL and acquire practical skills in implementing RL algorithms fromscratch.


    ✩ Table of Contents


    Table of Contents
    About the Author
    About the Technical Reviewer
    Acknowledgments
    Introduction
    Chapter 1: Introduction to  Reinforcement Learning
    Reinforcement Learning
    Machine Learning Branches
    Supervised Learning
    Unsupervised Learning
    Reinforcement Learning
    Emerging Sub-branches
    Self-Supervised Learning
    Generative AI
    Generative AI vs Other Learning Paradigms
    Core Elements of RL
    Deep Learning with Reinforcement Learning
    Examples and Case Studies
    Autonomous Vehicles
    Robots
    Recommendation Systems
    Finance and Trading
    Healthcare
    Large Language Models and Generative AI
    Game Playing
    Libraries and Environment Setup
    Local Install (Recommended for a Local Option)
    Local Install with VS Code
    Running on Google Colab (Recommended for a  Cloud Option)
    Running on Kaggle
    Using devcontainer-Based Environments
    Running devcontainer Locally
    Running on GitHub Codespaces
    Running on AWS Studio Lab
    Running Using Lightning.ai
    Other Options to Run Code
    Summary
    Chapter 2: The Foundation: Markov Decision Processes
    Definition of Reinforcement Learning
    Agent and Environment
    Rewards
    Markov Processes
    Markov Chains
    Markov Reward Processes
    Markov Decision Processes
    Policies and Value Functions
    Bellman Equations
    Optimality Bellman Equations
    Train Your First Agent
    First Agent
    Walkthrough of Common Libraries Used
    Environments: Gymnasium and OpenAI Gym
    Stable Baselines3 (SB3)
    RL Baselines3 Zoo
    Hugging Face
    Second Agent
    RL Zoo Baselines3
    Solution Approaches with a Mind Map
    Summary
    Chapter 3: Model-Based Approaches
    Grid World Environment
    Dynamic Programming
    Policy Evaluation/Prediction
    Policy Improvement and Iterations
    Value Iteration
    Generalized Policy Iteration
    Asynchronous Backups
    Summary
    Chapter 4: Model-Free Approaches
    Estimation/Prediction with Monte Carlo
    Bias and Variance of MC Predication Methods
    Control with Monte Carlo
    Off-Policy MC Control
    Importance Sampling
    Temporal Difference Learning Methods
    Temporal Difference Control
    Cliff Walking
    Taxi
    Cart Pole
    On-Policy SARSA
    Q-Learning: An Off-Policy TD Control
    Maximization Bias and Double Learning
    Expected SARSA Control
    Replay Buffer and Off-Policy Learning
    Q-Learning for Continuous State Spaces
    n-Step Returns
    Eligibility Traces and TD(λ)
    Relationships Between DP, MC, and TD
    Summary
    Chapter 5: Function Approximation and Deep Learning
    Introduction
    Theory of Approximation
    Coarse Coding
    Tile Encoding
    Challenges in Approximation
    Incremental Prediction: MC, TD, TD(λ)
    Incremental Control
    Semi-gradient n-step SARSA Control
    Semi-gradient SARSA(λ) Control
    Convergence in Functional Approximation
    Gradient Temporal Difference Learning
    Batch Methods (DQN)
    Linear Least Squares Method
    Deep Learning Libraries
    PyTorch
    What Are Neural Networks
    Training with Back-Propagation
    PyTorch Lightning
    TensorFlow
    Summary
    Chapter 6: Deep Q-Learning (DQN)
    Deep Q Networks
    OpenAI Gym vs Farma Gymnasium
    Recording Videos of Trained Agents
    End-to-End Training with SB3
    End to End Training with SB3 Zoo
    Hyperparameter Optimization
    Integration with Rliable library(
    )
    Atari Game-Playing Agent Using DQN
    Atari Environment in Gymnasium
    Preprocessing and Training
    Overview of Various RL Environments and Libraries
    PyGame
    MuJoCo
    Unity ML Agents
    PettingZoo
    Bullet Physics Engine and Related Environments
    CleanRL
    MineRL
    FinRL
    FlappyBird Environment
    Summary
    Chapter 7: Improvements to DQN
    Prioritized Replay
    Double DQN (DDQN)
    Dueling DQN
    NoisyNets DQN
    Categorical 51-Atom DQN (C51)
    Quantile Regression DQN
    Hindsight Experience Replay
    Summary
    Chapter 8: Policy Gradient Algorithms
    Introduction
    Pros and Cons of Policy-Based Methods
    Policy Representation
    Discrete Cases
    Continuous Cases
    Policy Gradient Derivation
    Objective Function
    Derivative Update Rule
    Intuition Behind the Update Rule
    The REINFORCE Algorithm
    Variance Reduction with Rewards-to-Go
    Further Variance Reduction with Baselines
    Actor-Critic Methods
    Defining Advantage
    Advantage Actor-Critic (A2C)
    Implementation of the A2C Algorithm
    Asynchronous Advantage Actor-Critic
    Trust Region Policy Optimization Algorithm
    Proximal Policy Optimization Algorithm (PPO)
    Curiosity-Driven Learning
    Summary
    Chapter 9: Combining Policy Gradient and Q-Learning
    Tradeoffs in Policy Gradient and Q-Learning
    General Framework to Combine Policy Gradient with Q-Learning
    Deep Deterministic Policy Gradient
    Q-Learning in DDPG (Critic)
    Policy Learning in DDPG (Actor)
    Pseudocode and Implementation
    Gymnasium Environments Used in Code
    Code Listing
    Policy Network Actor
    Q-Network Critic Implementation
    Combined Model-Actor-Critic Implementation
    Experience Replay
    Q-Loss Implementation
    Policy Loss Implementation
    One-Step Update Implementation
    DDPG: Main Loop
    Twin Delayed DDPG
    Target-Policy Smoothing
    Q-Loss (Critic)
    Policy Loss (Actor)
    Delayed Update
    Pseudocode and Implementation
    Code Implementation
    Combined Model-Actor-Critic Implementation
    Q-Loss Implementation
    Policy-Loss Implementation
    One-Step Update Implementation
    TD3 Main Loop
    Reparameterization Trick
    Score/Reinforce Way
    Reparameterization Trick and Pathwise Derivatives
    Experiment
    Entropy Explained
    Soft Actor-Critic
    SAC vs. TD3
    Q-Loss with Entropy-Regularization
    Policy Loss with the Reparameterization Trick
    Pseudocode and Implementation
    Policy Network-Actor Implementation
    Q-Network, Combined Model, and Experience Replay
    Q-Loss and Policy-Loss Implementation
    One-Step Update and SAC Main Loop
    Summary
    Chapter 10: Integrated Planning and Learning
    Model-Based Reinforcement Learning
    Planning with a Learned Model
    Integrating Learning and Planning (Dyna)
    Dyna Q and Changing Environments
    Dyna Q+
    Expected vs. Sample Updates
    Exploration vs. Exploitation
    Multi-Arm Bandit
    Regret: Measure the Quality of Exploration
    Epsilon Greedy Exploration
    Upper Confidence Bound Exploration
    Thompson Sampling Exploration
    Comparing Different Exploration Strategies
    Planning at Decision Time and Monte Carlo Tree Search
    Example Uses of MCTS
    AlphaGo
    AlphaGo Zero and AlphaZero
    AlphaFold with MCTS
    Use of MCTS in Other Domains
    Summary
    Chapter 11: Proximal Policy Optimization (PPO) and RLHF
    Theoretical Foundations of PPO

    Score Function and MLE Estimator
    Fisher Information Matrix (FIM) and Hessian
    Natural Gradient Method
    Trust Region Policy Optimization (TRPO)
    PPO Deep Dive
    PPO CLIP Objective
    Advantage Calculation
    Value and Entropy Loss Objectives
    Implementation Details of PPO
    1. Vectorized Environment
    2. Parameter Initialization
    3. Adam Optimizer’s Epsilon Parameter
    4. Adam Learning Rate Annealing
    5. Generalized Advantage Estimation
    6. Mini-Batch Updates
    7. Normalization of Advantages
    8. Clipped Surrogate Objective
    9. Value Function Loss Clipping
    10. Overall Loss and Entropy Bonus
    11. Global Gradient Clipping
    12. Debug Variables
    13. Shared and Separate MLP Networks for Policy and Value Functions
    Running CleanRL PPO
    Asynchronous PPO
    Large Language Models(
    )
    Prompt Engineering
    Prompting Techniques
    RAG and Chat Bots
    LLMs as Operating Systems
    Fine-Tuning
    Parameter Efficient Fine-Tuning (PEFT)
    Chaining LLMs Together
    Auto Agents
    Multimodal Generative AI
    RL with Human Feedback
    Latest Advances in LLM Alignment
    Libraries and Frameworks for RLHF
    VertexAI from Google
    SageMaker from AWS Using Trlx
    TRL Library from HuggingFace
    Walkthrough of RLHF Tuning
    Summary
    Chapter 12: Multi-Agent RL (MARL)
    Key Challenges in MARL
    MARL Taxonomy
    Communication Between Agents
    Mapping with Game Theory
    Solutions in MARL
    MARL and Core Algorithms
    Value Iteration
    TD Approach with Joint Action Learning
    Minimax Q-Learning
    Nash Q-Learning
    Correlated Q-Learning
    Assumptions on Agents
    Policy-Based Learning
    No-Regret Learning
    Deep MARL
    Petting Zoo Library
    Sample Training
    Summary
    Chapter 13: Additional Topics and Recent Advances
    Other Interesting RL Environments
    MineRL
    Donkey Car RL
    FinRL
    Star Craft II: PySc2
    Godot RL Agents
    Model-Based RL: Additional Approaches
    World Models
    Imagination-Augmented Agents (I2A)
    Model-Based RL with Model-Free Fine-Tuning (MBMF)
    Model-Based Value Expansion (MBVE)
    IRIS: Transformers as World Models
    Causal World Models
    Offline RL
    Decision Transformers
    Automatic Curriculum Learning
    Imitation Learning and Inverse Reinforcement Learning
    Derivative-Free Methods
    Transfer Learning and Multitask Learning
    Meta-Learning
    Unsupervised Zero-Shot Reinforcement Learning
    REINFORCE Learning from Human Feedback in LLMs
    How to Continue Studying
    Summary
    Index


    📜 SIMILAR VOLUMES


    Deep Reinforcement Learning with Python
    ✍ Nimish Sanghi 📂 Library 📅 2024 🏛 Apress 🌐 English

    Gain a theoretical understanding to the most popular libraries in deep reinforcement learning (deep RL). This new edition focuses on the latest advances in deep RL using a learn-by-coding approach, allowing readers to assimilate and replicate the latest research in this field. New agent environme

    Deep Reinforcement Learning with Python:
    ✍ Nimish Sanghi 📂 Library 📅 2024 🏛 Apress 🌐 English

    Deep Reinforcement Learning with Python, Second Edition Gain a theoretical understanding to the most popular libraries in deep reinforcement learning (deep RL). This new edition focuses on the latest advances in deep RL using a learn-by-coding approach, allowing readers to assimilate and replicat

    Deep Reinforcement Learning with Python:
    ✍ Nimish Sanghi 📂 Library 📅 2024 🏛 Apress 🌐 English

    Deep Reinforcement Learning with Python, Second Edition Gain a theoretical understanding to the most popular libraries in deep reinforcement learning (deep RL). This new edition focuses on the latest advances in deep RL using a learn-by-coding approach, allowing readers to assimilate and replicat

    Natural Language Understanding with Pyth
    ✍ Deborah A. Dahl 📂 Library 📅 2023 🏛 Packt Publishing Pvt Ltd 🌐 English

    Build advanced Natural Language Understanding Systems by acquiring data and selecting appropriate technology. Key Features Master NLU concepts from basic text processing to advanced deep learning techniques Explore practical NLU applications like chatbots, sentiment analysis, and language trans

    Mastering Large Language Models with Pyt
    ✍ Raj Arun R 📂 Library 📅 2024 🏛 Orange Education Pvt Ltd 🌐 English

    <p>"Mastering Large Language Models with Python" is an indispensable resource that offers a comprehensive exploration of Large Language Models (LLMs), providing the essential knowledge to leverage these transformative AI models effectively. From unraveling the intricacies of LLM architecture to prac