Skip to content

Latest commit

 

History

History
150 lines (124 loc) · 7.9 KB

README.md

File metadata and controls

150 lines (124 loc) · 7.9 KB

GPT-X

Implementation of autoregressive language model(like GPT) using improved Transformer and DeepSpeed pipeline parallelism.

Improved Transformer

Transformer used in this repository attempts to improve the transformer using the additional modules below.

Name Description Link
Rezero Rezero Is All You Need link
Explicit Sparse Transformer Concentrated Attention Through Explicit Selection link
Macaron Architecture Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View link
RealFormer Residual Attention link
ALiBi Position Embedding effective relative positional encoding

Model Description

model_name n_params n_layer d_model n_heads vocab_size max_seq_len learning_rate
GPT-X 1B 1B 20 2048 16 22000 1024 2.0 x 10^-4

DeepSpeed

DeepSpeed is a deep learning training optimization library, providing the means to train massive billion parameter models at scale.

Piepline Parallelism

You can train 1B GPT-X Model using deepspeed pipeline parallelism on 2 V100 GPU(16G).

GPU Usage

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:00:06.0 Off |                    0 |
| N/A   42C    P0    44W / 250W |  16076MiB / 16130MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:00:07.0 Off |                    0 |
| N/A   45C    P0   168W / 250W |  16060MiB / 16130MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     29525      C   /home/ubuntu/anaconda3/bin/python          16065MiB |
|    1     29528      C   /home/ubuntu/anaconda3/bin/python          16049MiB |
+-----------------------------------------------------------------------------+

Pipeline Parallelism Log

[2021-12-31 12:24:20,042] [INFO] [engine.py:93:__init__] CONFIG: micro_batches=4 micro_batch_size=1
[2021-12-31 12:24:20,094] [INFO] [engine.py:151:__init__] RANK=1 STAGE=1 LAYERS=12 [11, 23) STAGE_PARAMS=548560916 (548.561M) TOTAL_PARAMS=1099214888 (1099.215M) UNIQUE_PARAMS=1099214888 (1099.215M)
[2021-12-31 12:24:20,094] [INFO] [engine.py:151:__init__] RANK=0 STAGE=0 LAYERS=11 [0, 11) STAGE_PARAMS=550653972 (550.654M) TOTAL_PARAMS=1099214888 (1099.215M) UNIQUE_PARAMS=1099214888 (1099.215M)
[2021-12-31 12:24:08,793] [INFO] [module.py:365:_partition_layers] Partitioning pipeline stages with method parameters
stage=0 layers=11
     0: Embedding
     1: ReZeroSparseTopkDecoder
     2: ReZeroSparseTopkDecoder
     3: ReZeroSparseTopkDecoder
     4: ReZeroSparseTopkDecoder
     5: ReZeroSparseTopkDecoder
     6: ReZeroSparseTopkDecoder
     7: ReZeroSparseTopkDecoder
     8: ReZeroSparseTopkDecoder
     9: ReZeroSparseTopkDecoder
    10: ReZeroSparseTopkDecoder
stage=1 layers=12
    11: ReZeroSparseTopkDecoder
    12: ReZeroSparseTopkDecoder
    13: ReZeroSparseTopkDecoder
    14: ReZeroSparseTopkDecoder
    15: ReZeroSparseTopkDecoder
    16: ReZeroSparseTopkDecoder
    17: ReZeroSparseTopkDecoder
    18: ReZeroSparseTopkDecoder
    19: ReZeroSparseTopkDecoder
    20: ReZeroSparseTopkDecoder
    21: LayerNorm
    22: Linear
  loss: cross_entropy

TODO

  • ReZero
  • RealFormer, Residual Attention
  • Macaron architectures
  • Macaron architectures - layer Scale 0.5
  • Explicit Sparse Transformer
  • torch lightning
  • Deepspeed train on single GPU
  • apply wandb
  • Deepspeed pipeline parallel trainig on 2 V100 GPU with 16GB Memory

Parameter For Few-shot

GPT-3 has a 175B parameter, and the size of the model is important for few-shot learning. In this repository, I try to pretrain language model as large as possible using 2 V100 GPUs.

GPT-3 Config

model_name n_params n_layer d_model n_heads d_head batch_size learning_rate
GPT-3 175B 175B 96 12288 96 128 3.2M 0.6 x 10^-4
GPT-3 13B 13B 40 5140 40 128 2M 1.0 x 10^-4
GPT-3 6.7B 6.7B 32 4096 32 128 2M 1.2 x 10^-4
GPT-3 2.7B 2.7B 32 2560 32 80 1M 1.6 x 10^-4
GPT-3 1.3B 1.3B 24 2048 24 128 1M 2.0 x 10^-4

Issue

  • AttributeError: module 'deepspeed' has no attribute 'zero': reinstall deepspeed

  • userwarning: cuda initialization: the nvidia driver on your system is too old: reinstall pytorch following by cuda version my solution-GPU V100, cuda 10.1

    pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
  • can't find CUDA_HOME path: reinstall cuda

References

Transformer

DeepSpeed

ReZero

Explicit Sparse Transformer

Macaron Architecrue

RealFormer Residual Attention

DeepSpeed

Pipeline Parallelism