[Llama-recipes] Readme.md 톺아보기

본 글에서는 Llam2 모델을 활용해

(1) 커스텀 데이터를 Finetuning 하는법

(2) 커스텀 문서와 RAG를 이용해 챗봇을 만드는 법

을 다뤄 봅니다.

상당수의 내용은 아래의 사이트를 참고하였습니다.

GitHub - facebookresearch/llama-recipes: Scripts for fine-tuning Llama2 with composable FSDP & PEFT methods to cover single/mult

Scripts for fine-tuning Llama2 with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization & question ...

github.com

meta-llama/llama-recipes: Scripts for fine-tuning Llama2

GitHub - meta-llama/llama-recipes: Scripts for fine-tuning Llama2 with composable FSDP & PEFT methods to cover single/multi-node

Scripts for fine-tuning Llama2 with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization & question ...

github.com

(같은 내용인데 두개의 레포지토리가 있습니다;)

<목차>

1. 설치

2. Model Conversion to Hugging Face (생략)

3. Fine-Tuning 방법들 소개

3-1. 싱글 and 멀티 GPU 미세 조정

Single GPU:

Multiple GPUs One Node:

4. Flash Attention and Xformer Memory Efficient Kernels

5. Weights & Biases Experiment Tracking

BitsandBytes Error

$ python -m "torch.utils.collect_env"

1. 설치

pip install --extra-index-url https://download.pytorch.org/whl/test/cu116 llama-recipes\
pip install --extra-index-url https://download.pytorch.org/whl/test/cu116 llama-recipes[tests]
pip install --extra-index-url https://download.pytorch.org/whl/test/cu116 llama-recipes[vllm]
pip install --extra-index-url https://download.pytorch.org/whl/test/cu116 llama-recipes[auditnlg]

pip install transformers \
            datasets \
            accelerate \
            sentencepiece \
            protobuf==3.20 \
            py7zr scipy \
            peft bitsandbytes fire \
            torch_tb_profiler ipywidgets

※ 몇몇 기능 (예를 들어 FSDF + PEFT를 이용한 파인 튜닝)은 파이토치 nightlies 설치를 요구합니다.

nightly 설치는 다음 가이드 참고(https://pytorch.org/get-started/locally/)

2. Model Conversion to Hugging Face (생략)

본인의 경우 HuggingFace Format으로 변환된 Llama2 모델을 (직접 변환하지 않고 다운 받아) 사용하였습니다.

(다운 로드 링크: meta-llama (Meta Llama 2) (huggingface.co))

※ 위의 설치를 수행하면 torch > 2.x 버전 대가 설치 되는데 본인의 GPU Driver 11.6에는 맞지 않는 높은 버전 이었다.

※ 본인이 다운로드 해본 결과 Llama-2-7b는 약 12.5G, Llama-2-7b-hf 는 약 50.2G 되었다. (용량 차이 3배??)

3. Fine-Tuning

- 도메인별 사용 사례에 맞게 Llama 2 모델을 미세 조정하기 위해 PEFT, FSDP, PEFT+FSDP에 대한 레시피가 몇 가지 테스트 데이터 세트와 함께 포함되었습니다.

3-1. 싱글 and 멀티 GPU 미세 조정

- 아래 예시는 A10, T4, V100, A100 싱글 GPU르 활용한 미세 조정 방법을 다룹니다.

- 예시의 모든 파라미터는 추후에 (모델, 방법, 데이터, 업무에 따라) 조정되야 합니다.

3-1. 단일 GPU로 실행

- dataset: 아래 명령에서 dataset를 변경하려면 dataset arg를 전달하세요. 통합 데이터세트의 현재 옵션은 Grammar_dataset, alpaca_dataset 및 samsum_dataset입니다. 또한 OpenAssistant/oasst1 데이터 세트를 사용자 정의 데이터 세트의 예로 통합합니다.

- 사용자 데이터를 사용하는 방법과 datasets에 커스텀 데이터셋을 추가하는 방법은 Dataset.md를 참고 합니다.
- 디폴트 데이터 셋과 LORA 설정은 samsum_dataset에 맞춰져 있습니다.

- 'llama_recipes/configs/training.py' 에 적절한 경로를 집어 넣어 실행 시킬 수 있습니다.

(하지만, 위에서 llama_recipes를 pip로 설치 했었다. 따라서, configs/training.py에 경로를 집어 넣은 후 적용하려면 llama_recipes를 다시 설치해야 한다. 그러므로 아래에서도 언급하겠지만, configs/XXX 파일들은 수정하지 않기를 권장합니다.)

- 평가 및 Perplexity 메트릭을 저장하려면 --save_metrics 를 인자로 전달합니다. 이는 example/plot_metrics.py를 사용하여 관련 내용을 plot 합니다.

#if running on multi-gpu machine
export CUDA_VISIBLE_DEVICES=0

python -m llama_recipes.finetuning \
--use_peft \
--peft_method lora \
--quantization \
--model_name /path_of_model_folder/7B \
--output_dir path/to/save/PEFT/model

여기서 PEFT( Parameter Efficient Methods ) 로 lora를 사용합니다. 그 외에도 llama_adapter, prefix 등으로 설정이 가능합니다.

전체적인 설정은 llama_recipies/configs 의 설정 값들을 확인하되, 인자로 전달하면 되기에 파일을 직접 수정하지 않기를 권장합니다.

Single GPU:

<설치된 llama_recipes 모듈을 통해 실행 시킬 경우>

$ python -m llama_recipes.finetuning --model_name "C:/DDrive/000_MasterSeries/104_LLM/Llama-2-7b-hf" --output_dir "C:/DDrive/000_MasterSeries/104_LLM/Llama-2-7b-hf_output" --use_peft --peft_method lora --quantization

$ python -m llama_recipes.finetuning --model_name "/home/user/DATA_LOCAL/llama2/Llama-2-7b-hf" --output_dir "/home/user/DATA_LOCAL2/04_Generation/1_llama/Llama-2-7b-hf_output" --use_peft --peft_method lora --quantization

$ python -m llama_recipes.finetuning --model_name "/workspace/DATA_LOCAL2/04_Generation/1_llama/Llama-2-7b-hf" --output_dir "/workspace/DATA_LOCAL2/04_Generation/1_llama/Llama-2-7b-hf_output" --use_peft --peft_method lora --quantization

<코드를 통해 직접 실행 시킬 경우>

$ cd llama-recipes/src

$ python llama_recipes\finetuning.py --model_name "C:/DDrive/000_MasterSeries/104_LLM/Llama-2-7b-hf" --output_dir "C:/DDrive/000_MasterSeries/104_LLM/Llama-2-7b-hf_output" --use_peft --peft_method lora --quantization

$ python llama_recipes/finetuning.py --model_name "/home/user/DATA_LOCAL/llama2/Llama-2-7b-hf" --output_dir " /home/user/DATA_LOCAL2/04_Generation/1_llama/Llama-2-7b-hf_output" --use_peft --peft_method lora --quantization

==> 본인의 GPU(3069ti, 8GB)의 경우 위와 같이 메모리 부족 에러가 난다. 본 예제는 24G 이상의 단일 GPU를 필요로 하는 것으로 보입니다.

==> "일부 모듈은 CPU 혹은 디스크에 디스패치 됩니다. 양자화된 모델에 맞는 GPU RAM이 충분한지 확인하세요. 이러한 모듈을 32비트로 유지하면서 CPU 또는 디스크에 모델을 디스패치하려면 load_in_8bit_fp32_cpu_offload=True를 설정하고 사용자 정의 device_map을 from_pretrained에 전달해야 합니다.

Multiple GPUs One Node:

※ PEFT+FSDP를 사용하기 위해선 Pytorch Nightly 를 사용하고 있어야 합니다.

현재 bitsandbytes의 int8 양지화는 FSDP에서 현재 지원되지 않습니다.

torchrun \
--nnodes 1 \
--nproc_per_node 4 \
examples/finetuning.py \
--enable_fsdp \
--use_peft \
--peft_method lora \
--model_name "/home/user/DATA_LOCAL/llama2/Llama-2-7b-hf" \
--fsdp_config.pure_bf16 \
--output_dir "/home/user/DATA_LOCAL2/04_Generation/1_llama/Llama-2-7b-hf_output"

- 여기서는 PEFT 방법과 함께 사용할 수 있는 FSDP를 사용합니다.

- PEFT를 FSDP와 같이 사용하려면, --enable_fsdp, --use_peft, --peft_method 인자를 사용해야 합니다.

- 여기서는 BF16을 사용합니다.

4. Flash Attention and Xformer Memory Efficient Kernels

- --use_fast_kernels 인자는 사용하면, Flash Attention 혹은 Xformer 메모리 효율 커널을 사용할 수 있습니다.

- 이렇게 하면 미세조정 속도가 빨라집니다.

- Flash Attention 혹은 Xformer 메모리 효율 커널 은 HuggingFace의 optimum 라이브러리에서 단일 라이너 API(?) 로 사용되었습니다.

torchrun
--nnodes 1
--nproc_per_node 4
examples/finetuning.py
--enable_fsdp
--use_peft
--peft_method lora
--model_name /path_of_model_folder/7B
--fsdp_config.pure_bf16
--output_dir path/to/save/PEFT/model
--use_fast_kernels

(자세한 사항은 다음 링크 참조: Out of the box acceleration and memory savings of 🤗 decoder models with PyTorch 2.0 | PyTorch)

Fine-tuning using FSDP Only

- PEFT 를 사용하지 않고, 전체 파라미터를 미세 조정하는 방법

(PEFT는 전체 모델은 Freeze 하고 학습가능한 작은 파라미터와 계측만 조정하는 방법이며, LORA, LLaMA Adapter, Prefix-Tuning 등이 현재 존재한다)

- nproc_per_node는 사용가능한 GPU로 변경합니다. 이는 8 x A100 GPU에서 BF16으로 테스트 되었습니다.

torchrun
--nnodes 1
--nproc_per_node 8
examples/finetuning.py
--enable_fsdp
--model_name /path_of_model_folder/7B
--dist_checkpoint_root_folder model_checkpoints
--dist_checkpoint_folder fine-tuned
--use_fast_kernels

Fine-tuning using FSDP on 70B Model

Multi GPU Multi Node:

5. Weights & Biases Experiment Tracking

AutoModelForCausalLM vs LlamaForCausalLM

AutoTokenizer vs LlamaTokenizer

(참고: llama-recipes/docs/Dataset.md at main · facebookresearch/llama-recipes (github.com))

데이터 셋과 평가 메트릭

- llama-recipies에는 아래의 4개 데이터 셋을 사용하는 법과 커스텀 데이터 셋을 이용하는 법을 다룹니다.

    - Grammar_dataset: 150,000쌍의 `영어 문장과 수정 사항` 쌍이 포함되어 있습니다.
    - alpaca_dataset: (OpenAI의) text-davinci-003에서 생성된 대로 52K `명령-응답` 쌍을 제공합니다.
    - samsum: 약 16,000개의 메신저 형식의 `대화와 요약`이 포함되어 있습니다.
    - OpenAssistant/oasst1: 보조자(assistant) 스타일 대화에서 나온 약 88,000개의 `메시지`가 포함

배치 전략

- Llama 레시피는 배치 요청을 일괄 처리하는 두 가지 전략을 지원합니다.

(1) 기본 설정은 토큰화된 샘플을 모델의 컨텍스트 길이를 채우는 긴 시퀀스로 연결하는 packing 입니다. 이는 패딩을 피하고 모든 시퀀스의 길이가 동일하므로 가장 계산 효율적인 변형입니다. 컨텍스트 길이의 경계에 있는 샘플은 잘리고 절단 시퀀스의 나머지 부분은 다음 긴 시퀀스의 시작으로 사용됩니다.
훈련 데이터의 양이 작은 경우 이 절차는 훈련 데이터에 많은 노이즈를 도입하여 미세 조정 모델의 예측 성능을 저하시킬 수 있습니다.

(2) 잘린 시퀀스로 인한 추가 노이즈를 발생시키지 않는 padding 전략도 지원합니다. 이 전략은 비슷한 길이의 샘플을 일괄 처리하여 효율성 손실을 최소화하려고 시도하므로 최소한의 패딩만 필요합니다.

명령줄 매개변수 --batching_strategy [packing]/[padding]을 통해 일괄 처리 전략을 선택할 수 있습니다.

커스텀 데이터 사용하기

- llama-recipes는 커스텀 데이터 셋을 사용하는 2가지 방법을 제공합니다.

(생략)

BitsandBytes Error

시도1) pip로 설치

>>> 본인의 경우 pip로 설치한 bitsandbytes가 위와 같은 에러가 생겼다.

시도2) pip install git+https://github.com/Keith-Hon/bitsandbytes-windows.git

https://github.com/TimDettmers/bitsandbytes/issues/175#issuecomment-1488003048

CUDA Setup failed despite GPU being available. Inspect the CUDA SETUP outputs above to fix your environment! · Issue #175 · Ti

C:\ProgramData\Anaconda3\envs\novelai\lib\site-packages\bitsandbytes\cuda_setup\main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {Wind...

github.com

위의 이슈에 나온 것 처럼 아래의 명령을 수행하였다.

$ pip install git+https://github.com/Keith-Hon/bitsandbytes-windows.git

>>> 하지만 CUDA 라이브러리가 detect 되지 않는 다고 나온다.

시도3) .whl 파일 이용 설치

Releases · TimDettmers/bitsandbytes (github.com)

Releases · TimDettmers/bitsandbytes

Accessible large language models via k-bit quantization for PyTorch. - TimDettmers/bitsandbytes

github.com

>> release 탭의 최신 .whl 파일인 bitsandbytes-0.42.0-py3-none-any.whl 를 통해서 설치하였다.

하지만 시도1)과 동일한 에러 발생

시도4) 소스코드 컴파일 및 설치

HuggingFace/bitsandbytes에 나와 있는 대로 Compiling from Source 하였다.

git clone https://github.com/TimDettmers/bitsandbytes.git && cd bitsandbytes/
pip install -r requirements-dev.txt
cmake -DCOMPUTE_BACKEND=cuda -S .
make
pip install .

>>> 하지만 아래와 같은 에러 발생

export BNB_CUDA_VERSION=116
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64

위의 변수를 bashrc에 포함하니 warning은 뜨지만 설치가 되었다.

사이드 이슈)

- bitsandbytes를 설치하면 아래와 같이 torch>=2.0.1 을 요구한다.

실제로 torch 1.13 버전으로 실행하면 아래와 같은 에러가 뜬다.

실제로 torch.distributed.fsdp.MixedPrecision 의 'cast_forward_inputs' 인자는 torch>=2 에서 발견된다.

시도5) pip install bitsandbytes-cuda116

이슈를 참고하여 위의 설치 명령으로 해결하였다. 물론 아래와 같이 Warning은 나온다.

WARNING! This version of bitsandbytes is deprecated. Please switch to pip install bitsandbytes and the new repo: https://github.com/TimDettmers/bitsandbytes

https://github.com/TimDettmers/bitsandbytes/issues/112#issuecomment-1382978746

Warning: "The installed version of bitsandbytes was compiled without GPU support." · Issue #112 · TimDettmers/bitsandbytes

Issue When I run the following line of code: pipe = pipeline(model=name, model_kwargs= {"device_map": "auto", "load_in_8bit": True}, max_new_tokens=max_new_tokens) I get the following warning messa...

github.com

==> 중간 결론

bitsandbytes는 bitsandbytes-cuda116 명령으로 설치하는 게 맞겠다.

AutoModelForCausalLM vs

AutoTokenizer vs LlamaTokenizer

$ python -m "torch.utils.collect_env"

- python -m "torch.utils.collect_env" 명령은 본인의 실행 환경을 출력해 준다.

/home/dhpark/miniconda3/envs/llm_env/lib/python3.9/runpy.py:127: RuntimeWarning: 'torch.utils.collect_env' found in sys.modules after import of package 'torch.utils', but prior to execution of 'torch.utils.collect_env'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
Collecting environment information...
/home/dhpark/miniconda3/envs/llm_env/lib/python3.9/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11060). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
PyTorch version: 2.2.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.28.3
Libc version: glibc-2.31

Python version: 3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:33:10)  [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-46-generic-x86_64-with-glibc2.31
Is CUDA available: False
CUDA runtime version: 11.6.124
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090
GPU 2: NVIDIA GeForce RTX 3090
GPU 3: NVIDIA GeForce RTX 3090

Nvidia driver version: 510.47.03
cuDNN version: Probably one of the following:
/usr/local/cuda-11.2-cudnn-8.1.0/targets/x86_64-linux/lib/libcudnn.so.8
/usr/local/cuda-11.2-cudnn-8.1.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8
/usr/local/cuda-11.2-cudnn-8.1.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8
/usr/local/cuda-11.2-cudnn-8.1.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8
/usr/local/cuda-11.2-cudnn-8.1.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8
/usr/local/cuda-11.2-cudnn-8.1.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8
/usr/local/cuda-11.2-cudnn-8.1.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          96
On-line CPU(s) list:             0-95
Thread(s) per core:              2
Core(s) per socket:              24
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz
Stepping:                        7
CPU MHz:                         2400.000
CPU max MHz:                     4000.0000
CPU min MHz:                     1000.0000
BogoMIPS:                        4800.00
Virtualization:                  VT-x
L1d cache:                       1.5 MiB
L1i cache:                       1.5 MiB
L2 cache:                        48 MiB
L3 cache:                        71.5 MiB
NUMA node0 CPU(s):               0-23,48-71
NUMA node1 CPU(s):               24-47,72-95
Vulnerability Itlb multihit:     KVM: Mitigation: VMX disabled
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed:          Mitigation; Enhanced IBRS
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; TSX disabled
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] lion-pytorch==0.0.8
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] torch==2.2.1
[pip3] triton==2.2.0
[conda] lion-pytorch              0.0.8                    pypi_0    pypi
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] torch                     2.2.1                    pypi_0    pypi
[conda] triton                    2.2.0                    pypi_0    pypi

(상황) 본인의 환경은 torch 2.2에 cuda11.6 이다. torch2.2에 cuda 11.6은 드라이브 버전이 낮다. cuda 11.8이상이 필요한데 이를 설치하기 보다는 cuda11.8+torch2.1이 설치된 이미지로 부터 컨테이터를 구성하고자 한다.

(질문) 쿠다 드라이버 마다 도커를 구성할 수 있나? 예를 들어 아래와 같이

torch	cuda driver
1.13	11.6
2.21	11.8

A. 가능하다. cuda11.8+torch2.1이 설치된 이미지로 부터 컨테이너를 만들면 cuda11.8 드라이버에 torch 2.1을 사용할 수 있다.

<시도1>

FROM pytorch/pytorch:2.1.0-cuda11.8-cudnn8-devel

RUN pip install --extra-index-url https://download.pytorch.org/whl/test/cu116 llama-recipes & \
pip install --extra-index-url https://download.pytorch.org/whl/test/cu116 llama-recipes[tests] & \
pip install --extra-index-url https://download.pytorch.org/whl/test/cu116 llama-recipes[vllm] & \
pip install --extra-index-url https://download.pytorch.org/whl/test/cu116 llama-recipes[auditnlg]

RUN pip install transformers \
            datasets \
            accelerate \
            sentencepiece \
            protobuf==3.20 \
            py7zr \
scipy \
            peft \
bitsandbytes \
fire \
            torch_tb_profiler \
ipywidgets

<시도2>

FROM pytorch/pytorch:2.1.0-cuda11.8-cudnn8-devel

RUN git clone https://github.com/facebookresearch/llama-recipes & cd llama-recipes & pip -r install requirements.txt

'LLM > Llama' 카테고리의 다른 글

GGUF 파일이란? (0)	2024.05.06
[Llama-recipes] LLM_finetuning (0)	2024.03.04
Code Llama FineTune (0)	2024.02.19
Code llama 개요 (0)	2024.02.18

Donghoon Note