POLYGLOT-KO 다운로드 및 예제

Lyan

4 min readMay 17, 2023

KoAlpaca과 합작해 만든 LLM모델이라고 한다.

EleutherAI/polyglot-ko-1.3b · Hugging Face

Polyglot-Ko is a series of large-scale Korean autoregressive language models made by the EleutherAI polyglot team. The…

huggingface.co

Billion급 파라미터들을 가진 다양한 버전이 있는데 1.3/3.8/5.8/12.8B

이 중 5.8B, 12.8B으로 대화 생성모델을 finetuning하려는데 RTX A6000으로도 3.8B을 full finetuning하기 어렵기 때문에 저번 글에서 언급한 LoRA를 얹어서 튜닝하는 걸로,,

근데 polyglot이 GPTNeoX기반이기 때문에 많이 사용하는 Llama와 코드가 살짝 다르다.. 이 부분 때문에 많이 헤맸는데 다행히 적당한 코드를 발견해서 학습시키는 중,,

가상 환경 생성(python 3.11은 너무 최신버전이라 아래 오류 발생)

RuntimeError: Python 3.11+ not yet supported for torch.compile

conda create -n polyglot_ko python==3.10
conda activate polyglot_ko

2. git clone 해오기

git clone https://github.com/satani99/alpaca-lora_gpt_neox_20b

3. pip install

pip install -r requirements.txt
pip install scipy

4. 이대로 돌리면 undefined symbol: cget_col_row_stats 이슈에 봉착하기 때문에 아래 코드 실행

conda install cudatoolkit -y

5. 이걸 설치 안해주면 최종 adapter_model.bin이 443kb로 저장되는 이슈 발생

pip uninstall peft -y
pip install git+https://github.com/huggingface/peft.git@e536616888d51b453ed354a6f1e243fecb02ea08

6. 아휴 많다 많아, 이제 돌리는데 polyglot 먼저 git해오기

모델을 huggingface에서 불러와도 되지만 매번 오래 걸림

git lfs 로 대용량 모델 clone 하는 법

git clone하는 법도 3개월 전에 알았던 사람으로서,,,

medium.com

7. 가져온 polyglot을 base로 넣고 data는 임시로 ko_alpaca_data.json 가져옴

이건 진짜 돌아가나 확인하는 용도의 light ver. => 모델 잘 저장되는지 확인

python finetune.py \
    --base_model 'polyglot-ko-1.3b' \
    --data_path 'data/ko_alpaca_data.json' \
    --output_dir './lora-polyglot' \
    --num_epochs 0.1

8. 이건 진또배기 -> 오늘 돌려놓고 가야지

python finetune.py \
--base_model 'polyglot-ko-5.8b' \
    --data_path 'data/ko_alpaca_data.json' \
    --output_dir './lora-polyglot'
    --batch_size 128 \
    --micro_batch_size 4 \
    --num_epochs 5\
    --learning_rate 5e-5 \
    --cutoff_len 512 \
    --val_set_size 2000 \
    --lora_r 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --train_on_inputs \
    --group_by_length

EleutherAI/polyglot-ko-1.3b · Hugging Face

Polyglot-Ko is a series of large-scale Korean autoregressive language models made by the EleutherAI polyglot team. The…

git lfs 로 대용량 모델 clone 하는 법

git clone하는 법도 3개월 전에 알았던 사람으로서,,,

Written by Lyan