huggingface๐Ÿค— datasets ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋‹ค๋ฃจ๊ธฐ

Lyan
4 min readSep 12, 2023
https://huggingface.co/docs/datasets/index

๋ฐ์ดํ„ฐ ์‚ฌ์ด์–ธ์Šค ๋ถ„์•ผ์—์„œ๋„ ๋ฐ์ดํ„ฐ๋ฅผ ์ž˜ ์ •์ œํ•˜๊ณ  ์›ํ•˜๋Š” shape์œผ๋กœ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด ์ค‘์š”ํ–ˆ๋Š”๋ฐ,,, ๋”ฅ๋Ÿฌ๋‹์—์„œ๋„ ๋งˆ์ฐฌ๊ฐ€์ง€์ธ ๊ฒƒ ๊ฐ™๋‹ค

์‚ฌ์‹ค csv๊ฑด json์ด๊ฑด inputํŒŒ์ผ์˜ shape์ด ํฌ๊ฒŒ ์ค‘์š”ํ•˜์ง„ ์•Š์ง€๋งŒ datasets์œผ๋กœ ๊ด€๋ฆฌํ•˜๋ฉด ํŽธํ•œ ๋ถ€๋ถ„์ด ๋งŽ์Œ! (eg. map๊ณผ lambdaํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ , json๊ฐ€ ์œ ์‚ฌํ•œ ํ˜•ํƒœ์ง€๋งŒ feature๋ณ„๋กœ๋„ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋Š” ์  ๋“ฑ)

๊ทผ๋ฐ ์•„์ง datasets์˜ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ™œ์šฉ๋ฒ•์— ๋Œ€ํ•ด์„œ ์ž˜ ์—ฐ๊ฒฐ ์•ˆ๋˜๋Š” ๋ถ€๋ถ„์ด ์žˆ์–ด์„œ ์ •๋ฆฌํ•ด๋‘˜ ๊ฒธ ์ด๋Ÿฐ ์…‹์œผ๋กœ ์ •๋ฆฌํ•ด๋‘๋ฉด ํŽธ๋ฆฌํ•˜๊ธฐ ๋•Œ๋ฌธ

from datasets import load_dataset, Dataset, load_from_disk
  1. json to datasetdict

๋งŒ์•ฝ train, valid, test๋กœ ๋œ json ํŒŒ์ผ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๋ฉด ๋ถˆ๋Ÿฌ์™€์ค€ ํ›„ from_list๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ ๋ชจ์–‘์ด ๋œ๋‹ค.

import json

with open(data_path + 'train.json', 'r') as f:
train = json.load(f)
with open(data_path + 'val.json', 'r') as f:
val = json.load(f)
with open(data_path + 'test.json', 'r') as f:
test = json.load(f)

train_dataset = Dataset.from_list(train)
val_dataset = Dataset.from_list(val)
test_dataset = Dataset.from_list(test)

์–˜๋“ค์„ ํ•˜๋‚˜๋กœ ๋ฌถ์–ด์ฃผ๋ฉด ์ด๋ ‡๊ฒŒ ๋จ

class_dataset = datasets.DatasetDict({'train' : train_dataset,
'valid': val_dataset,
'test' : test_dataset})
class_dataset

์ด ์ƒํƒœ๋กœ ์ €์žฅํ•˜๊ณ  ๋ถˆ๋Ÿฌ์˜ค๋Š” ๊ฒŒ save_to_disk, load_from_disk์ž„

class_dataset.save_to_disk(์ €์žฅํ•  ๊ฒฝ๋กœ)
class_dataset = load_from_disk(์ €์žฅ๋œ ๊ฒฝ๋กœ)

์ด๋Ÿฐ ํ˜•ํƒœ๋กœ ์ €์žฅ๋จ

2. csv, excel to datasetdict

๋งŒ์•ฝ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ csv, excel๋ผ๋ฉดpandas๋ฅผ ํ†ตํ•ด ๋ถˆ๋Ÿฌ์˜ค๊ธฐ ํ•œ๋‹ค.

dataframe to json์œผ๋กœ ๋ฐ”๊พธ๋Š” ๊ณผ์ •๋งŒ ์ถ”๊ฐ€ํ•˜๋ฉด ์œ„์™€ ๊ฐ™์ด ์‚ฌ์šฉ ๊ฐ€๋Šฅ

import pandas as pd
data = pd.read_csv(์ €์žฅ๋œ ๊ฒฝ๋กœ, index_col = 0)
# data = pd.read_excel(์ €์žฅ๋œ ๊ฒฝ๋กœ, index_col = 0)
data_dict = data.to_dict(oriented = 'records')

์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ํ–‰๋ณ„๋กœ ์นผ๋Ÿผ์„ key๋กœ ํ•œ dict์— ๋‹ด๊ธด๋‹ค.

--

--