Skip to content

USTC-StarTeam/ZIP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ZIP

Motivation

This is the project for our paper Entropy Law: The Story Behind Data Compression and LLM Performance.

Quick start

Data pool

The data pool used in the paper can be found in here, which is provided by the DEITA. And we appreciate their contribution. If you want to use ZIP to select your data, we only currently support the following sharegpt format:

[
    {
        "id": 0,
        "conversations":[
            {
                "from": "human",
                "value": "XXX",
            },
            {
                "from": "gpt",
                "value": "XXX",
            }
        ],
        "source": "ShareGPT"
    },
    {
        "id": 1,
        "conversations":[
            {
                "from": "human",
                "value": "XXX",
            },
            {
                "from": "gpt",
                "value": "XXX",
            }
        ],
        "source": "ShareGPT"
    }
]

Perform data selection

python ZIP.py --data_path data_pool.json --save_path selected_data.json --budget 10000 

LLM alignment & evaluation

  • We use Axolotl to align LLMs with the selected data.
  • Then we use MT-bench in FastChat to evaluate the aligned LLMs.

Citation

If you find the content of this project helpful, please cite our paper as follows:

@ARTICLE{2024arXiv240706645Y,
       author = {{Yin}, Mingjia and {Wu}, Chuhan and {Wang}, Yufei and {Wang}, Hao and {Guo}, Wei and {Wang}, Yasheng and {Liu}, Yong and {Tang}, Ruiming and {Lian}, Defu and {Chen}, Enhong},
        title = "{Entropy Law: The Story Behind Data Compression and LLM Performance}",
      journal = {arXiv e-prints},
     keywords = {Computer Science - Machine Learning, Computer Science - Computation and Language},
         year = 2024,
        month = jul,
          doi = {10.48550/arXiv.2407.06645},
       eprint = {2407.06645},
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages