diff --git a/README.md b/README.md index 638d12c..6961ab0 100644 --- a/README.md +++ b/README.md @@ -139,6 +139,9 @@ Start by aggregating available data from various sources (open-source or not) an * [**Auto Data**](https://github.com/Itachi-Uchiha581/Auto-Data): Lightweight library to automatically generate fine-tuning datasets with API models. * [**Bonito**](https://github.com/BatsResearch/bonito): Library for generating synthetic instruction tuning datasets for your data without GPT (see also [AutoBonito](https://colab.research.google.com/drive/1l9zh_VX0X4ylbzpGckCjH5yEflFsLW04?usp=sharing)). * [**Augmentoolkit**](https://github.com/e-p-armstrong/augmentoolkit): Framework to convert raw text into datasets using open-source and closed-source models. + +### Data preparation +* [**Data Prep Kit**](https://github.com/IBM/data-prep-kit): Data Prep Kit is a community project to democratize and accelerate unstructured data preparation for LLM app developers. It offers [data preparation capabilities](https://github.com/IBM/data-prep-kit/tree/dev/transforms) for both Code and Language modalities. The goal is to offer high-level APIs for developers to quickly get started in working with their data, without needing expertise in the underlying runtimes and frameworks, thereby reducing time to value. The toolkit supports a growing number of data preparation modules across python, ray, and spark runtimes. It also supports a wide range of scale from a laptop to an entire data centre. The tool also supports KFP based implementations to support no code data processing. The toolkit has a nice [getting started](https://github.com/IBM/data-prep-kit/tree/dev?tab=readme-ov-file#-getting-started-) section that has various examples to get started with. ## Acknowledgments