Skip to content

Latest commit

 

History

History
73 lines (62 loc) · 2.56 KB

add_new_formats.md

File metadata and controls

73 lines (62 loc) · 2.56 KB

Add New Format

Case 1: Supported Formats

If the format of your datasets has already been supported by the existing library, you can directly use it without any library-level modification

  • TaskType.text_classification
    • FileType.tsv
  • TaskType.named_entity_recognition
    • FileType.conll
  • TaskType.summarization
    • FileType.tsv
  • TaskType.extractive_qa
    • FileType.json (same format with squad)

For example, suppose that you have a system output of the summarization task in tsv format:

from explainaboard import TaskType, get_dataset_class, get_processor_class
dataset_path = "./integration_tests/artifacts/summarization/dataset.tsv"
output_path = "./integration_tests/artifacts/summarization/output.txt"
loader = get_dataset_class(TaskType.summarization)(
    dataset_path,
    output_path,
    Source.local_filesystem,
    Source.local_filesystem,
    FileType.tsv,
    FileType.text,
)
data = loader.load()

processor = get_processor_class(TaskType.summarization)()
analysis = processor.process()
analysis.write_to_directory("./")

Case 2: Unsupported Formats

If your dataset is in a new format which the current SDK doesn't support, you can

  • (1) reformat your data into a format that the current library supports

  • (2) or re-write the loader.load() function to make it support your format. Taking the summarization task for example, suppose that the existing SDK only supports tsv format, we can make json format supported by adding the following code inside loaders.summarization.TextSummarizationLoader.loader()

        def load(self) -> Iterable[Dict]:
          raw_data = self._load_raw_data_points()
          data: List[Dict] = []
          if self._file_type == FileType.tsv:
              for id, dp in enumerate(raw_data):
                  source, reference, hypothesis = dp[:3]
                  data.append({"id": id,
                               "source": source.strip(),
                               "reference": reference.strip(),
                               "hypothesis": hypothesis.strip()})
          if self._file_type == FileType.json: # This function has been unittested
              for id, info in enumerate(raw_data):
                  source, reference, hypothesis = info["source"], info["references"], info["hypothesis"]
                  data.append({"id": id,
                               "source": source.strip(),
                               "reference": reference.strip(),
                               "hypothesis": hypothesis.strip()})
          else:
              raise NotImplementedError
          return data