Skip to content

Commit

Permalink
Added transparent-ai blogpost
Browse files Browse the repository at this point in the history
  • Loading branch information
robvanvolt committed Sep 21, 2023
1 parent 2981bd0 commit d196e98
Show file tree
Hide file tree
Showing 5 changed files with 2,265 additions and 1,012 deletions.
54 changes: 54 additions & 0 deletions blog/transparent-ai.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
---
title: "Towards a transparent AI Future: The Call for less regulatory hurdles on Open-Source AI in Europe"
author: "LAION.ai"
date: "September 21, 2023"
previewImg: "/images/blog/laion-blue.png"
---

Following our previous open letter to the European Parliament on the significance of open-source AI, LAION, backed by European Laboratory for Learning and Intelligent Systems (ELLIS) and a long list of very impactful AI researchers, we submit this new open letter to the European Parliament:

| [Link to the PDF](/documents/transparent-ai-eu-ai-act.pdf) |
|----------|

#### Why Open-Source is the Gold Standard for AI Security

The transparency of open-source AI is its strength. It ensures robustness and security unmatched by closed systems. Why? Open-source AI benefits from the scrutiny of the global community, allowing vulnerabilities to be detected and fixed promptly. Drawing parallels, we can look at the Linux operating system—a paragon of security and robustness stemming from its open-source nature.

#### Countering Redundancy and Upholding Sustainability

With the environmental toll of extensive AI training becoming a major concern, open-source models have shown a clear path forward. By minimizing redundant training, they reduce computational and energy overheads, reflecting a commitment to a sustainable future.

#### A Catalyst for Innovation

Open-source AI has been instrumental in levelling the playing field. Small and mid-sized enterprises can now fine-tune existing models, fostering innovation without the daunting costs of building from scratch. If Europe's ambition is to retain its brightest minds, ensuring uninterrupted access to these resources is non-negotiable.

#### Regulating Application, Not Innovation

The clarion call from LAION and its supporters is clear—focus regulations on AI's applications, not the foundational technology. By doing so, the EU will nurture innovation while ensuring that AI's real-world applications are ethical, safe, and in line with European values.

#### Incentivizing the Open-Source Paradigm

Perhaps the most potent recommendation in this new letter is the incentivization of open-source AI. It's a win-win. Organizations can release foundational models as open-source, maintaining proprietary rights on fine-tuned versions. This ensures that the broader community benefits from the base models while commercial competitiveness remains intact.

#### The European AI Path Forward

European sovereignty in AI is crucial, and open-source AI research is key to addressing challenges ranging from healthcare to climate change. The future, as outlined in the letter, imagines a Europe at the forefront of AI research, one that champions transparency, security, and sustainability.

#### Supporters

| Name | Description |
|----------|----------|
| Board of the European Laboratory for Learning and Intelligent Systems (ELLIS): Serge Belongie, Nicolò Cesa-Bianchi, Florence d'Alché-Buc, Nada Lavrac, Neil D. Lawrence, Nuria Oliver, Bernhard Schölkopf, Josef Sivic, Sepp Hochreiter| [European Lab for Learning & Intelligent Systems (ellis.eu)](https://ellis.eu/board) |
| Jürgen Schmidhuber | Prof. Jürgen Schmidhuber : Scientific Director of the Swiss AI Lab IDSIA (USI & SUPSI), Co-Founder & Chief Scientist of NNAISENSE, Inventor of LSTM Networks |
| Kristian Kersting | Full Professor at Technical University of Darmstadt, Co-Director, Hessian Center for AI (hessian.AI) and member of the German Center for Artificial Intelligence (DFKI) |
| Björn Ommer | Full professor and head of the Computer Vision & Learning Group at the Ludwig-Maximilians-University of Munich |
| Hilde Kuehne | Professor, Institute for Computer Science II, Head of Multimodal Learning, University of Bonn |
| Mira Mezini | Professor of Computer Science at Technical University of Darmstadt, Co-Director of Hessian Center for AI (hessian.AI) |
| Patrick Schramowski | Senior Researcher at the German Center for Artificial Intelligence (DFKI) and Hessian Center for AI (hessian.AI) |
| Jenia Jitsev | Senior Researcher and Lab Lead at Juelich Supercomputing Center, Research Center Juelich. Scientific Lead and Co-Founder at LAION; Member of European Laboratory for Learning and Intelligent Systems (ELLIS) |
| Dominik L. Michels | Full Professor of Intelligent Algorithms in Modeling and Simulation at the Technical University of Darmstadt |
| Tim Dettmers | PhD Student at The University of Washington. Creator of the bitsandbytes library. |
| Mark Schutera | PhD Student Karlsruhe Institute of Technology within Unsupervised Deep Learning for Cognitive Perception Systems |
| Andreas Hochlehnert | PhD Student, University of Tübingen, International Max-Planck Research School for Intelligent Systems (IMPRS-IS) |
| Christoph Schuhmann | Organizational Lead & Co-Founder of the Large-scale AI Open Network (LAION), Neurips 2022 Outstanding Paper Award & Falling Walls Breakthrough of the Year 2023 Award Winner |
| Robert Kaczmarczyk | Medical Lead & Co-Founder of the Large-scale AI Open Network (LAION), Neurips 2022 Outstanding Paper Award & Falling Walls Breakthrough of the Year 2023 Award Winner |
23 changes: 17 additions & 6 deletions blog/video2dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ As of 2023, multimodal deep learning is still heavily focusing on text-image mod
We argue that overcoming this data problem is a core interest of (open source) multimodal research since it can foster important previously impossible projects such as high quality [video](https://research.nvidia.com/labs/toronto-ai/VideoLDM/) and [audio](https://google-research.github.io/seanet/audiolm/examples/) generation, [better pre-trained models for robotics](https://twitter.com/comma_ai/status/1666959310310752257?s=20), [movie AD for the blind community](https://www.robots.ox.ac.uk/~vgg/research/autoad/), and more.

![ManyVideos](/images/blog/videos_figure.gif)
_Figure 1:_ video2dataset allows to easily create large scale collections of videos as the ones in the above sample created from available research datasets.
_Figure 1:_ video2dataset allows to easily create large scale collections of videos as the ones in the above sample created from available research datasets.

### Solution: Flexible dataset curation tooling

Expand All @@ -29,21 +29,26 @@ We’ve also used video2dataset to build upon existing video datasets by downloa
video2dataset is built on the foundation of [img2dataset](https://github.com/rom1504/img2dataset) and is designed to transform a table of URLs and metadata into an easily loadable [WebDataset](https://github.com/webdataset/webdataset) in just one command. Furthermore, it allows you to reprocess the WebDataset for additional transformations while retaining the same shard contents. Let's break down how video2dataset operates.

### Input Sharding

The process begins with sharding the input data, a step that enables easy distribution among the workers. These input shards are temporarily stored, and the 1-1 correspondence between input and output shards ensures seamless resumption following any failures. If a dataset processing run stops prematurely, we can conveniently bypass processing the input shards for which the output shard already exists.

### Distribution and Reading
Post-sharding, the individual shards are distributed among the workers, who read each shard and process the samples inside. For distribution we support 3 modes - multiprocessing, pyspark, and slurm - the first is good for single machine jobs whereas the last two can help with distributing across many machines. The reading method varies depending on the input dataset's format. For instance, if it's a table of links, video2dataset downloads the video from the web. video2dataset supports a wide variety of video platforms by using [yt-dlp](https://github.com/yt-dlp/yt-dlp) to download videos it can’t directly request. However, if it's an existing WebDataset with videos, an existing webdataset dataloader reads the bytes or frames in tensor format from those samples.

Post-sharding, the individual shards are distributed among the workers, who read each shard and process the samples inside. For distribution we support 3 modes - multiprocessing, pyspark, and slurm - the first is good for single machine jobs whereas the last two can help with distributing across many machines. The reading method varies depending on the input dataset's format. For instance, if it's a table of links, video2dataset downloads the video from the web. video2dataset supports a wide variety of video platforms by using [yt-dlp](https://github.com/yt-dlp/yt-dlp) to download videos it can’t directly request. However, if it's an existing WebDataset with videos, an existing webdataset dataloader reads the bytes or frames in tensor format from those samples.

### Subsampling

Once the video is read and the worker has the video bytes, they are sent through a pipeline of subsamplers defined in the job config. This step optionally transforms the video through actions such as frames per second (FPS) or resolution downsampling, clipping, scene detection, and more. Alternatively there are subsamplers that are meant to only extract metadata from the input modalities like resolution/compression information, synthetic captions, optical flow, or others and include it in the metadata of a given sample. If your desired transformation isn’t already in video2dataset, its very easy to add it by defining a new subsampler or adjusting an existing one. This can be done with minimal changes in other locations of the repository and is a very welcomed contribution.

### Logging
Throughout the entire process, video2dataset meticulously logs vital information at various stages. Upon completion of each shard a corresponding {ID}\_stats.json file is generated. This file contains key details, such as the number of samples processed, the number of successful operations, and a log of any failures along with their associated error messages. For added functionality, video2dataset also supports integration with Weights & Biases (wandb). This integration can be activated with a single argument and, when enabled, it provides extensive performance reporting, along with success and failure metrics. Such features are helpful for benchmarking and cost-estimating tasks related to full jobs.

Throughout the entire process, video2dataset meticulously logs vital information at various stages. Upon completion of each shard a corresponding {ID}\_stats.json file is generated. This file contains key details, such as the number of samples processed, the number of successful operations, and a log of any failures along with their associated error messages. For added functionality, video2dataset also supports integration with Weights & Biases (wandb). This integration can be activated with a single argument and, when enabled, it provides extensive performance reporting, along with success and failure metrics. Such features are helpful for benchmarking and cost-estimating tasks related to full jobs.

![](/images/blog/video2dataset_wandb_logs.png)
_Figure 3:_ Part of a wandb report from a large video2dataset run

### Writing

Finally, video2dataset saves the transformed data to output shards in specified locations, where they can be utilized for training or reprocessing with video2dataset or other tools. The output format of the dataset is shards of N samples each where the shards can be formatted in multiple ways - directories, tar files, tfrecords, or parquet files. The most useful ones are the directories format for smaller datasets and debugging and tar files which is used by the WebDataset format for loading. Here is a visualization of the output datasets:

```
Expand All @@ -68,6 +73,7 @@ video-dataset
```

### Reprocessing

video2dataset can reprocess previous output datasets by reading the output shards and passing the samples inside through new transformations. This capability is particularly beneficial for video datasets, given their often hefty size and unwieldy nature. It allows us to conservatively downsample our data to avoid multiple downloads of large datasets. We delve into a practical example of this in the next section.

## Examples
Expand All @@ -78,12 +84,13 @@ Each video is a rich source of data that can be decomposed into many forms - dif

<video width="640" height="480" controls style="margin-left:auto;margin-right:auto;">
<source src="/images/blog/video2dataset_tree_of_datasets.mp4" type="video/mp4">
Your browser does not support the video tag.
Your browser does not support the video tag.
</video>

_Figure 4:_ You can efficiently extract many types of datasets from an initial base set of video links using video2dataset

The individual steps are:

1. Download an HD video dataset for a generative video modeling project.
2. Download 2 more datasets at various resolutions so you can increase your sample count
3. Combine all 3 video datasets and downsample in resolution and FPS so it can be more easily stored.
Expand All @@ -94,14 +101,15 @@ The individual steps are:
8. We can further process the audio and extract transcripts (using our [WhisperX](https://github.com/m-bain/whisperX) subsampler)
9. The transcripts can be used to train text-only or vision-text models


Doing dataset curation using video2dataset is very convenient across projects since datasets with the same contents can share metadata shards - the audio dataset from step 6 can use the same captions as the contrastive video-text model in step 4; we may filter that audio dataset with the same optical flow scores produced in step 5.

### Dataset processing jobs
We have used video2dataset to process many popular datasets and we include instructions for how to reproduce these jobs in the [dataset\_examples section](https://github.com/iejMac/video2dataset/tree/main/dataset_examples) of the repository. One such dataset is [WebVid](https://m-bain.github.io/webvid-dataset/) (10M samples) which can be downloaded in 12h on a single cpu16 EC2 instance which costs 8.16$ in total.

We have used video2dataset to process many popular datasets and we include instructions for how to reproduce these jobs in the [dataset\_examples section](https://github.com/iejMac/video2dataset/tree/main/dataset_examples) of the repository. One such dataset is [WebVid](https://m-bain.github.io/webvid-dataset/) (10M samples) which can be downloaded in 12h on a single cpu16 EC2 instance which costs 8.16$ in total.
To further test video2dataset’s capabilities, we create a large scale video-text dataset (590M pairs) by combining existing large datasets and performing extensive processing on them using video2dataset transformations. Specifically, we perform [scene detection](https://github.com/Breakthrough/PySceneDetect), clip according to those scenes, add synthetic captions and add optical flow estimates for each clip. The dataset will be released soon along with a discovery study on its applicability

### Metadata and Statistics

video2dataset can be used to gather various metadata and statistics about the processed data. Some subsamplers have the goal of taking a given modality (video, audio) and extracting metadata from it like compression/video information, optical flow scores, audio transcripts etc. Additionally during downloading if the source already has associated metadata, like f.e. Youtube videos do, video2dataset will try to extract that metadata and place it in the webdataset so you can later access it easily. Here are some examples:

| Video | Optical Flow | Synthetic Caption | Whisper Transcript | YouTube Metadata |
Expand All @@ -113,6 +121,7 @@ video2dataset can be used to gather various metadata and statistics about the pr
_YouTube provides a large amount of metadata for each video so we only select a few keys for display here. For a full example of a youtube metadata dictionary see [this example](https://github.com/iejMac/video2dataset/blob/main/examples/yt_metadata.md)._

## What’s next?

- Scientific analysis and release of a large scale dataset created with the tool presented in this blog post.
- Improved synthetic captioning. Synthetic captioning for videos is still underexplored and there’s many exciting ideas to try. Soon in video2dataset we will have more interesting methods to produce captions for videos that make use of image captioning models and LLMs.
- Since its release people have been talking about using [Whisper](https://arxiv.org/abs/2212.04356) to obtain many text tokens from video. This is possible with video2dataset and we are working on transcribing a large corpus of podcasts which we will soon release as a text dataset (we are aiming at 50B tokens).
Expand All @@ -127,7 +136,9 @@ video2dataset is a fully open-source project and we are committed to developing
MIT

### Contributions

Big thanks to everyone involved, most notably:

- [Romain](https://github.com/rom1504) for building out img2dataset, helping with the initial design of video2dataset, and giving lots of advice during the process of building video2dataset.
- [Marianna](https://github.com/marianna13) for helping create the audio functionality.
- [Daniel](twitter.com/danielmend_) for building the cut detection and optical flow capabilities. Also for extensive help with testing and runs at scale, and feedback on the blogpost.
Expand Down
Loading

0 comments on commit d196e98

Please sign in to comment.