Document-to-podcast: a Blueprint by Mozilla.ai for generating podcasts from documents using local AI
This blueprint demonstrate how you can use open-source models & tools to convert input documents into a podcast featuring two speakers. It is designed to work on most local setups or with GitHub Codespaces, meaning no external API calls or GPU access is required. This makes it more accessible and privacy-friendly by keeping everything local.
👉 📖 For more detailed guidance on using this project, please visit our Docs here.
- Python 3.10+ (use Python 3.12 for Apple M1/2/3 chips)
- Llama-cpp (text-to-text, i.e script generation)
- Parler_tts (text-to-speech, i.e audio generation)
- Streamlit (UI demo)
Get started with Document-to-Podcast using one of the two options below: GitHub Codespaces for a hassle-free setup or Local Installation for running on your own machine.
The fastest way to get started. Click the button below to launch the project directly in GitHub Codespaces:
Once the Codespaces environment launches, inside the terminal, start the Streamlit demo by running:
python -m streamlit run demo/app.py
-
Clone the Repository Inside the Codespaces terminal, run:
git clone https://github.com/mozilla-ai/document-to-podcast.git cd document-to-podcast
-
Install Dependencies Inside the terminal, run:
pip install -e .
-
Run the Demo Inside the terminal, start the Streamlit demo by running:
python -m streamlit run demo/app.py
NOTE: The first time you run the demo app it might take a while to generate the script or the audio because it will download the models to the machine which are a few GBs in size.
-
Document Upload Start by uploading a document in a supported format (e.g., PDF, .txt, or .docx).
-
Document Pre-Processing The uploaded document is processed to extract and clean the text. This involves:
- Extracting readable text from the document.
- Removing noise such as URLs, email addresses, and special characters to ensure the text is clean and structured.
-
Script Generation The cleaned text is passed to a language model to generate a podcast transcript in the form of a conversation between two speakers.
- Model Loading: The system selects and loads a pre-trained LLM optimized for running locally, using the llama_cpp library. This enables the model to run efficiently on CPUs, making them more accessible and suitable for local setups.
- Customizable Prompt: A user-defined "system prompt" guides the LLM in shaping the conversation, specifying tone, content, speaker interaction, and format.
- Output Transcript: The model generates a podcast script in structured format, with each speaker's dialogue clearly labeled.
Example output:
{ "Speaker 1": "Welcome to the podcast on AI advancements.", "Speaker 2": "Thank you! So what's new this week for the latest AI trends?", "Speaker 1": "Where should I start.. Lots has been happening!", ... }
This step ensures that the podcast script is engaging, relevant, and ready for audio conversion.
-
Audio Generation
- The generated transcript is converted into audio using a Text-to-Speech (TTS) model.
- Each speaker is assigned a distinct voice.
- The final output is saved as an audio file in formats like MP3 or WAV.
-
System requirements:
- OS: Windows, macOS, or Linux
- Python 3.10 or higher
- Minimum RAM: 16 GB
- Disk space: 32 GB minimum
-
Dependencies:
- Dependencies listed in
pyproject.toml
- Dependencies listed in
When starting up the codespace, I get the message
Oh no, it looks like you are offline!
If you are on Firefox and have Enhanced Tracking Protection On
, try turning it Off
for the codespace webpage.
During the installation of the package, it fails with
ERROR: Failed building wheel for llama-cpp-python
You are probably missing the GNU Make
package. A quick way to solve it is run on your terminal sudo apt install build-essential
This project is licensed under the Apache 2.0 License. See the LICENSE file for details.