Listen, Chat, And Edit on Edge: Text-Guided Soundscape Modification for Real-Time Auditory Experience
What is this project about?
Listen, Chat, and Edit (LCE) is a cutting-edge multimodal sound mixture editor designed to modify each sound source in a mixture based on user-provided text instructions. The system features a user-friendly chat interface and the unique ability to edit multiple sound sources simultaneously within a mixture without the need for separation. Using open-vocabulary text prompts interpreted by a large language model, LCE creates a semantic filter to edit sound mixtures, which are then decomposed, filtered, and reassembled into the desired output.
Project Structure
-
data/datasets: Contains the scripts used to process dataset and prompts.
-
demonstration: A demonstration of an input mixure and the edited version.
-
embeddings: The pkl file recieved from the LLM are stored in this folder.
-
hparams: Hyperparameters settings for the models.
-
llm_cloud: Configuration and scripts for cloud-based language model interactions.
-
modules: Core modules and utilities for the project.
-
prompts: Handling and processing of text prompts.
-
pubsub: Setup for publish-subscribe messaging patterns.
-
utils: Utility scripts for general purposes.
-
E6692.2022Spring.LCEE.ss6928.pkk2125.presentationFinal.pptx: Final presentation file detailing project overview and results.
-
profiling.ipynb: Jupyter notebook for profiling the modules in terms of inference speed and gpu memory usage.
-
run_lce.ipynb: Main executable notebook for the LCE system.
-
run_prompt_reader.ipynb: Notebook for reading and processing prompts.
-
run_prompt_reader_profiling.ipynb: Profiling for the prompt reader.
-
run_sound_editor_nosb.ipynb: Notebook for the sound editor module without SpeechBrain.
Installation
- Clone the repository:
git clone https://github.com/SiavashShams/Listen-Chat-Edit-on-Edge.git
- Install required dependencies:
pip install -r requirements.txt
Usage
To run the main LCE application:
run_lce.ipynb
For a demonstration of the system's capabilities, refer to the demonstration
folder.
Implementation
- Deploy Conv-TasNet on the Jetson Nano.
- Deploy LLAMA 2 on a GCP server
- Send a prompt to the server. Communication is handled in two methods - one, through SSH and the other, through Pub/Sub service.
- LLM computed the embedding and publishes back the embedding, which is input to the Conv-TasNet model.
- The resulting audio mixture is ready to be played!
Links
Presentation
Report
References
Thanks to the authors of Listen, Chat, And Edit for their amazing work.