Metadata-Version: 2.1
Name: bio-embeddings
Version: 0.1.2
Summary: A pipeline for protein embedding generation and visualization
License: MIT
Author: Christian Dallago
Author-email: christian.dallago@tum.de
Maintainer: Rostlab
Maintainer-email: admin@rostlab.org
Requires-Python: >=3.7,<4.0
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9 
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Scientific/Engineering :: Visualization
Requires-Dist: allennlp (>=0.9.0,<0.10.0)
Requires-Dist: biopython (>=1.76,<2.0)
Requires-Dist: gensim (>=3.8.2,<4.0.0)
Requires-Dist: h5py (>=2.10.0,<3.0.0)
Requires-Dist: lock (>=2018.3.25,<2019.0.0)
Requires-Dist: matplotlib (>=3.2.1,<4.0.0)
Requires-Dist: numpy (>=1.18.3,<2.0.0)
Requires-Dist: pandas (>=1.0.3,<2.0.0)
Requires-Dist: plotly (>=4.6.0,<5.0.0)
Requires-Dist: ruamel_yaml (>=0.16.10,<0.17.0)
Requires-Dist: scikit-learn (>=0.22.2.post1,<0.23.0)
Requires-Dist: scipy (>=1.4.1,<2.0.0)
Requires-Dist: torch (>=1.5.0,<2.0.0)
Requires-Dist: tqdm (>=4.45.0,<5.0.0)
Requires-Dist: transformers (>=2.8.0,<3.0.0)
Requires-Dist: umap-learn (>=0.4.2,<0.5.0)
Project-URL: homepage, https://visualize.protein.properties
Project-URL: issues, https://github.com/sacdallago/bio_embeddings/issues
Project-URL: repository, https://github.com/sacdallago/bio_embeddings
Project-URL: url, https://github.com/sacdallago/bio_embeddings
Description-Content-Type: text/markdown

# Bio Embeddings
The project includes:

- A pipeline that allows to embed a FASTA file choosing from various embedders (see below), and then project and visualize the embeddings on 3D plots.
- A web server that takes in sequences, embeds them and returns the embeddings OR visualizes the embedding spaces on interactive plots online.
- General purpose library to embed protein sequences in any python app.

## Important information

- The `albert` model weights are not publicly available yet. You can request early access by opening an issue.
- Please help us out by opening issues and submitting PRs as you see fit, this repository is actively being developed.

## Install guides

You can install the package via PIP like so:

```bash
pip install bio-embeddings
```

Or directly from the source (e.g. to have the latest features):

```bash
pip install -U git+https://github.com/sacdallago/bio_embeddings.git
```

### Additional dependencies and steps to run the webserver

If you want to run the webserver locally, you need to have some python backend deployment experience.
You'll need a couple of dependencies if you want to run the webserver locally: `pip install dash celery pymongo flask-restx pyyaml`.

Additionally, you will need to have two instances of the app run (the backend and at least one celery worker), and both instances must be granted access to a MongoDB and a RabbitMQ or Redis store for celery.

## Examples

We highly recommend you to check out the `examples` folder for pipeline examples, and the `notebooks` folder for post-processing pipeline runs and general purpose use of the embedders.

After having installed the package, you can:

1. Use the pipeline like:

    ```bash
    bio_embeddings config.yml
    ```

    A blueprint of the configuration file, and an example setup can be found in the `examples` directory of this repository.

1. Use the general purpose embedder objects via python, e.g.:

    ```python
    from bio_embeddings import SeqVecEmbedder

    embedder = SeqVecEmbedder()

    embedding = embedder.embed("SEQVENCE")
    ```

    More examples can be found in the `notebooks` folder of this repository.
 
## Development status

1. Pipeline stages
    - embed:   
        - [x] SeqVec v1/v2 (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3220-8)
        - [ ] TransformerXL
        - [ ] Fastext
        - [ ] Glove
        - [ ] Word2Vec
        - [ ] UniRep (https://www.nature.com/articles/s41592-019-0598-1?sfns=mo)
        - [x] Albert (unpublished)
    - project:
        - [x] t-SNE
        - [x] UMAP
    
1. Web server:  
    - [x] SecVec
    - [x] Albert (unpublished)
    
1. General purpose objects:
    - [x] SecVec
    - [x] TransformerXL
    - [x] Fastext
    - [x] Glove
    - [x] Word2Vec
    - [ ] UniRep
    - [x] Albert (unpublished)
    

## Building a Distribution
Building the packages best happens using invoke.
If you manganage your dependecies with poetry this should be already installed.
Simply use `poetry run invoke clean build` to update your requirements according to your current status
and to generate the dist files

## Contributors

- Christian Dallago (lead)
- Tobias Olenyi
- Michael Heinzinger

