Skip to content

ddg_data

Module for loading and preprocessing ddG data.

SequenceDataModule

source

SequenceDataModule(
   datafiles: Dict, params: Dict
)

A data module to handle sequence data with functional scores

Args

  • datafiles (dict) : A dictionary containing data files. Must contain "train".
  • params (dict) : Dataloader parameters.

Attributes

  • datafile_train (str) : The path to the training data file.
  • datafile_val (str) : The path to the validation data file, if provided.
  • datafile_test (str) : The path to the test data file, if provided.
  • batch_size (int) : The batch size for the dataloader
  • num_workers (int) : The number of workers for the dataloader.
  • drop_last (bool) : Whether to drop the last batch if it is smaller than the batch size.
  • replacement (bool) : Whether to sample with replacement.
  • sequence_length (int) : The length of the sequence data.
  • alphabet_size (int) : The size of the alphabet for the sequence data.

Methods:

.read_data_file

source

.read_data_file(
   filename: str
)

Reads a csv file containing sequence data and returns a TensorDataset. required columns in the csv file are "seq" and "y"

Args

  • filename : path to csv file

Returns

TensorDataset with one-hot encoded sequences and labels

.setup

source

.setup(
   stage: Optional[str] = None
)

.train_dataloader

source

.train_dataloader()

.val_dataloader

source

.val_dataloader()

.test_dataloader

source

.test_dataloader()

Sequence_WT_DataModule

source

Sequence_WT_DataModule(
   datafiles: Dict, params: Dict
)

A data module to handle sequence data in the context of the wild type reference sequence.

Args

  • datafiles (dict) : A dictionary containing data files. Must contain "train".
  • params (dict) : Dataloader parameters.

Attributes

  • datafile_train (str) : The path to the training data file.
  • datafile_val (str) : The path to the validation data file, if provided.
  • datafile_test (str) : The path to the test data file, if provided.
  • batch_size (int) : The batch size for the dataloader
  • num_workers (int) : The number of workers for the dataloader.
  • drop_last (bool) : Whether to drop the last batch if it is smaller than the batch size.
  • replacement (bool) : Whether to sample with replacement.
  • sequence_length (int) : The length of the sequence data.
  • alphabet_size (int) : The size of the alphabet for the sequence data.

Methods:

.read_data_file

source

.read_data_file(
   filename: str
)

Reads a csv file containing sequence data and returns a TensorDataset. required columns in the csv file are "seq","wt" and "y"

Args

  • filename : path to csv file

Returns

TensorDataset with one-hot encoded sequences and labels

.setup

source

.setup(
   stage: Optional[str] = None
)

.train_dataloader

source

.train_dataloader()

.val_dataloader

source

.val_dataloader()

.test_dataloader

source

.test_dataloader()

EmbeddingsDataset

source

EmbeddingsDataset(
   emb_dict, ids: List, *tensors: torch.Tensor
)

Embeddings Dataset: a dataset overwriting the getitem method to lookup and return the batch embeddings to avoid loading the entire embeddings file into memory.

Args

  • emb_dict : dictionary containing the embeddings
  • ids : list of protein ids
  • tensors : tensors containing the sequence data and labels

Attributes

  • tensors : tensors containing the sequence data and labels
  • embeddings : dictionary containing the embeddings
  • ids : list of protein ids

EmbeddingsDataModule

source

EmbeddingsDataModule(
   datafiles: Dict, params: Dict
)

A PyTorch Lightning DataModule for handling sequences embedded with a large protein model (.h5 format).

Args

  • datafiles (dict) : A dictionary containing data files. Must contain "train" and "embeddings".
  • params (dict) : Dataloader parameters.

Attributes

  • embedding_file (str) : The path to the embedding file.
  • datafile_train (str) : The path to the training data file.
  • datafile_val (str) : The path to the validation data file, if provided.
  • datafile_test (str) : The path to the test data file, if provided.
  • batch_size (int) : The batch size for the dataloader
  • num_workers (int) : The number of workers for the dataloader.
  • drop_last (bool) : Whether to drop the last batch if it is smaller than the batch size.
  • replacement (bool) : Whether to sample with replacement.
  • sequence_length (int) : The length of the sequence data.
  • alphabet_size (int) : The size of the alphabet for the sequence data.

Methods:

.read_data_file

source

.read_data_file(
   filename: str
)

Reads a csv file containing sequence data and returns a TensorDataset. required columns in the csv file are "seq","y" and "id"

Args

  • filename : path to csv file

Returns

TensorDataset with one-hot encoded sequences and labels

.generate_embeddings

source

.generate_embeddings(
   df: pd.DataFrame
)

generate corresponding lookup keys for the input data. We store the keys instead of all the embeddings due to RAM limitations

Args

  • df (pd.DataFrame) : input data_frame with column "id" containing a unique id-string for each sequence that matches the keys in the embedding file

Returns

  • array : sequence embeddings wih dim (Nsequences, N_AminoAcids, N_embedding_dim)

.setup

source

.setup(
   stage: Optional[str] = None
)

.train_dataloader

source

.train_dataloader()

.val_dataloader

source

.val_dataloader()

.test_dataloader

source

.test_dataloader()

Embeddings_WT_Dataset

source

Embeddings_WT_Dataset(
   emb_dict, wt_dict, ids: List, wt: List, *tensors: torch.Tensor
)

Embeddings Dataset: a dataset that looks up embeddings and the embeddings of the WT sequ in the get_item method


Embeddings_WT_DataModule

source

Embeddings_WT_DataModule(
   datafiles: dict, params: dict
)

A PyTorch Lightning DataModule for handling sequences embedded with a large protein model in context of the corresponding wt sequence.

Args

  • datafiles (dict) : A dictionary containing data files. Must contain "train" and "embeddings".
  • params (dict) : Dataloader parameters.

Attributes

  • embedding_file (str) : The path to the embedding file.
  • datafile_train (str) : The path to the training data file.
  • datafile_val (str) : The path to the validation data file, if provided.
  • datafile_test (str) : The path to the test data file, if provided.
  • batch_size (int) : The batch size for the dataloader
  • num_workers (int) : The number of workers for the dataloader.
  • drop_last (bool) : Whether to drop the last batch if it is smaller than the batch size.
  • replacement (bool) : Whether to sample with replacement.
  • sequence_length (int) : The length of the sequence data.
  • alphabet_size (int) : The size of the alphabet for the sequence data.

Methods:

.read_data_file

source

.read_data_file(
   filename: str
)

.generate_embeddings

source

.generate_embeddings(
   df: pd.DataFrame
)

generate corresponding lookup keys for the input data. We store the keys instead of all the embeddings due to RAM limitations

Args

  • df (pd.DataFrame) : input data_frame with column "id" containing a unique id-string for each sequence that matches the keys in the embedding file

Returns

  • array : sequence embeddings wih dim (Nsequences, N_AminoAcids, N_embedding_dim)

.setup

source

.setup(
   stage: Optional[str] = None
)

.train_dataloader

source

.train_dataloader()

.val_dataloader

source

.val_dataloader()

.test_dataloader

source

.test_dataloader()