ddg_data
Module for loading and preprocessing ddG data.
SequenceDataModule
SequenceDataModule(
datafiles: Dict, params: Dict
)
A data module to handle sequence data with functional scores
Args
- datafiles (dict) : A dictionary containing data files. Must contain "train".
- params (dict) : Dataloader parameters.
Attributes
- datafile_train (str) : The path to the training data file.
- datafile_val (str) : The path to the validation data file, if provided.
- datafile_test (str) : The path to the test data file, if provided.
- batch_size (int) : The batch size for the dataloader
- num_workers (int) : The number of workers for the dataloader.
- drop_last (bool) : Whether to drop the last batch if it is smaller than the batch size.
- replacement (bool) : Whether to sample with replacement.
- sequence_length (int) : The length of the sequence data.
- alphabet_size (int) : The size of the alphabet for the sequence data.
Methods:
.read_data_file
.read_data_file(
filename: str
)
Reads a csv file containing sequence data and returns a TensorDataset. required columns in the csv file are "seq" and "y"
Args
- filename : path to csv file
Returns
TensorDataset with one-hot encoded sequences and labels
.setup
.setup(
stage: Optional[str] = None
)
.train_dataloader
.train_dataloader()
.val_dataloader
.val_dataloader()
.test_dataloader
.test_dataloader()
Sequence_WT_DataModule
Sequence_WT_DataModule(
datafiles: Dict, params: Dict
)
A data module to handle sequence data in the context of the wild type reference sequence.
Args
- datafiles (dict) : A dictionary containing data files. Must contain "train".
- params (dict) : Dataloader parameters.
Attributes
- datafile_train (str) : The path to the training data file.
- datafile_val (str) : The path to the validation data file, if provided.
- datafile_test (str) : The path to the test data file, if provided.
- batch_size (int) : The batch size for the dataloader
- num_workers (int) : The number of workers for the dataloader.
- drop_last (bool) : Whether to drop the last batch if it is smaller than the batch size.
- replacement (bool) : Whether to sample with replacement.
- sequence_length (int) : The length of the sequence data.
- alphabet_size (int) : The size of the alphabet for the sequence data.
Methods:
.read_data_file
.read_data_file(
filename: str
)
Reads a csv file containing sequence data and returns a TensorDataset. required columns in the csv file are "seq","wt" and "y"
Args
- filename : path to csv file
Returns
TensorDataset with one-hot encoded sequences and labels
.setup
.setup(
stage: Optional[str] = None
)
.train_dataloader
.train_dataloader()
.val_dataloader
.val_dataloader()
.test_dataloader
.test_dataloader()
EmbeddingsDataset
EmbeddingsDataset(
emb_dict, ids: List, *tensors: torch.Tensor
)
Embeddings Dataset: a dataset overwriting the getitem method to lookup and return the batch embeddings to avoid loading the entire embeddings file into memory.
Args
- emb_dict : dictionary containing the embeddings
- ids : list of protein ids
- tensors : tensors containing the sequence data and labels
Attributes
- tensors : tensors containing the sequence data and labels
- embeddings : dictionary containing the embeddings
- ids : list of protein ids
EmbeddingsDataModule
EmbeddingsDataModule(
datafiles: Dict, params: Dict
)
A PyTorch Lightning DataModule for handling sequences embedded with a large protein model (.h5 format).
Args
- datafiles (dict) : A dictionary containing data files. Must contain "train" and "embeddings".
- params (dict) : Dataloader parameters.
Attributes
- embedding_file (str) : The path to the embedding file.
- datafile_train (str) : The path to the training data file.
- datafile_val (str) : The path to the validation data file, if provided.
- datafile_test (str) : The path to the test data file, if provided.
- batch_size (int) : The batch size for the dataloader
- num_workers (int) : The number of workers for the dataloader.
- drop_last (bool) : Whether to drop the last batch if it is smaller than the batch size.
- replacement (bool) : Whether to sample with replacement.
- sequence_length (int) : The length of the sequence data.
- alphabet_size (int) : The size of the alphabet for the sequence data.
Methods:
.read_data_file
.read_data_file(
filename: str
)
Reads a csv file containing sequence data and returns a TensorDataset. required columns in the csv file are "seq","y" and "id"
Args
- filename : path to csv file
Returns
TensorDataset with one-hot encoded sequences and labels
.generate_embeddings
.generate_embeddings(
df: pd.DataFrame
)
generate corresponding lookup keys for the input data. We store the keys instead of all the embeddings due to RAM limitations
Args
- df (pd.DataFrame) : input data_frame with column "id" containing a unique id-string for each sequence that matches the keys in the embedding file
Returns
- array : sequence embeddings wih dim (Nsequences, N_AminoAcids, N_embedding_dim)
.setup
.setup(
stage: Optional[str] = None
)
.train_dataloader
.train_dataloader()
.val_dataloader
.val_dataloader()
.test_dataloader
.test_dataloader()
Embeddings_WT_Dataset
Embeddings_WT_Dataset(
emb_dict, wt_dict, ids: List, wt: List, *tensors: torch.Tensor
)
Embeddings Dataset: a dataset that looks up embeddings and the embeddings of the WT sequ in the get_item method
Embeddings_WT_DataModule
Embeddings_WT_DataModule(
datafiles: dict, params: dict
)
A PyTorch Lightning DataModule for handling sequences embedded with a large protein model in context of the corresponding wt sequence.
Args
- datafiles (dict) : A dictionary containing data files. Must contain "train" and "embeddings".
- params (dict) : Dataloader parameters.
Attributes
- embedding_file (str) : The path to the embedding file.
- datafile_train (str) : The path to the training data file.
- datafile_val (str) : The path to the validation data file, if provided.
- datafile_test (str) : The path to the test data file, if provided.
- batch_size (int) : The batch size for the dataloader
- num_workers (int) : The number of workers for the dataloader.
- drop_last (bool) : Whether to drop the last batch if it is smaller than the batch size.
- replacement (bool) : Whether to sample with replacement.
- sequence_length (int) : The length of the sequence data.
- alphabet_size (int) : The size of the alphabet for the sequence data.
Methods:
.read_data_file
.read_data_file(
filename: str
)
.generate_embeddings
.generate_embeddings(
df: pd.DataFrame
)
generate corresponding lookup keys for the input data. We store the keys instead of all the embeddings due to RAM limitations
Args
- df (pd.DataFrame) : input data_frame with column "id" containing a unique id-string for each sequence that matches the keys in the embedding file
Returns
- array : sequence embeddings wih dim (Nsequences, N_AminoAcids, N_embedding_dim)
.setup
.setup(
stage: Optional[str] = None
)
.train_dataloader
.train_dataloader()
.val_dataloader
.val_dataloader()
.test_dataloader
.test_dataloader()