datasets¶

GATNE dataset¶

class cogdl.datasets.gatne.AmazonDataset[source]¶: Bases: cogdl.datasets.gatne.GatneDataset

class cogdl.datasets.gatne.GatneDataset(root, name)[source]¶

Bases: cogdl.data.dataset.Dataset

The network datasets “Amazon”, “Twitter” and “YouTube” from the “Representation Learning for Attributed Multiplex Heterogeneous Network” paper.

Args:: root (string): Root directory where the dataset should be saved. name (string): The name of the dataset ("Amazon",

"Twitter", "YouTube").

download()[source]¶: Downloads the dataset to the self.raw_dir folder.

get(idx)[source]¶: Gets the data object at index idx.

process()[source]¶: Processes the dataset to the self.processed_dir folder.

processed_file_names¶: The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names¶: The name of the files to find in the self.raw_dir folder in order to skip the download.

url = 'https://github.com/THUDM/GATNE/raw/master/data'¶

class cogdl.datasets.gatne.TwitterDataset[source]¶: Bases: cogdl.datasets.gatne.GatneDataset

class cogdl.datasets.gatne.YouTubeDataset[source]¶: Bases: cogdl.datasets.gatne.GatneDataset

cogdl.datasets.gatne.read_gatne_data(folder)[source]¶

GCC dataset¶

class cogdl.datasets.gcc_data.Edgelist(root, name)[source]¶

Bases: cogdl.data.dataset.Dataset

download()[source]¶: Downloads the dataset to the self.raw_dir folder.

get(idx)[source]¶: Gets the data object at index idx.

num_classes¶: The number of classes in the dataset.

process()[source]¶: Processes the dataset to the self.processed_dir folder.

processed_file_names¶: The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names¶: The name of the files to find in the self.raw_dir folder in order to skip the download.

url = 'https://github.com/cenyk1230/gcc-data/raw/master'¶

class cogdl.datasets.gcc_data.GCCDataset(root, name)[source]¶

Bases: cogdl.data.dataset.Dataset

download()[source]¶: Downloads the dataset to the self.raw_dir folder.

get(idx)[source]¶: Gets the data object at index idx.

preprocess(root, name)[source]¶

processed_file_names¶: The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names¶: The name of the files to find in the self.raw_dir folder in order to skip the download.

url = 'https://github.com/cenyk1230/gcc-data/raw/master'¶

class cogdl.datasets.gcc_data.KDD_ICDM_GCCDataset[source]¶: Bases: cogdl.datasets.gcc_data.GCCDataset

class cogdl.datasets.gcc_data.SIGIR_CIKM_GCCDataset[source]¶: Bases: cogdl.datasets.gcc_data.GCCDataset

class cogdl.datasets.gcc_data.SIGMOD_ICDE_GCCDataset[source]¶: Bases: cogdl.datasets.gcc_data.GCCDataset

class cogdl.datasets.gcc_data.USAAirportDataset[source]¶: Bases: cogdl.datasets.gcc_data.Edgelist

GTN dataset¶

class cogdl.datasets.gtn_data.ACM_GTNDataset[source]¶: Bases: cogdl.datasets.gtn_data.GTNDataset

class cogdl.datasets.gtn_data.DBLP_GTNDataset[source]¶: Bases: cogdl.datasets.gtn_data.GTNDataset

class cogdl.datasets.gtn_data.GTNDataset(root, name)[source]¶

Bases: cogdl.data.dataset.Dataset

The network datasets “ACM”, “DBLP” and “IMDB” from the “Graph Transformer Networks” paper.

Args:: root (string): Root directory where the dataset should be saved. name (string): The name of the dataset ("gtn-acm",

"gtn-dblp", "gtn-imdb").

apply_to_device(device)[source]¶

download()[source]¶: Downloads the dataset to the self.raw_dir folder.

get(idx)[source]¶: Gets the data object at index idx.

num_classes¶: The number of classes in the dataset.

process()[source]¶: Processes the dataset to the self.processed_dir folder.

processed_file_names¶: The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names¶: The name of the files to find in the self.raw_dir folder in order to skip the download.

read_gtn_data(folder)[source]¶

class cogdl.datasets.gtn_data.IMDB_GTNDataset[source]¶: Bases: cogdl.datasets.gtn_data.GTNDataset

HAN dataset¶

class cogdl.datasets.han_data.ACM_HANDataset[source]¶: Bases: cogdl.datasets.han_data.HANDataset

class cogdl.datasets.han_data.DBLP_HANDataset[source]¶: Bases: cogdl.datasets.han_data.HANDataset

class cogdl.datasets.han_data.HANDataset(root, name)[source]¶

Bases: cogdl.data.dataset.Dataset

The network datasets “ACM”, “DBLP” and “IMDB” from the “Heterogeneous Graph Attention Network” paper.

Args:: root (string): Root directory where the dataset should be saved. name (string): The name of the dataset ("han-acm",

"han-dblp", "han-imdb").

apply_to_device(device)[source]¶

download()[source]¶: Downloads the dataset to the self.raw_dir folder.

get(idx)[source]¶: Gets the data object at index idx.

num_classes¶: The number of classes in the dataset.

process()[source]¶: Processes the dataset to the self.processed_dir folder.

processed_file_names¶: The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names¶: The name of the files to find in the self.raw_dir folder in order to skip the download.

read_gtn_data(folder)[source]¶

class cogdl.datasets.han_data.IMDB_HANDataset[source]¶: Bases: cogdl.datasets.han_data.HANDataset

cogdl.datasets.han_data.sample_mask(idx, length)[source]¶: Create mask.

KG dataset¶

class cogdl.datasets.kg_data.BidirectionalOneShotIterator(dataloader_head, dataloader_tail)[source]¶

Bases: object

static one_shot_iterator(dataloader)[source]¶: Transform a PyTorch Dataloader into python iterator

class cogdl.datasets.kg_data.FB13Datset[source]¶: Bases: cogdl.datasets.kg_data.KnowledgeGraphDataset

class cogdl.datasets.kg_data.FB13SDatset[source]¶

Bases: cogdl.datasets.kg_data.KnowledgeGraphDataset

url = 'https://raw.githubusercontent.com/cenyk1230/test-data/main'¶

class cogdl.datasets.kg_data.FB15k237Datset[source]¶: Bases: cogdl.datasets.kg_data.KnowledgeGraphDataset

class cogdl.datasets.kg_data.FB15kDatset[source]¶: Bases: cogdl.datasets.kg_data.KnowledgeGraphDataset

class cogdl.datasets.kg_data.KnowledgeGraphDataset(root, name)[source]¶

Bases: cogdl.data.dataset.Dataset

download()[source]¶: Downloads the dataset to the self.raw_dir folder.

get(idx)[source]¶: Gets the data object at index idx.

num_entities¶

num_relations¶

process()[source]¶: Processes the dataset to the self.processed_dir folder.

processed_file_names¶: The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names¶: The name of the files to find in the self.raw_dir folder in order to skip the download.

test_start_idx¶

train_start_idx¶

url = 'https://raw.githubusercontent.com/thunlp/OpenKE/OpenKE-PyTorch/benchmarks'¶

valid_start_idx¶

class cogdl.datasets.kg_data.TestDataset(triples, all_true_triples, nentity, nrelation, mode)[source]¶

Bases: torch.utils.data.dataset.Dataset

static collate_fn(data)[source]¶

class cogdl.datasets.kg_data.TrainDataset(triples, nentity, nrelation, negative_sample_size, mode)[source]¶

Bases: torch.utils.data.dataset.Dataset

static collate_fn(data)[source]¶

static count_frequency(triples, start=4)[source]¶: Get frequency of a partial triple like (head, relation) or (relation, tail) The frequency will be used for subsampling like word2vec

static get_true_head_and_tail(triples)[source]¶: Build a dictionary of true triples that will be used to filter these true triples for negative sampling

class cogdl.datasets.kg_data.WN18Datset[source]¶: Bases: cogdl.datasets.kg_data.KnowledgeGraphDataset

class cogdl.datasets.kg_data.WN18RRDataset[source]¶: Bases: cogdl.datasets.kg_data.KnowledgeGraphDataset

cogdl.datasets.kg_data.read_triplet_data(folder)[source]¶

Matlab matrix dataset¶

class cogdl.datasets.matlab_matrix.BlogcatalogDataset[source]¶: Bases: cogdl.datasets.matlab_matrix.MatlabMatrix

class cogdl.datasets.matlab_matrix.DblpNEDataset[source]¶: Bases: cogdl.datasets.matlab_matrix.NetworkEmbeddingCMTYDataset

class cogdl.datasets.matlab_matrix.FlickrDataset[source]¶: Bases: cogdl.datasets.matlab_matrix.MatlabMatrix

class cogdl.datasets.matlab_matrix.MatlabMatrix(root, name, url)[source]¶

Bases: cogdl.data.dataset.Dataset

networks from the http://leitang.net/code/social-dimension/data/ or http://snap.stanford.edu/node2vec/

Args:: root (string): Root directory where the dataset should be saved. name (string): The name of the dataset ("Blogcatalog").

download()[source]¶: Downloads the dataset to the self.raw_dir folder.

get(idx)[source]¶: Gets the data object at index idx.

num_classes¶: The number of classes in the dataset.

num_nodes¶

process()[source]¶: Processes the dataset to the self.processed_dir folder.

processed_file_names¶: The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names¶: The name of the files to find in the self.raw_dir folder in order to skip the download.

class cogdl.datasets.matlab_matrix.NetworkEmbeddingCMTYDataset(root, name, url)[source]¶

Bases: cogdl.data.dataset.Dataset

download()[source]¶: Downloads the dataset to the self.raw_dir folder.

get(idx)[source]¶: Gets the data object at index idx.

num_classes¶: The number of classes in the dataset.

num_nodes¶

process()[source]¶: Processes the dataset to the self.processed_dir folder.

processed_file_names¶: The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names¶: The name of the files to find in the self.raw_dir folder in order to skip the download.

class cogdl.datasets.matlab_matrix.PPIDataset[source]¶: Bases: cogdl.datasets.matlab_matrix.MatlabMatrix

class cogdl.datasets.matlab_matrix.WikipediaDataset[source]¶: Bases: cogdl.datasets.matlab_matrix.MatlabMatrix

class cogdl.datasets.matlab_matrix.YoutubeNEDataset[source]¶: Bases: cogdl.datasets.matlab_matrix.NetworkEmbeddingCMTYDataset

PyG OGB dataset¶

class cogdl.datasets.ogb.OGBArxivDataset[source]¶: Bases: cogdl.datasets.ogb.OGBNDataset

class cogdl.datasets.ogb.OGBCodeDataset[source]¶: Bases: cogdl.datasets.ogb.OGBGDataset

class cogdl.datasets.ogb.OGBGDataset(root, name)[source]¶

Bases: cogdl.data.dataset.Dataset

get(idx)[source]¶: Gets the data object at index idx.

get_loader(args)[source]¶

get_subset(subset)[source]¶

num_classes¶: The number of classes in the dataset.

class cogdl.datasets.ogb.OGBMAGDataset[source]¶: Bases: cogdl.datasets.ogb.OGBNDataset

class cogdl.datasets.ogb.OGBMolbaceDataset[source]¶: Bases: cogdl.datasets.ogb.OGBGDataset

class cogdl.datasets.ogb.OGBMolhivDataset[source]¶: Bases: cogdl.datasets.ogb.OGBGDataset

class cogdl.datasets.ogb.OGBMolpcbaDataset[source]¶: Bases: cogdl.datasets.ogb.OGBGDataset

class cogdl.datasets.ogb.OGBNDataset(root, name)[source]¶

Bases: cogdl.data.dataset.Dataset

get(idx)[source]¶: Gets the data object at index idx.

get_evaluator()[source]¶

get_loss_fn()[source]¶

class cogdl.datasets.ogb.OGBPapers100MDataset[source]¶: Bases: cogdl.datasets.ogb.OGBNDataset

class cogdl.datasets.ogb.OGBPpaDataset[source]¶: Bases: cogdl.datasets.ogb.OGBGDataset

class cogdl.datasets.ogb.OGBProductsDataset[source]¶: Bases: cogdl.datasets.ogb.OGBNDataset

class cogdl.datasets.ogb.OGBProteinsDataset[source]¶: Bases: cogdl.datasets.ogb.OGBNDataset

cogdl.datasets.ogb.coalesce(row, col, edge_attr=None)[source]¶

PyG strategies dataset¶

This file is borrowed from https://github.com/snap-stanford/pretrain-gnns/

class cogdl.datasets.strategies_data.BACEDataset(transform=None, pre_transform=None, pre_filter=None, empty=False)[source]¶

Bases: cogdl.data.dataset.MultiGraphDataset

download()[source]¶: Downloads the dataset to the self.raw_dir folder.

process()[source]¶: Processes the dataset to the self.processed_dir folder.

processed_file_names¶: The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names¶: The name of the files to find in the self.raw_dir folder in order to skip the download.

class cogdl.datasets.strategies_data.BBBPDataset(transform=None, pre_transform=None, pre_filter=None, empty=False)[source]¶

Bases: cogdl.data.dataset.MultiGraphDataset

download()[source]¶: Downloads the dataset to the self.raw_dir folder.

process()[source]¶: Processes the dataset to the self.processed_dir folder.

processed_file_names¶: The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names¶: The name of the files to find in the self.raw_dir folder in order to skip the download.

class cogdl.datasets.strategies_data.BatchAE(batch=None, **kwargs)[source]¶

Bases: cogdl.data.data.Data

cat_dim(key)[source]¶: Returns the dimension in which the attribute key with content value gets concatenated when creating batches.

Note

This method is for internal use only, and should only be overridden if the batch concatenation process is corrupted for a specific data attribute.

static from_data_list(data_list)[source]¶: Constructs a batch object from a python list holding torch_geometric.data.Data objects. The assignment vector batch is created on the fly.

num_graphs¶: Returns the number of graphs in the batch.

class cogdl.datasets.strategies_data.BatchFinetune(batch=None, **kwargs)[source]¶

Bases: cogdl.data.data.Data

static from_data_list(data_list)[source]¶: Constructs a batch object from a python list holding torch_geometric.data.Data objects. The assignment vector batch is created on the fly.

num_graphs¶: Returns the number of graphs in the batch.

class cogdl.datasets.strategies_data.BatchMasking(batch=None, **kwargs)[source]¶

Bases: cogdl.data.data.Data

cumsum(key, item)[source]¶

If True, the attribute key with content item should be added up cumulatively before concatenated together. .. note:

This method is for internal use only, and should only be overridden
if the batch concatenation process is corrupted for a specific data
attribute.

static from_data_list(data_list)[source]¶: Constructs a batch object from a python list holding torch_geometric.data.Data objects. The assignment vector batch is created on the fly.

num_graphs¶: Returns the number of graphs in the batch.

class cogdl.datasets.strategies_data.BatchSubstructContext(batch=None, **kwargs)[source]¶

Bases: cogdl.data.data.Data

cat_dim(key)[source]¶: Returns the dimension in which the attribute key with content value gets concatenated when creating batches.

Note

This method is for internal use only, and should only be overridden if the batch concatenation process is corrupted for a specific data attribute.

cumsum(key, item)[source]¶

If True, the attribute key with content item should be added up cumulatively before concatenated together. .. note:

This method is for internal use only, and should only be overridden
if the batch concatenation process is corrupted for a specific data
attribute.

static from_data_list(data_list)[source]¶: Constructs a batch object from a python list holding torch_geometric.data.Data objects. The assignment vector batch is created on the fly.

num_graphs¶: Returns the number of graphs in the batch.

class cogdl.datasets.strategies_data.BioDataset(data_type='unsupervised', empty=False, transform=None, pre_transform=None, pre_filter=None)[source]¶

Bases: cogdl.data.dataset.MultiGraphDataset

download()[source]¶: Downloads the dataset to the self.raw_dir folder.

process()[source]¶: Processes the dataset to the self.processed_dir folder.

processed_file_names¶: The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names¶: The name of the files to find in the self.raw_dir folder in order to skip the download.

class cogdl.datasets.strategies_data.ChemExtractSubstructureContextPair(k, l1, l2)[source]¶: Bases: object

class cogdl.datasets.strategies_data.DataLoaderAE(dataset, batch_size=1, shuffle=True, **kwargs)[source]¶: Bases: torch.utils.data.dataloader.DataLoader

class cogdl.datasets.strategies_data.DataLoaderFinetune(dataset, batch_size=1, shuffle=True, **kwargs)[source]¶: Bases: torch.utils.data.dataloader.DataLoader

class cogdl.datasets.strategies_data.DataLoaderMasking(dataset, batch_size=1, shuffle=True, **kwargs)[source]¶: Bases: torch.utils.data.dataloader.DataLoader

class cogdl.datasets.strategies_data.DataLoaderSubstructContext(dataset, batch_size=1, shuffle=True, **kwargs)[source]¶: Bases: torch.utils.data.dataloader.DataLoader

class cogdl.datasets.strategies_data.ExtractSubstructureContextPair(l1, center=True)[source]¶: Bases: object

class cogdl.datasets.strategies_data.MaskAtom(num_atom_type, num_edge_type, mask_rate, mask_edge=True)[source]¶

Bases: object

Borrowed from https://github.com/snap-stanford/pretrain-gnns/

class cogdl.datasets.strategies_data.MaskEdge(mask_rate)[source]¶

Bases: object

Borrowed from https://github.com/snap-stanford/pretrain-gnns/

class cogdl.datasets.strategies_data.MoleculeDataset(data_type='unsupervised', transform=None, pre_transform=None, pre_filter=None, empty=False)[source]¶

Bases: cogdl.data.dataset.MultiGraphDataset

download()[source]¶: Downloads the dataset to the self.raw_dir folder.

process()[source]¶: Processes the dataset to the self.processed_dir folder.

processed_file_names¶: The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names¶: The name of the files to find in the self.raw_dir folder in order to skip the download.

class cogdl.datasets.strategies_data.NegativeEdge[source]¶

Bases: object

Borrowed from https://github.com/snap-stanford/pretrain-gnns/

class cogdl.datasets.strategies_data.TestBioDataset(data_type='unsupervised', root='testbio', transform=None, pre_transform=None, pre_filter=None)[source]¶: Bases: cogdl.data.dataset.MultiGraphDataset

class cogdl.datasets.strategies_data.TestChemDataset(data_type='unsupervised', root='testchem', transform=None, pre_transform=None, pre_filter=None)[source]¶: Bases: cogdl.data.dataset.MultiGraphDataset

cogdl.datasets.strategies_data.graph_data_obj_to_nx(data)[source]¶

cogdl.datasets.strategies_data.graph_data_obj_to_nx_simple(data)[source]¶: Converts graph Data object required by the pytorch geometric package to network x data object. NB: Uses simplified atom and bond features, and represent as indices. NB: possible issues with recapitulating relative stereochemistry since the edges in the nx object are unordered. :param data: pytorch geometric Data object :return: network x object

cogdl.datasets.strategies_data.nx_to_graph_data_obj(g, center_id, allowable_features_downstream=None, allowable_features_pretrain=None, node_id_to_go_labels=None)[source]¶

cogdl.datasets.strategies_data.nx_to_graph_data_obj_simple(G)[source]¶: Converts nx graph to pytorch geometric Data object. Assume node indices are numbered from 0 to num_nodes - 1. NB: Uses simplified atom and bond features, and represent as indices. NB: possible issues with recapitulating relative stereochemistry since the edges in the nx object are unordered. :param G: nx graph obj :return: pytorch geometric Data object

cogdl.datasets.strategies_data.reset_idxes(G)[source]¶: Resets node indices such that they are numbered from 0 to num_nodes - 1 :param G: :return: copy of G with relabelled node indices, mapping

TU dataset¶

class cogdl.datasets.tu_data.CollabDataset[source]¶: Bases: cogdl.datasets.tu_data.TUDataset

class cogdl.datasets.tu_data.ENZYMES[source]¶: Bases: cogdl.datasets.tu_data.TUDataset

class cogdl.datasets.tu_data.ImdbBinaryDataset[source]¶: Bases: cogdl.datasets.tu_data.TUDataset

class cogdl.datasets.tu_data.ImdbMultiDataset[source]¶: Bases: cogdl.datasets.tu_data.TUDataset

class cogdl.datasets.tu_data.MUTAGDataset[source]¶: Bases: cogdl.datasets.tu_data.TUDataset

class cogdl.datasets.tu_data.NCT109Dataset[source]¶: Bases: cogdl.datasets.tu_data.TUDataset

class cogdl.datasets.tu_data.NCT1Dataset[source]¶: Bases: cogdl.datasets.tu_data.TUDataset

class cogdl.datasets.tu_data.PTCMRDataset[source]¶: Bases: cogdl.datasets.tu_data.TUDataset

class cogdl.datasets.tu_data.ProtainsDataset[source]¶: Bases: cogdl.datasets.tu_data.TUDataset

class cogdl.datasets.tu_data.RedditBinary[source]¶: Bases: cogdl.datasets.tu_data.TUDataset

class cogdl.datasets.tu_data.RedditMulti12K[source]¶: Bases: cogdl.datasets.tu_data.TUDataset

class cogdl.datasets.tu_data.RedditMulti5K[source]¶: Bases: cogdl.datasets.tu_data.TUDataset

class cogdl.datasets.tu_data.TUDataset(root, name)[source]¶

Bases: cogdl.data.dataset.Dataset

download()[source]¶: Downloads the dataset to the self.raw_dir folder.

get(idx)[source]¶: Gets the data object at index idx.

num_classes¶: The number of classes in the dataset.

num_edge_attributes¶

num_edge_labels¶

num_node_attributes¶

num_node_labels¶

process()[source]¶: Processes the dataset to the self.processed_dir folder.

processed_file_names¶: The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names¶: The name of the files to find in the self.raw_dir folder in order to skip the download.

url = 'https://www.chrsmrrs.com/graphkerneldatasets'¶

cogdl.datasets.tu_data.cat(seq)[source]¶

cogdl.datasets.tu_data.coalesce(index, value, m, n)[source]¶

cogdl.datasets.tu_data.normalize_feature(data)[source]¶

cogdl.datasets.tu_data.parse_txt_array(src, sep=None, start=0, end=None, dtype=None, device=None)[source]¶

cogdl.datasets.tu_data.read_file(folder, prefix, name, dtype=None)[source]¶

cogdl.datasets.tu_data.read_tu_data(folder, prefix)[source]¶

cogdl.datasets.tu_data.read_txt_array(path, sep=None, start=0, end=None, dtype=None, device=None)[source]¶

cogdl.datasets.tu_data.segment(src, indptr)[source]¶

cogdl.datasets.tu_data.split(data, batch)[source]¶

Module contents¶

cogdl.datasets.build_dataset(args)[source]¶

cogdl.datasets.build_dataset_from_name(dataset)[source]¶

cogdl.datasets.build_dataset_from_path(data_path, task)[source]¶

cogdl.datasets.register_dataset(name)[source]¶

New dataset types can be added to cogdl with the register_dataset() function decorator.

For example:

@register_dataset('my_dataset')
class MyDataset():
    (...)

Args:: name (str): the name of the dataset

cogdl.datasets.try_import_dataset(dataset)[source]¶