Datasets

Managers for loading datasets

construe.datasets.loaders.load_all_datasets(sample=True, data_home=None)[source]

Load all available datasets as defined by __all__

construe.datasets.loaders.cleanup_all_datasets(data_home=None)[source]

Delete everything in the data home directory

construe.datasets.loaders.load_dialects(sample=True, data_home=None, no_dirs=True, pattern=None)
construe.datasets.loaders.cleanup_dialects(sample=True, data_home=None)
construe.datasets.loaders.load_lowlight(sample=True, data_home=None, no_dirs=True, *, pattern='lowlight/**/*.png')
construe.datasets.loaders.cleanup_lowlight(sample=True, data_home=None)
construe.datasets.loaders.load_reddit(sample=True, data_home=None)
construe.datasets.loaders.cleanup_reddit(sample=True, data_home=None)
construe.datasets.loaders.load_movies(sample=True, data_home=None, no_dirs=True, pattern=None)
construe.datasets.loaders.cleanup_movies(sample=True, data_home=None)
construe.datasets.loaders.load_essays(sample=True, data_home=None)
construe.datasets.loaders.cleanup_essays(sample=True, data_home=None)
construe.datasets.loaders.load_aegis(sample=True, data_home=None)
construe.datasets.loaders.cleanup_aegis(sample=True, data_home=None)
construe.datasets.loaders.load_nsfw(sample=True, data_home=None, no_dirs=True, *, pattern='nsfw/**/*.jpg')
construe.datasets.loaders.cleanup_nsfw(sample=True, data_home=None)

Manifest

Manifest handlers for datasets

construe.datasets.manifest.load_manifest(path='/home/runner/work/llm-benchmark/llm-benchmark/llm-benchmark/construe/datasets/manifest.json')[source]
construe.datasets.manifest.generate_manifest(fixtures='/home/runner/work/llm-benchmark/llm-benchmark/llm-benchmark/construe/datasets/fixtures', out='/home/runner/work/llm-benchmark/llm-benchmark/llm-benchmark/construe/datasets/manifest.json')[source]
construe.datasets.manifest.dataset_extra(path, name, **kwargs)[source]

Count the number of instances in each class in the dataset

Path Helpers

Path handling for downloads

construe.datasets.path.get_data_home(path=None)[source]

Return the path of the Construe data directory. This folder is used by dataset loaders to avoid downloading data several times.

By default, this folder is colocated with the code in the install directory so that data shipped with the package can be easily located. Alternatively it can be set by the $CONSTRUE_DATA environment variable, or programmatically by giving a folder path. Note that the '~' symbol is expanded to the user home directory, and environment variables are also expanded when resolving the path.

construe.datasets.path.find_dataset_path(dataset, data_home=None, fname=None, ext=None, raises=True)[source]

Looks up the path to the dataset specified in the data home directory, which is found using the get_data_home function. By default data home is in a config directory in the user’s home folder, but can be modified with the $CONSTRUE_DATA environment variable, or passing in a different directory.

If the dataset is not found a DatasetsError is raised by default.

construe.datasets.path.dataset_exists(dataset, data_home=None)[source]

Checks to see if a directory with the name of the specified dataset exists in the data home directory, found with get_data_home.

construe.datasets.path.dataset_archive(dataset, signature, data_home=None, ext='.zip')[source]

Checks to see if the dataset archive file exists in the data home directory, found with get_data_home. By specifying the signature, this function also checks to see if the archive is the latest version by comparing the sha256sum of the local archive with the specified signature.

construe.datasets.path.cleanup_dataset(dataset, data_home=None, ext='.zip')[source]

Removes the dataset directory and archive file from the data home directory.