Utility API Reference#
Utility functions are defined in chronocratic.datasets.utils and
re-exported from the package root. They handle caching, scaling, data processing,
and feature extraction.
Utility functions for data processing.
- chronocratic.datasets.utils.atomic_save_metadata(path: Path, data: dict[str, Any]) None#
Save a metadata dictionary to a JSON file atomically.
Writes to a
.json.tmpintermediate file then usesPath.replace()for POSIX atomicity.- Parameters:
path – Target
.jsonfile path.data – Metadata dictionary to persist.
- chronocratic.datasets.utils.atomic_save_npz(path: Path, **arrays: ndarray) None#
Save numpy arrays to a compressed
.npzfile atomically.Writes to a temporary
.npzfile in the same directory, then usesPath.replace()for POSIX atomicity. The tmp file is created in the same directory as the target to guarantee same- filesystem rename.- Parameters:
path – Target
.npzfile path.**arrays – Named arrays to persist.
- chronocratic.datasets.utils.build_cache_key(*, dataset_name: str, params: dict[str, Any]) str#
Build a hybrid cache key: SHA-256 hash prefix plus readable suffix.
The key format is
<8-char-sha256>_<dataset>_<key-params>.cache. Example:a3f8e1c2_ETTm1_seq_len=128_mode=UNIVARIATE.cache.- Parameters:
dataset_name – Dataset identifier (e.g.
"ETTm1").params – Parameters that affect data layout (seq_len, mode, scaling_method, etc.). Dict ordering does not affect the resulting key.
- Returns:
A deterministic cache key string.
- chronocratic.datasets.utils.compose(*functions: Callable) Callable#
Compose multiple functions into a single callable.
Functions are applied in the order they are provided.
- Parameters:
functions – Callables to compose.
- Returns:
A callable that applies all functions in order.
- chronocratic.datasets.utils.create_data_scaler(*, scale: bool, scaling_range: tuple[float, float], scaling_method: ScalingMethod = ScalingMethod.MINMAX, data_form: DataForm = DataForm.REGULAR) Callable#
Create a data scaling function.
Returns a callable that, when invoked with (train, valid, test) data, fits a scaler on train and transforms all splits.
- Parameters:
scale – Whether to apply scaling at all.
scaling_range – Target (min, max) for MinMaxScaler.
scaling_method – Scaling algorithm to use.
data_form – Shape category of the data.
- Returns:
A callable that accepts (train_data, valid_data, test_data) and returns scaled versions of the same.
- chronocratic.datasets.utils.custom_collate_fn(batch: list[Any], *, desired_batch_size: int) Any#
Collate function that pads the last batch by cycling samples.
If the current batch is smaller than desired_batch_size, extra samples are appended by cycling backwards through the batch.
- Parameters:
batch – A list of samples returned by the dataset.
desired_batch_size – Target batch size.
- Returns:
Standard collated tensor batch.
- chronocratic.datasets.utils.extract_time_features(datetime_index: DatetimeIndex) ndarray#
Extract cyclical time features from a DatetimeIndex.
Produces a 2-D array with columns: minute, hour, dayofweek, day, dayofyear, month, week.
- Parameters:
datetime_index – A pandas DatetimeIndex.
- Returns:
2-D numpy array of shape (len(index), 7) with dtype float32.
- chronocratic.datasets.utils.flatten_list_of_np_arrays(list_of_np_arrays: list[ndarray]) ndarray#
Flatten a list of numpy arrays into a single 1-D array.
- Parameters:
list_of_np_arrays – A list of numpy arrays.
- Returns:
A single flattened numpy array.
- chronocratic.datasets.utils.get_num_samples_from_ts(ts: ndarray | list[ndarray]) int#
Get number of samples from a time series.
- Parameters:
ts – A time series array or list of arrays.
- Returns:
Number of samples (length) of the time series.
- chronocratic.datasets.utils.load_metadata(path: Path) dict[str, Any]#
Load and validate metadata from a JSON file.
Checks that the
versionfield matchesCACHE_SCHEMA_VERSION. RaisesFileNotFoundErrorif the file does not exist andValueErroron version mismatch.- Parameters:
path – Metadata
.jsonfile path.- Returns:
The parsed metadata dictionary.
- Raises:
FileNotFoundError – If the metadata file does not exist.
ValueError – If the schema version does not match
CACHE_SCHEMA_VERSION.
- chronocratic.datasets.utils.load_scaler(path: Path) Any#
Load a persisted sklearn scaler via
torch.load.- Parameters:
path –
.ptfile path containing a pickled scaler.- Returns:
The loaded scaler instance.
- chronocratic.datasets.utils.process_data_with_varying_sequence_lengths_single(data: ndarray | DataFrame) ndarray | DataFrame#
Process data with varying sequence lengths by centering valid data.
Handles both 2-D (samples, timesteps) and 3-D (samples, timesteps, features) arrays. If the original data is a DataFrame, the result is returned as a DataFrame.
- Parameters:
data – Input array or DataFrame.
- Returns:
Processed numpy array of the same shape.
- chronocratic.datasets.utils.process_df_according_to_dtypes(df_data: DataFrame, meta: Any, dtypes_functions_map: dict[str, Callable]) DataFrame#
Process DataFrame columns according to ARFF dtype mapping.
Iterates over each column defined in the ARFF metadata, determines its type, and applies the corresponding transformation function from the provided mapping.
- Parameters:
df_data – DataFrame to process.
meta – ARFF metadata containing column type information.
dtypes_functions_map – Mapping from dtype name to transformation function.
- Returns:
Processed DataFrame with correctly typed columns.
- chronocratic.datasets.utils.read_arff_as_df(arff_file_path: Path | str) tuple[DataFrame, Any]#
Read an ARFF file into a pandas DataFrame.
Note
scipy.io.arff returns nominal (string) column values as bytes objects (e.g.
b'a'not'a'). The caller must provide a decode function indtypes_functions_mapwhen usingprocess_df_according_to_dtypes().- Parameters:
arff_file_path – Path to the ARFF file.
- Returns:
A tuple of (DataFrame, ARFF metadata object).
- chronocratic.datasets.utils.resolve_cache_dir(*, cache_dir: Path | None, dataset_name: str) Path#
Resolve the absolute cache directory path.
When
cache_dirisNone, the default location~/.cache/tsdatasets/{dataset_name}is used. A custom path is expanded (~) and resolved to an absolute path.- Parameters:
cache_dir – User-provided cache directory, or
Nonefor the default location.dataset_name – Dataset identifier appended to the default cache root.
- Returns:
An absolute
Pathto the cache directory.
- chronocratic.datasets.utils.save_scaler(scaler: Any, path: Path) None#
Persist a fitted sklearn scaler via
torch.save.Uses
pickle_protocol=5and writes atomically through a.pt.tmpintermediate file. If the target already exists (DDP race condition) or the cache directory is not writable (test environments with nonexistent paths), the save is skipped silently since the in-memory scaler is still valid.- Parameters:
scaler – Fitted scaler instance (e.g.
MinMaxScaler).path – Target
.ptfile path.
- chronocratic.datasets.utils.separate_target_feature_from_df(df: DataFrame, target_feature_name: str) tuple[DataFrame, Series]#
Separate target feature column from a DataFrame.
Extracts the specified target column as a Series and returns the remaining columns as a DataFrame.
- Parameters:
df – Source DataFrame containing the target column.
target_feature_name – Name of the target column to extract.
- Returns:
A tuple of (features DataFrame, target Series).
- Raises:
KeyError – If target_feature_name is not in df.columns.