Datasets API Reference¶
This page provides detailed API documentation for all dataset classes in Samay.
LPTMDataset¶
Dataset class for LPTM model.
              LPTMDataset(name=None, datetime_col=None, path=None, batchsize=16, mode='train', boundaries=[0, 0, 0], horizon=0, task_name='forecasting', label_col=None, stride=10, seq_len=512, **kwargs)
¶
    
              Bases: BaseDataset
Dataset class for Moment model Data Format: Dict with keys: input_ts: np.ndarray, historical time series data actual_ts: np.ndarray, actual time series data
Source code in Samay\src\samay\dataset.py
                    TimesfmDataset¶
Dataset class for TimesFM model.
              TimesfmDataset(name=None, datetime_col='ds', path=None, batchsize=4, mode='train', boundaries=(0, 0, 0), context_len=128, horizon_len=32, freq='h', normalize=False, stride=10, **kwargs)
¶
    
              Bases: BaseDataset
Dataset class for TimesFM model Data Format: Dict with keys: input_ts: np.ndarray, historical time series data actual_ts: np.ndarray, actual time series data
Source code in Samay\src\samay\dataset.py
                    MomentDataset¶
Dataset class for MOMENT model supporting multiple tasks.
              MomentDataset(name=None, datetime_col=None, path=None, batchsize=64, mode='train', boundaries=[0, 0, 0], horizon_len=0, task_name='forecasting', label_col=None, stride=10, **kwargs)
¶
    
              Bases: BaseDataset
Dataset class for Moment model Data Format: Dict with keys: input_ts: np.ndarray, historical time series data actual_ts: np.ndarray, actual time series data
Source code in Samay\src\samay\dataset.py
                    ChronosDataset¶
Dataset class for Chronos model with tokenization.
              ChronosDataset(name=None, datetime_col='ds', path=None, boundaries=[0, 0, 0], batch_size=16, mode=None, stride=10, tokenizer_class='MeanScaleUniformBins', drop_prob=0.2, min_past=64, np_dtype=np.float32, config=None)
¶
    
              Bases: BaseDataset
Dataset class for Chronos model Data Format: Dict with keys: input_ts: np.ndarray, historical time series data actual_ts: np.ndarray, actual time series data
Source code in Samay\src\samay\dataset.py
                    MoiraiDataset¶
Dataset class for MOIRAI model with frequency support.
              MoiraiDataset(name=None, datetime_col='date', path=None, boundaries=(0, 0, 0), context_len=128, horizon_len=32, patch_size=16, batch_size=16, freq=None, start_date=None, end_date=None, operation='mean', normalize=True, mode='train', htune=False, data_config=None, **kwargs)
¶
    
              Bases: BaseDataset
Dataset class for Moirai model. It ingests data in the form of a (num_variates x num_timesteps) matrix.
Source code in Samay\src\samay\dataset.py
                    | 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 |  | 
            add_past_fields(data: dict, ts_fields: list = [], past_ts_fields: list = [], dummy_val: float = 0.0, lead_time: int = 0, target_field: str = 'target', is_pad_field: str = 'is_pad', observed_value_field: str = 'observed_target', start_field: str = 'start', forecast_start_field: str = 'forecast_start', output_NTC: bool = True, mode='train')
¶
    Add the following fields: (a) past_target: The past target data (b) past_observed_target: The past target data with missing values indicator (c) past_is_pad: Indicates if the added value was a padding value (d) past_feat_dynamic_real: The past dynamic real features (e) past_observed_feat_dynamic_real: The past dynamic real features with missing values indicator
Source code in Samay\src\samay\dataset.py
              | 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 |  | 
            default_transforms() -> transforms.Compose
¶
    Default transformations for the dataset
Source code in Samay\src\samay\dataset.py
              | 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 |  | 
            gen_test_data()
¶
    Generates test data based on the boundaries
Returns:
| Type | Description | 
|---|---|
| np.ndarray: Test data | 
Source code in Samay\src\samay\dataset.py
              
            gen_train_val_data()
¶
    Generates training and validation data based on the boundaries
Returns:
| Type | Description | 
|---|---|
| np.ndarray: Training and Validation data | 
Source code in Samay\src\samay\dataset.py
              
            get_dataloader()
¶
    Returns the iterator for data batches for the dataset based on the mode
Returns:
| Type | Description | 
|---|---|
| torch.utils.data.DataLoader: Depends on the mode | 
Source code in Samay\src\samay\dataset.py
              
            prep_train_test_data(mode='train')
¶
    Apply transforms on the data and add the past fields (past target, past observed target, etc)
Source code in Samay\src\samay\dataset.py
              TinyTimeMixerDataset¶
Dataset class for TinyTimeMixer model.
              TinyTimeMixerDataset(name=None, datetime_col='ds', path=None, boundaries=[0, 0, 0], batch_size=128, mode=None, stride=10, context_len=512, horizon_len=64)
¶
    
              Bases: BaseDataset
Dataset class for ChronosBolt model Data Format: Dict with keys: input_ts: np.ndarray, historical time series data actual_ts: np.ndarray, actual time series data
Source code in Samay\src\samay\dataset.py
                    BaseDataset¶
All datasets inherit from the base dataset class:
              BaseDataset(name=None, datetime_col=None, path=None, batchsize=8, mode='train', **kwargs)
¶
    Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| name | str, dataset name | None | |
| target | np.ndarray, target data | required | 
Source code in Samay\src\samay\dataset.py
                    Usage Examples¶
Loading a Dataset¶
from samay.dataset import LPTMDataset
train_dataset = LPTMDataset(
    name="ett",
    datetime_col="date",
    path="./data/ETTh1.csv",
    mode="train",
    horizon=192,
    batchsize=16,
)
Custom Data Splits¶
# Specify exact boundaries
dataset = LPTMDataset(
    datetime_col="date",
    path="./data/ETTh1.csv",
    mode="train",
    horizon=192,
    boundaries=[0, 10000, 15000],  # Train: 0-10k, Val: 10k-15k, Test: 15k-end
)
Getting Data Loader¶
# Get PyTorch DataLoader
train_loader = train_dataset.get_data_loader()
for batch in train_loader:
    # Process batch
    pass
Accessing Dataset Properties¶
# Dataset length
print(f"Dataset size: {len(dataset)}")
# Get a single item
sample = dataset[0]
# Number of channels
print(f"Number of channels: {dataset.n_channels}")
# Sequence length
print(f"Sequence length: {dataset.seq_len}")
Denormalizing Predictions¶
# If dataset normalizes data
normalized_preds = model.evaluate(dataset)[2]
# Denormalize for interpretation
denormalized_preds = dataset._denormalize_data(normalized_preds)
Common Parameters¶
Most dataset classes share these common parameters:
| Parameter | Type | Default | Description | 
|---|---|---|---|
| name | str | None | Dataset name (for metadata) | 
| datetime_col | str | Varies | Name of the datetime column in CSV | 
| path | str | Required | Path to CSV file | 
| mode | str | "train" | Mode: "train"or"test" | 
| batchsize | int | Varies | Batch size for DataLoader | 
| boundaries | list | [0, 0, 0] | Custom train/val/test split indices | 
| stride | int | 10 | Stride for sliding window | 
Data Format Requirements¶
CSV Structure¶
All datasets expect CSV files with: 1. A datetime column (configurable name) 2. One or more value columns
Example:
date,HUFL,HULL,MUFL,MULL,LUFL,LULL,OT
2016-07-01 00:00:00,5.827,2.009,1.599,0.462,5.677,2.009,6.082
2016-07-01 01:00:00,5.693,2.076,1.492,0.426,5.485,1.942,5.947
...
Datetime Formats¶
Supported datetime formats:
- ISO 8601: 2016-07-01 00:00:00
- Date only: 2016-07-01
- Custom formats (parsed by pandas)
Missing Values¶
- Some datasets handle missing values automatically
- Others require preprocessing
- Check individual dataset documentation
Model-Specific Dataset Features¶
LPTMDataset¶
- Supports forecasting, classification, and detection
- Configurable sequence length (default: 512)
- Adaptive segmentation
dataset = LPTMDataset(
    datetime_col="date",
    path="./data/ETTh1.csv",
    mode="train",
    horizon=192,
    seq_len=512,  # Configurable
    task_name="forecasting",
)
TimesfmDataset¶
- Frequency specification
- Optional normalization
- Patch-based processing
dataset = TimesfmDataset(
    datetime_col="date",
    path="./data/ETTh1.csv",
    mode="train",
    context_len=512,
    horizon_len=192,
    freq="h",  # Frequency
    normalize=True,  # Optional normalization
)
MomentDataset¶
- Multi-task support
- Task-specific preprocessing
- Label handling for classification
# Forecasting
dataset = MomentDataset(
    datetime_col="date",
    path="./data/ETTh1.csv",
    mode="train",
    horizon_len=192,
    task_name="forecasting",
)
# Classification
dataset = MomentDataset(
    datetime_col="date",
    path="./data/classification.csv",
    mode="train",
    task_name="classification",
    label_col="label",
)
ChronosDataset¶
- Tokenization support
- Configurable vocab size
- Drop probability for training
from samay.models.chronosforecasting.chronos import ChronosConfig
config = ChronosConfig(
    context_length=512,
    prediction_length=64,
    # ... other configs
)
dataset = ChronosDataset(
    datetime_col="date",
    path="./data/ETTh1.csv",
    mode="train",
    config=config,
)
MoiraiDataset¶
- Frequency specification (required)
- Date range filtering
- Built-in normalization
dataset = MoiraiDataset(
    datetime_col="date",
    path="./data/ETTh1.csv",
    mode="train",
    freq="h",  # Required
    context_len=128,
    horizon_len=64,
    start_date="2016-01-01",  # Optional
    end_date="2017-12-31",    # Optional
    normalize=True,
)
TinyTimeMixerDataset¶
- Large batch support
- Efficient windowing
- Fast data loading
dataset = TinyTimeMixerDataset(
    datetime_col="date",
    path="./data/ETTh1.csv",
    mode="train",
    context_len=512,
    horizon_len=96,
    batch_size=128,  # Supports large batches
)
Common Methods¶
All datasets implement these methods:
__len__()¶
Returns the number of samples in the dataset.
__getitem__(idx)¶
Returns a single sample at the given index.
get_data_loader()¶
Returns a PyTorch DataLoader for the dataset.
Returns:
- torch.utils.data.DataLoader
_denormalize_data(data)¶
Denormalizes data (if normalization was applied).
Parameters:
- data (np.ndarray): Normalized data
Returns:
- np.ndarray: Denormalized data
Data Split Strategies¶
Default Split¶
When boundaries=[0, 0, 0]:
- Train: 60% of data
- Validation: 20% of data
- Test: 20% of data
Custom Split¶
# Specify exact indices
dataset = LPTMDataset(
    boundaries=[0, 10000, 15000],
    # Train: 0-10000
    # Val: 10000-15000
    # Test: 15000-end
)
Use All Data¶
Performance Tips¶
1. Batch Size¶
Larger batch sizes improve throughput:
2. Stride¶
Smaller stride creates more samples but is slower:
# More samples (slower)
dataset = LPTMDataset(
    stride=1,
    # ...
)
# Fewer samples (faster)
dataset = LPTMDataset(
    stride=96,
    # ...
)
3. Normalization¶
Enable normalization for better performance:
See Also¶
- Models API: Model classes
- Metrics API: Evaluation metrics
- Getting Started: Basic usage guide