tethys.backend/docs/commands/index-datasets.md

# Dataset Indexing Command

AdonisJS Ace command for indexing and synchronizing published datasets with OpenSearch for search functionality.

## Overview

The `index:datasets` command processes published datasets and creates/updates corresponding search index documents in OpenSearch. It intelligently compares modification timestamps to only re-index datasets when necessary, optimizing performance while maintaining search index accuracy.

## Command Syntax

```bash
node ace index:datasets [options]
```

## Options

| Flag | Alias | Description |
|------|-------|-------------|
| `--publish_id <number>` | `-p` | Index a specific dataset by publish_id |

## Usage Examples

### Basic Operations

```bash
# Index all published datasets that have been modified since last indexing
node ace index:datasets

# Index a specific dataset by publish_id
node ace index:datasets --publish_id 231
node ace index:datasets -p 231
```

## How It Works

### 1. **Dataset Selection**
The command processes datasets that meet these criteria:
- `server_state = 'published'` - Only published datasets
- Has preloaded `xmlCache` relationship for metadata transformation
- Optionally filtered by specific `publish_id`

### 2. **Smart Update Detection**
For each dataset, the command:
- Checks if the dataset exists in the OpenSearch index
- Compares `server_date_modified` timestamps
- Only re-indexes if the dataset is newer than the indexed version

### 3. **Document Processing**
The indexing process involves:
1. **XML Generation**: Creates structured XML from dataset metadata
2. **XSLT Transformation**: Converts XML to JSON using Saxon-JS processor
3. **Index Update**: Updates or creates the document in OpenSearch
4. **Logging**: Records success/failure for each operation

## Index Structure

### Index Configuration
- **Index Name**: `tethys-records`
- **Document ID**: Dataset `publish_id`
- **Refresh**: `true` (immediate availability)

### Document Fields
The indexed documents contain:
- **Metadata Fields**: Title, description, authors, keywords
- **Identifiers**: DOI, publish_id, and other identifiers
- **Temporal Data**: Publication dates, coverage periods
- **Geographic Data**: Spatial coverage information
- **Technical Details**: Data formats, access information
- **Timestamps**: Creation and modification dates

## Example Output

### Successful Run
```bash
node ace index:datasets
```
```
Found 150 published datasets to process
Dataset with publish_id 231 successfully indexed
Dataset with publish_id 245 is up to date, skipping indexing
Dataset with publish_id 267 successfully indexed
An error occurred while indexing dataset with publish_id 289. Error: Invalid XML metadata
Processing completed: 148 indexed, 1 skipped, 1 error
```

### Specific Dataset
```bash
node ace index:datasets --publish_id 231
```
```
Found 1 published dataset to process
Dataset with publish_id 231 successfully indexed
Processing completed: 1 indexed, 0 skipped, 0 errors
```

## Update Logic

The command uses intelligent indexing to avoid unnecessary processing:

| Condition | Action | Reason |
|-----------|--------|--------|
| Dataset not in index | ✅ Index | New dataset needs indexing |
| Dataset newer than indexed version | ✅ Re-index | Dataset has been updated |
| Dataset same/older than indexed version | ❌ Skip | Already up to date |
| OpenSearch document check fails | ✅ Index | Better safe than sorry |
| Invalid XML metadata | ❌ Skip + Log Error | Cannot process invalid data |

### Timestamp Comparison
```typescript
// Example comparison logic
const existingModified = DateTime.fromMillis(Number(existingDoc.server_date_modified) * 1000);
const currentModified = dataset.server_date_modified;

if (currentModified <= existingModified) {
    // Skip - already up to date
    return false;
}
// Proceed with indexing
```

## XML Transformation Process

### 1. **XML Generation**
```xml
<?xml version="1.0" encoding="UTF-8" standalone="true"?>
<root>
    <Dataset>
        <!-- Dataset metadata fields -->
        <title>Research Dataset Title</title>
        <description>Dataset description...</description>
        <!-- Additional metadata -->
    </Dataset>
</root>
```

### 2. **XSLT Processing**
The command uses Saxon-JS with a compiled stylesheet (`solr.sef.json`) to transform XML to JSON:
```javascript
const result = await SaxonJS.transform({
    stylesheetText: proc,
    destination: 'serialized',
    sourceText: xmlString,
});
```

### 3. **Final JSON Document**
```json
{
    "id": "231",
    "title": "Research Dataset Title",
    "description": "Dataset description...",
    "authors": ["Author Name"],
    "server_date_modified": 1634567890,
    "publish_id": 231
}
```

## Configuration Requirements

### Environment Variables
```bash
# OpenSearch Configuration
OPENSEARCH_HOST=localhost:9200

# For production:
# OPENSEARCH_HOST=your-opensearch-cluster:9200
```

### Required Files
- **XSLT Stylesheet**: `public/assets2/solr.sef.json` - Compiled Saxon-JS stylesheet for XML transformation

### Database Relationships
The command expects these model relationships:
```typescript
// Dataset model must have:
@hasOne(() => XmlCache, { foreignKey: 'dataset_id' })
public xmlCache: HasOne<typeof XmlCache>
```

## Error Handling

The command handles various error scenarios gracefully:

### Common Errors and Solutions

| Error | Cause | Solution |
|-------|-------|----------|
| `XSLT transformation failed` | Invalid XML or missing stylesheet | Check XML structure and stylesheet path |
| `OpenSearch connection error` | Service unavailable | Verify OpenSearch is running and accessible |
| `JSON parse error` | Malformed transformation result | Check XSLT stylesheet output format |
| `Missing xmlCache relationship` | Data integrity issue | Ensure xmlCache exists for dataset |

### Error Logging
```bash
# Typical error log entry
An error occurred while indexing dataset with publish_id 231.
Error: XSLT transformation failed: Invalid XML structure at line 15
```

## Performance Considerations

### Batch Processing
- Processes datasets sequentially to avoid overwhelming OpenSearch
- Each dataset is committed individually for reliability
- Failed indexing of one dataset doesn't stop processing others

### Resource Usage
- **Memory**: XML/JSON transformations require temporary memory
- **Network**: OpenSearch API calls for each dataset
- **CPU**: XSLT transformations are CPU-intensive

### Optimization Tips
```bash
# Index only recently modified datasets (run regularly)
node ace index:datasets

# Index specific datasets when needed
node ace index:datasets --publish_id 231

# Consider running during off-peak hours for large batches
```

## Integration with Other Systems

### Search Functionality
The indexed documents power:
- **Dataset Search**: Full-text search across metadata
- **Faceted Browsing**: Filter by authors, keywords, dates
- **Geographic Search**: Spatial query capabilities
- **Auto-complete**: Suggest dataset titles and keywords

### Related Commands
- [`update:datacite`](update-datacite.md) - Often run after indexing to sync DOI metadata
- **Database migrations** - May require re-indexing after schema changes

### API Integration
The indexed data is consumed by:
- **Search API**: `/api/search` endpoints
- **Browse API**: `/api/datasets` with filtering
- **Recommendations**: Related dataset suggestions

## Monitoring and Maintenance

### Regular Tasks
```bash
# Daily indexing (recommended cron job)
0 2 * * * cd /path/to/project && node ace index:datasets

# Weekly full re-index (if needed)
0 3 * * 0 cd /path/to/project && node ace index:datasets --force
```

### Health Checks
- Monitor OpenSearch cluster health
- Check for failed indexing operations in logs
- Verify search functionality is working
- Compare dataset counts between database and index

### Troubleshooting
```bash
# Check specific dataset indexing
node ace index:datasets --publish_id 231

# Verify OpenSearch connectivity
curl -X GET "localhost:9200/_cluster/health"

# Check index statistics
curl -X GET "localhost:9200/tethys-records/_stats"
```

## Best Practices

1. **Regular Scheduling**: Run the command regularly (daily) to keep the search index current
2. **Monitor Logs**: Watch for transformation errors or OpenSearch issues
3. **Backup Strategy**: Include OpenSearch indices in backup procedures
4. **Resource Management**: Monitor OpenSearch cluster resources during bulk operations
5. **Testing**: Verify search functionality after major indexing operations
6. **Coordination**: Run indexing before DataCite updates when both are needed