- feat: Enhance README with setup instructions, usage, and command documentation

- fix: Update API routes to include DOI URL handling and improve route organization - chore: Add ORCID preload rule file and ensure proper registration - docs: Add MIT License to the project for open-source compliance - feat: Implement command to detect and fix missing dataset cross-references - feat: Create command for updating DataCite DOI records with detailed logging and error handling - docs: Add comprehensive documentation for dataset indexing command - docs: Create detailed documentation for DataCite update command with usage examples and error handling
2025-09-19 14:35:23 +02:00 · 2025-09-19 14:35:23 +02:00 · c049b22723
commit c049b22723
parent 8f67839f93
11 changed files with 2187 additions and 555 deletions
--- a/docs/commands/index-datasets.md
+++ b/docs/commands/index-datasets.md
@ -0,0 +1,278 @@
+# Dataset Indexing Command
+
+AdonisJS Ace command for indexing and synchronizing published datasets with OpenSearch for search functionality.
+
+## Overview
+
+The `index:datasets` command processes published datasets and creates/updates corresponding search index documents in OpenSearch. It intelligently compares modification timestamps to only re-index datasets when necessary, optimizing performance while maintaining search index accuracy.
+
+## Command Syntax
+
+```bash
+node ace index:datasets [options]
+```
+
+## Options
+
+| Flag | Alias | Description |
+|------|-------|-------------|
+| `--publish_id <number>` | `-p` | Index a specific dataset by publish_id |
+
+## Usage Examples
+
+### Basic Operations
+
+```bash
+# Index all published datasets that have been modified since last indexing
+node ace index:datasets
+
+# Index a specific dataset by publish_id
+node ace index:datasets --publish_id 231
+node ace index:datasets -p 231
+```
+
+## How It Works
+
+### 1. **Dataset Selection**
+The command processes datasets that meet these criteria:
+- `server_state = 'published'` - Only published datasets
+- Has preloaded `xmlCache` relationship for metadata transformation
+- Optionally filtered by specific `publish_id`
+
+### 2. **Smart Update Detection**
+For each dataset, the command:
+- Checks if the dataset exists in the OpenSearch index
+- Compares `server_date_modified` timestamps
+- Only re-indexes if the dataset is newer than the indexed version
+
+### 3. **Document Processing**
+The indexing process involves:
+1. **XML Generation**: Creates structured XML from dataset metadata
+2. **XSLT Transformation**: Converts XML to JSON using Saxon-JS processor
+3. **Index Update**: Updates or creates the document in OpenSearch
+4. **Logging**: Records success/failure for each operation
+
+## Index Structure
+
+### Index Configuration
+- **Index Name**: `tethys-records`
+- **Document ID**: Dataset `publish_id`
+- **Refresh**: `true` (immediate availability)
+
+### Document Fields
+The indexed documents contain:
+- **Metadata Fields**: Title, description, authors, keywords
+- **Identifiers**: DOI, publish_id, and other identifiers
+- **Temporal Data**: Publication dates, coverage periods
+- **Geographic Data**: Spatial coverage information
+- **Technical Details**: Data formats, access information
+- **Timestamps**: Creation and modification dates
+
+## Example Output
+
+### Successful Run
+```bash
+node ace index:datasets
+```
+```
+Found 150 published datasets to process
+Dataset with publish_id 231 successfully indexed
+Dataset with publish_id 245 is up to date, skipping indexing
+Dataset with publish_id 267 successfully indexed
+An error occurred while indexing dataset with publish_id 289. Error: Invalid XML metadata
+Processing completed: 148 indexed, 1 skipped, 1 error
+```
+
+### Specific Dataset
+```bash
+node ace index:datasets --publish_id 231
+```
+```
+Found 1 published dataset to process
+Dataset with publish_id 231 successfully indexed
+Processing completed: 1 indexed, 0 skipped, 0 errors
+```
+
+## Update Logic
+
+The command uses intelligent indexing to avoid unnecessary processing:
+
+| Condition | Action | Reason |
+|-----------|--------|--------|
+| Dataset not in index | ✅ Index | New dataset needs indexing |
+| Dataset newer than indexed version | ✅ Re-index | Dataset has been updated |
+| Dataset same/older than indexed version | ❌ Skip | Already up to date |
+| OpenSearch document check fails | ✅ Index | Better safe than sorry |
+| Invalid XML metadata | ❌ Skip + Log Error | Cannot process invalid data |
+
+### Timestamp Comparison
+```typescript
+// Example comparison logic
+const existingModified = DateTime.fromMillis(Number(existingDoc.server_date_modified) * 1000);
+const currentModified = dataset.server_date_modified;
+
+if (currentModified <= existingModified) {
+    // Skip - already up to date
+    return false;
+}
+// Proceed with indexing
+```
+
+## XML Transformation Process
+
+### 1. **XML Generation**
+```xml
+<?xml version="1.0" encoding="UTF-8" standalone="true"?>
+<root>
+    <Dataset>
+        <!-- Dataset metadata fields -->
+        <title>Research Dataset Title</title>
+        <description>Dataset description...</description>
+        <!-- Additional metadata -->
+    </Dataset>
+</root>
+```
+
+### 2. **XSLT Processing**
+The command uses Saxon-JS with a compiled stylesheet (`solr.sef.json`) to transform XML to JSON:
+```javascript
+const result = await SaxonJS.transform({
+    stylesheetText: proc,
+    destination: 'serialized',
+    sourceText: xmlString,
+});
+```
+
+### 3. **Final JSON Document**
+```json
+{
+    "id": "231",
+    "title": "Research Dataset Title",
+    "description": "Dataset description...",
+    "authors": ["Author Name"],
+    "server_date_modified": 1634567890,
+    "publish_id": 231
+}
+```
+
+## Configuration Requirements
+
+### Environment Variables
+```bash
+# OpenSearch Configuration
+OPENSEARCH_HOST=localhost:9200
+
+# For production:
+# OPENSEARCH_HOST=your-opensearch-cluster:9200
+```
+
+### Required Files
+- **XSLT Stylesheet**: `public/assets2/solr.sef.json` - Compiled Saxon-JS stylesheet for XML transformation
+
+### Database Relationships
+The command expects these model relationships:
+```typescript
+// Dataset model must have:
+@hasOne(() => XmlCache, { foreignKey: 'dataset_id' })
+public xmlCache: HasOne<typeof XmlCache>
+```
+
+## Error Handling
+
+The command handles various error scenarios gracefully:
+
+### Common Errors and Solutions
+
+| Error | Cause | Solution |
+|-------|-------|----------|
+| `XSLT transformation failed` | Invalid XML or missing stylesheet | Check XML structure and stylesheet path |
+| `OpenSearch connection error` | Service unavailable | Verify OpenSearch is running and accessible |
+| `JSON parse error` | Malformed transformation result | Check XSLT stylesheet output format |
+| `Missing xmlCache relationship` | Data integrity issue | Ensure xmlCache exists for dataset |
+
+### Error Logging
+```bash
+# Typical error log entry
+An error occurred while indexing dataset with publish_id 231.
+Error: XSLT transformation failed: Invalid XML structure at line 15
+```
+
+## Performance Considerations
+
+### Batch Processing
+- Processes datasets sequentially to avoid overwhelming OpenSearch
+- Each dataset is committed individually for reliability
+- Failed indexing of one dataset doesn't stop processing others
+
+### Resource Usage
+- **Memory**: XML/JSON transformations require temporary memory
+- **Network**: OpenSearch API calls for each dataset
+- **CPU**: XSLT transformations are CPU-intensive
+
+### Optimization Tips
+```bash
+# Index only recently modified datasets (run regularly)
+node ace index:datasets
+
+# Index specific datasets when needed
+node ace index:datasets --publish_id 231
+
+# Consider running during off-peak hours for large batches
+```
+
+## Integration with Other Systems
+
+### Search Functionality
+The indexed documents power:
+- **Dataset Search**: Full-text search across metadata
+- **Faceted Browsing**: Filter by authors, keywords, dates
+- **Geographic Search**: Spatial query capabilities
+- **Auto-complete**: Suggest dataset titles and keywords
+
+### Related Commands
+- [`update:datacite`](update-datacite.md) - Often run after indexing to sync DOI metadata
+- **Database migrations** - May require re-indexing after schema changes
+
+### API Integration
+The indexed data is consumed by:
+- **Search API**: `/api/search` endpoints
+- **Browse API**: `/api/datasets` with filtering
+- **Recommendations**: Related dataset suggestions
+
+## Monitoring and Maintenance
+
+### Regular Tasks
+```bash
+# Daily indexing (recommended cron job)
+0 2 * * * cd /path/to/project && node ace index:datasets
+
+# Weekly full re-index (if needed)
+0 3 * * 0 cd /path/to/project && node ace index:datasets --force
+```
+
+### Health Checks
+- Monitor OpenSearch cluster health
+- Check for failed indexing operations in logs
+- Verify search functionality is working
+- Compare dataset counts between database and index
+
+### Troubleshooting
+```bash
+# Check specific dataset indexing
+node ace index:datasets --publish_id 231
+
+# Verify OpenSearch connectivity
+curl -X GET "localhost:9200/_cluster/health"
+
+# Check index statistics
+curl -X GET "localhost:9200/tethys-records/_stats"
+```
+
+## Best Practices
+
+1. **Regular Scheduling**: Run the command regularly (daily) to keep the search index current
+2. **Monitor Logs**: Watch for transformation errors or OpenSearch issues  
+3. **Backup Strategy**: Include OpenSearch indices in backup procedures
+4. **Resource Management**: Monitor OpenSearch cluster resources during bulk operations
+5. **Testing**: Verify search functionality after major indexing operations
+6. **Coordination**: Run indexing before DataCite updates when both are needed
--- a/docs/commands/update-datacite.md
+++ b/docs/commands/update-datacite.md
@ -0,0 +1,216 @@
+# DataCite Update Command
+
+AdonisJS Ace command for updating DataCite DOI records for published datasets.
+
+## Overview
+
+The `update:datacite` command synchronizes your local dataset metadata with DataCite DOI records. It intelligently compares modification dates to only update records when necessary, reducing unnecessary API calls and maintaining data consistency.
+
+## Command Syntax
+
+```bash
+node ace update:datacite [options]
+```
+
+## Options
+
+| Flag | Alias | Description |
+|------|-------|-------------|
+| `--publish_id <number>` | `-p` | Update a specific dataset by publish_id |
+| `--force` | `-f` | Force update all records regardless of modification date |
+| `--dry-run` | `-d` | Preview what would be updated without making changes |
+| `--stats` | `-s` | Show detailed statistics for datasets that need updating |
+
+## Usage Examples
+
+### Basic Operations
+
+```bash
+# Update all datasets that have been modified since their DOI was last updated
+node ace update:datacite
+
+# Update a specific dataset
+node ace update:datacite --publish_id 231
+node ace update:datacite -p 231
+
+# Force update all datasets with DOIs (ignores modification dates)
+node ace update:datacite --force
+```
+
+### Preview and Analysis
+
+```bash
+# Preview what would be updated (dry run)
+node ace update:datacite --dry-run
+
+# Show detailed statistics for datasets that need updating
+node ace update:datacite --stats
+
+# Show stats for a specific dataset
+node ace update:datacite --stats --publish_id 231
+```
+
+### Combined Options
+
+```bash
+# Dry run for a specific dataset
+node ace update:datacite --dry-run --publish_id 231
+
+# Show stats for all datasets (including up-to-date ones)
+node ace update:datacite --stats --force
+```
+
+## Command Modes
+
+### 1. **Normal Mode** (Default)
+Updates DataCite records for datasets that have been modified since their DOI was last updated.
+
+**Example Output:**
+```
+Using DataCite API: https://api.test.datacite.org
+Found 50 datasets to process
+Dataset 231: Successfully updated DataCite record
+Dataset 245: Up to date, skipping
+Dataset 267: Successfully updated DataCite record
+DataCite update completed. Updated: 15, Skipped: 35, Errors: 0
+```
+
+### 2. **Dry Run Mode** (`--dry-run`)
+Shows what would be updated without making any changes to DataCite.
+
+**Use Case:** Preview updates before running the actual command.
+
+**Example Output:**
+```
+Dataset 231: Would update DataCite record (dry run)
+Dataset 267: Would update DataCite record (dry run)
+Dataset 245: Up to date, skipping
+DataCite update completed. Updated: 2, Skipped: 1, Errors: 0
+```
+
+### 3. **Stats Mode** (`--stats`)
+Shows detailed information for each dataset that needs updating, including why it needs updating.
+
+**Use Case:** Debug synchronization issues, monitor dataset/DOI status, generate reports.
+
+**Example Output:**
+```
+┌─ Dataset 231 ─────────────────────────────────────────────────────────
+│ DOI Value:           10.21388/tethys.231
+│ DOI Status (DB):     findable
+│ DOI State (DataCite): findable
+│ Dataset Modified:    2024-09-15T10:30:00.000Z
+│ DOI Modified:        2024-09-10T08:15:00.000Z
+│ Needs Update:        YES - Dataset newer than DOI
+└───────────────────────────────────────────────────────────────────────
+
+┌─ Dataset 267 ─────────────────────────────────────────────────────────
+│ DOI Value:           10.21388/tethys.267
+│ DOI Status (DB):     findable
+│ DOI State (DataCite): findable
+│ Dataset Modified:    2024-09-18T14:20:00.000Z
+│ DOI Modified:        2024-09-16T12:45:00.000Z
+│ Needs Update:        YES - Dataset newer than DOI
+└───────────────────────────────────────────────────────────────────────
+
+DataCite Stats Summary: 2 datasets need updating, 48 are up to date
+```
+
+## Update Logic
+
+The command uses intelligent update detection:
+
+1. **Compares modification dates**: Dataset `server_date_modified` vs DOI last modification date from DataCite
+2. **Validates data integrity**: Checks for missing or future dates
+3. **Handles API failures gracefully**: Updates anyway if DataCite info can't be retrieved
+4. **Uses dual API approach**: DataCite REST API (primary) with MDS API fallback
+
+### When Updates Happen
+
+| Condition | Action | Reason |
+|-----------|--------|--------|
+| Dataset modified > DOI modified | ✅ Update | Dataset has newer changes |
+| Dataset modified ≤ DOI modified | ❌ Skip | DOI is up to date |
+| Dataset date in future | ❌ Skip | Invalid data, needs investigation |
+| Dataset date missing | ✅ Update | Can't determine staleness |
+| DataCite API error | ✅ Update | Better safe than sorry |
+| `--force` flag used | ✅ Update | Override all logic |
+
+## Environment Configuration
+
+Required environment variables:
+
+```bash
+# DataCite Credentials
+DATACITE_USERNAME=your_username
+DATACITE_PASSWORD=your_password
+
+# API Endpoints (environment-specific)
+DATACITE_API_URL=https://api.test.datacite.org          # Test environment
+DATACITE_SERVICE_URL=https://mds.test.datacite.org      # Test MDS
+
+DATACITE_API_URL=https://api.datacite.org               # Production
+DATACITE_SERVICE_URL=https://mds.datacite.org           # Production MDS
+
+# Project Configuration
+DATACITE_PREFIX=10.21388                                # Your DOI prefix
+BASE_DOMAIN=tethys.at                                   # Your domain
+```
+
+## Error Handling
+
+The command handles various error scenarios:
+
+- **Invalid modification dates**: Logs errors but continues processing other datasets
+- **DataCite API failures**: Falls back to MDS API, then to safe update
+- **Missing DOI identifiers**: Skips datasets without DOI identifiers
+- **Network issues**: Continues with next dataset after logging error
+
+## Integration
+
+The command integrates with:
+
+- **Dataset Model**: Uses `server_date_modified` for change detection
+- **DatasetIdentifier Model**: Reads DOI values and status
+- **OpenSearch Index**: Updates search index after DataCite update
+- **DoiClient**: Handles all DataCite API interactions
+
+## Common Workflows
+
+### Daily Maintenance
+```bash
+# Update any datasets modified today
+node ace update:datacite
+```
+
+### Pre-Deployment Check
+```bash
+# Check what would be updated before deployment
+node ace update:datacite --dry-run
+```
+
+### Debugging Sync Issues
+```bash
+# Investigate why specific dataset isn't syncing
+node ace update:datacite --stats --publish_id 231
+```
+
+### Full Resync
+```bash
+# Force update all DOI records (use with caution)
+node ace update:datacite --force
+```
+
+### Monitoring Report
+```bash
+# Generate sync status report
+node ace update:datacite --stats > datacite-sync-report.txt
+```
+
+## Best Practices
+
+1. **Regular Updates**: Run daily or after bulk dataset modifications
+2. **Test First**: Use `--dry-run` or `--stats` before bulk operations
+3. **Monitor Logs**: Check for data integrity warnings
+4. **Environment Separation**: Use correct API URLs for test vs production
+5. **Rate Limiting**: The command handles DataCite rate limits automatically