tethys.backend/docs/commands/update-datacite.md
Arno Kaimbacher c049b22723 - feat: Enhance README with setup instructions, usage, and command documentation
- fix: Update API routes to include DOI URL handling and improve route organization

- chore: Add ORCID preload rule file and ensure proper registration

- docs: Add MIT License to the project for open-source compliance

- feat: Implement command to detect and fix missing dataset cross-references

- feat: Create command for updating DataCite DOI records with detailed logging and error handling

- docs: Add comprehensive documentation for dataset indexing command

- docs: Create detailed documentation for DataCite update command with usage examples and error handling
2025-09-19 14:35:23 +02:00

7.3 KiB

DataCite Update Command

AdonisJS Ace command for updating DataCite DOI records for published datasets.

Overview

The update:datacite command synchronizes your local dataset metadata with DataCite DOI records. It intelligently compares modification dates to only update records when necessary, reducing unnecessary API calls and maintaining data consistency.

Command Syntax

node ace update:datacite [options]

Options

Flag Alias Description
--publish_id <number> -p Update a specific dataset by publish_id
--force -f Force update all records regardless of modification date
--dry-run -d Preview what would be updated without making changes
--stats -s Show detailed statistics for datasets that need updating

Usage Examples

Basic Operations

# Update all datasets that have been modified since their DOI was last updated
node ace update:datacite

# Update a specific dataset
node ace update:datacite --publish_id 231
node ace update:datacite -p 231

# Force update all datasets with DOIs (ignores modification dates)
node ace update:datacite --force

Preview and Analysis

# Preview what would be updated (dry run)
node ace update:datacite --dry-run

# Show detailed statistics for datasets that need updating
node ace update:datacite --stats

# Show stats for a specific dataset
node ace update:datacite --stats --publish_id 231

Combined Options

# Dry run for a specific dataset
node ace update:datacite --dry-run --publish_id 231

# Show stats for all datasets (including up-to-date ones)
node ace update:datacite --stats --force

Command Modes

1. Normal Mode (Default)

Updates DataCite records for datasets that have been modified since their DOI was last updated.

Example Output:

Using DataCite API: https://api.test.datacite.org
Found 50 datasets to process
Dataset 231: Successfully updated DataCite record
Dataset 245: Up to date, skipping
Dataset 267: Successfully updated DataCite record
DataCite update completed. Updated: 15, Skipped: 35, Errors: 0

2. Dry Run Mode (--dry-run)

Shows what would be updated without making any changes to DataCite.

Use Case: Preview updates before running the actual command.

Example Output:

Dataset 231: Would update DataCite record (dry run)
Dataset 267: Would update DataCite record (dry run)
Dataset 245: Up to date, skipping
DataCite update completed. Updated: 2, Skipped: 1, Errors: 0

3. Stats Mode (--stats)

Shows detailed information for each dataset that needs updating, including why it needs updating.

Use Case: Debug synchronization issues, monitor dataset/DOI status, generate reports.

Example Output:

┌─ Dataset 231 ─────────────────────────────────────────────────────────
│ DOI Value:           10.21388/tethys.231
│ DOI Status (DB):     findable
│ DOI State (DataCite): findable
│ Dataset Modified:    2024-09-15T10:30:00.000Z
│ DOI Modified:        2024-09-10T08:15:00.000Z
│ Needs Update:        YES - Dataset newer than DOI
└───────────────────────────────────────────────────────────────────────

┌─ Dataset 267 ─────────────────────────────────────────────────────────
│ DOI Value:           10.21388/tethys.267
│ DOI Status (DB):     findable
│ DOI State (DataCite): findable
│ Dataset Modified:    2024-09-18T14:20:00.000Z
│ DOI Modified:        2024-09-16T12:45:00.000Z
│ Needs Update:        YES - Dataset newer than DOI
└───────────────────────────────────────────────────────────────────────

DataCite Stats Summary: 2 datasets need updating, 48 are up to date

Update Logic

The command uses intelligent update detection:

  1. Compares modification dates: Dataset server_date_modified vs DOI last modification date from DataCite
  2. Validates data integrity: Checks for missing or future dates
  3. Handles API failures gracefully: Updates anyway if DataCite info can't be retrieved
  4. Uses dual API approach: DataCite REST API (primary) with MDS API fallback

When Updates Happen

Condition Action Reason
Dataset modified > DOI modified Update Dataset has newer changes
Dataset modified ≤ DOI modified Skip DOI is up to date
Dataset date in future Skip Invalid data, needs investigation
Dataset date missing Update Can't determine staleness
DataCite API error Update Better safe than sorry
--force flag used Update Override all logic

Environment Configuration

Required environment variables:

# DataCite Credentials
DATACITE_USERNAME=your_username
DATACITE_PASSWORD=your_password

# API Endpoints (environment-specific)
DATACITE_API_URL=https://api.test.datacite.org          # Test environment
DATACITE_SERVICE_URL=https://mds.test.datacite.org      # Test MDS

DATACITE_API_URL=https://api.datacite.org               # Production
DATACITE_SERVICE_URL=https://mds.datacite.org           # Production MDS

# Project Configuration
DATACITE_PREFIX=10.21388                                # Your DOI prefix
BASE_DOMAIN=tethys.at                                   # Your domain

Error Handling

The command handles various error scenarios:

  • Invalid modification dates: Logs errors but continues processing other datasets
  • DataCite API failures: Falls back to MDS API, then to safe update
  • Missing DOI identifiers: Skips datasets without DOI identifiers
  • Network issues: Continues with next dataset after logging error

Integration

The command integrates with:

  • Dataset Model: Uses server_date_modified for change detection
  • DatasetIdentifier Model: Reads DOI values and status
  • OpenSearch Index: Updates search index after DataCite update
  • DoiClient: Handles all DataCite API interactions

Common Workflows

Daily Maintenance

# Update any datasets modified today
node ace update:datacite

Pre-Deployment Check

# Check what would be updated before deployment
node ace update:datacite --dry-run

Debugging Sync Issues

# Investigate why specific dataset isn't syncing
node ace update:datacite --stats --publish_id 231

Full Resync

# Force update all DOI records (use with caution)
node ace update:datacite --force

Monitoring Report

# Generate sync status report
node ace update:datacite --stats > datacite-sync-report.txt

Best Practices

  1. Regular Updates: Run daily or after bulk dataset modifications
  2. Test First: Use --dry-run or --stats before bulk operations
  3. Monitor Logs: Check for data integrity warnings
  4. Environment Separation: Use correct API URLs for test vs production
  5. Rate Limiting: The command handles DataCite rate limits automatically