tethys.backend/docs/commands/index-datasets.md
Arno Kaimbacher c049b22723 - feat: Enhance README with setup instructions, usage, and command documentation
- fix: Update API routes to include DOI URL handling and improve route organization

- chore: Add ORCID preload rule file and ensure proper registration

- docs: Add MIT License to the project for open-source compliance

- feat: Implement command to detect and fix missing dataset cross-references

- feat: Create command for updating DataCite DOI records with detailed logging and error handling

- docs: Add comprehensive documentation for dataset indexing command

- docs: Create detailed documentation for DataCite update command with usage examples and error handling
2025-09-19 14:35:23 +02:00

8.4 KiB

Dataset Indexing Command

AdonisJS Ace command for indexing and synchronizing published datasets with OpenSearch for search functionality.

Overview

The index:datasets command processes published datasets and creates/updates corresponding search index documents in OpenSearch. It intelligently compares modification timestamps to only re-index datasets when necessary, optimizing performance while maintaining search index accuracy.

Command Syntax

node ace index:datasets [options]

Options

Flag Alias Description
--publish_id <number> -p Index a specific dataset by publish_id

Usage Examples

Basic Operations

# Index all published datasets that have been modified since last indexing
node ace index:datasets

# Index a specific dataset by publish_id
node ace index:datasets --publish_id 231
node ace index:datasets -p 231

How It Works

1. Dataset Selection

The command processes datasets that meet these criteria:

  • server_state = 'published' - Only published datasets
  • Has preloaded xmlCache relationship for metadata transformation
  • Optionally filtered by specific publish_id

2. Smart Update Detection

For each dataset, the command:

  • Checks if the dataset exists in the OpenSearch index
  • Compares server_date_modified timestamps
  • Only re-indexes if the dataset is newer than the indexed version

3. Document Processing

The indexing process involves:

  1. XML Generation: Creates structured XML from dataset metadata
  2. XSLT Transformation: Converts XML to JSON using Saxon-JS processor
  3. Index Update: Updates or creates the document in OpenSearch
  4. Logging: Records success/failure for each operation

Index Structure

Index Configuration

  • Index Name: tethys-records
  • Document ID: Dataset publish_id
  • Refresh: true (immediate availability)

Document Fields

The indexed documents contain:

  • Metadata Fields: Title, description, authors, keywords
  • Identifiers: DOI, publish_id, and other identifiers
  • Temporal Data: Publication dates, coverage periods
  • Geographic Data: Spatial coverage information
  • Technical Details: Data formats, access information
  • Timestamps: Creation and modification dates

Example Output

Successful Run

node ace index:datasets
Found 150 published datasets to process
Dataset with publish_id 231 successfully indexed
Dataset with publish_id 245 is up to date, skipping indexing
Dataset with publish_id 267 successfully indexed
An error occurred while indexing dataset with publish_id 289. Error: Invalid XML metadata
Processing completed: 148 indexed, 1 skipped, 1 error

Specific Dataset

node ace index:datasets --publish_id 231
Found 1 published dataset to process
Dataset with publish_id 231 successfully indexed
Processing completed: 1 indexed, 0 skipped, 0 errors

Update Logic

The command uses intelligent indexing to avoid unnecessary processing:

Condition Action Reason
Dataset not in index Index New dataset needs indexing
Dataset newer than indexed version Re-index Dataset has been updated
Dataset same/older than indexed version Skip Already up to date
OpenSearch document check fails Index Better safe than sorry
Invalid XML metadata Skip + Log Error Cannot process invalid data

Timestamp Comparison

// Example comparison logic
const existingModified = DateTime.fromMillis(Number(existingDoc.server_date_modified) * 1000);
const currentModified = dataset.server_date_modified;

if (currentModified <= existingModified) {
    // Skip - already up to date
    return false;
}
// Proceed with indexing

XML Transformation Process

1. XML Generation

<?xml version="1.0" encoding="UTF-8" standalone="true"?>
<root>
    <Dataset>
        <!-- Dataset metadata fields -->
        <title>Research Dataset Title</title>
        <description>Dataset description...</description>
        <!-- Additional metadata -->
    </Dataset>
</root>

2. XSLT Processing

The command uses Saxon-JS with a compiled stylesheet (solr.sef.json) to transform XML to JSON:

const result = await SaxonJS.transform({
    stylesheetText: proc,
    destination: 'serialized',
    sourceText: xmlString,
});

3. Final JSON Document

{
    "id": "231",
    "title": "Research Dataset Title",
    "description": "Dataset description...",
    "authors": ["Author Name"],
    "server_date_modified": 1634567890,
    "publish_id": 231
}

Configuration Requirements

Environment Variables

# OpenSearch Configuration
OPENSEARCH_HOST=localhost:9200

# For production:
# OPENSEARCH_HOST=your-opensearch-cluster:9200

Required Files

  • XSLT Stylesheet: public/assets2/solr.sef.json - Compiled Saxon-JS stylesheet for XML transformation

Database Relationships

The command expects these model relationships:

// Dataset model must have:
@hasOne(() => XmlCache, { foreignKey: 'dataset_id' })
public xmlCache: HasOne<typeof XmlCache>

Error Handling

The command handles various error scenarios gracefully:

Common Errors and Solutions

Error Cause Solution
XSLT transformation failed Invalid XML or missing stylesheet Check XML structure and stylesheet path
OpenSearch connection error Service unavailable Verify OpenSearch is running and accessible
JSON parse error Malformed transformation result Check XSLT stylesheet output format
Missing xmlCache relationship Data integrity issue Ensure xmlCache exists for dataset

Error Logging

# Typical error log entry
An error occurred while indexing dataset with publish_id 231.
Error: XSLT transformation failed: Invalid XML structure at line 15

Performance Considerations

Batch Processing

  • Processes datasets sequentially to avoid overwhelming OpenSearch
  • Each dataset is committed individually for reliability
  • Failed indexing of one dataset doesn't stop processing others

Resource Usage

  • Memory: XML/JSON transformations require temporary memory
  • Network: OpenSearch API calls for each dataset
  • CPU: XSLT transformations are CPU-intensive

Optimization Tips

# Index only recently modified datasets (run regularly)
node ace index:datasets

# Index specific datasets when needed
node ace index:datasets --publish_id 231

# Consider running during off-peak hours for large batches

Integration with Other Systems

Search Functionality

The indexed documents power:

  • Dataset Search: Full-text search across metadata
  • Faceted Browsing: Filter by authors, keywords, dates
  • Geographic Search: Spatial query capabilities
  • Auto-complete: Suggest dataset titles and keywords
  • update:datacite - Often run after indexing to sync DOI metadata
  • Database migrations - May require re-indexing after schema changes

API Integration

The indexed data is consumed by:

  • Search API: /api/search endpoints
  • Browse API: /api/datasets with filtering
  • Recommendations: Related dataset suggestions

Monitoring and Maintenance

Regular Tasks

# Daily indexing (recommended cron job)
0 2 * * * cd /path/to/project && node ace index:datasets

# Weekly full re-index (if needed)
0 3 * * 0 cd /path/to/project && node ace index:datasets --force

Health Checks

  • Monitor OpenSearch cluster health
  • Check for failed indexing operations in logs
  • Verify search functionality is working
  • Compare dataset counts between database and index

Troubleshooting

# Check specific dataset indexing
node ace index:datasets --publish_id 231

# Verify OpenSearch connectivity
curl -X GET "localhost:9200/_cluster/health"

# Check index statistics
curl -X GET "localhost:9200/tethys-records/_stats"

Best Practices

  1. Regular Scheduling: Run the command regularly (daily) to keep the search index current
  2. Monitor Logs: Watch for transformation errors or OpenSearch issues
  3. Backup Strategy: Include OpenSearch indices in backup procedures
  4. Resource Management: Monitor OpenSearch cluster resources during bulk operations
  5. Testing: Verify search functionality after major indexing operations
  6. Coordination: Run indexing before DataCite updates when both are needed