Arno Kaimbacher c049b22723 - feat: Enhance README with setup instructions, usage, and command documentation

- fix: Update API routes to include DOI URL handling and improve route organization

- chore: Add ORCID preload rule file and ensure proper registration

- docs: Add MIT License to the project for open-source compliance

- feat: Implement command to detect and fix missing dataset cross-references

- feat: Create command for updating DataCite DOI records with detailed logging and error handling

- docs: Add comprehensive documentation for dataset indexing command

- docs: Create detailed documentation for DataCite update command with usage examples and error handling

2025-09-19 14:35:23 +02:00

8.4 KiB

Raw Blame History

Dataset Indexing Command

AdonisJS Ace command for indexing and synchronizing published datasets with OpenSearch for search functionality.

Overview

The index:datasets command processes published datasets and creates/updates corresponding search index documents in OpenSearch. It intelligently compares modification timestamps to only re-index datasets when necessary, optimizing performance while maintaining search index accuracy.

Command Syntax

node ace index:datasets [options]

Options

Flag	Alias	Description
`--publish_id <number>`	`-p`	Index a specific dataset by publish_id

Usage Examples

Basic Operations

# Index all published datasets that have been modified since last indexing
node ace index:datasets

# Index a specific dataset by publish_id
node ace index:datasets --publish_id 231
node ace index:datasets -p 231

How It Works

1. Dataset Selection

The command processes datasets that meet these criteria:

server_state = 'published' - Only published datasets
Has preloaded xmlCache relationship for metadata transformation
Optionally filtered by specific publish_id

2. Smart Update Detection

For each dataset, the command:

Checks if the dataset exists in the OpenSearch index
Compares server_date_modified timestamps
Only re-indexes if the dataset is newer than the indexed version

3. Document Processing

The indexing process involves:

XML Generation: Creates structured XML from dataset metadata
XSLT Transformation: Converts XML to JSON using Saxon-JS processor
Index Update: Updates or creates the document in OpenSearch
Logging: Records success/failure for each operation

Index Structure

Index Configuration

Index Name: tethys-records
Document ID: Dataset publish_id
Refresh: true (immediate availability)

Document Fields

The indexed documents contain:

Metadata Fields: Title, description, authors, keywords
Identifiers: DOI, publish_id, and other identifiers
Temporal Data: Publication dates, coverage periods
Geographic Data: Spatial coverage information
Technical Details: Data formats, access information
Timestamps: Creation and modification dates

Example Output

Successful Run

node ace index:datasets

Found 150 published datasets to process
Dataset with publish_id 231 successfully indexed
Dataset with publish_id 245 is up to date, skipping indexing
Dataset with publish_id 267 successfully indexed
An error occurred while indexing dataset with publish_id 289. Error: Invalid XML metadata
Processing completed: 148 indexed, 1 skipped, 1 error

Specific Dataset

node ace index:datasets --publish_id 231

Found 1 published dataset to process
Dataset with publish_id 231 successfully indexed
Processing completed: 1 indexed, 0 skipped, 0 errors

Update Logic

The command uses intelligent indexing to avoid unnecessary processing:

Condition	Action	Reason
Dataset not in index	✅ Index	New dataset needs indexing
Dataset newer than indexed version	✅ Re-index	Dataset has been updated
Dataset same/older than indexed version	❌ Skip	Already up to date
OpenSearch document check fails	✅ Index	Better safe than sorry
Invalid XML metadata	❌ Skip + Log Error	Cannot process invalid data

Timestamp Comparison

// Example comparison logic
const existingModified = DateTime.fromMillis(Number(existingDoc.server_date_modified) * 1000);
const currentModified = dataset.server_date_modified;

if (currentModified <= existingModified) {
    // Skip - already up to date
    return false;
}
// Proceed with indexing

XML Transformation Process

1. XML Generation

<?xml version="1.0" encoding="UTF-8" standalone="true"?>
<root>
    <Dataset>
        <!-- Dataset metadata fields -->
        <title>Research Dataset Title</title>
        <description>Dataset description...</description>
        <!-- Additional metadata -->
    </Dataset>
</root>

2. XSLT Processing

The command uses Saxon-JS with a compiled stylesheet (solr.sef.json) to transform XML to JSON:

const result = await SaxonJS.transform({
    stylesheetText: proc,
    destination: 'serialized',
    sourceText: xmlString,
});

3. Final JSON Document

{
    "id": "231",
    "title": "Research Dataset Title",
    "description": "Dataset description...",
    "authors": ["Author Name"],
    "server_date_modified": 1634567890,
    "publish_id": 231
}

Configuration Requirements

Environment Variables

# OpenSearch Configuration
OPENSEARCH_HOST=localhost:9200

# For production:
# OPENSEARCH_HOST=your-opensearch-cluster:9200

Required Files

XSLT Stylesheet: public/assets2/solr.sef.json - Compiled Saxon-JS stylesheet for XML transformation

Database Relationships

The command expects these model relationships:

// Dataset model must have:
@hasOne(() => XmlCache, { foreignKey: 'dataset_id' })
public xmlCache: HasOne<typeof XmlCache>

Error Handling

The command handles various error scenarios gracefully:

Common Errors and Solutions

Error	Cause	Solution
`XSLT transformation failed`	Invalid XML or missing stylesheet	Check XML structure and stylesheet path
`OpenSearch connection error`	Service unavailable	Verify OpenSearch is running and accessible
`JSON parse error`	Malformed transformation result	Check XSLT stylesheet output format
`Missing xmlCache relationship`	Data integrity issue	Ensure xmlCache exists for dataset

Error Logging

# Typical error log entry
An error occurred while indexing dataset with publish_id 231.
Error: XSLT transformation failed: Invalid XML structure at line 15

Performance Considerations

Batch Processing

Processes datasets sequentially to avoid overwhelming OpenSearch
Each dataset is committed individually for reliability
Failed indexing of one dataset doesn't stop processing others

Resource Usage

Memory: XML/JSON transformations require temporary memory
Network: OpenSearch API calls for each dataset
CPU: XSLT transformations are CPU-intensive

Optimization Tips

# Index only recently modified datasets (run regularly)
node ace index:datasets

# Index specific datasets when needed
node ace index:datasets --publish_id 231

# Consider running during off-peak hours for large batches

Integration with Other Systems

Search Functionality

The indexed documents power:

Dataset Search: Full-text search across metadata
Faceted Browsing: Filter by authors, keywords, dates
Geographic Search: Spatial query capabilities
Auto-complete: Suggest dataset titles and keywords

update:datacite - Often run after indexing to sync DOI metadata
Database migrations - May require re-indexing after schema changes

API Integration

The indexed data is consumed by:

Search API: /api/search endpoints
Browse API: /api/datasets with filtering
Recommendations: Related dataset suggestions

Monitoring and Maintenance

Regular Tasks

# Daily indexing (recommended cron job)
0 2 * * * cd /path/to/project && node ace index:datasets

# Weekly full re-index (if needed)
0 3 * * 0 cd /path/to/project && node ace index:datasets --force

Health Checks

Monitor OpenSearch cluster health
Check for failed indexing operations in logs
Verify search functionality is working
Compare dataset counts between database and index

Troubleshooting

# Check specific dataset indexing
node ace index:datasets --publish_id 231

# Verify OpenSearch connectivity
curl -X GET "localhost:9200/_cluster/health"

# Check index statistics
curl -X GET "localhost:9200/tethys-records/_stats"

Best Practices

Regular Scheduling: Run the command regularly (daily) to keep the search index current
Monitor Logs: Watch for transformation errors or OpenSearch issues
Backup Strategy: Include OpenSearch indices in backup procedures
Resource Management: Monitor OpenSearch cluster resources during bulk operations
Testing: Verify search functionality after major indexing operations
Coordination: Run indexing before DataCite updates when both are needed

8.4 KiB Raw Blame History