- fix: Update API routes to include DOI URL handling and improve route organization - chore: Add ORCID preload rule file and ensure proper registration - docs: Add MIT License to the project for open-source compliance - feat: Implement command to detect and fix missing dataset cross-references - feat: Create command for updating DataCite DOI records with detailed logging and error handling - docs: Add comprehensive documentation for dataset indexing command - docs: Create detailed documentation for DataCite update command with usage examples and error handling
8.4 KiB
Dataset Indexing Command
AdonisJS Ace command for indexing and synchronizing published datasets with OpenSearch for search functionality.
Overview
The index:datasets command processes published datasets and creates/updates corresponding search index documents in OpenSearch. It intelligently compares modification timestamps to only re-index datasets when necessary, optimizing performance while maintaining search index accuracy.
Command Syntax
node ace index:datasets [options]
Options
| Flag | Alias | Description |
|---|---|---|
--publish_id <number> |
-p |
Index a specific dataset by publish_id |
Usage Examples
Basic Operations
# Index all published datasets that have been modified since last indexing
node ace index:datasets
# Index a specific dataset by publish_id
node ace index:datasets --publish_id 231
node ace index:datasets -p 231
How It Works
1. Dataset Selection
The command processes datasets that meet these criteria:
server_state = 'published'- Only published datasets- Has preloaded
xmlCacherelationship for metadata transformation - Optionally filtered by specific
publish_id
2. Smart Update Detection
For each dataset, the command:
- Checks if the dataset exists in the OpenSearch index
- Compares
server_date_modifiedtimestamps - Only re-indexes if the dataset is newer than the indexed version
3. Document Processing
The indexing process involves:
- XML Generation: Creates structured XML from dataset metadata
- XSLT Transformation: Converts XML to JSON using Saxon-JS processor
- Index Update: Updates or creates the document in OpenSearch
- Logging: Records success/failure for each operation
Index Structure
Index Configuration
- Index Name:
tethys-records - Document ID: Dataset
publish_id - Refresh:
true(immediate availability)
Document Fields
The indexed documents contain:
- Metadata Fields: Title, description, authors, keywords
- Identifiers: DOI, publish_id, and other identifiers
- Temporal Data: Publication dates, coverage periods
- Geographic Data: Spatial coverage information
- Technical Details: Data formats, access information
- Timestamps: Creation and modification dates
Example Output
Successful Run
node ace index:datasets
Found 150 published datasets to process
Dataset with publish_id 231 successfully indexed
Dataset with publish_id 245 is up to date, skipping indexing
Dataset with publish_id 267 successfully indexed
An error occurred while indexing dataset with publish_id 289. Error: Invalid XML metadata
Processing completed: 148 indexed, 1 skipped, 1 error
Specific Dataset
node ace index:datasets --publish_id 231
Found 1 published dataset to process
Dataset with publish_id 231 successfully indexed
Processing completed: 1 indexed, 0 skipped, 0 errors
Update Logic
The command uses intelligent indexing to avoid unnecessary processing:
| Condition | Action | Reason |
|---|---|---|
| Dataset not in index | ✅ Index | New dataset needs indexing |
| Dataset newer than indexed version | ✅ Re-index | Dataset has been updated |
| Dataset same/older than indexed version | ❌ Skip | Already up to date |
| OpenSearch document check fails | ✅ Index | Better safe than sorry |
| Invalid XML metadata | ❌ Skip + Log Error | Cannot process invalid data |
Timestamp Comparison
// Example comparison logic
const existingModified = DateTime.fromMillis(Number(existingDoc.server_date_modified) * 1000);
const currentModified = dataset.server_date_modified;
if (currentModified <= existingModified) {
// Skip - already up to date
return false;
}
// Proceed with indexing
XML Transformation Process
1. XML Generation
<?xml version="1.0" encoding="UTF-8" standalone="true"?>
<root>
<Dataset>
<!-- Dataset metadata fields -->
<title>Research Dataset Title</title>
<description>Dataset description...</description>
<!-- Additional metadata -->
</Dataset>
</root>
2. XSLT Processing
The command uses Saxon-JS with a compiled stylesheet (solr.sef.json) to transform XML to JSON:
const result = await SaxonJS.transform({
stylesheetText: proc,
destination: 'serialized',
sourceText: xmlString,
});
3. Final JSON Document
{
"id": "231",
"title": "Research Dataset Title",
"description": "Dataset description...",
"authors": ["Author Name"],
"server_date_modified": 1634567890,
"publish_id": 231
}
Configuration Requirements
Environment Variables
# OpenSearch Configuration
OPENSEARCH_HOST=localhost:9200
# For production:
# OPENSEARCH_HOST=your-opensearch-cluster:9200
Required Files
- XSLT Stylesheet:
public/assets2/solr.sef.json- Compiled Saxon-JS stylesheet for XML transformation
Database Relationships
The command expects these model relationships:
// Dataset model must have:
@hasOne(() => XmlCache, { foreignKey: 'dataset_id' })
public xmlCache: HasOne<typeof XmlCache>
Error Handling
The command handles various error scenarios gracefully:
Common Errors and Solutions
| Error | Cause | Solution |
|---|---|---|
XSLT transformation failed |
Invalid XML or missing stylesheet | Check XML structure and stylesheet path |
OpenSearch connection error |
Service unavailable | Verify OpenSearch is running and accessible |
JSON parse error |
Malformed transformation result | Check XSLT stylesheet output format |
Missing xmlCache relationship |
Data integrity issue | Ensure xmlCache exists for dataset |
Error Logging
# Typical error log entry
An error occurred while indexing dataset with publish_id 231.
Error: XSLT transformation failed: Invalid XML structure at line 15
Performance Considerations
Batch Processing
- Processes datasets sequentially to avoid overwhelming OpenSearch
- Each dataset is committed individually for reliability
- Failed indexing of one dataset doesn't stop processing others
Resource Usage
- Memory: XML/JSON transformations require temporary memory
- Network: OpenSearch API calls for each dataset
- CPU: XSLT transformations are CPU-intensive
Optimization Tips
# Index only recently modified datasets (run regularly)
node ace index:datasets
# Index specific datasets when needed
node ace index:datasets --publish_id 231
# Consider running during off-peak hours for large batches
Integration with Other Systems
Search Functionality
The indexed documents power:
- Dataset Search: Full-text search across metadata
- Faceted Browsing: Filter by authors, keywords, dates
- Geographic Search: Spatial query capabilities
- Auto-complete: Suggest dataset titles and keywords
Related Commands
update:datacite- Often run after indexing to sync DOI metadata- Database migrations - May require re-indexing after schema changes
API Integration
The indexed data is consumed by:
- Search API:
/api/searchendpoints - Browse API:
/api/datasetswith filtering - Recommendations: Related dataset suggestions
Monitoring and Maintenance
Regular Tasks
# Daily indexing (recommended cron job)
0 2 * * * cd /path/to/project && node ace index:datasets
# Weekly full re-index (if needed)
0 3 * * 0 cd /path/to/project && node ace index:datasets --force
Health Checks
- Monitor OpenSearch cluster health
- Check for failed indexing operations in logs
- Verify search functionality is working
- Compare dataset counts between database and index
Troubleshooting
# Check specific dataset indexing
node ace index:datasets --publish_id 231
# Verify OpenSearch connectivity
curl -X GET "localhost:9200/_cluster/health"
# Check index statistics
curl -X GET "localhost:9200/tethys-records/_stats"
Best Practices
- Regular Scheduling: Run the command regularly (daily) to keep the search index current
- Monitor Logs: Watch for transformation errors or OpenSearch issues
- Backup Strategy: Include OpenSearch indices in backup procedures
- Resource Management: Monitor OpenSearch cluster resources during bulk operations
- Testing: Verify search functionality after major indexing operations
- Coordination: Run indexing before DataCite updates when both are needed