- feat: Enhance README with setup instructions, usage, and command documentation
- fix: Update API routes to include DOI URL handling and improve route organization - chore: Add ORCID preload rule file and ensure proper registration - docs: Add MIT License to the project for open-source compliance - feat: Implement command to detect and fix missing dataset cross-references - feat: Create command for updating DataCite DOI records with detailed logging and error handling - docs: Add comprehensive documentation for dataset indexing command - docs: Create detailed documentation for DataCite update command with usage examples and error handling
This commit is contained in:
parent
8f67839f93
commit
c049b22723
11 changed files with 2187 additions and 555 deletions
278
docs/commands/index-datasets.md
Normal file
278
docs/commands/index-datasets.md
Normal file
|
|
@ -0,0 +1,278 @@
|
|||
# Dataset Indexing Command
|
||||
|
||||
AdonisJS Ace command for indexing and synchronizing published datasets with OpenSearch for search functionality.
|
||||
|
||||
## Overview
|
||||
|
||||
The `index:datasets` command processes published datasets and creates/updates corresponding search index documents in OpenSearch. It intelligently compares modification timestamps to only re-index datasets when necessary, optimizing performance while maintaining search index accuracy.
|
||||
|
||||
## Command Syntax
|
||||
|
||||
```bash
|
||||
node ace index:datasets [options]
|
||||
```
|
||||
|
||||
## Options
|
||||
|
||||
| Flag | Alias | Description |
|
||||
|------|-------|-------------|
|
||||
| `--publish_id <number>` | `-p` | Index a specific dataset by publish_id |
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Basic Operations
|
||||
|
||||
```bash
|
||||
# Index all published datasets that have been modified since last indexing
|
||||
node ace index:datasets
|
||||
|
||||
# Index a specific dataset by publish_id
|
||||
node ace index:datasets --publish_id 231
|
||||
node ace index:datasets -p 231
|
||||
```
|
||||
|
||||
## How It Works
|
||||
|
||||
### 1. **Dataset Selection**
|
||||
The command processes datasets that meet these criteria:
|
||||
- `server_state = 'published'` - Only published datasets
|
||||
- Has preloaded `xmlCache` relationship for metadata transformation
|
||||
- Optionally filtered by specific `publish_id`
|
||||
|
||||
### 2. **Smart Update Detection**
|
||||
For each dataset, the command:
|
||||
- Checks if the dataset exists in the OpenSearch index
|
||||
- Compares `server_date_modified` timestamps
|
||||
- Only re-indexes if the dataset is newer than the indexed version
|
||||
|
||||
### 3. **Document Processing**
|
||||
The indexing process involves:
|
||||
1. **XML Generation**: Creates structured XML from dataset metadata
|
||||
2. **XSLT Transformation**: Converts XML to JSON using Saxon-JS processor
|
||||
3. **Index Update**: Updates or creates the document in OpenSearch
|
||||
4. **Logging**: Records success/failure for each operation
|
||||
|
||||
## Index Structure
|
||||
|
||||
### Index Configuration
|
||||
- **Index Name**: `tethys-records`
|
||||
- **Document ID**: Dataset `publish_id`
|
||||
- **Refresh**: `true` (immediate availability)
|
||||
|
||||
### Document Fields
|
||||
The indexed documents contain:
|
||||
- **Metadata Fields**: Title, description, authors, keywords
|
||||
- **Identifiers**: DOI, publish_id, and other identifiers
|
||||
- **Temporal Data**: Publication dates, coverage periods
|
||||
- **Geographic Data**: Spatial coverage information
|
||||
- **Technical Details**: Data formats, access information
|
||||
- **Timestamps**: Creation and modification dates
|
||||
|
||||
## Example Output
|
||||
|
||||
### Successful Run
|
||||
```bash
|
||||
node ace index:datasets
|
||||
```
|
||||
```
|
||||
Found 150 published datasets to process
|
||||
Dataset with publish_id 231 successfully indexed
|
||||
Dataset with publish_id 245 is up to date, skipping indexing
|
||||
Dataset with publish_id 267 successfully indexed
|
||||
An error occurred while indexing dataset with publish_id 289. Error: Invalid XML metadata
|
||||
Processing completed: 148 indexed, 1 skipped, 1 error
|
||||
```
|
||||
|
||||
### Specific Dataset
|
||||
```bash
|
||||
node ace index:datasets --publish_id 231
|
||||
```
|
||||
```
|
||||
Found 1 published dataset to process
|
||||
Dataset with publish_id 231 successfully indexed
|
||||
Processing completed: 1 indexed, 0 skipped, 0 errors
|
||||
```
|
||||
|
||||
## Update Logic
|
||||
|
||||
The command uses intelligent indexing to avoid unnecessary processing:
|
||||
|
||||
| Condition | Action | Reason |
|
||||
|-----------|--------|--------|
|
||||
| Dataset not in index | ✅ Index | New dataset needs indexing |
|
||||
| Dataset newer than indexed version | ✅ Re-index | Dataset has been updated |
|
||||
| Dataset same/older than indexed version | ❌ Skip | Already up to date |
|
||||
| OpenSearch document check fails | ✅ Index | Better safe than sorry |
|
||||
| Invalid XML metadata | ❌ Skip + Log Error | Cannot process invalid data |
|
||||
|
||||
### Timestamp Comparison
|
||||
```typescript
|
||||
// Example comparison logic
|
||||
const existingModified = DateTime.fromMillis(Number(existingDoc.server_date_modified) * 1000);
|
||||
const currentModified = dataset.server_date_modified;
|
||||
|
||||
if (currentModified <= existingModified) {
|
||||
// Skip - already up to date
|
||||
return false;
|
||||
}
|
||||
// Proceed with indexing
|
||||
```
|
||||
|
||||
## XML Transformation Process
|
||||
|
||||
### 1. **XML Generation**
|
||||
```xml
|
||||
<?xml version="1.0" encoding="UTF-8" standalone="true"?>
|
||||
<root>
|
||||
<Dataset>
|
||||
<!-- Dataset metadata fields -->
|
||||
<title>Research Dataset Title</title>
|
||||
<description>Dataset description...</description>
|
||||
<!-- Additional metadata -->
|
||||
</Dataset>
|
||||
</root>
|
||||
```
|
||||
|
||||
### 2. **XSLT Processing**
|
||||
The command uses Saxon-JS with a compiled stylesheet (`solr.sef.json`) to transform XML to JSON:
|
||||
```javascript
|
||||
const result = await SaxonJS.transform({
|
||||
stylesheetText: proc,
|
||||
destination: 'serialized',
|
||||
sourceText: xmlString,
|
||||
});
|
||||
```
|
||||
|
||||
### 3. **Final JSON Document**
|
||||
```json
|
||||
{
|
||||
"id": "231",
|
||||
"title": "Research Dataset Title",
|
||||
"description": "Dataset description...",
|
||||
"authors": ["Author Name"],
|
||||
"server_date_modified": 1634567890,
|
||||
"publish_id": 231
|
||||
}
|
||||
```
|
||||
|
||||
## Configuration Requirements
|
||||
|
||||
### Environment Variables
|
||||
```bash
|
||||
# OpenSearch Configuration
|
||||
OPENSEARCH_HOST=localhost:9200
|
||||
|
||||
# For production:
|
||||
# OPENSEARCH_HOST=your-opensearch-cluster:9200
|
||||
```
|
||||
|
||||
### Required Files
|
||||
- **XSLT Stylesheet**: `public/assets2/solr.sef.json` - Compiled Saxon-JS stylesheet for XML transformation
|
||||
|
||||
### Database Relationships
|
||||
The command expects these model relationships:
|
||||
```typescript
|
||||
// Dataset model must have:
|
||||
@hasOne(() => XmlCache, { foreignKey: 'dataset_id' })
|
||||
public xmlCache: HasOne<typeof XmlCache>
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
The command handles various error scenarios gracefully:
|
||||
|
||||
### Common Errors and Solutions
|
||||
|
||||
| Error | Cause | Solution |
|
||||
|-------|-------|----------|
|
||||
| `XSLT transformation failed` | Invalid XML or missing stylesheet | Check XML structure and stylesheet path |
|
||||
| `OpenSearch connection error` | Service unavailable | Verify OpenSearch is running and accessible |
|
||||
| `JSON parse error` | Malformed transformation result | Check XSLT stylesheet output format |
|
||||
| `Missing xmlCache relationship` | Data integrity issue | Ensure xmlCache exists for dataset |
|
||||
|
||||
### Error Logging
|
||||
```bash
|
||||
# Typical error log entry
|
||||
An error occurred while indexing dataset with publish_id 231.
|
||||
Error: XSLT transformation failed: Invalid XML structure at line 15
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Batch Processing
|
||||
- Processes datasets sequentially to avoid overwhelming OpenSearch
|
||||
- Each dataset is committed individually for reliability
|
||||
- Failed indexing of one dataset doesn't stop processing others
|
||||
|
||||
### Resource Usage
|
||||
- **Memory**: XML/JSON transformations require temporary memory
|
||||
- **Network**: OpenSearch API calls for each dataset
|
||||
- **CPU**: XSLT transformations are CPU-intensive
|
||||
|
||||
### Optimization Tips
|
||||
```bash
|
||||
# Index only recently modified datasets (run regularly)
|
||||
node ace index:datasets
|
||||
|
||||
# Index specific datasets when needed
|
||||
node ace index:datasets --publish_id 231
|
||||
|
||||
# Consider running during off-peak hours for large batches
|
||||
```
|
||||
|
||||
## Integration with Other Systems
|
||||
|
||||
### Search Functionality
|
||||
The indexed documents power:
|
||||
- **Dataset Search**: Full-text search across metadata
|
||||
- **Faceted Browsing**: Filter by authors, keywords, dates
|
||||
- **Geographic Search**: Spatial query capabilities
|
||||
- **Auto-complete**: Suggest dataset titles and keywords
|
||||
|
||||
### Related Commands
|
||||
- [`update:datacite`](update-datacite.md) - Often run after indexing to sync DOI metadata
|
||||
- **Database migrations** - May require re-indexing after schema changes
|
||||
|
||||
### API Integration
|
||||
The indexed data is consumed by:
|
||||
- **Search API**: `/api/search` endpoints
|
||||
- **Browse API**: `/api/datasets` with filtering
|
||||
- **Recommendations**: Related dataset suggestions
|
||||
|
||||
## Monitoring and Maintenance
|
||||
|
||||
### Regular Tasks
|
||||
```bash
|
||||
# Daily indexing (recommended cron job)
|
||||
0 2 * * * cd /path/to/project && node ace index:datasets
|
||||
|
||||
# Weekly full re-index (if needed)
|
||||
0 3 * * 0 cd /path/to/project && node ace index:datasets --force
|
||||
```
|
||||
|
||||
### Health Checks
|
||||
- Monitor OpenSearch cluster health
|
||||
- Check for failed indexing operations in logs
|
||||
- Verify search functionality is working
|
||||
- Compare dataset counts between database and index
|
||||
|
||||
### Troubleshooting
|
||||
```bash
|
||||
# Check specific dataset indexing
|
||||
node ace index:datasets --publish_id 231
|
||||
|
||||
# Verify OpenSearch connectivity
|
||||
curl -X GET "localhost:9200/_cluster/health"
|
||||
|
||||
# Check index statistics
|
||||
curl -X GET "localhost:9200/tethys-records/_stats"
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Regular Scheduling**: Run the command regularly (daily) to keep the search index current
|
||||
2. **Monitor Logs**: Watch for transformation errors or OpenSearch issues
|
||||
3. **Backup Strategy**: Include OpenSearch indices in backup procedures
|
||||
4. **Resource Management**: Monitor OpenSearch cluster resources during bulk operations
|
||||
5. **Testing**: Verify search functionality after major indexing operations
|
||||
6. **Coordination**: Run indexing before DataCite updates when both are needed
|
||||
216
docs/commands/update-datacite.md
Normal file
216
docs/commands/update-datacite.md
Normal file
|
|
@ -0,0 +1,216 @@
|
|||
# DataCite Update Command
|
||||
|
||||
AdonisJS Ace command for updating DataCite DOI records for published datasets.
|
||||
|
||||
## Overview
|
||||
|
||||
The `update:datacite` command synchronizes your local dataset metadata with DataCite DOI records. It intelligently compares modification dates to only update records when necessary, reducing unnecessary API calls and maintaining data consistency.
|
||||
|
||||
## Command Syntax
|
||||
|
||||
```bash
|
||||
node ace update:datacite [options]
|
||||
```
|
||||
|
||||
## Options
|
||||
|
||||
| Flag | Alias | Description |
|
||||
|------|-------|-------------|
|
||||
| `--publish_id <number>` | `-p` | Update a specific dataset by publish_id |
|
||||
| `--force` | `-f` | Force update all records regardless of modification date |
|
||||
| `--dry-run` | `-d` | Preview what would be updated without making changes |
|
||||
| `--stats` | `-s` | Show detailed statistics for datasets that need updating |
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Basic Operations
|
||||
|
||||
```bash
|
||||
# Update all datasets that have been modified since their DOI was last updated
|
||||
node ace update:datacite
|
||||
|
||||
# Update a specific dataset
|
||||
node ace update:datacite --publish_id 231
|
||||
node ace update:datacite -p 231
|
||||
|
||||
# Force update all datasets with DOIs (ignores modification dates)
|
||||
node ace update:datacite --force
|
||||
```
|
||||
|
||||
### Preview and Analysis
|
||||
|
||||
```bash
|
||||
# Preview what would be updated (dry run)
|
||||
node ace update:datacite --dry-run
|
||||
|
||||
# Show detailed statistics for datasets that need updating
|
||||
node ace update:datacite --stats
|
||||
|
||||
# Show stats for a specific dataset
|
||||
node ace update:datacite --stats --publish_id 231
|
||||
```
|
||||
|
||||
### Combined Options
|
||||
|
||||
```bash
|
||||
# Dry run for a specific dataset
|
||||
node ace update:datacite --dry-run --publish_id 231
|
||||
|
||||
# Show stats for all datasets (including up-to-date ones)
|
||||
node ace update:datacite --stats --force
|
||||
```
|
||||
|
||||
## Command Modes
|
||||
|
||||
### 1. **Normal Mode** (Default)
|
||||
Updates DataCite records for datasets that have been modified since their DOI was last updated.
|
||||
|
||||
**Example Output:**
|
||||
```
|
||||
Using DataCite API: https://api.test.datacite.org
|
||||
Found 50 datasets to process
|
||||
Dataset 231: Successfully updated DataCite record
|
||||
Dataset 245: Up to date, skipping
|
||||
Dataset 267: Successfully updated DataCite record
|
||||
DataCite update completed. Updated: 15, Skipped: 35, Errors: 0
|
||||
```
|
||||
|
||||
### 2. **Dry Run Mode** (`--dry-run`)
|
||||
Shows what would be updated without making any changes to DataCite.
|
||||
|
||||
**Use Case:** Preview updates before running the actual command.
|
||||
|
||||
**Example Output:**
|
||||
```
|
||||
Dataset 231: Would update DataCite record (dry run)
|
||||
Dataset 267: Would update DataCite record (dry run)
|
||||
Dataset 245: Up to date, skipping
|
||||
DataCite update completed. Updated: 2, Skipped: 1, Errors: 0
|
||||
```
|
||||
|
||||
### 3. **Stats Mode** (`--stats`)
|
||||
Shows detailed information for each dataset that needs updating, including why it needs updating.
|
||||
|
||||
**Use Case:** Debug synchronization issues, monitor dataset/DOI status, generate reports.
|
||||
|
||||
**Example Output:**
|
||||
```
|
||||
┌─ Dataset 231 ─────────────────────────────────────────────────────────
|
||||
│ DOI Value: 10.21388/tethys.231
|
||||
│ DOI Status (DB): findable
|
||||
│ DOI State (DataCite): findable
|
||||
│ Dataset Modified: 2024-09-15T10:30:00.000Z
|
||||
│ DOI Modified: 2024-09-10T08:15:00.000Z
|
||||
│ Needs Update: YES - Dataset newer than DOI
|
||||
└───────────────────────────────────────────────────────────────────────
|
||||
|
||||
┌─ Dataset 267 ─────────────────────────────────────────────────────────
|
||||
│ DOI Value: 10.21388/tethys.267
|
||||
│ DOI Status (DB): findable
|
||||
│ DOI State (DataCite): findable
|
||||
│ Dataset Modified: 2024-09-18T14:20:00.000Z
|
||||
│ DOI Modified: 2024-09-16T12:45:00.000Z
|
||||
│ Needs Update: YES - Dataset newer than DOI
|
||||
└───────────────────────────────────────────────────────────────────────
|
||||
|
||||
DataCite Stats Summary: 2 datasets need updating, 48 are up to date
|
||||
```
|
||||
|
||||
## Update Logic
|
||||
|
||||
The command uses intelligent update detection:
|
||||
|
||||
1. **Compares modification dates**: Dataset `server_date_modified` vs DOI last modification date from DataCite
|
||||
2. **Validates data integrity**: Checks for missing or future dates
|
||||
3. **Handles API failures gracefully**: Updates anyway if DataCite info can't be retrieved
|
||||
4. **Uses dual API approach**: DataCite REST API (primary) with MDS API fallback
|
||||
|
||||
### When Updates Happen
|
||||
|
||||
| Condition | Action | Reason |
|
||||
|-----------|--------|--------|
|
||||
| Dataset modified > DOI modified | ✅ Update | Dataset has newer changes |
|
||||
| Dataset modified ≤ DOI modified | ❌ Skip | DOI is up to date |
|
||||
| Dataset date in future | ❌ Skip | Invalid data, needs investigation |
|
||||
| Dataset date missing | ✅ Update | Can't determine staleness |
|
||||
| DataCite API error | ✅ Update | Better safe than sorry |
|
||||
| `--force` flag used | ✅ Update | Override all logic |
|
||||
|
||||
## Environment Configuration
|
||||
|
||||
Required environment variables:
|
||||
|
||||
```bash
|
||||
# DataCite Credentials
|
||||
DATACITE_USERNAME=your_username
|
||||
DATACITE_PASSWORD=your_password
|
||||
|
||||
# API Endpoints (environment-specific)
|
||||
DATACITE_API_URL=https://api.test.datacite.org # Test environment
|
||||
DATACITE_SERVICE_URL=https://mds.test.datacite.org # Test MDS
|
||||
|
||||
DATACITE_API_URL=https://api.datacite.org # Production
|
||||
DATACITE_SERVICE_URL=https://mds.datacite.org # Production MDS
|
||||
|
||||
# Project Configuration
|
||||
DATACITE_PREFIX=10.21388 # Your DOI prefix
|
||||
BASE_DOMAIN=tethys.at # Your domain
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
The command handles various error scenarios:
|
||||
|
||||
- **Invalid modification dates**: Logs errors but continues processing other datasets
|
||||
- **DataCite API failures**: Falls back to MDS API, then to safe update
|
||||
- **Missing DOI identifiers**: Skips datasets without DOI identifiers
|
||||
- **Network issues**: Continues with next dataset after logging error
|
||||
|
||||
## Integration
|
||||
|
||||
The command integrates with:
|
||||
|
||||
- **Dataset Model**: Uses `server_date_modified` for change detection
|
||||
- **DatasetIdentifier Model**: Reads DOI values and status
|
||||
- **OpenSearch Index**: Updates search index after DataCite update
|
||||
- **DoiClient**: Handles all DataCite API interactions
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### Daily Maintenance
|
||||
```bash
|
||||
# Update any datasets modified today
|
||||
node ace update:datacite
|
||||
```
|
||||
|
||||
### Pre-Deployment Check
|
||||
```bash
|
||||
# Check what would be updated before deployment
|
||||
node ace update:datacite --dry-run
|
||||
```
|
||||
|
||||
### Debugging Sync Issues
|
||||
```bash
|
||||
# Investigate why specific dataset isn't syncing
|
||||
node ace update:datacite --stats --publish_id 231
|
||||
```
|
||||
|
||||
### Full Resync
|
||||
```bash
|
||||
# Force update all DOI records (use with caution)
|
||||
node ace update:datacite --force
|
||||
```
|
||||
|
||||
### Monitoring Report
|
||||
```bash
|
||||
# Generate sync status report
|
||||
node ace update:datacite --stats > datacite-sync-report.txt
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Regular Updates**: Run daily or after bulk dataset modifications
|
||||
2. **Test First**: Use `--dry-run` or `--stats` before bulk operations
|
||||
3. **Monitor Logs**: Check for data integrity warnings
|
||||
4. **Environment Separation**: Use correct API URLs for test vs production
|
||||
5. **Rate Limiting**: The command handles DataCite rate limits automatically
|
||||
Loading…
Add table
Add a link
Reference in a new issue