# Dataset Indexing Command AdonisJS Ace command for indexing and synchronizing published datasets with OpenSearch for search functionality. ## Overview The `index:datasets` command processes published datasets and creates/updates corresponding search index documents in OpenSearch. It intelligently compares modification timestamps to only re-index datasets when necessary, optimizing performance while maintaining search index accuracy. ## Command Syntax ```bash node ace index:datasets [options] ``` ## Options | Flag | Alias | Description | |------|-------|-------------| | `--publish_id ` | `-p` | Index a specific dataset by publish_id | ## Usage Examples ### Basic Operations ```bash # Index all published datasets that have been modified since last indexing node ace index:datasets # Index a specific dataset by publish_id node ace index:datasets --publish_id 231 node ace index:datasets -p 231 ``` ## How It Works ### 1. **Dataset Selection** The command processes datasets that meet these criteria: - `server_state = 'published'` - Only published datasets - Has preloaded `xmlCache` relationship for metadata transformation - Optionally filtered by specific `publish_id` ### 2. **Smart Update Detection** For each dataset, the command: - Checks if the dataset exists in the OpenSearch index - Compares `server_date_modified` timestamps - Only re-indexes if the dataset is newer than the indexed version ### 3. **Document Processing** The indexing process involves: 1. **XML Generation**: Creates structured XML from dataset metadata 2. **XSLT Transformation**: Converts XML to JSON using Saxon-JS processor 3. **Index Update**: Updates or creates the document in OpenSearch 4. **Logging**: Records success/failure for each operation ## Index Structure ### Index Configuration - **Index Name**: `tethys-records` - **Document ID**: Dataset `publish_id` - **Refresh**: `true` (immediate availability) ### Document Fields The indexed documents contain: - **Metadata Fields**: Title, description, authors, keywords - **Identifiers**: DOI, publish_id, and other identifiers - **Temporal Data**: Publication dates, coverage periods - **Geographic Data**: Spatial coverage information - **Technical Details**: Data formats, access information - **Timestamps**: Creation and modification dates ## Example Output ### Successful Run ```bash node ace index:datasets ``` ``` Found 150 published datasets to process Dataset with publish_id 231 successfully indexed Dataset with publish_id 245 is up to date, skipping indexing Dataset with publish_id 267 successfully indexed An error occurred while indexing dataset with publish_id 289. Error: Invalid XML metadata Processing completed: 148 indexed, 1 skipped, 1 error ``` ### Specific Dataset ```bash node ace index:datasets --publish_id 231 ``` ``` Found 1 published dataset to process Dataset with publish_id 231 successfully indexed Processing completed: 1 indexed, 0 skipped, 0 errors ``` ## Update Logic The command uses intelligent indexing to avoid unnecessary processing: | Condition | Action | Reason | |-----------|--------|--------| | Dataset not in index | ✅ Index | New dataset needs indexing | | Dataset newer than indexed version | ✅ Re-index | Dataset has been updated | | Dataset same/older than indexed version | ❌ Skip | Already up to date | | OpenSearch document check fails | ✅ Index | Better safe than sorry | | Invalid XML metadata | ❌ Skip + Log Error | Cannot process invalid data | ### Timestamp Comparison ```typescript // Example comparison logic const existingModified = DateTime.fromMillis(Number(existingDoc.server_date_modified) * 1000); const currentModified = dataset.server_date_modified; if (currentModified <= existingModified) { // Skip - already up to date return false; } // Proceed with indexing ``` ## XML Transformation Process ### 1. **XML Generation** ```xml Research Dataset Title Dataset description... ``` ### 2. **XSLT Processing** The command uses Saxon-JS with a compiled stylesheet (`solr.sef.json`) to transform XML to JSON: ```javascript const result = await SaxonJS.transform({ stylesheetText: proc, destination: 'serialized', sourceText: xmlString, }); ``` ### 3. **Final JSON Document** ```json { "id": "231", "title": "Research Dataset Title", "description": "Dataset description...", "authors": ["Author Name"], "server_date_modified": 1634567890, "publish_id": 231 } ``` ## Configuration Requirements ### Environment Variables ```bash # OpenSearch Configuration OPENSEARCH_HOST=localhost:9200 # For production: # OPENSEARCH_HOST=your-opensearch-cluster:9200 ``` ### Required Files - **XSLT Stylesheet**: `public/assets2/solr.sef.json` - Compiled Saxon-JS stylesheet for XML transformation ### Database Relationships The command expects these model relationships: ```typescript // Dataset model must have: @hasOne(() => XmlCache, { foreignKey: 'dataset_id' }) public xmlCache: HasOne ``` ## Error Handling The command handles various error scenarios gracefully: ### Common Errors and Solutions | Error | Cause | Solution | |-------|-------|----------| | `XSLT transformation failed` | Invalid XML or missing stylesheet | Check XML structure and stylesheet path | | `OpenSearch connection error` | Service unavailable | Verify OpenSearch is running and accessible | | `JSON parse error` | Malformed transformation result | Check XSLT stylesheet output format | | `Missing xmlCache relationship` | Data integrity issue | Ensure xmlCache exists for dataset | ### Error Logging ```bash # Typical error log entry An error occurred while indexing dataset with publish_id 231. Error: XSLT transformation failed: Invalid XML structure at line 15 ``` ## Performance Considerations ### Batch Processing - Processes datasets sequentially to avoid overwhelming OpenSearch - Each dataset is committed individually for reliability - Failed indexing of one dataset doesn't stop processing others ### Resource Usage - **Memory**: XML/JSON transformations require temporary memory - **Network**: OpenSearch API calls for each dataset - **CPU**: XSLT transformations are CPU-intensive ### Optimization Tips ```bash # Index only recently modified datasets (run regularly) node ace index:datasets # Index specific datasets when needed node ace index:datasets --publish_id 231 # Consider running during off-peak hours for large batches ``` ## Integration with Other Systems ### Search Functionality The indexed documents power: - **Dataset Search**: Full-text search across metadata - **Faceted Browsing**: Filter by authors, keywords, dates - **Geographic Search**: Spatial query capabilities - **Auto-complete**: Suggest dataset titles and keywords ### Related Commands - [`update:datacite`](update-datacite.md) - Often run after indexing to sync DOI metadata - **Database migrations** - May require re-indexing after schema changes ### API Integration The indexed data is consumed by: - **Search API**: `/api/search` endpoints - **Browse API**: `/api/datasets` with filtering - **Recommendations**: Related dataset suggestions ## Monitoring and Maintenance ### Regular Tasks ```bash # Daily indexing (recommended cron job) 0 2 * * * cd /path/to/project && node ace index:datasets # Weekly full re-index (if needed) 0 3 * * 0 cd /path/to/project && node ace index:datasets --force ``` ### Health Checks - Monitor OpenSearch cluster health - Check for failed indexing operations in logs - Verify search functionality is working - Compare dataset counts between database and index ### Troubleshooting ```bash # Check specific dataset indexing node ace index:datasets --publish_id 231 # Verify OpenSearch connectivity curl -X GET "localhost:9200/_cluster/health" # Check index statistics curl -X GET "localhost:9200/tethys-records/_stats" ``` ## Best Practices 1. **Regular Scheduling**: Run the command regularly (daily) to keep the search index current 2. **Monitor Logs**: Watch for transformation errors or OpenSearch issues 3. **Backup Strategy**: Include OpenSearch indices in backup procedures 4. **Resource Management**: Monitor OpenSearch cluster resources during bulk operations 5. **Testing**: Verify search functionality after major indexing operations 6. **Coordination**: Run indexing before DataCite updates when both are needed