Crawling Manager API

The Crawling Manager API provides endpoints for scheduling, managing, and monitoring data crawling jobs across various connector types including Google Workspace, OneDrive, SharePoint Online, Slack, and Confluence.

Base URL

All endpoints are prefixed with /api/v1/crawlingManager

Authentication

All endpoints require:
  • Authentication via Authorization header with valid JWT token
  • Admin privileges - Only users with admin role can access these endpoints
Headers:
Authorization: Bearer <jwt_token>

Supported Connectors

  • gmail - Gmail connector
  • drive - Google Drive connector
  • onedrive - OneDrive connector
  • sharepointonline - SharePoint Online connector
  • confluence - Confluence connector
  • slack - Slack connector
  • linear - Linear connector
  • dropbox - Dropbox connector
  • outlook - Outlook connector
  • jira - Jira connector
  • atlassian - Atlassian connector
  • github - GitHub connector
  • box - Box connector
  • s3 - Amazon S3 connector
  • azure - Azure connector
  • airtable - Airtable connector
  • zendesk - Zendesk connector
Note: Connector names are case-insensitive and spaces are automatically removed during processing. Job IDs are generated using the format: crawl-{connector}-{orgId} where connector is normalized to lowercase with hyphens.

API Endpoints

Schedule a new crawling job for a specific connector type.Endpoint: POST /api/v1/crawlingManager/:connector/scheduleParameters:
  • connector (string, path) - The connector type (gmail, drive, onedrive, sharepointonline, confluence, slack, linear, dropbox, outlook, jira, atlassian, github, box, s3, azure, airtable, zendesk)
{
  "scheduleConfig": {
    "scheduleType": "daily|weekly|monthly|hourly|custom|once",
    "isEnabled": true,
    "timezone": "UTC",
    // Schedule-specific configuration based on scheduleType
  },
  "priority": 5,          // 1-10, default: 5
  "maxRetries": 3,        // 0-10, default: 3
  "timeout": 300000       // 1000-600000ms, default: 300000 (5 minutes)
}
Schedule Configuration Types:All schedule configurations inherit base properties:
  • scheduleType (required) - The type of schedule
  • isEnabled (boolean, default: true) - Whether the schedule is enabled
  • timezone (string, default: “UTC”) - Timezone for schedule execution
Hourly Schedule:
{
  "scheduleType": "hourly",
  "minute": 0,            // 0-59, required
  "interval": 1,          // 1-24, default: 1 (every X hours)
  "isEnabled": true,
  "timezone": "UTC"
}
Daily Schedule:
{
  "scheduleType": "daily", 
  "hour": 9,              // 0-23, required
  "minute": 0,            // 0-59, required
  "isEnabled": true,
  "timezone": "UTC"
}
Weekly Schedule:
{
  "scheduleType": "weekly",
  "daysOfWeek": [1, 2, 3, 4, 5],  // Array of 0-6 (Sunday-Saturday), min 1 item
  "hour": 9,                      // 0-23, required
  "minute": 0,                    // 0-59, required
  "isEnabled": true,
  "timezone": "UTC"
}
Monthly Schedule:
{
  "scheduleType": "monthly",
  "dayOfMonth": 1,        // 1-31, required
  "hour": 9,              // 0-23, required
  "minute": 0,            // 0-59, required
  "isEnabled": true,
  "timezone": "UTC"
}
Custom Schedule (Cron):
{
  "scheduleType": "custom",
  "cronExpression": "0 9 * * 1-5",  // Required, must match pattern: ^(\S+\s+){4}\S+$
  "description": "Weekdays at 9 AM", // Optional
  "isEnabled": true,
  "timezone": "UTC"
}
Once Schedule:
{
  "scheduleType": "once",
  "scheduledTime": "2024-12-31T23:59:59.000Z",  // ISO 8601 datetime string
  "isEnabled": true,
  "timezone": "UTC"
}
Retrieve the status of a scheduled crawling job for a specific connector.Endpoint: GET /api/v1/crawlingManager/:connector/scheduleParameters:
  • connector (string, path) - The connector type
Status: 200 OK
{
  "success": true,
  "message": "Job status retrieved successfully", 
  "data": {
    "id": "bull:crawling-scheduler:123",
    "name": "crawl-box",
    "data": {
      "connector": "box",
      "scheduleConfig": {
        "scheduleType": "daily",
        "hour": 9,
        "minute": 0,
        "isEnabled": true,
        "timezone": "UTC"
      },
      "orgId": "64f1a2b3c4d5e6f7a8b9c0d1",
      "userId": "64f1a2b3c4d5e6f7a8b9c0d2",
      "timestamp": "2024-01-15T09:00:00.000Z",
      "metadata": {}
    },
    "progress": 75,
    "delay": null,
    "timestamp": 1705392000000,
    "attemptsMade": 1,
    "finishedOn": null,
    "processedOn": 1705392060000,
    "failedReason": null,
    "state": "active"
  }
}
Job States:
  • waiting - Job is waiting to be processed
  • active - Job is currently being processed
  • completed - Job completed successfully
  • failed - Job failed with errors
  • delayed - Job is delayed for future execution
  • paused - Job is paused
Progress Tracking: Jobs report progress at key stages: 10% (start), 20% (task service obtained), 100% (completion). Failed jobs may show partial progress.
Retrieve all scheduled crawling jobs for the organization.Endpoint: GET /api/v1/crawlingManager/schedule/all
Status: 200 OK
{
  "success": true,
  "message": "All job statuses retrieved successfully",
  "data": [
    {
      "id": "bull:crawling-scheduler:123",
      "name": "crawl-github", 
      "data": {
        "connector": "github",
        "scheduleConfig": {
          "scheduleType": "daily",
          "hour": 9,
          "minute": 0,
          "isEnabled": true,
          "timezone": "UTC"
        },
        "orgId": "64f1a2b3c4d5e6f7a8b9c0d1",
        "userId": "64f1a2b3c4d5e6f7a8b9c0d2",
        "timestamp": "2024-01-15T09:00:00.000Z"
      },
      "progress": 100,
      "state": "completed",
      "finishedOn": 1705392120000
    },
    {
      "id": "crawl-zendesk-64f1a2b3c4d5e6f7a8b9c0d1",
      "name": "crawl-zendesk",
      "data": {
        "connector": "zendesk", 
        "scheduleConfig": {
          "scheduleType": "weekly",
          "daysOfWeek": [1, 3, 5],
          "hour": 10,
          "minute": 0,
          "isEnabled": true,
          "timezone": "UTC"
        },
        "orgId": "64f1a2b3c4d5e6f7a8b9c0d1",
        "userId": "64f1a2b3c4d5e6f7a8b9c0d2", 
        "timestamp": "2024-01-15T08:00:00.000Z"
      },
      "progress": 0,
      "state": "paused"
    },
    {
      "id": "crawl-linear-64f1a2b3c4d5e6f7a8b9c0d1",
      "name": "crawl-linear",
      "data": {
        "connector": "linear",
        "scheduleConfig": {
          "scheduleType": "hourly",
          "minute": 30,
          "interval": 2,
          "isEnabled": true,
          "timezone": "UTC"
        },
        "orgId": "64f1a2b3c4d5e6f7a8b9c0d1",
        "userId": "64f1a2b3c4d5e6f7a8b9c0d2",
        "timestamp": "2024-01-15T10:30:00.000Z"
      },
      "progress": 45,
      "state": "active"
    }
  ]
}
Remove a scheduled crawling job for a specific connector.Endpoint: DELETE /api/v1/crawlingManager/:connector/scheduleParameters:
  • connector (string, path) - The connector type
Status: 200 OK
{
  "success": true,
  "message": "Crawling job removed successfully"
}
Remove all scheduled crawling jobs for the organization.Endpoint: DELETE /api/v1/crawlingManager/schedule/all
Status: 200 OK
{
  "success": true,
  "message": "All crawling jobs removed successfully"
}
Pause a scheduled crawling job for a specific connector.Endpoint: POST /api/v1/crawlingManager/:connector/pauseParameters:
  • connector (string, path) - The connector type
Status: 200 OK
{
  "success": true,
  "message": "Crawling job paused successfully",
  "data": {
    "connector": "airtable",
    "orgId": "64f1a2b3c4d5e6f7a8b9c0d1",
    "pausedAt": "2024-01-15T10:30:00.000Z"
  }
}
Resume a paused crawling job for a specific connector.Endpoint: POST /api/v1/crawlingManager/:connector/resumeParameters:
  • connector (string, path) - The connector type
Status: 200 OK
{
  "success": true,
  "message": "Crawling job resumed successfully",
  "data": {
    "connector": "dropbox", 
    "orgId": "64f1a2b3c4d5e6f7a8b9c0d1",
    "resumedAt": "2024-01-15T10:35:00.000Z"
  }
}
Retrieve statistics about the crawling job queue.Endpoint: GET /api/v1/crawlingManager/stats
Status: 200 OK
{
  "success": true,
  "message": "Queue statistics retrieved successfully",
  "data": {
    "waiting": 5,
    "active": 2,
    "completed": 120,
    "failed": 3,
    "delayed": 1,
    "paused": 2,
    "repeatable": 8,
    "total": 141
  }
}
Statistics Fields:
  • waiting - Number of jobs waiting to be processed
  • active - Number of jobs currently being processed
  • completed - Number of completed jobs
  • failed - Number of failed jobs
  • delayed - Number of delayed jobs
  • paused - Number of paused jobs
  • repeatable - Number of repeatable/scheduled jobs
  • total - Total number of jobs across all states

Data Types

enum CrawlingScheduleType {
  HOURLY = 'hourly',
  DAILY = 'daily', 
  WEEKLY = 'weekly',
  MONTHLY = 'monthly',
  CUSTOM = 'custom',
  ONCE = 'once'
}

System Configuration

The API includes built-in rate limiting and concurrency controls:
  • Maximum 5 concurrent jobs per queue
  • Jobs are automatically retried with exponential backoff (5000ms initial delay) on failure
  • Stalled jobs are detected after 30 seconds
  • Maximum 3 retry attempts for failed jobs
  • Job history retention: Last 10 completed and 10 failed jobs per connector type
  • Jobs are removed and recreated when updating schedules (no job modification)

Error Handling

All endpoints follow a consistent error response format:
{
  "success": false,
  "message": "Error description",
  "error": {
    "code": "ERROR_CODE",
    "details": "Additional error details"
  }
}
Common HTTP Status Codes:
  • 200 - Success
  • 201 - Created (for scheduling jobs)
  • 400 - Bad Request (validation errors, invalid configuration)
  • 401 - Unauthorized (missing or invalid authentication)
  • 403 - Forbidden (insufficient privileges, admin required)
  • 404 - Not Found (job not found)
  • 500 - Internal Server Error