Crawling Manager API

The Crawling Manager API provides endpoints for scheduling, managing, and monitoring data crawling jobs across various connector types including Google Workspace, OneDrive, SharePoint Online, Slack, and Confluence.

Base URL

All endpoints are prefixed with /api/v1/crawlingManager

Authentication

All endpoints require:

Authentication via Authorization header with valid JWT token
Admin privileges - Only users with admin role can access these endpoints

Headers:

Authorization: Bearer <jwt_token>

Supported Connectors

gmail - Gmail connector
drive - Google Drive connector
onedrive - OneDrive connector
sharepointonline - SharePoint Online connector
confluence - Confluence connector
slack - Slack connector
linear - Linear connector
dropbox - Dropbox connector
outlook - Outlook connector
jira - Jira connector
atlassian - Atlassian connector
github - GitHub connector
box - Box connector
s3 - Amazon S3 connector
azure - Azure connector
airtable - Airtable connector
zendesk - Zendesk connector

Note: Connector names are case-insensitive and spaces are automatically removed during processing. Job IDs are generated using the format: crawl-{connector}-{orgId} where connector is normalized to lowercase with hyphens.

API Endpoints

POST /:connector/schedule - Schedule Job

Schedule a new crawling job for a specific connector type.Endpoint: POST /api/v1/crawlingManager/:connector/scheduleParameters:

connector (string, path) - The connector type (gmail, drive, onedrive, sharepointonline, confluence, slack, linear, dropbox, outlook, jira, atlassian, github, box, s3, azure, airtable, zendesk)

{
  "scheduleConfig": {
    "scheduleType": "daily|weekly|monthly|hourly|custom|once",
    "isEnabled": true,
    "timezone": "UTC",
    // Schedule-specific configuration based on scheduleType
  },
  "priority": 5,          // 1-10, default: 5
  "maxRetries": 3,        // 0-10, default: 3
  "timeout": 300000       // 1000-600000ms, default: 300000 (5 minutes)
}

Schedule Configuration Types:All schedule configurations inherit base properties:

scheduleType (required) - The type of schedule
isEnabled (boolean, default: true) - Whether the schedule is enabled
timezone (string, default: “UTC”) - Timezone for schedule execution

Hourly Schedule:

{
  "scheduleType": "hourly",
  "minute": 0,            // 0-59, required
  "interval": 1,          // 1-24, default: 1 (every X hours)
  "isEnabled": true,
  "timezone": "UTC"
}

Daily Schedule:

{
  "scheduleType": "daily", 
  "hour": 9,              // 0-23, required
  "minute": 0,            // 0-59, required
  "isEnabled": true,
  "timezone": "UTC"
}

Weekly Schedule:

{
  "scheduleType": "weekly",
  "daysOfWeek": [1, 2, 3, 4, 5],  // Array of 0-6 (Sunday-Saturday), min 1 item
  "hour": 9,                      // 0-23, required
  "minute": 0,                    // 0-59, required
  "isEnabled": true,
  "timezone": "UTC"
}

Monthly Schedule:

{
  "scheduleType": "monthly",
  "dayOfMonth": 1,        // 1-31, required
  "hour": 9,              // 0-23, required
  "minute": 0,            // 0-59, required
  "isEnabled": true,
  "timezone": "UTC"
}

Custom Schedule (Cron):

{
  "scheduleType": "custom",
  "cronExpression": "0 9 * * 1-5",  // Required, must match pattern: ^(\S+\s+){4}\S+$
  "description": "Weekdays at 9 AM", // Optional
  "isEnabled": true,
  "timezone": "UTC"
}

Once Schedule:

{
  "scheduleType": "once",
  "scheduledTime": "2024-12-31T23:59:59.000Z",  // ISO 8601 datetime string
  "isEnabled": true,
  "timezone": "UTC"
}

GET /:connector/schedule - Get Job Status

Retrieve the status of a scheduled crawling job for a specific connector.Endpoint: GET /api/v1/crawlingManager/:connector/scheduleParameters:

connector (string, path) - The connector type

Status: 200 OK

{
  "success": true,
  "message": "Job status retrieved successfully", 
  "data": {
    "id": "bull:crawling-scheduler:123",
    "name": "crawl-box",
    "data": {
      "connector": "box",
      "scheduleConfig": {
        "scheduleType": "daily",
        "hour": 9,
        "minute": 0,
        "isEnabled": true,
        "timezone": "UTC"
      },
      "orgId": "64f1a2b3c4d5e6f7a8b9c0d1",
      "userId": "64f1a2b3c4d5e6f7a8b9c0d2",
      "timestamp": "2024-01-15T09:00:00.000Z",
      "metadata": {}
    },
    "progress": 75,
    "delay": null,
    "timestamp": 1705392000000,
    "attemptsMade": 1,
    "finishedOn": null,
    "processedOn": 1705392060000,
    "failedReason": null,
    "state": "active"
  }
}

Job States:

waiting - Job is waiting to be processed
active - Job is currently being processed
completed - Job completed successfully
failed - Job failed with errors
delayed - Job is delayed for future execution
paused - Job is paused

Progress Tracking: Jobs report progress at key stages: 10% (start), 20% (task service obtained), 100% (completion). Failed jobs may show partial progress.

GET /schedule/all - Get All Job Statuses

Retrieve all scheduled crawling jobs for the organization.Endpoint: GET /api/v1/crawlingManager/schedule/all

Status: 200 OK

{
  "success": true,
  "message": "All job statuses retrieved successfully",
  "data": [
    {
      "id": "bull:crawling-scheduler:123",
      "name": "crawl-github", 
      "data": {
        "connector": "github",
        "scheduleConfig": {
          "scheduleType": "daily",
          "hour": 9,
          "minute": 0,
          "isEnabled": true,
          "timezone": "UTC"
        },
        "orgId": "64f1a2b3c4d5e6f7a8b9c0d1",
        "userId": "64f1a2b3c4d5e6f7a8b9c0d2",
        "timestamp": "2024-01-15T09:00:00.000Z"
      },
      "progress": 100,
      "state": "completed",
      "finishedOn": 1705392120000
    },
    {
      "id": "crawl-zendesk-64f1a2b3c4d5e6f7a8b9c0d1",
      "name": "crawl-zendesk",
      "data": {
        "connector": "zendesk", 
        "scheduleConfig": {
          "scheduleType": "weekly",
          "daysOfWeek": [1, 3, 5],
          "hour": 10,
          "minute": 0,
          "isEnabled": true,
          "timezone": "UTC"
        },
        "orgId": "64f1a2b3c4d5e6f7a8b9c0d1",
        "userId": "64f1a2b3c4d5e6f7a8b9c0d2", 
        "timestamp": "2024-01-15T08:00:00.000Z"
      },
      "progress": 0,
      "state": "paused"
    },
    {
      "id": "crawl-linear-64f1a2b3c4d5e6f7a8b9c0d1",
      "name": "crawl-linear",
      "data": {
        "connector": "linear",
        "scheduleConfig": {
          "scheduleType": "hourly",
          "minute": 30,
          "interval": 2,
          "isEnabled": true,
          "timezone": "UTC"
        },
        "orgId": "64f1a2b3c4d5e6f7a8b9c0d1",
        "userId": "64f1a2b3c4d5e6f7a8b9c0d2",
        "timestamp": "2024-01-15T10:30:00.000Z"
      },
      "progress": 45,
      "state": "active"
    }
  ]
}

DELETE /:connector/schedule - Remove Job

Remove a scheduled crawling job for a specific connector.Endpoint: DELETE /api/v1/crawlingManager/:connector/scheduleParameters:

connector (string, path) - The connector type

Status: 200 OK

{
  "success": true,
  "message": "Crawling job removed successfully"
}

DELETE /schedule/all - Remove All Jobs

Remove all scheduled crawling jobs for the organization.Endpoint: DELETE /api/v1/crawlingManager/schedule/all

Status: 200 OK

{
  "success": true,
  "message": "All crawling jobs removed successfully"
}

POST /:connector/pause - Pause Job

Pause a scheduled crawling job for a specific connector.Endpoint: POST /api/v1/crawlingManager/:connector/pauseParameters:

connector (string, path) - The connector type

Status: 200 OK

{
  "success": true,
  "message": "Crawling job paused successfully",
  "data": {
    "connector": "airtable",
    "orgId": "64f1a2b3c4d5e6f7a8b9c0d1",
    "pausedAt": "2024-01-15T10:30:00.000Z"
  }
}

POST /:connector/resume - Resume Job

Resume a paused crawling job for a specific connector.Endpoint: POST /api/v1/crawlingManager/:connector/resumeParameters:

connector (string, path) - The connector type

Status: 200 OK

{
  "success": true,
  "message": "Crawling job resumed successfully",
  "data": {
    "connector": "dropbox", 
    "orgId": "64f1a2b3c4d5e6f7a8b9c0d1",
    "resumedAt": "2024-01-15T10:35:00.000Z"
  }
}

GET /stats - Get Queue Statistics

Retrieve statistics about the crawling job queue.Endpoint: GET /api/v1/crawlingManager/stats

Status: 200 OK

{
  "success": true,
  "message": "Queue statistics retrieved successfully",
  "data": {
    "waiting": 5,
    "active": 2,
    "completed": 120,
    "failed": 3,
    "delayed": 1,
    "paused": 2,
    "repeatable": 8,
    "total": 141
  }
}

Statistics Fields:

waiting - Number of jobs waiting to be processed
active - Number of jobs currently being processed
completed - Number of completed jobs
failed - Number of failed jobs
delayed - Number of delayed jobs
paused - Number of paused jobs
repeatable - Number of repeatable/scheduled jobs
total - Total number of jobs across all states

Data Types

enum CrawlingScheduleType {
  HOURLY = 'hourly',
  DAILY = 'daily', 
  WEEKLY = 'weekly',
  MONTHLY = 'monthly',
  CUSTOM = 'custom',
  ONCE = 'once'
}

System Configuration

The API includes built-in rate limiting and concurrency controls:

Maximum 5 concurrent jobs per queue
Jobs are automatically retried with exponential backoff (5000ms initial delay) on failure
Stalled jobs are detected after 30 seconds
Maximum 3 retry attempts for failed jobs
Job history retention: Last 10 completed and 10 failed jobs per connector type
Jobs are removed and recreated when updating schedules (no job modification)

Error Handling

All endpoints follow a consistent error response format:

{
  "success": false,
  "message": "Error description",
  "error": {
    "code": "ERROR_CODE",
    "details": "Additional error details"
  }
}

Common HTTP Status Codes:

200 - Success
201 - Created (for scheduling jobs)
400 - Bad Request (validation errors, invalid configuration)
401 - Unauthorized (missing or invalid authentication)
403 - Forbidden (insufficient privileges, admin required)
404 - Not Found (job not found)
500 - Internal Server Error

Welcome To PipesHub

System Overview

Authentication

Mail Configuration

AI Providers

Connectors

User Management

Deployment

Developer

Additional Resources

Crawling manager

Crawling Manager API

Base URL

Authentication

Supported Connectors

API Endpoints

Data Types

System Configuration

Error Handling

Welcome To PipesHub

System Overview

Authentication

Mail Configuration

AI Providers

Connectors

User Management

Deployment

Developer

Additional Resources

​Crawling Manager API

​Base URL

​Authentication

​Supported Connectors

​API Endpoints

​Data Types

​System Configuration

​Error Handling

Crawling Manager API

Base URL

Authentication

Supported Connectors

API Endpoints

Data Types

System Configuration

Error Handling