Analyze Video Endpoint

POST /api/analyze-video

Create a new video-analyzing task. The server responds immediately with a video_id; analysis happens asynchronously.

Request Format

  • Content-Type: multipart/form-data

Parameters

FieldTypeRequiredDefaultNotes
filebinary✔︎Video to analyze (any common container/codec)
visual_analyticsbool✖︎falseEnables visual analytics

Response Format

Success Response (200 OK)

{
  "video_id": "a1b2c3d4-e5f6-e7",
  "status": "pending"
}

Analysis Types

  • Speech Analytics (Always included): Provides comprehensive speech analysis including transcript, filler words, pauses, sentiment, and speaking patterns
  • Visual Analytics (Optional): When visual_analytics=true, adds gesture analysis, posture detection, eye contact patterns, and facial expression analysis

Example: cURL

# example request to analyze-video endpoint
curl -X POST "https://api.aiframe.ai/api/analyze-video" \
-H "X-API-Key: YOUR_API_KEY" \
-F file=@"/path/to/video.mov" \
-F "visual_analytics=true"

Sample Output

When you retrieve the completed analysis results via the Get Status endpoint, the result field will contain detailed analytics data:

Speech Analytics Output (Always Included)
{
  "Audio Section": "-----------**********************----------------",
  "transcript": "Transcript of video",
  "summary": {
    "duration_sec": 47.4,
    "filler_word_count": 11,
    "words_per_minute": 155.7,
    "avg_word_duration": 0.269,
    "speech_to_pause_ratio": 0.699,
    "emotion": "neutral",
    "speaking_style": {
      "style": "neutral",
      "readability_grade": 4.416428571428572,
      "lexical_diversity": 0.4148148148148148
    }
  },
  "filler_words": {
    "count": 11,
    "instances": [
      {
        "word": "uhm",
        "start": 4.42,
        "end": 4.74,
        "confidence": 0.0
      }
    ]
  },
  "pauses": {
    "total": 4,
    "long_pauses": 4,
    "time_series": [
      {
        "start": 2.58,
        "end": 3.98,
        "duration": 1.4,
        "is_long": true
      }
    ]
  },
  "sentiment_words": [
    {
      "time": 7.5,
      "word": "importantly",
      "compound": 0.3182
    }
  ],
  "pos": {
    "NOUN": [
      {
        "word": "video",
        "timestamps": [
          {
            "start": 1.06,
            "end": 1.38
          }
        ]
      }
    ],
    "VERB": [],
    "ADP": [],
    "ADJ": [],
    "ADV": [],
    "INTJ": []
  }
}
Visual Analytics Output (When visual_analytics=true)
{
  "Video Section": "-----------**********************----------------",
  "gesture_summary": {
    "open_hand": 0,
    "pinch_or_point": 0,
    "none": 0
  },
  "posture_summary": {
    "raised_hand": 0,
    "hands_on_hips": 313,
    "left_hand_high": 0,
    "right_hand_high": 0,
    "left_hand_forward": 1456,
    "right_hand_forward": 1456
  },
  "merged_visual_time_series": [
    {
      "time": 0.0,
      "pose_detected": true,
      "raised_hand": false,
      "hands_on_hips": true,
      "left_hand_level_label": "low",
      "right_hand_level_label": "low",
      "left_hand_height": 0.0712,
      "right_hand_height": 0.0591,
      "left_hand_depth": -0.8676,
      "right_hand_depth": -1.0745,
      "gesture_left": "open_hand",
      "gesture_right": "open_hand",
      "eye_openness": 0.009172727272727272,
      "gaze_direction": null,
      "emotion": "happy",
      "gaze_direction_x": "center",
      "gaze_direction_y": "center",
      "gaze_angle_x": 2.4581818181818185,
      "gaze_angle_y": -0.06000000000000001,
      "mouth_width": 0.08699408173561096,
      "mouth_open": 2.0802021026611328e-05,
      "smile_ratio": 3990.1842874762315
    }
  ],
  "gesture_summary_word_sync": {
    "filler_summary": {
      "hands_on_hips": 0.36363636363636365,
      "left_hand_level_label": "low",
      "right_hand_level_label": "low",
      "gesture_left": "open_hand",
      "gesture_right": "open_hand",
      "emotion": "happy",
      "frame_times": [4.43, 5.9, 6.5, 8.13]
    },
    "pause_summary": {},
    "sentiment_positive_summary": {},
    "sentiment_negative_summary": {}
  }
}

Key Metrics Explained

Speech Analytics:

  • duration_sec: Total video duration in seconds
  • filler_word_count: Number of filler words detected (um, uh, etc.)
  • words_per_minute: Speaking pace
  • speech_to_pause_ratio: Ratio of speech time to pause time
  • readability_grade: Complexity level of language used
  • lexical_diversity: Vocabulary richness (unique words / total words)

Visual Analytics:

  • gesture_summary: Count of different hand gestures detected
  • posture_summary: Time spent in different postures
  • eye_openness: Eye engagement level (0-1 scale)
  • gaze_direction_x/y: Where the person is looking (left/center/right, up/center/down)
  • smile_ratio: Facial expression positivity
  • gesture_summary_word_sync: Correlates gestures with specific speech patterns

Retrieving Results

Once the analysis is complete, use the Get Status endpoint to retrieve the results:

  • The result field will contain the analysis data
  • Speech analytics are always included
  • Visual analytics are included if requested during submission

For detailed response formats and error codes, please see the API Reference (Swagger).