Analyze Video Endpoint

POST /api/analyze-video

Create a new video-analyzing task. The server responds immediately with a video_id; analysis happens asynchronously.

Request Format

Content-Type: multipart/form-data

Parameters

Field	Type	Required	Default	Notes
file	binary	✔︎	—	Video to analyze (any common container/codec)
visual_analytics	bool	✖︎	false	Enables visual analytics

Response Format

Success Response (200 OK)

{
  "video_id": "a1b2c3d4-e5f6-e7",
  "status": "pending"
}

Analysis Types

Speech Analytics (Always included): Provides comprehensive speech analysis including transcript, filler words, pauses, sentiment, and speaking patterns
Visual Analytics (Optional): When visual_analytics=true, adds gesture analysis, posture detection, eye contact patterns, and facial expression analysis

Example: cURL

# example request to analyze-video endpoint
curl -X POST "https://api.aiframe.ai/api/analyze-video" \
-H "X-API-Key: YOUR_API_KEY" \
-F file=@"/path/to/video.mov" \
-F "visual_analytics=true"

Sample Output

When you retrieve the completed analysis results via the Get Status endpoint, the result field will contain detailed analytics data:

Speech Analytics Output (Always Included)

{
  "Audio Section": "-----------**********************----------------",
  "transcript": "Transcript of video",
  "summary": {
    "duration_sec": 47.4,
    "filler_word_count": 11,
    "words_per_minute": 155.7,
    "avg_word_duration": 0.269,
    "speech_to_pause_ratio": 0.699,
    "emotion": "neutral",
    "speaking_style": {
      "style": "neutral",
      "readability_grade": 4.416428571428572,
      "lexical_diversity": 0.4148148148148148
    }
  },
  "filler_words": {
    "count": 11,
    "instances": [
      {
        "word": "uhm",
        "start": 4.42,
        "end": 4.74,
        "confidence": 0.0
      }
    ]
  },
  "pauses": {
    "total": 4,
    "long_pauses": 4,
    "time_series": [
      {
        "start": 2.58,
        "end": 3.98,
        "duration": 1.4,
        "is_long": true
      }
    ]
  },
  "sentiment_words": [
    {
      "time": 7.5,
      "word": "importantly",
      "compound": 0.3182
    }
  ],
  "pos": {
    "NOUN": [
      {
        "word": "video",
        "timestamps": [
          {
            "start": 1.06,
            "end": 1.38
          }
        ]
      }
    ],
    "VERB": [],
    "ADP": [],
    "ADJ": [],
    "ADV": [],
    "INTJ": []
  }
}

Visual Analytics Output (When visual_analytics=true)

{
  "Video Section": "-----------**********************----------------",
  "gesture_summary": {
    "open_hand": 0,
    "pinch_or_point": 0,
    "none": 0
  },
  "posture_summary": {
    "raised_hand": 0,
    "hands_on_hips": 313,
    "left_hand_high": 0,
    "right_hand_high": 0,
    "left_hand_forward": 1456,
    "right_hand_forward": 1456
  },
  "merged_visual_time_series": [
    {
      "time": 0.0,
      "pose_detected": true,
      "raised_hand": false,
      "hands_on_hips": true,
      "left_hand_level_label": "low",
      "right_hand_level_label": "low",
      "left_hand_height": 0.0712,
      "right_hand_height": 0.0591,
      "left_hand_depth": -0.8676,
      "right_hand_depth": -1.0745,
      "gesture_left": "open_hand",
      "gesture_right": "open_hand",
      "eye_openness": 0.009172727272727272,
      "gaze_direction": null,
      "emotion": "happy",
      "gaze_direction_x": "center",
      "gaze_direction_y": "center",
      "gaze_angle_x": 2.4581818181818185,
      "gaze_angle_y": -0.06000000000000001,
      "mouth_width": 0.08699408173561096,
      "mouth_open": 2.0802021026611328e-05,
      "smile_ratio": 3990.1842874762315
    }
  ],
  "gesture_summary_word_sync": {
    "filler_summary": {
      "hands_on_hips": 0.36363636363636365,
      "left_hand_level_label": "low",
      "right_hand_level_label": "low",
      "gesture_left": "open_hand",
      "gesture_right": "open_hand",
      "emotion": "happy",
      "frame_times": [4.43, 5.9, 6.5, 8.13]
    },
    "pause_summary": {},
    "sentiment_positive_summary": {},
    "sentiment_negative_summary": {}
  }
}

Key Metrics Explained

Speech Analytics:

duration_sec: Total video duration in seconds
filler_word_count: Number of filler words detected (um, uh, etc.)
words_per_minute: Speaking pace
speech_to_pause_ratio: Ratio of speech time to pause time
readability_grade: Complexity level of language used
lexical_diversity: Vocabulary richness (unique words / total words)

Visual Analytics:

gesture_summary: Count of different hand gestures detected
posture_summary: Time spent in different postures
eye_openness: Eye engagement level (0-1 scale)
gaze_direction_x/y: Where the person is looking (left/center/right, up/center/down)
smile_ratio: Facial expression positivity
gesture_summary_word_sync: Correlates gestures with specific speech patterns

Retrieving Results

Once the analysis is complete, use the Get Status endpoint to retrieve the results:

The result field will contain the analysis data
Speech analytics are always included
Visual analytics are included if requested during submission

For detailed response formats and error codes, please see the API Reference (Swagger).