Why I Built My Own Copilot Tool and How It Solves What Microsoft’s Doesn’t

Whether you’re new to a firm or have been in the industry for a long time, it doesn’t matter, you always want an AI-powered tool that captures everything from a meeting, sometimes the current Microsoft Copilot isn’t fully capable of. Let’s dive in and build additional features which must be helpful to all.

Let’s dive straight into development and context understanding.

Application areas

Talent Aquisition team

Automatically transcribes and summarizes candidate interviews with strengths, red flags, and role-fit indicators.
Compares candidates across interviews using structured summaries, making shortlisting faster and more objective.
Retains historical hiring conversations, allowing easy reference when similar roles open again.

HR Teams

Summarizes policy discussions, engagement plans, and internal forums for record-keeping and action tracking.
Captures sentiment from employee feedback meetings and flags recurring HR concerns.
Generates ready-to-send email updates or HR memos directly from internal meeting conversations.

Developers

Transforms sprint planning meetings into structured task lists with owner mapping.
Documents technical discussions, feature debates, and dependencies — even when developers miss the call.
Integrates with Jira or GitHub to auto-push summarized action items or blockers into workflows.

Leadership

Auto-generates high-level summaries with KPIs, decisions, and strategic discussions.
Allows querying across meetings: “What risks have been raised in the last 2 weeks?”
Consolidates multiple team meetings into a single executive brief with zero manual effort

KT Transfer (Freshers/ Data refresh)

Converts multi-speaker technical walkthroughs into clear documentation with visuals.
Auto-generates task lists for learners and assigns follow-ups to mentors.
Allows appending additional questions or clarifications after the session via prompt-based AI.

Limitations of Microsoft’s Co-Pilot tool

Limited Context Retention

Copilot often fails to retain long conversation history or complex multi-turn interactions.
It may lose track of project-specific context or prior messages in a session.

Dependency on Microsoft Ecosystem

Works best with Microsoft 365 apps like Word, Excel, Outlook, Teams.
Limited or no support for third-party integrations outside Microsoft’s ecosystem.

Surface-Level Summarization

Meeting or email summaries may miss nuanced details or action items.
Lacks the ability to understand organizational priorities or deeper project dependencies.

No Real-Time Collaboration Memory

Doesn’t maintain memory across multiple meetings or team collaborations.
Cannot remember individual user preferences or past instructions unless explicitly documented.

Limited Customization or Fine-Tuning

Cannot be easily customized for specific workflows or niche use cases.
Fine-tuning or task-specific behavior is restricted compared to open-source LLMs.

Generic Responses

Sometimes provides generic or overly cautious answers, especially in creative or strategic tasks.
May struggle with domain-specific terminology or unconventional queries.

Offline Inaccessibility

Requires internet access and integration with Microsoft cloud services.
Not available in fully offline or local environments.

Our Advanced Co-Pilot features

End-to-End Meeting Intelligence

Captures real-time meeting transcripts across platforms (Zoom, Teams, Meet).
Extracts action items, decisions, and follow-ups with precise timestamps.
Supports multi-speaker tracking and role-based summaries.

Contextual Memory and Knowledge Retention

Unlike traditional copilots, it retains memory across sessions, meetings, and users.
Smart context stitching lets it refer back to past decisions, notes, and priorities.
Learns user preferences for summary depth, tone, and format.

Multi-Platform Compatibility

Works across Google Workspace, Microsoft 365, and Slack.
Exports insights into Notion, Jira, or your preferred task management system.

Semantic Action Item Generator

Uses LLM-powered understanding to convert vague meeting conversations into clear tasks.
Auto-assigns tasks based on speaker roles or detected owners.

Real-Time Knowledge Graph Creation

Builds a dynamic knowledge graph of topics, people, decisions, and deadlines.
Enables fast cross-meeting querying: “What was the last update on Project X?”

Smart Summary Modes

Offers summaries in bullet points, detailed narratives, email-ready format and word documentation details.
Tailors output for executives, tech teams, or clients automatically.

Output – Word Document

Output isn’t just textual it contains important snapshots from meeting and placed perfectly in the word document in between paragraphs
Output is well structures with all essential information

User customisable AI based edits

You can add additional pointers – Like if you want to capture FAQs from meetings
Go To -> Append Section -> Write your query -> LLM call -> Appends new heading into word document

Applications are limitless. But how can we implement?

Total Read Time Needed :

Minutes

Pre-requisites:

Python and Generative AI

( Any LLM )

Required Tools:

Python – Primary programming language used for development
tkinter – For creating GUI elements like buttons, dialogs, and windows
messagebox, simpledialog – For user alerts and input prompts in GUI
moviepy – For editing and processing audio and video files
PIL.ImageGrab – To capture screenshots and screen regions
pyaudio – For real-time audio recording
wave – For saving and playing back audio recordings
pydub – For manipulating and converting audio formats
threading – To run tasks in the background without freezing the UI
os, glob – For interacting with file systems and handling files
time, datetime – For time tracking, timestamps, and scheduling
math – For mathematical operations
numpy – For high-performance numerical computations
TfidfVectorizer (from sklearn) – For transforming text into numerical features
cosine_similarity (from sklearn) – For comparing similarity between text chunks
pandas – For structured data handling and analysis
cv2 (OpenCV) – For video frame processing and computer vision
easyocr – For extracting text from screen captures and images
whisper – For transcribing speech to text with AI
AzureOpenAI (openai) – For integrating Azure-hosted OpenAI language models
tenacity – For retrying API calls with exponential backoff
docx.Document – For creating and editing Word documents
docx.shared.Inches – For inserting images into Word documents with specified dimensions
tqdm – For progress bars and operation tracking
re – For using regular expressions to search and manipulate text
json – For reading and writing structured data
torch – For GPU support and running models like Whisper

Process

Start with Jupyter Notebook

Step 1

Development of a Python-based screen capturing tool with output saved in .mp4 format. Code Reference

Step 2

Utilized an .mp4 video file and extracted its audio component in .wav format to enable efficient transcription. The audio was segmented into 20-second intervals and transcribed using the Whisper Turbo model : an open-source solution known for its speed and accuracy based on our experimentation.

[
{
“start_time”: “00:00”,
“end_time”: “00:20”,
“transcript”: “Audio Transcript 1”
},
{
“start_time”: “00:20”,
“end_time”: “00:40”,
“transcript”: “Audio Transcript 2”
},
…
{
“start_time”: “00:(n-20)”,
“end_time”: “00:(n)”,
“transcript”: “Audio Transcript n”
}
]

Step 3

Extract OCR Tokens for same interval of 20 seconds and store in similar JSON structure

[
{
“start_time”: “00:00”,
“end_time”: “00:20”,
“ocr_text”: “OCR Visuals 1”
},
{
“start_time”: “00:20”,
“end_time”: “00:40”,
“ocr_text”: “OCR Visuals 2”
},
…
{
“start_time”: “00:(n-20)”,
“end_time”: “00:(n)”,
“ocr_text”: “OCR Visuals n”
}
]

Step 4

Create a DataFrame that aligns transcript and OCR tokens for every 20-second video segment.
To eliminate duplicate screen captures, we ensure each row is unique based on OCR tokens. If the same OCR tokens appear in multiple segments, we append the corresponding transcripts into a single row.
This de-duplication trick helps identify and consolidate repeated or static screens across the video.

import pandas as pd
data = [
{“OCR_Tokens”: [“Home”, “Login”], “Transcript”: “Welcome to the platform.”},
{“OCR_Tokens”: [“Home”, “Login”], “Transcript”: “Click the login button.”},
{“OCR_Tokens”: [“Dashboard”, “Settings”], “Transcript”: “Here is your dashboard view.”}
]
df = pd.DataFrame(data)
Deduplicate based on OCR_Tokens and combine transcripts
df_grouped = df.groupby(df[“OCR_Tokens”].apply(tuple), as_index=False).agg({
“OCR_Tokens”: “first”,
“Transcript”: lambda x: ” “.join(x)
})
print(df_grouped)

Step 5

Leverage LLM to gererate first draft of Documentation prioritizing Transcript and OCR data both

Step 6

For each paragraph perform cosine similarity search over all screenshots captured using opencv2 and link them with the relevant snippet of paragraph in the response generated via LLM

Step 7

Leverage the docx module to extract and save the generated response into a Word document. Use a Large Language Model (LLM) to create a Table of Contents and insert it on the first page, followed by the complete LLM-generated content throughout the document.

Step 8

Prompt the user for input regarding optional features, such as generating FAQs. If the user opts in, leverage a Large Language Model (LLM) to fetch relevant information and append the generated response at the end.

Obective – To provide AI developers with a quick-start framework and mindset for implementing an advanced, user-customizable Co-Pilot tool that operates independently of third-party dependencies. The solution has been tested by multiple users and has achieved an accuracy rate of 94%.
Author Name
Tweet

When to use EasyOCR and when to prefer PyTesseract for OCR extraction??

What did I preferred Whisper Turbo model?

What threshold did I chose to optimise snapshots captured via opencv2?

Did I just used Tkinter or Streamlit was used too?

You may get such questions. Write down in comment section and let’s connect for detailed discussion. 🙂

Info Box

Tool Features – Our advanced Co-Pilot tool offers significantly higher accuracy and customization compared to existing solutions like Microsoft Copilot. With a tested accuracy of 94%, it intelligently captures key moments, user interactions, and contextual information during meetings or workflows. Unlike other tools, it is fully user-customizable, operates independently of third-party platforms, and adapts dynamically to specific user needs. This makes it a robust, secure, and reliable AI assistant that delivers more precise and relevant outputs—bridging the gap where current Co-Pilot tools fall short.

Advanced Co-Pilot Tool – AI Buddy for cross domain functionality