A smart computer vision system that identifies and classifies people and pets in real-time using advanced deep learning techniques.

"Watch the app nail the purr-fect prediction β correctly spotting me and my friendβs cat, Felix!"
- Four-Class Detection: Accurately identifies
owner,pet,other person, andbackgroundclasses - Adaptive Processing: Automatically switches between classification and segmentation for improved accuracy
- Real-Time Performance: ~33 FPS on consumer hardware (NVIDIA RTX 3050)
- Privacy-Focused: All processing happens locally on your device
- Interactive Controls: Toggle segmentation mode and visualize confidence scores
- Memory Efficient: Optimized for resource-constrained environments
This project combines transfer learning with efficient model deployment to create a responsive computer vision system that runs smoothly on mid-range hardware:
- Base Architecture: MobileNetV2 (finetuned from ImageNet weights)
- Enhancement: LRASPP MobileNetV3 segmentation model for challenging cases
- Confidence Threshold: Auto-switching between models at 0.7 confidence level
- Training Method: Transfer learning with frozen feature extraction layers
- Performance: 99.4% accuracy in ideal conditions, 84.2% in low light
| Attribute | Value |
|---|---|
| Architecture | MobileNetV2 (finetuned) |
| Input Resolution | 224x224 (resized from 640x480) |
| Output Classes | ['owner', 'pet', 'other person', 'background'] |
| Model Format | .pth |
| Model Size | ~10 MB (quantized) |
| Inference Speed | 33 FPS @ 640x480 |
| Hardware Tested | NVIDIA RTX 3050, CUDA 11.8 |
| Framework | PyTorch 3.13.2 |
# Clone the repository
git clone https://github.com/dosqas/Realtime-Entity-Classifier.git
cd Realtime-Entity-Classifier
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtpython src/realtime_classifier.py- Press
sto toggle forced segmentation mode - Press
qto quit
The system uses a modified MobileNetV2 architecture with:
model.classifier[1] = nn.Sequential(
nn.Linear(model.classifier[1].in_features, 256),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(256, 4)
)- Intermediate Layer (256 neurons): Enhanced representational capacity
- ReLU Activation: Efficient non-linearity
- Dropout (0.2): Prevents overfitting
- Xavier Initialization: Improves convergence speed
- Optimizer: Adam with selective training
- Learning Rate: 5e-5
- Weight Decay: 1e-5
- Loss Function: CrossEntropyLoss with label smoothing (0.1)
- Epochs: 10 (converged early)
When classification confidence drops below threshold:
- An LRASPP MobileNetV3 segmentation model identifies people and pets
- Segmentation mask isolates the subject from the background
- Classification is re-run on the masked input
- System returns to normal mode after confidence improves
| Epoch | Loss | Accuracy | Ξ Accuracy |
|---|---|---|---|
| 1 | 0.1600 | 95.01% | +0% |
| 2 | 0.0404 | 98.74% | +3.73% |
| 3 | 0.0293 | 99.11% | +0.37% |
| 4 | 0.0238 | 99.17% | +0.06% |
| 5 | 0.0209 | 99.34% | +0.17% |
| 6 | 0.0217 | 99.26% | -0.08% |
| 7 | 0.0194 | 99.35% | +0.09% |
| 8 | 0.0187 | 99.35% | +0.00% |
| 9 | 0.0158 | 99.48% | +0.13% |
| 10 | 0.0153 | 99.44% | -0.04% |
- Total Samples: 34,575
- Class Distribution:
owner: 8,750 samples (25.3%)- Sourced from a 2:20 min video of myself walking around the house in varied lighting conditions, angles and backgrounds.
pet: 4,575 samples (13.2%)- Extracted from a 30-second video of my friend Bogdanβs cat, Felix.
other person: 12,500 samples (36.2%)- Includes 2,500 cropped face images from the Human Faces Kaggle dataset.
background: 8,750 samples (25.3%)- Captured from a 30-second video of walking around the house with no subject in focus.
-
Pet Detection:
- Accuracy drops when <30% of the pet's body is visible
- Low lighting reduces confidence by ~40%
-
Person Identification:
- Needs β₯92% confidence to reliably classify "owner" vs "other person"
- False positives with reflections (mirrors, glass)
- May struggle with diverse "other person" examples
# In realtime_classifier.py
CONFIDENCE_THRESHOLD = 0.7 # Default# In realtime_classifier.py
PET_MASK_ENABLED = True # Set to False to disable generic pet detectionrealtime-entity-classifier/
βββ demo/
β βββ project_demo.gif # Project demo GIF
βββ data/ # Dataset used for training and evaluation
β βββ owner/ # Images and optional video of the owner
β β βββ images/ # Folder containing image samples
β β βββ owner.mp4 (optional) # Optional video for data generation
β βββ pet/ # Images and optional video of pets (e.g., cat, dog)
β β βββ images/
β β βββ pet.mp4 (optional)
β βββ other_people/ # Images and optional video of non-owners
β β βββ images/
β β βββ other_people.mp4 (optional)
β βββ background/ # Background-only scenes
β βββ images/
β βββ background.mp4 (optional)
βββ models/ # Trained model weights
β βββ entity_classifier.pth # Main classifier model
βββ notebooks/ # Jupyter notebooks
β βββ classifier_build_and_train.ipynb
βββ reports/ # Reports and visualizations
β βββ TEST_RESULTS.md # Full test performance summary
β βββ training_plots/
β βββ mobilenetv2_4class_finetune_20250420.jpg # Training progress plot
βββ src/
β βββ realtime_classifier.py # Main application script
βββ requirements.txt # Python dependencies
βββ README.md # Project overview and usage guide
- PyTorch β for the powerful and flexible deep learning framework
- TorchVision β for pre-trained models and helpful computer vision utilities
- OpenCV β for enabling efficient image and video processing
- Human Faces Dataset (Kaggle) β used for training on diverse human faces for the "other person" class
- My friend Bogdan and his cat Felix - for helping me with data to train the model for the "pet" class
This project is licensed under the MIT License. See the LICENSE file for details.
Questions, feedback, or ideas? Reach out anytime at [email protected].