Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 30, 2025

📄 126% (1.26x) speedup for PPTParser.convert_ppt_to_pptx in backend/python/app/modules/parsers/pptx/ppt_parser.py

⏱️ Runtime : 2.94 milliseconds 1.30 milliseconds (best of 190 runs)

📝 Explanation and details

The optimization introduces caching for LibreOffice availability checking using a class-level attribute _libreoffice_found. This eliminates the repeated subprocess.run(["which", "libreoffice"]) call that was executed on every method invocation.

Key changes:

  • LibreOffice check caching: The availability check now runs only once per class lifetime, storing the result in self.__class__._libreoffice_found
  • Direct return optimization: Removed intermediate variable pptx_content and return file content directly

Performance impact:
The line profiler shows the LibreOffice check (subprocess.run) takes ~3.7ms and represents 78-93% of total execution time. By caching this check, subsequent calls skip this expensive operation entirely. The optimization is most effective for:

  • Batch processing scenarios: When converting multiple PPT files in sequence, only the first call pays the LibreOffice check cost
  • Repeated conversions: Applications that perform multiple conversions benefit immediately after the first successful check
  • High-frequency usage: Services processing many PPT files see cumulative time savings

The 125% speedup (2.94ms → 1.30ms) demonstrates significant improvement, particularly valuable in production environments where PPT conversion happens repeatedly with the same parser instance.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 12 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 63.6%
🌀 Generated Regression Tests and Runtime
import os
import shutil
import subprocess
import tempfile

# imports
import pytest
from app.modules.parsers.pptx.ppt_parser import PPTParser

# unit tests

# Helper to check if LibreOffice is installed
def libreoffice_installed():
    try:
        subprocess.run(["which", "libreoffice"], check=True, capture_output=True)
        return True
    except subprocess.CalledProcessError:
        return False

# Helper to create a minimal valid .ppt file (not a real one, but enough for LibreOffice to process)
def minimal_ppt_bytes():
    # This is a minimal valid PPT file header (OLE Compound File header)
    # LibreOffice can process this as a valid PPT
    return (
        b'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1'  # OLE header
        + b'\x00' * 512  # pad to make it a bit larger
    )

# Helper to create a large fake PPT file
def large_ppt_bytes(size=1024*512):
    # Start with valid header, then pad
    return minimal_ppt_bytes() + b'A' * (size - len(minimal_ppt_bytes()))

@pytest.mark.skipif(not libreoffice_installed(), reason="LibreOffice not installed")
class TestConvertPptToPptxBasic:
    def setup_method(self):
        self.parser = PPTParser()

    
#------------------------------------------------
import os
import shutil
import subprocess
import tempfile

# imports
import pytest  # used for our unit tests
from app.modules.parsers.pptx.ppt_parser import PPTParser

# unit tests

# Helper function to check if LibreOffice is installed
def libreoffice_installed():
    return shutil.which("libreoffice") is not None

# Helper: create a minimal valid .ppt file (binary)
def minimal_ppt_bytes():
    # Minimal valid .ppt files start with D0 CF 11 E0 A1 B1 1A E1 (OLE header)
    # This is not a real PPT but enough for LibreOffice to attempt conversion
    return b'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1' + b'\x00' * 512

# Helper: create a large .ppt file (simulate by repeating minimal bytes)
def large_ppt_bytes(size=1000):
    # 1000 slides simulated by repeating the minimal header (not a real PPT, but for stress test)
    return minimal_ppt_bytes() * size

# Helper: create a corrupted .ppt file (invalid header)
def corrupted_ppt_bytes():
    return b'not_a_valid_ppt_file'

# Helper: create a valid but empty .ppt file (OLE header only)
def empty_ppt_bytes():
    return b'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1'

@pytest.mark.skipif(not libreoffice_installed(), reason="LibreOffice must be installed for these tests")
class TestConvertPptToPptx:
    # 1. Basic Test Cases

    
#------------------------------------------------
from app.modules.parsers.pptx.ppt_parser import PPTParser
import pytest

def test_PPTParser_convert_ppt_to_pptx():
    with pytest.raises(SideEffectDetected, match='We\'ve\\ blocked\\ a\\ file\\ writing\\ operation\\ on\\ "/tmp/rahaxd23"\\.\\ CrossHair\\ should\\ not\\ be\\ run\\ on\\ code\\ with\\ side\\ effects'):
        PPTParser.convert_ppt_to_pptx(PPTParser(), b'')

To edit these changes git checkout codeflash/optimize-PPTParser.convert_ppt_to_pptx-mhcus5al and push.

Codeflash Static Badge

The optimization introduces **caching for LibreOffice availability checking** using a class-level attribute `_libreoffice_found`. This eliminates the repeated `subprocess.run(["which", "libreoffice"])` call that was executed on every method invocation.

**Key changes:**
- **LibreOffice check caching**: The availability check now runs only once per class lifetime, storing the result in `self.__class__._libreoffice_found`
- **Direct return optimization**: Removed intermediate variable `pptx_content` and return file content directly

**Performance impact:**
The line profiler shows the LibreOffice check (`subprocess.run`) takes ~3.7ms and represents 78-93% of total execution time. By caching this check, subsequent calls skip this expensive operation entirely. The optimization is most effective for:

- **Batch processing scenarios**: When converting multiple PPT files in sequence, only the first call pays the LibreOffice check cost
- **Repeated conversions**: Applications that perform multiple conversions benefit immediately after the first successful check
- **High-frequency usage**: Services processing many PPT files see cumulative time savings

The 125% speedup (2.94ms → 1.30ms) demonstrates significant improvement, particularly valuable in production environments where PPT conversion happens repeatedly with the same parser instance.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 30, 2025 03:16
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Oct 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant