Skip to content
This repository was archived by the owner on Nov 16, 2023. It is now read-only.
This repository was archived by the owner on Nov 16, 2023. It is now read-only.

Large(?) memory leak when transforming pandas datafarme #152

@daholste

Description

@daholste

It looks like transforming a pandas dataframe results in a large memory leak

Transforming the same pandas dataframe repeatedly with this script:

from pdb import set_trace
from time import time
import gc
import os
import pandas as pd
import psutil

from nimbusml import FileDataStream, Pipeline
from nimbusml.feature_extraction.categorical import OneHotVectorizer

file_data_stream = FileDataStream.read_csv('D:/1MB.csv', sep=',')
t = OneHotVectorizer(columns={"FavoriteColor": "FavoriteColor", "Age" : "Age", "Weight" : "Weight", "PoliceTickets" : "PoliceTickets"})
p = Pipeline([t])
p.fit(file_data_stream)

df = pd.read_csv('D:/100MB.csv')
while True:
	p.transform(df)
	gc.collect()
	process = psutil.Process(os.getpid())
	print(f'{process.memory_info().rss / 1000000} MB memory used')

prints

1403.715584 MB memory used
2307.145728 MB memory used
2311.63904 MB memory used
2469.801984 MB memory used
3040.41984 MB memory used
3303.36256 MB memory used
3779.653632 MB memory used
4620.894208 MB memory used
4613.496832 MB memory used
4871.081984 MB memory used
5349.572608 MB memory used
5607.50592 MB memory used
6143.660032 MB memory used
6438.129664 MB memory used
6883.868672 MB memory used
7275.70432 MB memory used
7645.483008 MB memory used
8041.20576 MB memory used
8570.998784 MB memory used
8833.388544 MB memory used
9374.048256 MB memory used
9605.537792 MB memory used
9997.225984 MB memory used
10345.340928 MB memory used
11363.401728 MB memory used

Any ideas what could be going wrong?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions