pandas is an efficient tool to process data, but when the dataset cannot be fit in memory, using
pandas could be a little bit tricky.
Recently, we received a 10G+ dataset, and tried to use
pandas to preprocess it and save it to a smaller CSV file. When we attempted to put all data into memory on our server (with 64G memory, but other colleagues were using more than half it), the memory was fully occupied by
pandas, and the task was stuck there.
Then, I remembered that
chunksize option in related functions, so we took another try, and succeeded.
Some aspects are worth paying attetion to:
chunksizeshould not be too small. If it is too small, the IO cost will be high to overcome the benefit. For example, if we have a file with one million lines, we did a little experiment:
|Chunksize||Memory (MiB)||Time (s)|
In our main task, we set
chunksize as 200,000, and it used 211.22MiB memory to process the 10G+ dataset with 9min 54s.
pandas.DataFrame.to_csv()mode should be set as ‘a’ to append chunk results to a single file; otherwise, only the last chunk will be saved.