Yesterday, we encountered a very strange bug caused by the indexing mechanism in pandas.read_csv
function.
Due to the huge data size, we used pandas
to process data, but a very strange phenomenon occurred. The pseudo code looks like this:
We can confirm each loop of result is not empty. But only in the first time of the loop, line chunk['new_series'] = pd.series(result)
has result, and the rest are empty. Therefore, only the first chunk of the output contains new_series, and the rest are empty.
Three other colleagues helped me to debug the code, but none of them thought there was any problem in this logic.
We posted this problem on StackOvervlow, and half hour later, we received an answer. The helper told us that we need move the initialization of result
outside the chunk. Hence, the logic would be:
Strangely, this strategy worked, but this was very awkward to us. In this way, the result
would hold every single line of the whole file, instead of every line in each chunk, which means that the size of result
is not equivalent to chunk
, so how can we concatenate them together? We were really confused.
When we tracked the index of each chunk, we found that they are not individual. We supposed that each chunk would start the index from 0, but in reality, it is NOT. The index of each chunk is a subset of the whole CSV in this situation, so their index derives from the CSV. This is what caused the problem. In our initial logic, the pandas.to_csv
writes only the result of the first chunk, instead of the last chunk.
Therefore, a better solution would be rebuild index for each chunk, and concatenating it with result.
Line new_chunk = chunk.reindex()
is the key to solve this problem.
Lesson learned: do not trust your intuition when you try a new function.