Reading Files into Pandas


Pandas is an awesome Python package for manipulating data. It provides various tools to import data. In our daily job, CSV and JSON files are the most common data format to deal with. This post will introduce some tricks of pandas.read_csv and pandas.read_json.

pandas.read_csv

pandas.read_csv is a convenient tool in pandas to handle CSV file importing; however, it is not very intuitive as it supposes to be. It can use either Python or C engine to parse data, and the C engine is much more efficient than the default Python engine.

According to the doc, pandas.read_csv contains four basic parameters: filepath_or_buffer, sep, delimiter and delim_whitespace. Here, the most important parameter is sep: if the sep is set to be a comma, pandas can use the C engine to increase the reading speed; otherwise, the sep will be regarded as a regular expression string, and pandas can only use the default Python engine to parse.

Reading a 471MB CSV file containing 5,023,382 lines of data into pandas, with the help of C engine, the time cost reduces more than 50%.

%time dd = csv2pd(in_file, engine='c', column_names=get_header(in_file), column_constant=get_header(in_file), sep=',', )
CPU times: user 14.6 s, sys: 1.93 s, total: 16.5 s
Wall time: 17.3 s

%time dd = csv2pd(in_file, engine='python', column_names=get_header(in_file), column_constant=get_header(in_file), sep=',')
CPU times: user 32.2 s, sys: 2.77 s, total: 35 s
Wall time: 38.6 s

Another trick to increase reading speed and reduce memory use is to read columns only needed. However, this is not very easy to do with pandas.read_csv. The doc states that usecols as:

Return a subset of the columns. All elements in this array must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s).

Put it simple, we need provide the index of the columns to select the needed columns. Suppose the CSV file has many columns, and we only have to pick few of them, counting the index is not a realistic way to do so. Therefore two functions are used to solve this problem: using get_header to fetch the header of a CSV file, and use get_column_index to fetch the index of needed columns.

def get_header(in_file, sep=","):
    in_file = open(in_file)
    result = next(in_file).strip('\n\r').split(sep)
    in_file.close()
    return result
def get_column_index(col_list, column):
    """
    :param column: the COLUMN_CHAT constant
    :param col_list: define wanted columns here
    :return: a list of indices
    """
    col_index_dic = {v: k for (k, v) in enumerate(column)}
    return [int(col_index_dic[col]) for col in col_list]

Combining them together, we can get the index of needed columns, and pass it to usecols options in pandas.read_csv function. This simplifies the whole process.

def csv2pd(in_file, column_constant, column_names, sep=",", quote=3, engine='python'):
    """
    user Function get_column_index to get a list of column indices
    :param in_file: a csv_file
    :param column_names: the full column names of the csv
    :param column_constant: a predefined column header
    :param sep: ',' by default
    :param quote: exlcude quotation marks
    :param engine: choose engine for reading data
    :return: a trimmed pandas table
    """
    column_indices = get_column_index(column=column_constant, col_list=column_names)
    return pd.read_csv(in_file, usecols=column_indices, sep=sep, quoting=quote, engine=engine)

Sometimes, CSV files may contain quotation marks to reduces errors, but this may confuse pandas.read_csv. If so, we need demonstrate the quotation with the quoting option.

In the doc, quoting has four choices:

Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).

Therefore, we can set sep as comma, quoting as 0 and engine to C engine as default setting in pandas.read_csv:

def quickest_read_csv(in_file, column_names):
    """
    param: in_file: csv file
    """
    data = csv2pd(column_constant=get_header(in_file), column_names=column_names, engine='c',
                  in_file=in_file, quote=0, sep=',')
    return data

pandas.read_json

Using pandas to read JSON file is a new function in the version 0.12, and the pandas.read_json is a very powerful tool now.

By default, pandas.read_json supports the following JSON string formats:

split	dict like {index -> [index], columns -> [columns], data -> [values]}
records	list like [{column -> value}, ... , {column -> value}]
index	dict like {index -> {column -> value}}
columns	dict like {column -> {index -> value}}
values	just the values array

However, sometimes, we need to deal with line-based JSON file, so we need to set lines as True.

def json2pd(in_file, col_list, lines=True):
    """
    :param in_file: a json file
    :param col_list: set the extracted columns
    :param lines: True if the file is line-based
    :return: a pd df
    """
    data = pd.read_json(in_file, lines=lines)
    result = data[col_list]
    return result

Posted with : python, python

Related Posts