If you have prior experience with machine learning, chances are at some point in your project, you needed to load a dataset, usually in a JSON, txt, or CSV file. This can be done by libraries such as panda
. It could be cumbersome if you are debugging and you need to load the dataset multiple times because reading data from such files are relatively slow. To resolve this issue, we use NumPy
to save our preprocessed data. Therefore, not only loading data is done faster, but also, you will not need to preprocess the data again.
Step-by-step code
First, you need to import the required libraries:
Now, we can use Panda
and NumPy
as follows to load the original dataset and save a preprocessed version. Make sure to change the source_file
and target_file
variables according to your need.
In the above code, the load_csv
is an inline function defined by lambda
. It takes a file address as a parameter and loads it. Please ensure to read the documentation if the parameters in the pd.read_csv
are unclear or do not seem to make sense. Lastly, notice how the process is stopped when the target file already exists. This ensures that the create_preprocessed_dataset
is not needlessly executed. In the next step, we should define another function to load the preprocessed data.
Now that we have both create_preprocessed_dataset
and load_preprocessed_dataset
, The only remaining step is to use them as follows. Although create_preprocessed_dataset
is always executed, it gets stopped when the target_file
already exists.
Conclusion
In this post, we could see how to improve the speed of loading and preprocessing datasets. The whole goal is to load the raw data and preprocess them only once in different executions and to work with a npy
file that has a better save and load speed than a CSV, txt, or JSON file.