1min

If you have prior experience with machine learning, chances are at some point in your project, you needed to load a dataset, usually in a JSON, txt, or CSV file. This can be done by libraries such as panda. It could be cumbersome if you are debugging and you need to load the dataset multiple times because reading data from such files are relatively slow. To resolve this issue, we use NumPy to save our preprocessed data. Therefore, not only loading data is done faster, but also, you will not need to preprocess the data again.

Step-by-step code

First, you need to import the required libraries:

Now, we can use Panda and NumPy as follows to load the original dataset and save a preprocessed version. Make sure to change the source_file and target_file variables according to your need.

In the above code, the load_csv is an inline function defined by lambda. It takes a file address as a parameter and loads it. Please ensure to read the documentation if the parameters in the pd.read_csv are unclear or do not seem to make sense. Lastly, notice how the process is stopped when the target file already exists. This ensures that the create_preprocessed_dataset is not needlessly executed. In the next step, we should define another function to load the preprocessed data.

Now that we have both create_preprocessed_dataset and load_preprocessed_dataset, The only remaining step is to use them as follows. Although create_preprocessed_dataset is always executed, it gets stopped when the target_file already exists.

Conclusion

In this post, we could see how to improve the speed of loading and preprocessing datasets. The whole goal is to load the raw data and preprocess them only once in different executions and to work with a npy file that has a better save and load speed than a CSV, txt, or JSON file.