Read specific lines of a text file without loading it in memory.
A C backend leverages mmap
to map a text file into the virtual
address space of the process. This allows to read specific lines
of a text file without loading it in memory.
This comes in handy when you need to sample lines from a large text file (e.g. 100Gb) that does not fit in memory. Basically, instead of allocating a number of bytes equal to the size of the file, you only allocate a number of bytes equal to the number of lines times the size of a pointer to char (which is 4 bytes in 32-bit machines and 8 bytes in 64-bit machines).
The class FileMap has four methods:
__init__
: to create the mapping__len__
: to retrieve the number of lines mapped__getitem__
: to access to a given line__del__
: to unmap and free memoryThe class FileMap can be easily and nicely integrated with the Dataset classes of deep learning frameworks like PyTorch and MxNet (see examples).
The source of the C backend is provided. If you want to make some changes, then you have to re-build the shared object:
gcc -shared -o bustalines.so bustalines.c