mmap_ninja
Here, you can find a full list of the things you can do with mmap-ninja
.
Contents
Utils API:
Numpy API:
- Create a Numpy memmap from a Numpy array
- Create a Numpy memmap from a generator
- Open an existing Numpy memmap
- Append new samples to a Numpy memmap
Ragged API:
- Create a RaggedMmap from list of samples
- Create a RaggedMmap from a generator
- Open an existing RaggedMmap
- Append new samples to a RaggedMmap
String API:
- Create a StringsMmap from list of strings
- Create a StringsMmap from a string generator
- Open an existing StringsMmap
- Append new samples to a StringsMmap
Wrapped
The Wrapped
class allows you to lazily apply a function
whenever __getitem__
is called.
For example:
import numpy as np
import torch
from mmap_ninja.base import Wrapped
dataset = [np.array([1, 2, 3], np.array([-1, 10]))]
wrapped = Wrapped(dataset, wrapper_fn=torch.tensor)
print(wrapped[14])
Before the wrapper_fn
is called, a copy of the sample is made to avoid warnings
about non-writeable underlying tensor. If you are going to only read from the sample,
you can pass copy_before_wrapper_fn=False
when initializing the object.
For example:
import numpy as np
import torch
from mmap_ninja.base import Wrapped
dataset = [np.array([1, 2, 3], np.array([-1, 10]))]
wrapped = Wrapped(dataset, wrapper_fn=torch.tensor, copy_before_wrapper_fn=False)
print(wrapped[14])
Create a Numpy memmap from a Numpy array
The mmap_ninja.numpy
module provides utilities for
initializing and reading from a np.memmap
.
You can create a np.memmap
from an np.array
using
this example:
import numpy as np
from mmap_ninja import numpy as np_ninja
arr = np.random.randn(200, 224, 224, 3)
np_ninja.from_ndarray('imgs_mmap', arr)
Create a Numpy memmap from a generator
Very often, you cannot load the whole dataset into memory.
For this reason, mmap-ninja
provides a method for initializing
a np.memmap
from a generator of samples, which flushes to disk
every batch_size
samples. Optionally, a progress bar can be shown
during the conversion using tqdm
, since the conversion usually takes a long time.
import matplotlib.image as mpimg
import numpy as np
from mmap_ninja import numpy as np_ninja
from pathlib import Path
from os import listdir
imgs_dir = Path('./path_to_img_dir')
np_ninja.from_generator(
out_dir='imgs_mmap',
sample_generator=map(mpimg.imread, imgs_dir.iterdir()),
batch_size=32,
)
Open an existing Numpy memmap
Once you have created a np.memmap
, you can open it
in later stages of the project using np_ninja.open_existing
from mmap_ninja import numpy as np_ninja
memmap = np_ninja.open_existing("growable")
Once you have opened it, you can do all the usual numpy operations
on the np.memmap
.
Append new samples to a Numpy memmap
To append/extend new sample, just use the np_ninja.extend
or np_ninja.extend_dir
method. The difference between the two methods
is the type of the input, one uses np.memmap
directly, while the other
uses the directory, in which the np.memmap
is persisted.
import numpy as np
from mmap_ninja import numpy as np_ninja
arr = np.arange(3)
memmap = np_ninja.from_ndarray("growable", arr)
np_ninja.extend(memmap, np.arange(11, 13))
np_ninja.extend_dir("growable", np.arange(14, 16))
Create a RaggedMmap from list of samples
If your samples are of different shapes, then you should use
RaggedMmap
. If they all fit into memory, you can use the
from_lists
class method:
import numpy as np
from mmap_ninja.ragged import RaggedMmap
simple = [np.array([11, 13, -1, 17]), np.array([2, 3, 4, 19]), np.array([90, 12])]
RaggedMmap.from_lists("simple", simple)
Create a RaggedMmap from a generator
If your samples don't fit into memory, you can initialize
it from a generator, which yields samples (a sample is a np.ndarray
).
This method flushes to disk every batch_size
samples. Optionally, a progress bar can be shown
during the conversion using tqdm
, since the conversion usually takes a long time.
import matplotlib.pyplot as plt
from mmap_ninja.ragged import RaggedMmap
from pathlib import Path
img_path = Path('<PATH TO IMAGE DATASET>')
val_images = RaggedMmap.from_generator(
out_dir='val_images',
sample_generator=map(plt.imread, img_path.iterdir()),
batch_size=1024,
verbose=True
)
Open an existing RaggedMmap
Once the RaggedMmap
has been created, just open it using its
constructor:
import numpy as np
from mmap_ninja.ragged import RaggedMmap
images = RaggedMmap('val_images')
assert isinstance(np.ndarray, images[4])
# len(images) now returns the number of samples
# images[5] is now a numpy array representing the 5-th sample
For convenience, you can also pass in a wrapper function to be applied
to every sample after __getitem__
.
import numpy as np
import torch
from mmap_ninja.ragged import RaggedMmap
images = RaggedMmap('val_images', wrapper_fn=torch.tensor)
assert isinstance(torch.Tensor, images[4])
Before the wrapper_fn
is called, a copy of the sample is made to avoid warnings
about non-writeable underlying tensor. If you are going to only read from the sample,
you can pass copy_before_wrapper_fn=False
when initializing the object.
import numpy as np
import torch
from mmap_ninja.ragged import RaggedMmap
images = RaggedMmap('val_images', wrapper_fn=torch.tensor, copy_before_wrapper_fn=False)
assert isinstance(torch.Tensor, images[4])
Append new samples to a RaggedMmap
To append a single sample, use RaggedMmap.append
.
To append multiple samples, use RaggedMmap.extend
.
import numpy as np
from mmap_ninja.ragged import RaggedMmap
mmap = RaggedMmap('samples')
new_samples = [np.array([123, -1]), np.array([-1, 0, 123, 92, 12])]
mmap.extend(new_samples)
Create a StringsMmap from list of strings
To create a StringsMmap
from a list of strings, use the
StringsMmap.from_strings
method:
from mmap_ninja.string import StringsMmap
list_of_strings = ["foo", "bar", "bidon", "zele", "slanina"]
memmap = StringsMmap.from_strings("strings_memmap", list_of_strings, verbose=True)
Create a StringsMmap from a string generator
Sometimes, the list of strings cannot fit into memory.
In these cases, you can initialize a StringsMmap
from a generator,
flushing every batch_size
samples to disk, and optionally, a progress bar can be shown
(verbose=True
) during the conversion using tqdm
, since the conversion usually takes a long time.
from pathlib import Path
from mmap_ninja.string import StringsMmap
def generate_strs(n):
for i in range(n):
yield str(i)
memmap = StringsMmap.from_generator(
out_dir="strings_mmap",
sample_generator=generate_strs(1_000_000_000),
batch_size=1024,
verbose=True
)
Open an existing StringsMmap
Once the StringsMmap
has been created, just open it using its
constructor:
from mmap_ninja.string import StringsMmap
memmap = StringsMmap('strings_mmap')
Now you can access its elements using indexing, e.g. memmap[5]
is a str
.
Append new samples to a StringsMmap
To append a single string, use StringsMmap.append
.
To append multiple samples, use StringsMmap.extend
.
from mmap_ninja.string import StringsMmap
mmap = StringsMmap('strings_mmap')
new_samples = ['foo', 'bar']
mmap.extend(new_samples)