Vocabulary Processor

tflearn.data_utils.VocabularyProcessor (max_document_length, min_frequency=0, vocabulary=None, tokenizer_fn=None)

Maps documents to sequences of word ids.

Arguments

max_document_length: Maximum length of documents. if documents are longer, they will be trimmed, if shorter - padded.
min_frequency: Minimum frequency of words in the vocabulary.
vocabulary: CategoricalVocabulary object.

Attributes

vocabulary_: CategoricalVocabulary object.

Methods

fit (raw_documents, unused_y=None)

Learn a vocabulary dictionary of all tokens in the raw documents.

Arguments

raw_documents: An iterable which yield either str or unicode.
unused_y: to match fit format signature of estimators.

Returns

self

fit_transform (raw_documents, unused_y=None)

Learn the vocabulary dictionary and return indexies of words.

Arguments

raw_documents: An iterable which yield either str or unicode.
unused_y: to match fit_transform signature of estimators.

Returns

X: iterable, [n_samples, max_document_length] Word-id matrix.

restore (cls, filename)

Restores vocabulary processor from given file.

Arguments

filename: Path to file to load from.

Returns

VocabularyProcessor object.

reverse (documents)

Reverses output of vocabulary mapping to words.

Arguments

documents: iterable, list of class ids.

Returns

Iterator over mapped in words documents.

save (filename)

Saves vocabulary processor into given file.

Arguments

filename: Path to output file.

transform (raw_documents)

Transform documents to word-id matrix.

Convert words to ids with vocabulary fitted with fit or the one provided in the constructor.

Arguments

raw_documents: An iterable which yield either str or unicode.

to_categorical

tflearn.data_utils.to_categorical (y, nb_classes)

Convert class vector (integers from 0 to nb_classes) to binary class matrix, for use with categorical_crossentropy.

Arguments

y: array. Class vector to convert.
nb_classes: int. Total number of classes.

pad_sequences

tflearn.data_utils.pad_sequences (sequences, maxlen=None, dtype='int32', padding='post', truncating='post', value=0.0)

Pad each sequence to the same length: the length of the longest sequence. If maxlen is provided, any sequence longer than maxlen is truncated to maxlen. Truncation happens off either the beginning or the end (default) of the sequence. Supports pre-padding and post-padding (default).

Arguments

sequences: list of lists where each element is a sequence.
maxlen: int, maximum length.
dtype: type to cast the resulting sequence.
padding: 'pre' or 'post', pad either before or after each sequence.
truncating: 'pre' or 'post', remove values from sequences larger than maxlen either in the beginning or in the end of the sequence
value: float, value to pad the sequences to the desired value.

Returns

x: numpy array with dimensions (number_of_sequences, maxlen)

string_to_semi_redundant_sequences

tflearn.data_utils.string_to_semi_redundant_sequences (string, seq_maxlen=25, redun_step=3, char_idx=None)

Vectorize a string and returns parsed sequences and targets, along with the associated dictionary.

Arguments

string: str. Lower-case text from input text file.
seq_maxlen: int. Maximum length of a sequence. Default: 25.
redun_step: int. Redundancy step. Default: 3.
char_idx: 'dict'. A dictionary to convert chars to positions. Will be automatically generated if None

Returns

A tuple: (inputs, targets, dictionary)

Build HDF5 Image Dataset

tflearn.data_utils.build_hdf5_image_dataset (target_path, image_shape, output_path='dataset.h5', mode='file', categorical_labels=True, normalize=True, grayscale=False, files_extension=None, chunks=False)

Build an HDF5 dataset by providing either a root folder or a plain text file with images path and class id.

'folder' mode: Root folder should be arranged as follow:

ROOT_FOLDER -> SUBFOLDER_0 (CLASS 0) -> CLASS0_IMG1.jpg -> CLASS0_IMG2.jpg -> ...-> SUBFOLDER_1 (CLASS 1) -> CLASS1_IMG1.jpg -> ...-> ...

Note that if sub-folders are not integers from 0 to n_classes, an id will be assigned to each sub-folder following alphabetical order.

'file' mode: Plain text file should be formatted as follow:

/path/to/img1 class_id
/path/to/img2 class_id
/path/to/img3 class_id

Examples

# Load path/class_id image file:
dataset_file = 'my_dataset.txt'

# Build a HDF5 dataset (only required once)
from tflearn.data_utils import build_hdf5_image_dataset
build_hdf5_image_dataset(dataset_file, image_shape=(128, 128), mode='file', output_path='dataset.h5', categorical_labels=True, normalize=True)

# Load HDF5 dataset
import h5py
h5f = h5py.File('dataset.h5', 'r')
X = h5f['X']
Y = h5f['Y']

# Build neural network and train
network = ...
model = DNN(network, ...)
model.fit(X, Y)

Arguments

target_path: str. Path of root folder or images plain text file.
image_shape: tuple (height, width). The images shape. Images that doesn't match that shape will be resized.
output_path: str. The output path for the hdf5 dataset. Default: 'dataset.h5'
mode: str in ['file', 'folder']. The data source mode. 'folder' accepts a root folder with each of his sub-folder representing a class containing the images to classify. 'file' accepts a single plain text file that contains every image path with their class id. Default: 'folder'.
categorical_labels: bool. If True, labels are converted to binary vectors.
normalize: bool. If True, normalize all pictures by dividing every image array by 255.
grayscale: bool. If true, images are converted to grayscale.
files_extension: list of str. A list of allowed image file extension, for example ['.jpg', '.jpeg', '.png']. If None, all files are allowed.
chunks: bool Whether to chunks the dataset or not. You should use chunking only when you really need it. See HDF5 documentation. If chunks is 'True' a sensitive default will be computed.

Image PreLoader

tflearn.data_utils.image_preloader (target_path, image_shape, mode='file', normalize=True, grayscale=False, categorical_labels=True, files_extension=None, filter_channel=False)

Create a python array (Preloader) that loads images on the fly (from disk or url). There is two ways to provide image samples 'folder' or 'file', see the specifications below.

'folder' mode: Load images from disk, given a root folder. This folder should be arranged as follow:

ROOT_FOLDER -> SUBFOLDER_0 (CLASS 0) -> CLASS0_IMG1.jpg -> CLASS0_IMG2.jpg -> ...-> SUBFOLDER_1 (CLASS 1) -> CLASS1_IMG1.jpg -> ...-> ...

Note that if sub-folders are not integers from 0 to n_classes, an id will be assigned to each sub-folder following alphabetical order.

'file' mode: A plain text file listing every image path and class id. This file should be formatted as follow:

/path/to/img1 class_id
/path/to/img2 class_id
/path/to/img3 class_id

Note that load images on the fly and convert is time inefficient, so you can instead use build_hdf5_image_dataset to build a HDF5 dataset that enable fast retrieval (this function takes similar arguments).

Examples

# Load path/class_id image file:
dataset_file = 'my_dataset.txt'

# Build the preloader array, resize images to 128x128
from tflearn.data_utils import image_preloader
X, Y = image_preloader(dataset_file, image_shape=(128, 128),   mode='file', categorical_labels=True,   normalize=True)

# Build neural network and train
network = ...
model = DNN(network, ...)
model.fit(X, Y)

Arguments

target_path: str. Path of root folder or images plain text file.
image_shape: tuple (height, width). The images shape. Images that doesn't match that shape will be resized.
mode: str in ['file', 'folder']. The data source mode. 'folder' accepts a root folder with each of his sub-folder representing a class containing the images to classify. 'file' accepts a single plain text file that contains every image path with their class id. Default: 'folder'.
categorical_labels: bool. If True, labels are converted to binary vectors.
normalize: bool. If True, normalize all pictures by dividing every image array by 255.
grayscale: bool. If true, images are converted to grayscale.
files_extension: list of str. A list of allowed image file extension, for example ['.jpg', '.jpeg', '.png']. If None, all files are allowed.
filter_channel: bool. If true, images which the channel is not 3 should be filter.

Returns

(X, Y): with X the images array and Y the labels array.

shuffle

tflearn.data_utils.shuffle (*arrs)

Shuffle given arrays at unison, along first axis.

Arguments

*arrs: Each array to shuffle at unison.

Returns

Tuple of shuffled arrays.

samplewise_zero_center

tflearn.data_utils.samplewise_zero_center (X)

Zero center each sample by subtracting it by its mean.

Arguments

X: array. The batch of samples to center.

Returns

A numpy array with same shape as input.

samplewise_std_normalization

tflearn.data_utils.samplewise_std_normalization (X)

Scale each sample with its standard deviation.

Arguments

X: array. The batch of samples to scale.

Returns

A numpy array with same shape as input.

featurewise_zero_center

tflearn.data_utils.featurewise_zero_center (X, mean=None)

Zero center every sample with specified mean. If not specified, the mean is evaluated over all samples.

Arguments

X: array. The batch of samples to center.
mean: float. The mean to use for zero centering. If not specified, it will be evaluated on provided data.

Returns

A numpy array with same shape as input. Or a tuple (array, mean) if no mean value was specified.

featurewise_std_normalization

tflearn.data_utils.featurewise_std_normalization (X, std=None)

Scale each sample by the specified standard deviation. If no std specified, std is evaluated over all samples data.

Arguments

X: array. The batch of samples to scale.
std: float. The std to use for scaling data. If not specified, it will be evaluated over the provided data.

Returns

A numpy array with same shape as input. Or a tuple (array, std) if no std value was specified.

load_csv

tflearn.data_utils.load_csv (filepath, target_column=-1, columns_to_ignore=None, has_header=True, categorical_labels=False, n_classes=None)

Load data from a CSV file. By default the labels are considered to be the last column, but it can be changed by filling 'target_column' parameter.

Arguments

filepath: str. The csv file path.
target_column: The id of the column representing the labels. Default: -1 (The last column).
columns_to_ignore: list of int. A list of columns index to ignore.
has_header: bool. Whether the csv file has a header or not.
categorical_labels: bool. If True, labels are returned as binary vectors (to be used with 'categorical_crossentropy').
n_classes: int. Total number of class (needed if categorical_labels is True).

Returns

A tuple (data, target).