Imports

Train and Eval Dataset

We can get train and eval dataset by passing a problem assigned params and mode.

element_length_func[source]

element_length_func(yield_dict:Dict[str, Tensor])

train_eval_input_fn[source]

train_eval_input_fn(params:Params, mode='train')

This function will write and read tf record for training and evaluation.

Arguments: params {Params} -- Params objects

Keyword Arguments: mode {str} -- ModeKeys (default: {TRAIN})

Returns: tf Dataset -- Tensorflow dataset

train_dataset = train_eval_input_fn(
    params=params, mode=m3tl.TRAIN)
eval_dataset = train_eval_input_fn(
    params=params, mode=m3tl.EVAL
)

_ = next(train_dataset.as_numpy_iterator())
_ = next(eval_dataset.as_numpy_iterator())
2021-06-15 17:19:11.733 | WARNING  | m3tl.read_write_tfrecord:chain_processed_data:248 - Chaining problems with & may consume a lot of memory if data is not pyspark RDD.
2021-06-15 17:19:11.740 | DEBUG    | m3tl.read_write_tfrecord:_write_fn:134 - Writing /tmp/tmp2afsw8rx/weibo_fake_cls_weibo_fake_ner/train_00000.tfrecord
2021-06-15 17:19:11.771 | WARNING  | m3tl.read_write_tfrecord:chain_processed_data:248 - Chaining problems with & may consume a lot of memory if data is not pyspark RDD.
2021-06-15 17:19:11.777 | DEBUG    | m3tl.read_write_tfrecord:_write_fn:134 - Writing /tmp/tmp2afsw8rx/weibo_fake_cls_weibo_fake_ner/eval_00000.tfrecord
2021-06-15 17:19:11.803 | DEBUG    | m3tl.read_write_tfrecord:_write_fn:134 - Writing /tmp/tmp2afsw8rx/weibo_fake_multi_cls/train_00000.tfrecord
2021-06-15 17:19:11.827 | DEBUG    | m3tl.read_write_tfrecord:_write_fn:134 - Writing /tmp/tmp2afsw8rx/weibo_fake_multi_cls/eval_00000.tfrecord
2021-06-15 17:19:11.905 | DEBUG    | m3tl.read_write_tfrecord:_write_fn:134 - Writing /tmp/tmp2afsw8rx/weibo_masklm/train_00000.tfrecord
2021-06-15 17:19:11.955 | DEBUG    | m3tl.read_write_tfrecord:_write_fn:134 - Writing /tmp/tmp2afsw8rx/weibo_masklm/eval_00000.tfrecord
2021-06-15 17:19:12.697 | INFO     | __main__:train_eval_input_fn:37 - sampling weights: 
2021-06-15 17:19:12.698 | INFO     | __main__:train_eval_input_fn:38 - {
    "weibo_fake_cls_weibo_fake_ner": 0.3333333333333333,
    "weibo_fake_multi_cls": 0.3333333333333333,
    "weibo_masklm": 0.3333333333333333
}
2021-06-15 17:19:13.141 | INFO     | __main__:train_eval_input_fn:37 - sampling weights: 
2021-06-15 17:19:13.142 | INFO     | __main__:train_eval_input_fn:38 - {
    "weibo_fake_cls_weibo_fake_ner": 0.3333333333333333,
    "weibo_fake_multi_cls": 0.3333333333333333,
    "weibo_masklm": 0.3333333333333333
}

Predict Dataset

We can create a predict dataset by passing list/generator of inputs and problem assigned params.

predict_input_fn[source]

predict_input_fn(input_file_or_list:Union[str, List[str]], params:Params, mode='infer', labels_in_input=False)

Input function that takes a file path or list of string and convert it to tf.dataset

Example: predict_fn = lambda: predict_input_fn('test.txt', params) pred = estimator.predict(predict_fn)

Arguments: input_file_or_list {str or list} -- file path or list of string params {Params} -- Params object

Keyword Arguments: mode {str} -- ModeKeys (default: {PREDICT})

Returns: tf dataset -- tf dataset

Single modal inputs

from m3tl.utils import set_phase
from m3tl.special_tokens import PREDICT
set_phase(PREDICT)
single_dataset = predict_input_fn(
    ['this is a test']*5, params=params)
first_batch = next(single_dataset.as_numpy_iterator())
assert first_batch['text_input_ids'].tolist()[0] == [
    101,  8554,  8310,   143, 10060,   102]
2021-06-15 17:19:16.349 | INFO     | m3tl.utils:set_phase:478 - Setting phase to infer

Multi-modal inputs

mm_input = [{'text': 'this is a test',
             'image': np.zeros(shape=(5, 10), dtype='float32')}] * 5
mm_dataset = predict_input_fn(
    mm_input, params=params)
first_batch = next(mm_dataset.as_numpy_iterator())
assert first_batch['text_input_ids'].tolist()[0] == [
    101,  8554,  8310,   143, 10060,   102]
assert first_batch['image_input_ids'].tolist()[0] == np.zeros(
    shape=(5, 10), dtype='float32').tolist()