Imports

Decorator utils functions

has_key_startswith[source]

has_key_startswith(d:dict, prefix:str)

convert_legacy_output[source]

convert_legacy_output(inp:Generator[tuple, NoneType, NoneType])

Convert legacy preproc output to dictionary

Args: inp (Generator[tuple, None, None]): legacy format output

Returns: dict: new format output

Yields: Iterator[dict]

input_format_check[source]

input_format_check(inp:dict, mode:str)

none_generator[source]

none_generator(length:int=None)

convert_data_to_features[source]

convert_data_to_features(problem:str, data_iter:Iterable[T_co], params:Params, label_encoder:Any, tokenizer:Any, mode='train')

convert_data_to_features_pyspark[source]

convert_data_to_features_pyspark(problem:str, dataframe, params:Params, label_encoder:Any, tokenizer:Any, mode='train')

check_if_le_created[source]

check_if_le_created(problem:str, params:Params)

Decorator

preprocessing_fn[source]

preprocessing_fn(func:Callable)

Usually used as a decorator.

The input and output signature of decorated function should be: func(params: m3tl.Params, mode: str) -> Union[Generator[X, y], Tuple[List[X], List[y]]]

Where X can be:

  • Dicitionary of 'a' and 'b' texts: {'a': 'a test', 'b': 'b test'}
  • Text: 'a test'
  • Dicitionary of modalities: {'text': 'a test', 'image': np.array([1,2,3])}

Where y can be:

  • Text or scalar: 'label_a'
  • List of text or scalar: ['label_a', 'label_a1'] (for seq2seq and seq_tag)

This decorator will do the following things:

  • load tokenizer
  • call func, save as example_list
  • create label_encoder and count the number of rows of example_list
  • create bert features from example_list and write tfrecord

Args: func (Callable): preprocessing function for problem

User-Defined Preprocessing Function

The user-defined preprocessing function should return two elements: features and targets, except for pretrain problem type.

For features and targets, it can be one of the following format:

  • tuple of list
  • generator of tuple

Please note that if preprocessing function returns generator of tuple, then corresponding problem cannot be chained using &.

Tuple of List

Single Modal

@preprocessing_fn
def toy_cls(params: Params, mode: str) -> Tuple[list, list]:
    "Simple example to demonstrate singe modal tuple of list return"
    if mode == m3tl.TRAIN:
        toy_input = ['this is a toy input' for _ in range(10)]
        toy_target = ['a' for _ in range(10)]
    else:
        toy_input = ['this is a toy input for test' for _ in range(10)]
        toy_target = ['a' for _ in range(10)]
    return toy_input, toy_target

Multi-modal

@preprocessing_fn
def toy_cls(params: Params, mode: str) -> Tuple[list, list]:
    "Simple example to demonstrate multi-modal tuple of list return"
    if mode == m3tl.TRAIN:
        toy_input = [{'text': 'this is a toy input',
                      'image': np.random.uniform(size=(16))} for _ in range(10)]
        toy_target = ['a' for _ in range(10)]
    else:
        toy_input = [{'text': 'this is a toy input for test',
                      'image': np.random.uniform(size=(16))} for _ in range(10)]
        toy_target = ['a' for _ in range(10)]

    return toy_input, toy_target

A, B Token Multi-modal

TODO: Implement this. Not working yet.

# params.register_problem(problem_name='toy_cls', problem_type='cls', processing_fn=toy_cls)
# assert (10, 1)==toy_cls(params=params, mode=m3tl.TRAIN, get_data_num=True, write_tfrecord=False)

# shutil.rmtree(os.path.join(params.tmp_file_dir, 'toy_cls'))
# toy_cls(params=params, mode=m3tl.TRAIN, get_data_num=False, write_tfrecord=True)
# assert os.path.exists(os.path.join(params.tmp_file_dir, 'toy_cls', 'train_feature_desc.json'))

Generator of Tuple

Single Modal

@preprocessing_fn
def toy_cls(params: Params, mode: str) -> Tuple[list, list]:
    "Simple example to demonstrate singe modal tuple of list return"
    if mode == m3tl.TRAIN:
        toy_input = ['this is a toy input' for _ in range(10)]
        toy_target = ['a' for _ in range(10)]
    else:
        toy_input = ['this is a toy input for test' for _ in range(10)]
        toy_target = ['a' for _ in range(10)]
    for i, t in zip(toy_input, toy_target):
        yield i, t

Multi-modal

@preprocessing_fn
def toy_cls(params: Params, mode: str) -> Tuple[list, list]:
    "Simple example to demonstrate multi-modal tuple of list return"
    if mode == m3tl.TRAIN:
        toy_input = [{'text': 'this is a toy input',
                      'image': np.random.uniform(size=(16))} for _ in range(10)]
        toy_target = ['a' for _ in range(10)]
    else:
        toy_input = [{'text': 'this is a toy input for test',
                      'image': np.random.uniform(size=(16))} for _ in range(10)]
        toy_target = ['a' for _ in range(10)]
    for i, t in zip(toy_input, toy_target):
        yield i, t

Pyspark dataframe

single modal

multimodal

@preprocessing_fn
def toy_cls(params: Params, mode: str) -> RDD:
    get_or_make_label_encoder(params=params, problem='toy_cls', label_list=['a'], mode=mode)
    if mode == m3tl.TRAIN:
        d = {
            'inputs_text': ['this is a toy input' for _ in range(10)],
            'inputs_image': [np.random.uniform(size=(16)).tolist() for _ in range(10)],
            'labels': ['a' for _ in range(10)]
        }
    else:
        d = {
            'inputs_text': ['this is a toy input test' for _ in range(10)],
            'inputs_image': [np.random.uniform(size=(16)).tolist() for _ in range(10)],
            'labels': ['a' for _ in range(10)]
        }
    d = pd.DataFrame(d).to_dict('records')
    rdd = sc.parallelize(d)
    return rdd


preproc_dec_test()
2021-06-22 20:19:31.948 | WARNING  | m3tl.base_params:prepare_dir:361 - bert_config not exists. will load model from huggingface checkpoint.