for problem_type in params.list_available_problem_types():
print('`{problem_type}`: {desc}'.format(
desc=params.problem_type_desc[problem_type], problem_type=problem_type))
Normally, you would want to use this library to do multi-task learning. There are two types of chaining operations can be used to chain problems.
&
. If two problems have the same inputs, they can be chained using&
. Problems chained by&
will be trained at the same time.|
. If two problems don't have the same inputs, they need to be chained using|
. Problems chained by|
will be sampled to train at every instance.Note: chaining problems withIf your problem dose not fall in the pre-defined problem types, you can implement your own and register to params. We will cover this topic later. Let's start with a simple example of adding a classification problem and a sequence labeling problem.&
works better with pyspark pre-processing and providinginputs_record_id
key. For more information, please refer to Write More Flexible Preprocessing Function.
problem_type_dict = {'toy_cls': 'cls', 'toy_seq_tag': 'seq_tag'}
Then we need to do some coding. We need to implement preprocessing function for each problem. The preprocessing function is a callable with
- same name as problem name
- fixed input signature
- returns(or yield) inputs and targets
- decorated by
m3tl.preproc_decorator.preprocessing_fn
import m3tl
from m3tl.preproc_decorator import preprocessing_fn
from m3tl.params import Params
from m3tl.special_tokens import TRAIN
@preprocessing_fn
def toy_cls(params: Params, mode: str):
"Simple example to demonstrate singe modal tuple of list return"
if mode == TRAIN:
toy_input = ['this is a test' for _ in range(10)]
toy_target = ['a' if i <=5 else 'b' for i in range(10)]
else:
toy_input = ['this is a test' for _ in range(10)]
toy_target = ['a' if i <=5 else 'b' for i in range(10)]
return toy_input, toy_target
@preprocessing_fn
def toy_seq_tag(params: Params, mode: str):
"Simple example to demonstrate singe modal tuple of list return"
if mode == TRAIN:
toy_input = ['this is a test'.split(' ') for _ in range(10)]
toy_target = [['a', 'b', 'c', 'd'] for _ in range(10)]
else:
toy_input = ['this is a test'.split(' ') for _ in range(10)]
toy_target = [['a', 'b', 'c', 'd'] for _ in range(10)]
return toy_input, toy_target
processing_fn_dict = {'toy_cls': toy_cls, 'toy_seq_tag': toy_seq_tag}
Now we're good to go! Since these two toy problems shares the same input, we can chain them with &
.
from m3tl.run_bert_multitask import train_bert_multitask, eval_bert_multitask, predict_bert_multitask
problem = 'toy_cls&toy_seq_tag'
# train
model = train_bert_multitask(
problem=problem,
num_epochs=1,
problem_type_dict=problem_type_dict,
processing_fn_dict=processing_fn_dict,
continue_training=False
)
For eval, we need to provide model_dir
or model
to the function. Please note that the unresolved object warning raised by tensorflow is expected since optimizer's states will not be initialized in evaluation and prediction.
# eval
eval_dict = eval_bert_multitask(problem=problem,
problem_type_dict=problem_type_dict, processing_fn_dict=processing_fn_dict,
model_dir=model.params.ckpt_dir)
print(eval_dict)
# predict
fake_inputs = ['this is a test'.split(' ') for _ in range(10)]
pred, model = predict_bert_multitask(
problem=problem,
inputs=fake_inputs, model_dir=model.params.ckpt_dir,
problem_type_dict=problem_type_dict,
processing_fn_dict=processing_fn_dict, return_model=True)
pred
is a dictionary with problem name as key and probability distribution array as value.
for problem_name, prob_array in pred.items():
print(f'{problem_name} - {prob_array.shape}')
# change model to distilbert-base-uncased
from m3tl.params import Params
params = Params()
# specify model and its loading module
params.transformer_model_name = 'distilbert-base-uncased'
params.transformer_model_loading = 'TFDistilBertModel'
# specify tokenizer and its loading module
params.transformer_tokenizer_name = 'distilbert-base-uncased'
params.transformer_tokenizer_loading = 'DistilBertTokenizer'
# specify config and its loading module
params.transformer_config_name = 'distilbert-base-uncased'
params.transformer_config_loading = 'DistilBertConfig'
Besides the "body" model, we can also set mtl model. By default, it will be hard parameter sharing, but we have implemented various mtl models. To see what's available, use
import json
print(json.dumps(params.list_available_mtl_setup(), indent=4))
# train model with mmoe
params.assign_mtl_model('mmoe')
model = train_bert_multitask(
problem=problem,
num_epochs=1,
problem_type_dict=problem_type_dict,
processing_fn_dict=processing_fn_dict,
continue_training=False,
params=params # pass params
)
Write More Flexible Preprocessing Function
The most simple preprocessing function returns tuple of list, inputs and labels, as we shown above. However, inputs can get pretty complicated when doing multi-modal multi-task learning. In this case, we can use dictionary to store our data with some magic keys:
"inputs_"
and"labels_"
prefix. We still divide the preprocessing output into inputs and labels. By adding"inputs_"
and"labels_"
prefix to the dictionary keys, the module will correctly handle them in train, eval and predict."_modal_type"
and"_modal_info"
suffix. Adding these suffix will indicate the modal type of some inputs. If they're not provided, the module will try to infer the correct information from data.i
. If specified, this key will be used to join problems chained with&
. It is required if any problems are chained with&
.
Example:
from m3tl.predefined_problems.test_data import generate_fake_data
gen = generate_fake_data(output_format='gen_dict')
pprint.pprint(next(gen))
from m3tl.utils import get_or_make_label_encoder
from m3tl.special_tokens import TRAIN
import inspect
params.num_cpus = 1
@preprocessing_fn
def toy_cls(params: Params, mode: str):
# IMPORTANT!
get_or_make_label_encoder(
params=params,
problem=inspect.currentframe().f_code.co_name, # current function name
mode=mode,
label_list=['a', 'b'],
overwrite=True
)
return generate_fake_data(output_format='gen_dict')
params.register_problem(problem_name='toy_cls', problem_type='cls', processing_fn=toy_cls)
# then you can call the preproc function and take a look at the result
pprint.pprint(next(toy_cls(params, TRAIN)))
Pyspark preprocessing(experimental)
If your data is too huge to process locally, you can also return a pyspark RDD from your preprocessing function.
m3tl.utils.get_or_make_label_encoder
within your preprocessing function when using pyspark preprocessing!!!params.pyspark_output_path
must be set if pyspark is enabled.&
and they only share part of the inputs, returning RDD from preprocessing function is required.
from m3tl.utils import set_is_pyspark
import tempfile
set_is_pyspark(True)
@preprocessing_fn
def toy_cls(params: Params, mode: str):
return generate_fake_data(output_format='rdd')
params.register_problem(problem_name='toy_cls', problem_type='cls', processing_fn=toy_cls)
# set pyspark output path
params.pyspark_output_path = tempfile.mkdtemp()
# then you can call the preproc function and take a look at the result
toy_cls_rdd = toy_cls(params, TRAIN)
pprint.pprint(toy_cls_rdd.collect()[0])
What Happened?
The inputs returned by preprocessing function will be tokenized using transformers tokenizer which is configurable like we showed before and the labels will be encoded(or tokenized if the target is text) as scalar or numpy array. The encoded inputs and target then will be serialized and written as TFRecord. Please note that the TFRecord will NOT be overwritten even if you run the code again. So if you want to change the data in TFRecord, you need to manually remove the directory of TFRecord. The default directory is ./tmp/{problem_name}
.
After the TFRecord is created, if you want to check the feature info, you can head to the corresponding directory and take a look at the json file within.
First, we make sure the TFRecord is created.
from m3tl.input_fn import train_eval_input_fn
dataset = train_eval_input_fn(params)
Below is the TFRecord directory tree.
We can take a look at the json file.
import json
import os
# the problem chained by & create one TFRecord folder
json_path = os.path.join(params.tmp_file_dir, 'toy_cls_toy_seq_tag', 'train_feature_desc.json')
print(json.dumps(json.load(open(json_path, 'r', encoding='utf8')), indent=4))