Imports

load_transformer_tokenizer[source]

load_transformer_tokenizer(tokenizer_name:str, load_module_name=None)

some tokenizers cannot be loaded using AutoTokenizer.

this function served as a util function to catch that situation.

Args: tokenizer_name (str): tokenizer name

load_transformer_tokenizer(
            'voidful/albert_chinese_tiny', 'BertTokenizer')
PreTrainedTokenizer(name_or_path='voidful/albert_chinese_tiny', vocab_size=21128, model_max_len=1000000000000000019884624838656, is_fast=False, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

load_transformer_config[source]

load_transformer_config(config_name_or_dict, load_module_name=None)

Some models need specify loading module

Args: config_name (str): module name load_module_name (str, optional): loading module name. Defaults to None.

Returns: config: config

config = load_transformer_config(
    'bert-base-chinese')
config_dict = config.to_dict()
# load config with dict
config = load_transformer_config(
    config_dict, load_module_name='BertConfig')

load_transformer_model[source]

load_transformer_model(model_name_or_config, load_module_name=None)

# this is a pt only model
model = load_transformer_model(
    'voidful/albert_chinese_tiny')

# load by config (not load weights)
model = load_transformer_model(load_transformer_config(
    'bert-base-chinese'), 'TFBertModel')
404 Client Error: Not Found for url: https://huggingface.co/voidful/albert_chinese_tiny/resolve/main/tf_model.h5
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFAlbertModel: ['predictions.decoder.weight', 'predictions.dense.bias', 'predictions.bias', 'predictions.LayerNorm.bias', 'predictions.dense.weight', 'predictions.decoder.bias', 'predictions.LayerNorm.weight']
- This IS expected if you are initializing TFAlbertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFAlbertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFAlbertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFAlbertModel for predictions without further training.

get_label_encoder_save_path[source]

get_label_encoder_save_path(params, problem:str)

class LabelEncoder[source]

LabelEncoder() :: BaseEstimator

Base class for all estimators in scikit-learn.

Notes

All estimators should specify all the parameters that can be set at the class level in their __init__ as explicit keyword arguments (no *args or **kwargs).

create_path[source]

create_path(path)

need_make_label_encoder[source]

need_make_label_encoder(mode:str, le_path:str, overwrite=False)

get_or_make_label_encoder[source]

get_or_make_label_encoder(params, problem:str, mode:str, label_list=None, overwrite=True)

Function to unify ways to get or create label encoder for various problem type.

cls: LabelEncoder seq_tag: LabelEncoder multi_cls: MultiLabelBinarizer seq2seq_text: Tokenizer

Arguments: problem {str} -- problem name mode {mode} -- mode

Keyword Arguments: label_list {list} -- label list to fit the encoder (default: {None})

Returns: LabelEncoder -- label encoder

le_train = get_or_make_label_encoder(
    params=params, problem='weibo_fake_ner', mode=m3tl.TRAIN, label_list=[['a', 'b'], ['c']]
)
# seq_tag will add [PAD]
assert len(le_train.encode_dict) == 4, le_train.encode_dict

le_predict = get_or_make_label_encoder(
    params=params, problem='weibo_fake_ner', mode=m3tl.PREDICT)
assert le_predict.encode_dict==le_train.encode_dict

# list train
le_train = get_or_make_label_encoder(
    params=params, problem='weibo_fake_cls', mode=m3tl.TRAIN, label_list=['a', 'b', 'c']
)
# seq_tag will add [PAD]
assert len(le_train.encode_dict) == 3

le_predict = get_or_make_label_encoder(
    params=params, problem='weibo_fake_cls', mode=m3tl.PREDICT)
assert le_predict.encode_dict==le_train.encode_dict

# text
le_train = get_or_make_label_encoder(
    params=params, problem='weibo_masklm', mode=m3tl.TRAIN)
assert isinstance(le_train, transformers.PreTrainedTokenizer)
le_predict = get_or_make_label_encoder(
    params=params, problem='weibo_masklm', mode=m3tl.PREDICT)
assert isinstance(le_predict, transformers.PreTrainedTokenizer)

cluster_alphnum[source]

cluster_alphnum(text:str)

Simple funtions to aggregate eng and number

Arguments: text {str} -- input text

Returns: list -- list of string with chinese char or eng word as element

filter_empty[source]

filter_empty(input_list, target_list)

Filter empty inputs or targets

Arguments: input_list {list} -- input list target_list {list} -- target list

Returns: input_list, target_list -- data after filter

infer_shape_and_type_from_dict[source]

infer_shape_and_type_from_dict(inp_dict:dict, fix_dim_for_high_rank_tensor=True)

test_dict = {
    'test1': np.random.uniform(size=(64, 32)),
    'test2': np.array([1, 2, 3], dtype='int32'),
    'test5': 5
}
desc_dict = infer_shape_and_type_from_dict(
    test_dict)
assert desc_dict == ({'test1': [None, 32], 'test2': [None], 'test5': []}, {
                    'test1': tf.float32, 'test2': tf.int32, 'test5': tf.int32})

get_transformer_main_model[source]

get_transformer_main_model(model, key='embeddings')

Function to extrac model name from huggingface transformer models.

Args: model (Model): Huggingface transformers model key (str, optional): Key to identify model. Defaults to 'embeddings'.

Returns: model

model = load_transformer_model(
    'voidful/albert_chinese_tiny')
main_model = get_transformer_main_model(model)
isinstance(main_model, transformers.TFAlbertMainLayer)
404 Client Error: Not Found for url: https://huggingface.co/voidful/albert_chinese_tiny/resolve/main/tf_model.h5
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFAlbertModel: ['predictions.decoder.weight', 'predictions.dense.bias', 'predictions.bias', 'predictions.LayerNorm.bias', 'predictions.dense.weight', 'predictions.decoder.bias', 'predictions.LayerNorm.weight']
- This IS expected if you are initializing TFAlbertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFAlbertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFAlbertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFAlbertModel for predictions without further training.
True

get_embedding_table_from_model[source]

get_embedding_table_from_model(model:TFPreTrainedModel)

embedding = get_embedding_table_from_model(
    model)
assert embedding.shape == (21128, 128)

get_shape_list[source]

get_shape_list(tensor, expected_rank=None, name=None)

Returns a list of the shape of tensor, preferring static dimensions.

Args: tensor: A tf.Tensor object to find the shape of. expected_rank: (optional) int. The expected rank of tensor. If this is specified and the tensor has a different rank, and exception will be thrown. name: Optional name of the tensor for the error message.

Returns: A list of dimensions of the shape of tensor. All static dimensions will be returned as python integers, and dynamic dimensions will be returned as tf.Tensor scalars.

gather_indexes[source]

gather_indexes(sequence_tensor, positions)

Gathers the vectors at the specific positions over a minibatch.

dispatch_features[source]

dispatch_features(features, hidden_feature, problem, mode)

create_dict_from_nested_model[source]

create_dict_from_nested_model(model:Model, loss_dict=None, ele_name='losses', added_name=None)

variable_summaries[source]

variable_summaries(var, name)

Attach a lot of summaries to a Tensor (for TensorBoard visualization).

set_phase[source]

set_phase(phase:str)

get_phase[source]

get_phase()

set_is_pyspark[source]

set_is_pyspark(is_pyspark:bool)

get_is_pyspark[source]

get_is_pyspark()

class TFRedundantWarningFilter[source]

TFRedundantWarningFilter(name='') :: Filter

Filter instances are used to perform arbitrary filtering of LogRecords.

Loggers and Handlers can optionally use Filter instances to filter records as desired. The base filter class only allows events which are below a certain point in the logger hierarchy. For example, a filter initialized with "A.B" will allow events logged by loggers "A.B", "A.B.C", "A.B.C.D", "A.B.D" etc. but not "A.BB", "B.A.B" etc. If initialized with the empty string, all events are passed.

compress_tf_warnings[source]

compress_tf_warnings()