vendor/tomotopy/README.rst in tomoto-0.2.2 vs vendor/tomotopy/README.rst in tomoto-0.2.3

- old
+ new

@@ -200,10 +200,59 @@ print("Log-likelihood of inference: ", ll) The `infer` method can infer only one instance of `tomotopy.Document` or a `list` of instances of `tomotopy.Document`. See more at `tomotopy.LDAModel.infer`. +Corpus and transform +-------------------- +Every topic model in `tomotopy` has its own internal document type. +A document can be created and added into suitable for each model through each model's `add_doc` method. +However, trying to add the same list of documents to different models becomes quite inconvenient, +because `add_doc` should be called for the same list of documents to each different model. +Thus, `tomotopy` provides `tomotopy.utils.Corpus` class that holds a list of documents. +`tomotopy.utils.Corpus` can be inserted into any model by passing as argument `corpus` to `__init__` or `add_corpus` method of each model. +So, inserting `tomotopy.utils.Corpus` just has the same effect to inserting documents the corpus holds. + +Some topic models requires different data for its documents. +For example, `tomotopy.DMRModel` requires argument `metadata` in `str` type, +but `tomotopy.PLDAModel` requires argument `labels` in `List[str]` type. +Since `tomotopy.utils.Corpus` holds an independent set of documents rather than being tied to a specific topic model, +data types required by a topic model may be inconsistent when a corpus is added into that topic model. +In this case, miscellaneous data can be transformed to be fitted target topic model using argument `transform`. +See more details in the following code: + +:: + + from tomotopy import DMRModel + from tomotopy.utils import Corpus + + corpus = Corpus() + corpus.add_doc("a b c d e".split(), a_data=1) + corpus.add_doc("e f g h i".split(), a_data=2) + corpus.add_doc("i j k l m".split(), a_data=3) + + model = DMRModel(k=10) + model.add_corpus(corpus) + # You lose `a_data` field in `corpus`, + # and `metadata` that `DMRModel` requires is filled with the default value, empty str. + + assert model.docs[0].metadata == '' + assert model.docs[1].metadata == '' + assert model.docs[2].metadata == '' + + def transform_a_data_to_metadata(misc: dict): + return {'metadata': str(misc['a_data'])} + # this function transforms `a_data` to `metadata` + + model = DMRModel(k=10) + model.add_corpus(corpus, transform=transform_a_data_to_metadata) + # Now docs in `model` has non-default `metadata`, that generated from `a_data` field. + + assert model.docs[0].metadata == '1' + assert model.docs[1].metadata == '2' + assert model.docs[2].metadata == '3' + Parallel Sampling Algorithms ---------------------------- Since version 0.5.0, `tomotopy` allows you to choose a parallelism algorithm. The algorithm provided in versions prior to 0.4.2 is `COPY_MERGE`, which is provided for all topic models. The new algorithm `PARTITION`, available since 0.5.0, makes training generally faster and more memory-efficient, but it is available at not all topic models. @@ -258,9 +307,15 @@ `tomotopy` is licensed under the terms of MIT License, meaning you can use it for any reasonable purpose and remain in complete ownership of all the documentation you produce. History ------- +* 0.12.1 (2021-06-20) + * An issue where `tomotopy.LDAModel.set_word_prior()` causes a crash has been fixed. + * Now `tomotopy.LDAModel.perplexity` and `tomotopy.LDAModel.ll_per_word` return the accurate value when `TermWeight` is not `ONE`. + * `tomotopy.LDAModel.used_vocab_weighted_freq` was added, which returns term-weighted frequencies of words. + * Now `tomotopy.LDAModel.summary()` shows not only the entropy of words, but also the entropy of term-weighted words. + * 0.12.0 (2021-04-26) * Now `tomotopy.DMRModel` and `tomotopy.GDMRModel` support multiple values of metadata (see https://github.com/bab2min/tomotopy/blob/main/examples/dmr_multi_label.py ) * The performance of `tomotopy.GDMRModel` was improved. * A `copy()` method has been added for all topic models to do a deep copy. * An issue was fixed where words that are excluded from training (by `min_cf`, `min_df`) have incorrect topic id. Now all excluded words have `-1` as topic id.