vendor/tomotopy/README.rst in tomoto-0.2.2 vs vendor/tomotopy/README.rst in tomoto-0.2.3
- old
+ new
@@ -200,10 +200,59 @@
print("Log-likelihood of inference: ", ll)
The `infer` method can infer only one instance of `tomotopy.Document` or a `list` of instances of `tomotopy.Document`.
See more at `tomotopy.LDAModel.infer`.
+Corpus and transform
+--------------------
+Every topic model in `tomotopy` has its own internal document type.
+A document can be created and added into suitable for each model through each model's `add_doc` method.
+However, trying to add the same list of documents to different models becomes quite inconvenient,
+because `add_doc` should be called for the same list of documents to each different model.
+Thus, `tomotopy` provides `tomotopy.utils.Corpus` class that holds a list of documents.
+`tomotopy.utils.Corpus` can be inserted into any model by passing as argument `corpus` to `__init__` or `add_corpus` method of each model.
+So, inserting `tomotopy.utils.Corpus` just has the same effect to inserting documents the corpus holds.
+
+Some topic models requires different data for its documents.
+For example, `tomotopy.DMRModel` requires argument `metadata` in `str` type,
+but `tomotopy.PLDAModel` requires argument `labels` in `List[str]` type.
+Since `tomotopy.utils.Corpus` holds an independent set of documents rather than being tied to a specific topic model,
+data types required by a topic model may be inconsistent when a corpus is added into that topic model.
+In this case, miscellaneous data can be transformed to be fitted target topic model using argument `transform`.
+See more details in the following code:
+
+::
+
+ from tomotopy import DMRModel
+ from tomotopy.utils import Corpus
+
+ corpus = Corpus()
+ corpus.add_doc("a b c d e".split(), a_data=1)
+ corpus.add_doc("e f g h i".split(), a_data=2)
+ corpus.add_doc("i j k l m".split(), a_data=3)
+
+ model = DMRModel(k=10)
+ model.add_corpus(corpus)
+ # You lose `a_data` field in `corpus`,
+ # and `metadata` that `DMRModel` requires is filled with the default value, empty str.
+
+ assert model.docs[0].metadata == ''
+ assert model.docs[1].metadata == ''
+ assert model.docs[2].metadata == ''
+
+ def transform_a_data_to_metadata(misc: dict):
+ return {'metadata': str(misc['a_data'])}
+ # this function transforms `a_data` to `metadata`
+
+ model = DMRModel(k=10)
+ model.add_corpus(corpus, transform=transform_a_data_to_metadata)
+ # Now docs in `model` has non-default `metadata`, that generated from `a_data` field.
+
+ assert model.docs[0].metadata == '1'
+ assert model.docs[1].metadata == '2'
+ assert model.docs[2].metadata == '3'
+
Parallel Sampling Algorithms
----------------------------
Since version 0.5.0, `tomotopy` allows you to choose a parallelism algorithm.
The algorithm provided in versions prior to 0.4.2 is `COPY_MERGE`, which is provided for all topic models.
The new algorithm `PARTITION`, available since 0.5.0, makes training generally faster and more memory-efficient, but it is available at not all topic models.
@@ -258,9 +307,15 @@
`tomotopy` is licensed under the terms of MIT License,
meaning you can use it for any reasonable purpose and remain in complete ownership of all the documentation you produce.
History
-------
+* 0.12.1 (2021-06-20)
+ * An issue where `tomotopy.LDAModel.set_word_prior()` causes a crash has been fixed.
+ * Now `tomotopy.LDAModel.perplexity` and `tomotopy.LDAModel.ll_per_word` return the accurate value when `TermWeight` is not `ONE`.
+ * `tomotopy.LDAModel.used_vocab_weighted_freq` was added, which returns term-weighted frequencies of words.
+ * Now `tomotopy.LDAModel.summary()` shows not only the entropy of words, but also the entropy of term-weighted words.
+
* 0.12.0 (2021-04-26)
* Now `tomotopy.DMRModel` and `tomotopy.GDMRModel` support multiple values of metadata (see https://github.com/bab2min/tomotopy/blob/main/examples/dmr_multi_label.py )
* The performance of `tomotopy.GDMRModel` was improved.
* A `copy()` method has been added for all topic models to do a deep copy.
* An issue was fixed where words that are excluded from training (by `min_cf`, `min_df`) have incorrect topic id. Now all excluded words have `-1` as topic id.