- 16 Jul, 2019 2 commits
-
-
serpucga authored
-
serpucga authored
Before it was done in the same way than it is done in UTool, by increasing an entry in the metadata file by X each time that X tweets are added to that CSV. However, for a script that converts from Mongo to CSV static collections that are not growing in size, it is just better to just count the number of lines of each CSV file once that the conversion process has ended. This also supresses the risk of the metadata being corrupted due to bad parallelization
-
- 15 Jul, 2019 5 commits
-
-
serpucga authored
Previous version ran out of memory for big databases, because it tried to launch all processes at once. This version has no memory issues anymore. Problems with not being thread safe and process collisions prevail.
-
serpucga authored
-
serpucga authored
-
serpucga authored
Parallelized using multiprocessing library. I'm not really sure about the code being thread safe. I think we don't care if tweets are appended to the files in a different order, but the metadata files being corrupted would be problematic. In the first tests the metadata were fine, but I think this line is probably not thread safe (two threads could load try to update the old value at the same time, resulting in inconsistencies): """ metadata_file["files"][file_path]["count"] += increase """ Apart from that, code is much faster than before.
-
serpucga authored
Simpler, more elegant and slightly faster version using the cursors instead of building a list of tweets for each page
-
- 12 Jul, 2019 3 commits
- 11 Jul, 2019 2 commits
- 10 Jul, 2019 2 commits
-
-
serpucga authored
-
serpucga authored
Repository contains just one simple script for the moment to dump the "tweets" collection of a Mongo database to a JSON file in a "pymongodump" directory that is created at the moment and place of execution. Faster than mongoexport, although the format of the resulting JSONs is somewhat different (adapted to Python's syntax).
-