- 15 Jul, 2019 5 commits
-
-
serpucga authored
Previous version ran out of memory for big databases, because it tried to launch all processes at once. This version has no memory issues anymore. Problems with not being thread safe and process collisions prevail.
-
serpucga authored
-
serpucga authored
-
serpucga authored
Parallelized using multiprocessing library. I'm not really sure about the code being thread safe. I think we don't care if tweets are appended to the files in a different order, but the metadata files being corrupted would be problematic. In the first tests the metadata were fine, but I think this line is probably not thread safe (two threads could load try to update the old value at the same time, resulting in inconsistencies): """ metadata_file["files"][file_path]["count"] += increase """ Apart from that, code is much faster than before.
-
serpucga authored
Simpler, more elegant and slightly faster version using the cursors instead of building a list of tweets for each page
-
- 12 Jul, 2019 3 commits
- 11 Jul, 2019 2 commits
- 10 Jul, 2019 2 commits
-
-
serpucga authored
-
serpucga authored
Repository contains just one simple script for the moment to dump the "tweets" collection of a Mongo database to a JSON file in a "pymongodump" directory that is created at the moment and place of execution. Faster than mongoexport, although the format of the resulting JSONs is somewhat different (adapted to Python's syntax).
-