- 19 Jul, 2019 4 commits
-
-
serpucga authored
-
serpucga authored
Now a directory "recovery" is created to contain these kind of files. Besides, they are no longer hidden files and they will always be unique, because they contain a timestamp in their filename (this way a recovery file won't unexpectedly a previous recovery file for the same collection)
-
serpucga authored
Added new mode of execution, 'recovery', which allows to continue execution of a task by loading a recovery file from a previous process
-
serpucga authored
-
- 18 Jul, 2019 4 commits
-
-
serpucga authored
-
serpucga authored
-
serpucga authored
Added a new option and mode, "-t", which will show the time costs for some of the most relevant operations (writing to file, converting a page to CSV format, creating the metadata file...). Besides, the verbose mode was enhanced considerably, leaving the most noisy messages out and introducing some useful ones and improving others.
-
serpucga authored
Names of variables enhanced for clarity, old and unused code removed, some changes in the logs and lots of new docstrings.
-
- 17 Jul, 2019 3 commits
-
-
serpucga authored
Changed structure of the code to make it thread safe when dumping data to the filesystem. Previous parallelism afected all the stages, and that could lead to corrupt data when two processes tried to write at the same time to the same file. Now the code for retrieving data from Mongo and converting it to CSV, named "process_page", because each worker receives a page of X (default 1000) tweets to convert, that code is parallelized and given to a pool of workers. However, those workers only write to buffers that they pass to a multiprocessing thread-safe queue. That queue is processed by a single process, the "filesystem_writer", which is the only one that can write to the filesystem (this includes both the creation of the necessary dirs and appending tweets to the CSV files). This worker is on an eternal loop looking for new data on the queue in order to write it down. This is a pretty dirty version that includes functions and code taht is no longer used and pretty bad log messages used during development to hunt down bugs. Will refactor soon.
-
serpucga authored
-
serpucga authored
-
- 16 Jul, 2019 3 commits
-
-
serpucga authored
-
serpucga authored
-
serpucga authored
Before it was done in the same way than it is done in UTool, by increasing an entry in the metadata file by X each time that X tweets are added to that CSV. However, for a script that converts from Mongo to CSV static collections that are not growing in size, it is just better to just count the number of lines of each CSV file once that the conversion process has ended. This also supresses the risk of the metadata being corrupted due to bad parallelization
-
- 15 Jul, 2019 5 commits
-
-
serpucga authored
Previous version ran out of memory for big databases, because it tried to launch all processes at once. This version has no memory issues anymore. Problems with not being thread safe and process collisions prevail.
-
serpucga authored
-
serpucga authored
-
serpucga authored
Parallelized using multiprocessing library. I'm not really sure about the code being thread safe. I think we don't care if tweets are appended to the files in a different order, but the metadata files being corrupted would be problematic. In the first tests the metadata were fine, but I think this line is probably not thread safe (two threads could load try to update the old value at the same time, resulting in inconsistencies): """ metadata_file["files"][file_path]["count"] += increase """ Apart from that, code is much faster than before.
-
serpucga authored
Simpler, more elegant and slightly faster version using the cursors instead of building a list of tweets for each page
-
- 12 Jul, 2019 3 commits
- 11 Jul, 2019 2 commits
- 10 Jul, 2019 2 commits
-
-
serpucga authored
-
serpucga authored
Repository contains just one simple script for the moment to dump the "tweets" collection of a Mongo database to a JSON file in a "pymongodump" directory that is created at the moment and place of execution. Faster than mongoexport, although the format of the resulting JSONs is somewhat different (adapted to Python's syntax).
-