- 02 Sep, 2019 1 commit
-
-
serpucga authored
-
- 25 Jul, 2019 4 commits
- 24 Jul, 2019 9 commits
-
-
serpucga authored
-
serpucga authored
-
serpucga authored
-
serpucga authored
-
serpucga authored
Added dict-like indexing of the pagination system to show "relative" pagination in the logs and the recovery file instead of the tweet ids, which would be confusing for the user
-
serpucga authored
Bugfix: calculation of recovery index was being made combining the new and old style of page indexing
-
serpucga authored
-
serpucga authored
Found that old pagination system based on skip() and limit() scaled terribly bad for large collections. However, if the indexing isn't based on making a skip but instead in asking for the tweets with a higher or lesser value for one field, the query is much much faster. Thus, using the "id" unique field as index pagination and retrieval system can work for large collections.
-
serpucga authored
-
- 22 Jul, 2019 5 commits
-
-
serpucga authored
-
serpucga authored
Enhanced documentation and removed function that is no longer used
-
serpucga authored
Don't know what I was thinking when I wrote ".csv" when this is clearly a JSON file.
-
serpucga authored
The system should now be capable of overcoming a failure during the process of conversion, either by ignoring the error or by dumping the state at the moment of failure and allowing to resume the process later from the point where it was stopped. The policies followed at this stage for avoiding corrupt data or other errors are the following: 1. If an specific tweet raises an error in the process of being converted to CSV, the tweet is skipped and the whole execution continues. 2. If there is any other error when processing a page of tweets, the number of that page is recorded in the recovery file, and that page will be skipped when the user tries to resume the execution from the recovery file. 3. If any other unexpected error, keyboard interruption or anything happened, a standard recovery file will be dumped, with the list of already converted pages but without "error_page", so when executing the script with the flag "-r", the program will try to resume the execution from the point where it was left without discarding any info.
-
serpucga authored
Added way of dumping the number of the page that raised an error when converting to CSV. This way we can in the future implement some ways of dealing with it (skipping the corrupt page, inspecting it to inquire where the problem resides, etc.)
-
- 19 Jul, 2019 4 commits
-
-
serpucga authored
-
serpucga authored
Now a directory "recovery" is created to contain these kind of files. Besides, they are no longer hidden files and they will always be unique, because they contain a timestamp in their filename (this way a recovery file won't unexpectedly a previous recovery file for the same collection)
-
serpucga authored
Added new mode of execution, 'recovery', which allows to continue execution of a task by loading a recovery file from a previous process
-
serpucga authored
-
- 18 Jul, 2019 4 commits
-
-
serpucga authored
-
serpucga authored
-
serpucga authored
Added a new option and mode, "-t", which will show the time costs for some of the most relevant operations (writing to file, converting a page to CSV format, creating the metadata file...). Besides, the verbose mode was enhanced considerably, leaving the most noisy messages out and introducing some useful ones and improving others.
-
serpucga authored
Names of variables enhanced for clarity, old and unused code removed, some changes in the logs and lots of new docstrings.
-
- 17 Jul, 2019 3 commits
-
-
serpucga authored
Changed structure of the code to make it thread safe when dumping data to the filesystem. Previous parallelism afected all the stages, and that could lead to corrupt data when two processes tried to write at the same time to the same file. Now the code for retrieving data from Mongo and converting it to CSV, named "process_page", because each worker receives a page of X (default 1000) tweets to convert, that code is parallelized and given to a pool of workers. However, those workers only write to buffers that they pass to a multiprocessing thread-safe queue. That queue is processed by a single process, the "filesystem_writer", which is the only one that can write to the filesystem (this includes both the creation of the necessary dirs and appending tweets to the CSV files). This worker is on an eternal loop looking for new data on the queue in order to write it down. This is a pretty dirty version that includes functions and code taht is no longer used and pretty bad log messages used during development to hunt down bugs. Will refactor soon.
-
serpucga authored
-
serpucga authored
-
- 16 Jul, 2019 3 commits
-
-
serpucga authored
-
serpucga authored
-
serpucga authored
Before it was done in the same way than it is done in UTool, by increasing an entry in the metadata file by X each time that X tweets are added to that CSV. However, for a script that converts from Mongo to CSV static collections that are not growing in size, it is just better to just count the number of lines of each CSV file once that the conversion process has ended. This also supresses the risk of the metadata being corrupted due to bad parallelization
-
- 15 Jul, 2019 5 commits
-
-
serpucga authored
Previous version ran out of memory for big databases, because it tried to launch all processes at once. This version has no memory issues anymore. Problems with not being thread safe and process collisions prevail.
-
serpucga authored
-
serpucga authored
-
serpucga authored
Parallelized using multiprocessing library. I'm not really sure about the code being thread safe. I think we don't care if tweets are appended to the files in a different order, but the metadata files being corrupted would be problematic. In the first tests the metadata were fine, but I think this line is probably not thread safe (two threads could load try to update the old value at the same time, resulting in inconsistencies): """ metadata_file["files"][file_path]["count"] += increase """ Apart from that, code is much faster than before.
-
serpucga authored
Simpler, more elegant and slightly faster version using the cursors instead of building a list of tweets for each page
-
- 12 Jul, 2019 2 commits