-
Thread-safe parallel version (dirty code) · 34170f35serpucga authored
Changed structure of the code to make it thread safe when dumping data to the filesystem. Previous parallelism afected all the stages, and that could lead to corrupt data when two processes tried to write at the same time to the same file. Now the code for retrieving data from Mongo and converting it to CSV, named "process_page", because each worker receives a page of X (default 1000) tweets to convert, that code is parallelized and given to a pool of workers. However, those workers only write to buffers that they pass to a multiprocessing thread-safe queue. That queue is processed by a single process, the "filesystem_writer", which is the only one that can write to the filesystem (this includes both the creation of the necessary dirs and appending tweets to the CSV files). This worker is on an eternal loop looking for new data on the queue in order to write it down. This is a pretty dirty version that includes functions and code taht is no longer used and pretty bad log messages used during development to hunt down bugs. Will refactor soon.
34170f35
Name |
Last commit
|
Last update |
---|---|---|
lib | Loading commit data... | |
.gitignore | Loading commit data... | |
header.txt | Loading commit data... | |
pymongoexport_csv.py | Loading commit data... | |
pymongoexport_json.py | Loading commit data... | |
requirements.txt | Loading commit data... |