Thread-safe parallel version (dirty code)
Changed structure of the code to make it thread safe when dumping data
to the filesystem. Previous parallelism afected all the stages, and that
could lead to corrupt data when two processes tried to write at the same
time to the same file. Now the code for retrieving data from Mongo and
converting it to CSV, named "process_page", because each worker receives
a page of X (default 1000) tweets to convert, that code is parallelized
and given to a pool of workers.
However, those workers only write to buffers that they pass to a
multiprocessing thread-safe queue. That queue is processed by a single
process, the "filesystem_writer", which is the only one that can write
to the filesystem (this includes both the creation of the necessary dirs
and appending tweets to the CSV files). This worker is on an eternal
loop looking for new data on the queue in order to write it down.
This is a pretty dirty version that includes functions and code taht is
no longer used and pretty bad log messages used during development to
hunt down bugs.
Will refactor soon.
Showing
Please
register
or
sign in
to comment