Commits · f16cbf7ecd31c8eb36ffee8afdee828898935d95 · serpucga / migration_scripts

19 Jul, 2019 4 commits
- An additional pair of msgs to indicate end of successful execution in verbose mode · f16cbf7e
  serpucga authored Jul 19, 2019
  
  f16cbf7e
- Enhancements to path of recovery files · 19681c8e
  serpucga authored Jul 19, 2019
```
Now a directory "recovery" is created to contain these kind of files.
Besides, they are no longer hidden files and they will always be unique,
because they contain a timestamp in their filename (this way a recovery
file won't unexpectedly a previous recovery file for the same
collection)
```
  19681c8e
- Added new mode of execution, 'recovery', which allows to continue execution of a… · 174863a2
  serpucga authored Jul 19, 2019
```
Added new mode of execution, 'recovery', which allows to continue execution of a task by loading a recovery file from a previous process
```
  174863a2
- Generation of recovery file achieved · c9b7fb65
  serpucga authored Jul 19, 2019
  
  c9b7fb65
18 Jul, 2019 4 commits

Implementing fault tolerance, stage 1 · 053f7b0f
serpucga authored Jul 18, 2019

053f7b0f
Merge branch 'feature/parallelism' into develop · 36d7a65a
serpucga authored Jul 18, 2019

36d7a65a

Added new mode to measure times and improved logs · 52dbc5e9

authored Jul 18, 2019

Added a new option and mode, "-t", which will show the time costs for
some of the most relevant operations (writing to file, converting a page
to CSV format, creating the metadata file...).

Besides, the verbose mode was enhanced considerably, leaving the most
noisy messages out and introducing some useful ones and improving
others.

52dbc5e9

Refactoring and documenting · 662a3566

authored Jul 18, 2019

Names of variables enhanced for clarity, old and unused code removed,
some changes in the logs and lots of new docstrings.

662a3566

17 Jul, 2019 3 commits

Thread-safe parallel version (dirty code) · 34170f35

authored Jul 17, 2019

Changed structure of the code to make it thread safe when dumping data
to the filesystem. Previous parallelism afected all the stages, and that
could lead to corrupt data when two processes tried to write at the same
time to the same file. Now the code for retrieving data from Mongo and
converting it to CSV, named "process_page", because each worker receives
a page of X (default 1000) tweets to convert, that code is parallelized
and given to a pool of workers.

However, those workers only write to buffers that they pass to a
multiprocessing thread-safe queue. That queue is processed by a single
process, the "filesystem_writer", which is the only one that can write
to the filesystem (this includes both the creation of the necessary dirs
and appending tweets to the CSV files). This worker is on an eternal
loop looking for new data on the queue in order to write it down.

This is a pretty dirty version that includes functions and code taht is
no longer used and pretty bad log messages used during development to
hunt down bugs.

Will refactor soon.

34170f35

Forgot to add main module logger · 50dda137
serpucga authored Jul 17, 2019

50dda137
Added proper logging · 7ff9ad26
serpucga authored Jul 17, 2019

7ff9ad26

16 Jul, 2019 3 commits

Trying to make it thread safe, step1 · 866da8f7
serpucga authored Jul 16, 2019

866da8f7
Reformating · 844fabe9
serpucga authored Jul 16, 2019

844fabe9

Changed way of generating metadata file · 2e50b803

authored Jul 16, 2019

Before it was done in the same way than it is done in UTool, by
increasing an entry in the metadata file by X each time that X tweets
are added to that CSV. However, for a script that converts from Mongo to
CSV static collections that are not growing in size, it is just better
to just count the number of lines of each CSV file once that the
conversion process has ended.

This also supresses the risk of the metadata being corrupted due to bad
parallelization

2e50b803

15 Jul, 2019 5 commits

Limited number of processes launched by using Pool · 1f695bf1

authored Jul 15, 2019

Previous version ran out of memory for big databases, because it tried
to launch all processes at once. This version has no memory issues
anymore.

Problems with not being thread safe and process collisions prevail.

1f695bf1

Ignored tests.py · 84849517
serpucga authored Jul 15, 2019

84849517
Create one Mongo connection for each process · 33782e37
serpucga authored Jul 15, 2019

33782e37

First parallel version of the code · ed2c9d74

authored Jul 15, 2019

Parallelized using multiprocessing library. I'm not really sure about
the code being thread safe. I think we don't care if tweets are appended
to the files in a different order, but the metadata files being
corrupted would be problematic. In the first tests the metadata were
fine, but I think this line is probably not thread safe (two threads
could load try to update the old value at the same time, resulting in
inconsistencies):

"""
metadata_file["files"][file_path]["count"] += increase
"""

Apart from that, code is much faster than before.

ed2c9d74

Simpler, more elegant and slightly faster version using the cursors instead of… · 34776b63
serpucga authored Jul 15, 2019
```
Simpler, more elegant and slightly faster version using the cursors instead of building a list of tweets for each page
```
34776b63

12 Jul, 2019 3 commits
- Bugfix: was writing the metadata instead of the header at the beggining of each file · e093713a
  serpucga authored Jul 12, 2019
  
  e093713a
- Added pagination system to elude memory issues · 3cab07f8
  serpucga authored Jul 12, 2019
  
  3cab07f8
- Moved auxiliar functions out of the main script · 16f47745
  serpucga authored Jul 12, 2019
  
  16f47745
11 Jul, 2019 2 commits
- Script working (seemingly fine) · 95eb843f
  serpucga authored Jul 11, 2019
  
  95eb843f
- Exporting to JSON and to CSV separated. Implemented creation of filesystem tree to store the CSVs · dcc81cb0
  serpucga authored Jul 11, 2019
  
  dcc81cb0
10 Jul, 2019 2 commits

gitignore set to ignore output dir pymongodump · 56c27157
serpucga authored Jul 10, 2019

56c27157

Initial commit: Mongo to JSON dumper · d1923e7e

authored Jul 10, 2019

Repository contains just one simple script for the moment to dump the
"tweets" collection of a Mongo database to a JSON file in a
"pymongodump" directory that is created at the moment and place of
execution.
Faster than mongoexport, although the format of the resulting JSONs is
somewhat different (adapted to Python's syntax).

d1923e7e