Commits · 33782e37057402103fc5a8ee2a9906c9fd8e1325 · serpucga / migration_scripts

15 Jul, 2019 3 commits

Create one Mongo connection for each process · 33782e37
serpucga authored Jul 15, 2019

33782e37

First parallel version of the code · ed2c9d74

authored Jul 15, 2019

Parallelized using multiprocessing library. I'm not really sure about
the code being thread safe. I think we don't care if tweets are appended
to the files in a different order, but the metadata files being
corrupted would be problematic. In the first tests the metadata were
fine, but I think this line is probably not thread safe (two threads
could load try to update the old value at the same time, resulting in
inconsistencies):

"""
metadata_file["files"][file_path]["count"] += increase
"""

Apart from that, code is much faster than before.

ed2c9d74

Simpler, more elegant and slightly faster version using the cursors instead of… · 34776b63
serpucga authored Jul 15, 2019
```
Simpler, more elegant and slightly faster version using the cursors instead of building a list of tweets for each page
```
34776b63

12 Jul, 2019 3 commits
- Bugfix: was writing the metadata instead of the header at the beggining of each file · e093713a
  serpucga authored Jul 12, 2019
  
  e093713a
- Added pagination system to elude memory issues · 3cab07f8
  serpucga authored Jul 12, 2019
  
  3cab07f8
- Moved auxiliar functions out of the main script · 16f47745
  serpucga authored Jul 12, 2019
  
  16f47745
11 Jul, 2019 2 commits
- Script working (seemingly fine) · 95eb843f
  serpucga authored Jul 11, 2019
  
  95eb843f
- Exporting to JSON and to CSV separated. Implemented creation of filesystem tree to store the CSVs · dcc81cb0
  serpucga authored Jul 11, 2019
  
  dcc81cb0
10 Jul, 2019 2 commits

gitignore set to ignore output dir pymongodump · 56c27157
serpucga authored Jul 10, 2019

56c27157

Initial commit: Mongo to JSON dumper · d1923e7e

authored Jul 10, 2019

Repository contains just one simple script for the moment to dump the
"tweets" collection of a Mongo database to a JSON file in a
"pymongodump" directory that is created at the moment and place of
execution.
Faster than mongoexport, although the format of the resulting JSONs is
somewhat different (adapted to Python's syntax).

d1923e7e