Commits · master · serpucga / migration_scripts

02 Sep, 2019 1 commit
- Bugs fixed · bb514ff5
  serpucga authored Sep 02, 2019
  
  bb514ff5
25 Jul, 2019 4 commits
- More detailed exception · 209aaec4
  serpucga authored Jul 25, 2019
  
  209aaec4
- Exit appropriately when there are no tweets in the collection · 4516d8ee
  serpucga authored Jul 25, 2019
  
  4516d8ee
- Little bugfix that made single paged collections crash. Also, now code is more elegant · 7441fb7f
  serpucga authored Jul 25, 2019
  
  7441fb7f
- First definitive version. May contain bugs · a4b15d31
  serpucga authored Jul 25, 2019
  
  a4b15d31
24 Jul, 2019 9 commits
- Beeps substituted by synth voice, which is c00ler and more explicit about what happened · 8d7eb4a2
  serpucga authored Jul 24, 2019
  
  8d7eb4a2
- Beep at failure too · d3ed1927
  serpucga authored Jul 24, 2019
  
  d3ed1927
- Added beep at program termination to alert the distracted user · 6a15c8d2
  serpucga authored Jul 24, 2019
  
  6a15c8d2
- Merge branch 'feature/enhance_performance' into develop · 49f13f4a
  serpucga authored Jul 24, 2019
  
  49f13f4a
- Added dict-like indexing of the pagination system to show "relative" pagination… · f05f9d63
  serpucga authored Jul 24, 2019
```
Added dict-like indexing of the pagination system to show "relative" pagination in the logs and the recovery file instead of the tweet ids, which would be confusing for the user
```
  f05f9d63
- Bugfix: calculation of recovery index was being made combining the new and old… · a0751265
  serpucga authored Jul 24, 2019
```
Bugfix: calculation of recovery index was being made combining the new and old style of page indexing
```
  a0751265
- Bugfix: first tweet of each paged was being ignored · 407daa4c
  serpucga authored Jul 24, 2019
  
  407daa4c
- Added fast versions for getting and indexing pages · 2612720f
  serpucga authored Jul 24, 2019
```
Found that old pagination system based on skip() and limit() scaled
terribly bad for large collections. However, if the indexing isn't based
on making a skip but instead in asking for the tweets with a higher or
lesser value for one field, the query is much much faster. Thus, using
the "id" unique field as index pagination and retrieval system can work
for large collections.
```
  2612720f
- Extended error detection · 9ba7bac1
  serpucga authored Jul 24, 2019
  
  9ba7bac1
22 Jul, 2019 5 commits

Merge branch 'feature/fault_tolerance' into develop · 888acbe2
serpucga authored Jul 22, 2019

888acbe2

authored Jul 22, 2019

Enhanced documentation and removed function that is no longer used

ab69fb73

Changed recovery file format extension · b3721791
serpucga authored Jul 22, 2019
```
Don't know what I was thinking when I wrote ".csv" when this is clearly
a JSON file.
```
b3721791

Error handling added · 662b9e66

authored Jul 22, 2019

The system should now be capable of overcoming a failure during the
process of conversion, either by ignoring the error or by dumping the
state at the moment of failure and allowing to resume the process later
from the point where it was stopped.

The policies followed at this stage for avoiding corrupt data or other
errors are the following:
1. If an specific tweet raises an error in the process of being
converted to CSV, the tweet is skipped and the whole execution
continues.
2. If there is any other error when processing a page of tweets, the
number of that page is recorded in the recovery file, and that page will
be skipped when the user tries to resume the execution from the recovery
file.
3. If any other unexpected error, keyboard interruption or anything
happened, a standard recovery file will be dumped, with the list of
already converted pages but without "error_page", so when executing the
script with the flag "-r", the program will try to resume the execution
from the point where it was left without discarding any info.

662b9e66

Include corrupt number page in recovery file · c05a677c

authored Jul 22, 2019

Added way of dumping the number of the page that raised an error when
converting to CSV. This way we can in the future implement some ways of
dealing with it (skipping the corrupt page, inspecting it to inquire
where the problem resides, etc.)

c05a677c

19 Jul, 2019 4 commits
- An additional pair of msgs to indicate end of successful execution in verbose mode · f16cbf7e
  serpucga authored Jul 19, 2019
  
  f16cbf7e
- Enhancements to path of recovery files · 19681c8e
  serpucga authored Jul 19, 2019
```
Now a directory "recovery" is created to contain these kind of files.
Besides, they are no longer hidden files and they will always be unique,
because they contain a timestamp in their filename (this way a recovery
file won't unexpectedly a previous recovery file for the same
collection)
```
  19681c8e
- Added new mode of execution, 'recovery', which allows to continue execution of a… · 174863a2
  serpucga authored Jul 19, 2019
```
Added new mode of execution, 'recovery', which allows to continue execution of a task by loading a recovery file from a previous process
```
  174863a2
- Generation of recovery file achieved · c9b7fb65
  serpucga authored Jul 19, 2019
  
  c9b7fb65
18 Jul, 2019 4 commits

Implementing fault tolerance, stage 1 · 053f7b0f
serpucga authored Jul 18, 2019

053f7b0f
Merge branch 'feature/parallelism' into develop · 36d7a65a
serpucga authored Jul 18, 2019

36d7a65a

Added new mode to measure times and improved logs · 52dbc5e9

authored Jul 18, 2019

Added a new option and mode, "-t", which will show the time costs for
some of the most relevant operations (writing to file, converting a page
to CSV format, creating the metadata file...).

Besides, the verbose mode was enhanced considerably, leaving the most
noisy messages out and introducing some useful ones and improving
others.

52dbc5e9

Refactoring and documenting · 662a3566

authored Jul 18, 2019

Names of variables enhanced for clarity, old and unused code removed,
some changes in the logs and lots of new docstrings.

662a3566

17 Jul, 2019 3 commits

Thread-safe parallel version (dirty code) · 34170f35

authored Jul 17, 2019

Changed structure of the code to make it thread safe when dumping data
to the filesystem. Previous parallelism afected all the stages, and that
could lead to corrupt data when two processes tried to write at the same
time to the same file. Now the code for retrieving data from Mongo and
converting it to CSV, named "process_page", because each worker receives
a page of X (default 1000) tweets to convert, that code is parallelized
and given to a pool of workers.

However, those workers only write to buffers that they pass to a
multiprocessing thread-safe queue. That queue is processed by a single
process, the "filesystem_writer", which is the only one that can write
to the filesystem (this includes both the creation of the necessary dirs
and appending tweets to the CSV files). This worker is on an eternal
loop looking for new data on the queue in order to write it down.

This is a pretty dirty version that includes functions and code taht is
no longer used and pretty bad log messages used during development to
hunt down bugs.

Will refactor soon.

34170f35

Forgot to add main module logger · 50dda137
serpucga authored Jul 17, 2019

50dda137
Added proper logging · 7ff9ad26
serpucga authored Jul 17, 2019

7ff9ad26

16 Jul, 2019 3 commits

Trying to make it thread safe, step1 · 866da8f7
serpucga authored Jul 16, 2019

866da8f7
Reformating · 844fabe9
serpucga authored Jul 16, 2019

844fabe9

Changed way of generating metadata file · 2e50b803

authored Jul 16, 2019

Before it was done in the same way than it is done in UTool, by
increasing an entry in the metadata file by X each time that X tweets
are added to that CSV. However, for a script that converts from Mongo to
CSV static collections that are not growing in size, it is just better
to just count the number of lines of each CSV file once that the
conversion process has ended.

This also supresses the risk of the metadata being corrupted due to bad
parallelization

2e50b803

15 Jul, 2019 5 commits

Limited number of processes launched by using Pool · 1f695bf1

authored Jul 15, 2019

Previous version ran out of memory for big databases, because it tried
to launch all processes at once. This version has no memory issues
anymore.

Problems with not being thread safe and process collisions prevail.

1f695bf1

Ignored tests.py · 84849517
serpucga authored Jul 15, 2019

84849517
Create one Mongo connection for each process · 33782e37
serpucga authored Jul 15, 2019

33782e37

First parallel version of the code · ed2c9d74

authored Jul 15, 2019

Parallelized using multiprocessing library. I'm not really sure about
the code being thread safe. I think we don't care if tweets are appended
to the files in a different order, but the metadata files being
corrupted would be problematic. In the first tests the metadata were
fine, but I think this line is probably not thread safe (two threads
could load try to update the old value at the same time, resulting in
inconsistencies):

"""
metadata_file["files"][file_path]["count"] += increase
"""

Apart from that, code is much faster than before.

ed2c9d74

Simpler, more elegant and slightly faster version using the cursors instead of… · 34776b63
serpucga authored Jul 15, 2019
```
Simpler, more elegant and slightly faster version using the cursors instead of building a list of tweets for each page
```
34776b63

12 Jul, 2019 2 commits
- Bugfix: was writing the metadata instead of the header at the beggining of each file · e093713a
  serpucga authored Jul 12, 2019
  
  e093713a
- Added pagination system to elude memory issues · 3cab07f8
  serpucga authored Jul 12, 2019
  
  3cab07f8