1. 02 Sep, 2019 1 commit
  2. 25 Jul, 2019 4 commits
  3. 24 Jul, 2019 9 commits
  4. 22 Jul, 2019 5 commits
    • Reformatting · ab69fb73
      serpucga authored
      Enhanced documentation and removed function that is no longer used
    • Changed recovery file format extension · b3721791
      serpucga authored
      Don't know what I was thinking when I wrote ".csv" when this is clearly
      a JSON file.
    • Error handling added · 662b9e66
      serpucga authored
      The system should now be capable of overcoming a failure during the
      process of conversion, either by ignoring the error or by dumping the
      state at the moment of failure and allowing to resume the process later
      from the point where it was stopped.
      
      The policies followed at this stage for avoiding corrupt data or other
      errors are the following:
      1. If an specific tweet raises an error in the process of being
      converted to CSV, the tweet is skipped and the whole execution
      continues.
      2. If there is any other error when processing a page of tweets, the
      number of that page is recorded in the recovery file, and that page will
      be skipped when the user tries to resume the execution from the recovery
      file.
      3. If any other unexpected error, keyboard interruption or anything
      happened, a standard recovery file will be dumped, with the list of
      already converted pages but without "error_page", so when executing the
      script with the flag "-r", the program will try to resume the execution
      from the point where it was left without discarding any info.
    • Include corrupt number page in recovery file · c05a677c
      serpucga authored
      Added way of dumping the number of the page that raised an error when
      converting to CSV. This way we can in the future implement some ways of
      dealing with it (skipping the corrupt page, inspecting it to inquire
      where the problem resides, etc.)
  5. 19 Jul, 2019 4 commits
  6. 18 Jul, 2019 4 commits
  7. 17 Jul, 2019 3 commits
    • Thread-safe parallel version (dirty code) · 34170f35
      serpucga authored
      Changed structure of the code to make it thread safe when dumping data
      to the filesystem. Previous parallelism afected all the stages, and that
      could lead to corrupt data when two processes tried to write at the same
      time to the same file. Now the code for retrieving data from Mongo and
      converting it to CSV, named "process_page", because each worker receives
      a page of X (default 1000) tweets to convert, that code is parallelized
      and given to a pool of workers.
      
      However, those workers only write to buffers that they pass to a
      multiprocessing thread-safe queue. That queue is processed by a single
      process, the "filesystem_writer", which is the only one that can write
      to the filesystem (this includes both the creation of the necessary dirs
      and appending tweets to the CSV files). This worker is on an eternal
      loop looking for new data on the queue in order to write it down.
      
      This is a pretty dirty version that includes functions and code taht is
      no longer used and pretty bad log messages used during development to
      hunt down bugs.
      
      Will refactor soon.
    • Forgot to add main module logger · 50dda137
      serpucga authored
    • Added proper logging · 7ff9ad26
      serpucga authored
  8. 16 Jul, 2019 3 commits
    • Trying to make it thread safe, step1 · 866da8f7
      serpucga authored
    • Reformating · 844fabe9
      serpucga authored
    • Changed way of generating metadata file · 2e50b803
      serpucga authored
      Before it was done in the same way than it is done in UTool, by
      increasing an entry in the metadata file by X each time that X tweets
      are added to that CSV. However, for a script that converts from Mongo to
      CSV static collections that are not growing in size, it is just better
      to just count the number of lines of each CSV file once that the
      conversion process has ended.
      
      This also supresses the risk of the metadata being corrupted due to bad
      parallelization
  9. 15 Jul, 2019 5 commits
    • Limited number of processes launched by using Pool · 1f695bf1
      serpucga authored
      Previous version ran out of memory for big databases, because it tried
      to launch all processes at once. This version has no memory issues
      anymore.
      
      Problems with not being thread safe and process collisions prevail.
    • Ignored tests.py · 84849517
      serpucga authored
    • Create one Mongo connection for each process · 33782e37
      serpucga authored
    • First parallel version of the code · ed2c9d74
      serpucga authored
      Parallelized using multiprocessing library. I'm not really sure about
      the code being thread safe. I think we don't care if tweets are appended
      to the files in a different order, but the metadata files being
      corrupted would be problematic. In the first tests the metadata were
      fine, but I think this line is probably not thread safe (two threads
      could load try to update the old value at the same time, resulting in
      inconsistencies):
      
      """
      metadata_file["files"][file_path]["count"] += increase
      """
      
      Apart from that, code is much faster than before.
    • Simpler, more elegant and slightly faster version using the cursors instead of… · 34776b63
      serpucga authored
      Simpler, more elegant and slightly faster version using the cursors instead of building a list of tweets for each page
  10. 12 Jul, 2019 2 commits