Questions and notes from workshop day 4

Icebreaker questions

Icebreaker Q1: Computer programs are expected to produce the same output for the same inputs. Is that always true for research software? Can you give some examples?

Icebreaker Q2: What do you want to be when you grow up?

Poll: Do you start this week or did you already attend last week?

Questions on last week's material

  1. I can't get the link "Very detailed 2-page git cheatsheet" to work on this page: https://coderefinery.github.io/git-intro/reference/ I just get "404 File not found". Thanks for your help!

    • Fixed now :relaxed: Thank you for notifying! -Thanks so much!
  2. Could you clarify the concepts of the branches combined with having your local repository, your forked repo and the central repo? For example, all of them have a "master" branch, so if I talk about the master branch, which one am I referring to? And where should mergers happen? All these things weren't properly discussed last week.

    • You have multiple repositories:
      • On your computer
      • the main repository online
      • maybe also a fork online
    • each repository has a bunch of forks. At minimum they all have a main or a master. (They can point to different code, but I like to keep them the same on different repositories.)
      • Multiple people might be working on the main online repository. So there might be changes your don't have
        • When you run git pull, you get those changes and merge them. This merging happens on your computer.
        • Then you may run git push to get your own changes to your online repository. This merge happens online (so it only works if the computer can do it. No conflicts allowed. You must pull first.)
      • If you make your own changes (say on branch "example"), you can git push them to the online repository. This merge happens online. (so no conflicts allowed, you must pull first.)
      • If you have fork you push to, then you will need to create a "pull request" (as we saw on the exercises). This asks the owner of the main repository to pull and merge those changes.
        • The owner can then click "merge pull request", which triggers a git pull (technically git fetch and git merge, actually). This merge happens on the main repository. Again, no conflicts allowed.

Reproducible Research

https://coderefinery.github.io/reproducible-research/

Questions

  1. Could you please elaborate on this "reproducibility crisis"? How did it start?

    • Yes. And we will discuss this also later during the lesson.
    • I approve the new spelling of "crysis" - cause we cry because of it :crying_cat_face: haha
      • aha, I thought 'cry' was for very cold :-) but that is 'cryo'
        • haha yes, that's true! It's from Greek
        • but can it run crysis?(+1)
          • no, I don't think it's correct, unfortunately :P
  2. Man I wish people were less selfish, I totally believe that science should be more collaborative and "for the good of all". Should we maybe redesign the funding and publishing systems? :D (+1)

    • I have the same wish, no answer though.
      • Stop the criteria of publishing x amount of articles to promote?
      • I totally agree. We should care more about the way the research is conducted (i.e., quality) rather than publications per se...
    • There are many initiatives for "multi-lab" (many research groups collecting independently data to answer the same question) and "multi-verse" (one big dataset analysed in parallel by many research groups, each using their own methods) projects. Here a multi-lab example from psychology https://psysciacc.org/ and here a multi-verse publication from neuroscience (E. was one of the 70 teams: https://www.nature.com/articles/s41586-020-2314-9). In particle physics they understood many years ago that the multi-lab approach is the only way to advance the field.
  3. At least in bioinformatics it seems to me that people always believe they cannot use the programs/codes created by others, their own version would be better (at least 2-3 guys/girls I met), is that a field-specific attitude or is it in reality that many times the analyses for a specific biological question are just too "specific" (for the lack of a better word :D)?

    • It occurs in other fields too. For many reasons I think. One reason can be for the sake of learning: a good way to learn a method or algorithm is to implement it yourself starting from scratch. When the programs become larger, then clearly not everyone can start from the beginning as it would take too long time.
    • Isn't is a problem for reprodicibility? I mean, using a tool as it is it's easier to cite and to refer to while creating your own version is more complicated. Moreover you should also maintain this version that takes huge effort.
    • Agreed, this could lead to problems with reproducibility. Another aspect is that codes/tools that are developed through often hard and long work might not get used that much for modeling/simulations/calculations for projects that get worked on all the way towards communicating results in a paper. Many codes are underused.
  4. Slightly off topic; do you have any environments for Python you recommend working in? I currently use Jupyter Lab on Mac. (+1)

    • Personally I use VS Code, and I also like Atom. Both are general IDEs with good support for Python. (But I also think the different environments have mostly the same support, so it's a matter of learning one.)
      • Thanks. The ones I've mainly been recommended are Spyder and IDE (but i dont think IDE is available on mac)
        • IDE is just short for integrated development environment, think of a software application that helps you edit/develop your code.
    • I use the CodeRefinery Conda environment that you can find in the installation instructions. It includes Jupyter, etc.
  5. This professor still knows more than most. :) Many others would just assume everything is available and not doubt there may be issues.(+2)

    • Heh, I can never read that comic the same again!

Discussion

What are your experiences re-running or adjusting a script or a figure you created few months ago?

Have you continued working from a previous student's script/code/plot/notebook? What were the biggest challenges?

Questions (continued)

  1. In each sub-folder there should be a different .git repository? Or one .git repository for the entire project?

    • One git repository per project is the best approach. If a project is big, then subprojects can have their own repository. If you think of academic publications, one project corresponds to one publication.
  2. Are there standard files that should be in .gitignore?

    • Depends on what content that goes in the git repo. For a git repo with source code, one typically lists in the .gitignore object files and executables that are generated by the compiler. For e.g. Fortran one would like to have *.mod, *.o in the .gitignore.
  3. Do you know of any version control for HARDWARE developments? (e.g, CAD or PCB designs)

    • If the development is done using text, git works.
    • Git is also fine with small non-text files, but it creates copies whenever they are changed. Essentially any version control will do that, though. You need a copy for each version.
      • Thanks!
  4. Are there other resources one can use to share data that is too big to be shared on GitHub?

    • Depends on what kind of data you have since there are a lot of laws now in Europe about GDPR and/or sensitive data, but you could use Dropbox in case your data doesn't need to conform to that.
    • Addition by Elisa (data steward at the VU): we don't recommend using Dropbox. Dropbox is a commercial company that can decide to cease operations, resulting in you losing your data. If you use Google Drive the risk is slightly lower if you use your institutional account, but still, Google could also decide to cease operations. Zenodo (mentioned below) is indeed a better option. That is, if data aren't sensitive, GDPR- or otherwise.
    • If it's a temporary sharing, and the data are not sensitive, you can use some of the popular tools that you most likely have access to at your organisation (Google Drive, Microsoft OneDrive, etc). If the sharing is instead more "permanent" (sharing with the world and preserving the dataset for a long time), you want to version the data release and use a data repository like Zenodo.org. Search if in your field there are dedicated repositories for data sharing https://www.re3data.org/
  5. The files that we do not want to track should we then put them in the ./gitignore folder ?

    • The .gitignore is a text file. There you can specify the types of files you do not want to be tracked by git (e.g. *.zip).

Discussion

Are you using version control for academic papers?

How do you handle collaborative issues e.g. conflicting changes?

What tools are you using when organizing your projects?

Questions (continued)

  1. How to use overleaf with git? I don't use it although use overleaf a lot. Would be nice to know about this.

    • Git is used internally within Overleaf. My personal experience of using the git API of Overleaf is mixed. Things might get confused if some authors are editing directly on Overleaf, whereas others are adding material over the gitAPI.
      • The versioning may be a premium feature, I am not sure what is available for free.
    • handy links:
      • https://www.overleaf.com/learn/how-to/Using_Git_and_GitHub#.V7NMWLNnthE
      • https://www.overleaf.com/articles/git-and-overleaf-integration/qmdncpnqwfxx
        • thanks!
  2. This is actually a question about last week's exercise. For the forking exercise, it said to wait for someone to accept my merge before starting the second part. However, my merge request is yet to be approved.

    • That's our own fault for not being on top of things. Pull requests don't make people magically interactive, you do need to talk and make sure someone will accept!
      • I do apologise. I am joining this session online, hence I did not know who to contact. Could I ask you to approve #32 in forking-workflow-exercise?
        • No need to appologize, we were too slow. It's accepted now.
  3. I'm very happy to see someone else use cooking anologies! I always do this when running python undergrad courses (coding language = robot chef, data = ingredients, code/script = instructions)

    • We used the cooking multi-chef / multi-pot metaphores to teach parallel computing last summer :) 1.2 What is parallel computing? An analogy with cooking.
      • ah fun! more to come. I (RB) have the ambition to write a blog post about cooking analogies for everything computing related": parallelization, scheduling, resource use, SIMD, ...
  4. If you use code snippets posted online by others, say in stack overflow etc, do you need to cite them? Always feel slightly guilty just using them, but then again my coding journey just began and nothing was published yet (not even close).

    • We'll actually talk about this in the second lesson of today! Make sure to raise this question again if it's not answered.
  5. Would like to hear a bit about where things "live" when using Conda. Like what happens under the hood, how do channels work?

    • Good question but might take a while to answer. Let's see: "channels" are basically a way to organize groups of packages by a single group. There is a default anaconda channel, but it also allows others to package and distribute what they need. You can select a particular channel when you install packages with conda.
  6. Probably you're going to treat this, but what is the difference between Conda and Anaconda?

    • conda is a packaging program. You use it for installing packages and creating program environments. Anaconda is the name of a set of packages. You also have miniconda which is a name of a set of packages. Anaconda is huge as it is a lot of packages. Miniconda is the bare minimum of packages needed for having a python environment.
      • thanks!
  7. In R I am using the renv package (https://rstudio.github.io/renv/articles/renv.html) to deal with package versions. Is this a good approach?

    • yes! :-)
      • Thx :)
  8. If one wants to publish your work when coding in python, PyPI is a common choice. How do one accomplish this with conda? Is it possible, or is conda-forge and other channels the only option?

    • Conda-forge is the most common. There are other sources that specialize on different fields or different types of packages.
    • The instructions for conda-forge https://conda-forge.org/docs/maintainer/adding_pkgs.html
    • So, is there a motivation for ever using conda and publish to conda-forge, rather than virtualenv and publish to PyPI?
      • Both conda-forge and PyPI are viable options for distributing the code. Publishing to both might increase the uptake of the tool.
  9. Is Spyder another package that can be used?

    • It's an editor that can do a lot of the similar things, yes.
  10. Is there a good way for me not to have to remember which environmnt I made for which project. Something like conda activate .?

    • I name the environment and the project folder the same. Not perfect but helps đź‘Ť
    • You can add the environment to the current folder: conda create --prefix.
  11. What is .yml and why are enviroments written down in this format?

    • It's a format for textual data. It's meant to be machine readable and easy enough for humans to read and write.
    • And it is good to keep the name and format since this is now recognized by a number of tools (e.g. Binder, later ...)
  12. R. said "I prefer having isolated environments instead of having it installed on my computer." I dont get the exact difference. Arent the things installed even if you use an environment, but the isolated environment just helps "list" exactly what you need and where they are?

    • If the environment is isolated, it is easier to remove it without affecting everything else on your computer. So they are still installed but often installed into a folder of your choice instead of installed "somewhere" where it's harder to remove.
  13. It has been commented that it is recomended to have one environment per project, but when doing this it seemed to me that it takes a lot of space on my hard disc (few Gb space). Is this a know effect, or I didn´t perceive it properly?

    • Conda should reuse packages already installed in a different environment. It can create copies, though, when two environments use a different version. You can save space by using the same (latest?) version in different projects.
      • good to know, thank you!
    • The conda cache can take up a lot of space. conda clean --all will remove unused packages and other data.
      • Cool, thank you!
  14. I am not sure I understood what is an environment. Would you mind explaining again?

    • In this context, a way to install outside libraries without needing to install them on your whole computer. It lets projects not interfere with each other, you don't get your computer messy, and can do it without admin access to the computer.
  15. Does anyone have experience with JULIA? i have been considering using it as it is supposedly excellent for analysis on large quantities of data

    • Julia tracks dependencies in the package lock file (sorry I forgot exact name) but tracking dependencies and communicating them is relatively easy in Julia and it is well designed.
  16. When is the best time to generate this yml file? At the end of the project? or regularly at each major releases?

    • I would do it as soon as possible. And incrementally as you add things, so you keep track of what is needed.
    • If environment gets messed up, you can re-create it. Re-creating = ensures it's reproducible later!
      • Then, what is the difference of generating this yml file and generating requirements.txt using pip freeze? (I just realized that pip freeze only works for python venv)
        • If you're working with conda environments, the default is to store dependencis in a file called environment.yml. If you work in a python virtual environment, the default is requirements.txt.
          • silly question, what is the Conda environment / Python virtual environment?
            • conda and pip are both package managers for Python. They can both be used to create environments in order to isolate your code from the rest of the system and keep track of dependencies.
              • Thanks!
  17. What is the best approach to data sharing/version control if you are the only one in the team using it and writing scripts?..

    • I would use git and commit everything/push everything directly
    • But day 6 will show "automated testing", which would show some benefit of pull requests even for one person.
    • I find it useful to imagine: "what if my computer breaks today completely?" - what steps will I need to take to get this to run again. then I write down these steps and put this all somewhere safe outside of my computer. that already will go a long way towards reproducibility.

Exercises until xx:10, then break until xx:20

How was the exercise?

  1. Do we need to do the installation of dependencies for the exercise?

    • For Dependencies-1 you don't need any coding or installation of dependencies.
    • For Dependencies-2 you need the CodeRefinery Conda environment (or your own env)
      • but even only "reading" Dependencies-2 is hopefully useful. The conda env export will create a file similar to the one we see in Dependencies-1
  2. What is a channel in relation to Conda, envs, etc.?

    • these are different distribution channels: places where packages are shared.
      • when building, e.g. a Docker container, how do I know about channels: which one to use, where to get info about available ones?
        • When preparing a Docker container you could add channels e.g. conda-forge, similarly to how you work with conda and channels in a terminal.
  3. I am not sure how to start the exercise. What exactly do I need to do?

    • No coding needed for part 1. the goal for exercise 1 is to read A-E and discuss/consider how this affects reuse in future
    • For part 2: if you are in the activated coderefinery conda environment, I would try to export that environment into a file
      • "A: You find a couple of library imports across the code but that’s it." how would I find the libraries?
        • in Python it would be that you find somewhere for instance "import scipy" or "import somelibrary"
      • what is the coderefinery conda environment? Where can i find it?
      • https://coderefinery.github.io/installation/conda-environment/
  4. When in a project is it appropriate to create an enviroment.yml? at the start? at version 1? and when should it be updated?

    • I would create it at the start and add packages to it as needed. (Also periodically delete the environment and reinstall.)
      • Why would it be good practice to delete it periodically?
    • I like to write the dependencies in there and install from the file. This way it is already documented and I never install anything that I did not document.
      • (like the sound of this approach)
  5. How and when does one decide to create a new environment? It becomes a bit tricky to me that people work on different projects and sometimes even hard to have a well-defined "project". What would be a good advice for that?

    • I would create a new environment when I start working on a new code repository. I have 1 or more environments per repository.
    • Also any data repository can have it's own environment. I probably would create one for a paper as well, but it could be shared between data and paper.
  6. In the "Dependencies-1" example, should we not specify the python version (e.g. 3.5 or 3.7)?

    • good point! it could definitiely be relevant and can/should be added to environment.yml (however, requirements.txt has no mechanism for that as far as I know)
  7. what is the "CodeRefinery conda environment"?

    • It is an environment that contains the programs used within this workshop.
    • https://coderefinery.github.io/installation/conda-environment/
      • sorry, we should add a link to it from lesson
        • I get:
        G:\>conda activate
        'conda' is not recognized as an internal or external command, operable program or batch file.
        
        • What should I do?
          • Is there an "anaconda terminal" of similar in the start menu?
            • Yes, should I start from there? It is difficult to understand from which terminal a person needs to run from. There is the cmd, the anaconda one, the git bash etc...
            • How do I "activate" this coderefinery conda environment?
              • conda activate coderefinery
                • How do i know in which environment i was in? Maybe it was already activated?
                  • Conda usually adds the environment name in parenthesis to the terminal. For example (coderefinery) $
                    • Ah yes, I see now. Thanks!
  8. Would the conda environment be operating system agnostic?

    • In theory yes, if the packages work across different OSs.
    • An environment.yml can be made with "build numbers", the =hcf16a7b_0_cpython you see in a question below. These identify specific builds, and aren't portable across different OSs. So conda env export --no-builds excludes it and makes it a bit more portable.
    • But you would need to re-build it to install packakges built for each OS.
      • Thanks! Fellowing that, is, for example, Python 3.11 in Linux works and is built the same in different OSs (under the hood)? So when I port and re-build Python 3.11 from Mac to Windows, will there be problems?
  9. When I try the exercise 2 the first command, I recieved this message:

    out-file : Access to the path 'C:\environment.yml' is denied.
    At line:1 char:1
    + conda env export > environment.yml
    + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        + CategoryInfo          : OpenError: (:) [Out-File], UnauthorizedAccessException
        + FullyQualifiedErrorId :  FileOpenFailure,Microsoft.PowerShell.Commands.OutFileCommand
    

    I already have the coderefinery in conda environment

    • I guess you need to move to a directory where you are allowed to make a new file. C:\ isn't allowed for your user -Okay it makes sense why I am struggle, but what do you mean to move directory? I am in the coderefinery directory, but unfortunately I can not avoid the C:>
  10. I checked the files from last week. At the end of the directory name there is info like (older code), (master), (just-before). I guess it is git that writes this, but it was very unexpected. What kind of help is that and how do I use it?

    • It's a "git-aware prompt", where the shell prompt has some info on git's status and current branch. This happened on your computer by default?
    • Yes. I did not see it last week. It just came up now.
  11. What does the text after the second =-sign mean: python=3.11.0=hcf16a7b_0_cpython

    • Exact hash (h) and build. More specific exact identifier than version. Note this changes on win/mac/linux, so makes it more reproducible on one computer but not portable to others
      • Thank you!
  12. I am receiving a conda: command not found error. I tried adding the conda path to PATH but nothing changed. I even initialized conda.exe in the shell and restarted.

    • Which operating system are you on. And what terminal?
    • If you use export PATH=$PATH:..., that's only for the current terminal. If you restart the terminal, you need to do it again.
      • If you are on bash, you can add the export command to the ".bashrc" file. I am using WSL (windows subsystem for Linux)
    • Then it's probably bash. How did you install conda? If you install it using the normal windows installer, WSL will not know about it.
    • Is there a link I should follow?
    • Check https://kontext.tech/article/1064/install-miniconda-and-anaconda-on-wsl-2-or-linux.
    • Or run apt-get update; apt-get install wget
    • and `wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-Linux-x86_64.sh
    • now running sh ./Miniconda3-py39_4.12.0-Linux-x86_64.sh should install MiniConda
  13. How does Conda know where to get info to perform conda activate <env_name>? Is info about these environments stored somewhere?

    • yes, on my computer it keeps a list of all environments in .conda on my home folder
      • I'm on Ubuntu, found ~/.conda dir as well after I typed conda env export > environment.yml. But where does coderefinery env come from, which is activated magically by conda activate coderefinery?
        • The name of an environment can be specified in the environment.yml file, for instance name: coderefinery.
  14. So I can see I have the coderefinery environment under my Miniconda/envs folder and I see it when I do conda info -e. I have put my current environment info a myownenvironment.yml file. But what has to happen to myownenvironment.yml` so it behaves like the coderefinery environment I can see in my envs folder and so it lists like an available environment I can load?

    • Oh, wait I might have found it. Its the next part of the exercise (facepalm...)
  15. It is difficult to understand from which terminal a person needs to run from. There is the cmd``, the anaconda one, the git bash etc... Should I start with the anaconda terminal?

    • I recommend to always start from Git Bash on Windows in this workshop (then also the Git part will work)
    • $conda activate bash: conda: command not found. As you can see with git bash it is not working.
  16. The "Setting path to Conda from your terminal shell" is not working. I do the command "echo ". '${PWD}'/conda.sh" >> ~/.bashrc" and when I try conda --version it does not find anything. I remember there was a problem also during the install sessions. Please let me know what to do. with windows

    • Sorry, this is hard to debug without direct directly seeing your terminal...
    • Did you restart the terminal after adding to .bashrc?
      • yes I restarted. I was in the installing session and they told
    • Can you run echo $PATH and paste the results?
      • do I need to print here? -Yes, please
      $ echo $PATH
      /z//bin:/mingw64/bin:/usr/local/bin:/usr/bin:/bin:/mingw64/bin:/usr/bin:/z/bin:/c/Program Files/Python311/Scripts:/c/Program Files/Python311:/c/WINDOWS/system32:/c/WINDOWS:/c/WINDOWS/System32/Wbem:/c/WINDOWS/System32/WindowsPowerShell/v1.0:/c/WINDOWS/System32/OpenSSH:/c/Program Files/dotnet:/cmd:/c/Program Files/MATLAB/R2022b/runtime/win64:/c/Program Files/MATLAB/R2022b/bin:/c/Users/user/AppData/Local/Microsoft/WindowsApps:/usr/bin/vendor_perl:/usr/bin/core_perl
      
    • Seems that conda is just not there... Can you paste the last few rows from the .bashrc file (tail ~/.bashrc)?
    tail ~/.bashrc
    . '/z/'/conda.sh
    . '/c/Hyapp/Anaconda3-2022.05/etc/profile.d'/conda.sh
    . '/c/LocalData/bortolus/coderefinery'/conda.sh
    . '/c/Hyapp/Anaconda3-2022.05/etc/profile.d'/conda.sh
    . '/c/Hyapp/Anaconda3-2022.05/etc/profile.d'/conda.sh
    . '/c/Hyapp/Anaconda3-2022.05/etc/profile.d'/conda.sh
    . '/c/Hyapp/Anaconda3-2022.05/etc/profile.d'/conda.sh
    . '/c/Hyapp/Anaconda3-2022.05/etc'/conda.sh
    
    • There is something wrong with the echo command you ran. Those lines are not correct, and they might even break the .bashrc file.
      • yes I remember there was a problem also during the installation session... I copied the lines from the prerequisite page
    • The issue is it was run from the wrong folder. The instructions at https://coderefinery.github.io/installation/conda/ ask you to find the miniconda3 installation folder, and then tell you how to find a folder with a file called conda.sh. You should open a GitBash from there and run the command. You should also edit the .bashrc file and delete the extra lines.
      • that's what I did. though not in miniconda, but in anaconda.
    • The path that was added to the .bashrc-file on the first time was /z/..., which is not correct. Mayve the "/c/Hyapp/Anaconda3-2022.05/etc/profile.d..." is correct.
      • where do I find the .bashrc file?
        • start GitBash from the start menu and run pwd. That should print the path to you home folder, where the .bashrc file is.
        • it's in /z/ but I can't change in C:
          • Sorry, I don't understand the line above.
    • so I have anaconda in /c/ but when I open gitbash and run pwd I am in /z/ disk
      • When you open gitbash, you will start in your home folder. That is the one you want, for now. It's where the .bashrc file is. You can open it in a text editor (if you know the name of the .exe file for the editor, type editor.exe .bashrc). Or you can open the folder using explorer . and go from there.
    • Unfortunately I need to go to a meeting, but I will ask if someone can follow up.
    • But, if you can open explorer (see above) in your gitbash home, edit the .bashrc file. In the end there are a bunch of lines starting with ". '". Delete them and add one line: . '/c/u/Anaconda3-2022.05/etc/profile.d'/conda.sh
    • :'( ok thank you for the help
    • still not working
      • reading the above ... also for rest of today it is not a big problem that this does not work (rest of exercises will be discussion) but we should try to get this running for tomorrow -this was working during the installation session. I'm sure there is a easy way to fix it
  17. When uploading my .yml file to a public repository, would it be better to not include the hash and build for portability to other operating systems? (+1)

    • Including the hash makes it less portable, since that exact version may exist only for a given operating system. I prefer less exact versions, because that may allow it to work on other systems.
    • But even with the hashes it is useful since they are relatively easy to remove for the person who comes 3 years later. But for a project in progress it is indeed easier not to have too precise/restrictive versions.
  18. If I run the command pip freeze while being in the miniconda prompt. I see text that looks like certifi @ file:///C:/b/abs_85o_6fm0se/croot/certifi_1671487778835/work/certifi. What is the meaning of the @file and other things, while it does not specify the versions.

  19. In the anaconda and miniconda I am at C:>, but I do not know how to move to a directory as someone said before, can you clarify? I use anaconda in windows.

    • Most likely your "home" folder is under the folder Users. So start with cd Users, and then ls (or dir) to see the subfolders there. There should be one which is you username.
  20. "(coderefinery) C:>conda env export > environment.yml Access is denied." What this means? What should i do?

    • I think you are in a directory where a environment.yml file can't be made, since you don't have permissions to write there. C:\
    • how can I change the directory?
      • With the command cd.
  21. Is there some way you can see the previous version of the environments excercise? i liked it better before. :/ I remember there was one awesome excercise where you could try out a bunch of different commands with conda where you kind of put stuff into a yml file, and then creating a new environment from that but then changing it slightly and so forth.

    • With the power of git, we can! Easy to see the raw source (for example) here: https://github.com/coderefinery/reproducible-research/blob/2fc3f7cb0a2e03c79ca7420358eecaf9517b0b5f/content/dependencies.md
  22. How do I get the current environment? According to this: https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#determining-your-current-environment it should show an asterix, but if I run conda env list in Git Bash i don't get an asterix.

    • Weird... I wonder if there could be none activated now...?
      • after i activite the coderefinery environment it is clearly activated (checked by exporting environment.yml), but the asterix isn't shown
      • try conda info it tells you various details on the current activated environments (or if nothing is activated)
      • Thanks, that works!
  23. I have made the script.sh and put in the repository. I run it using bash script.sh. But there is no output. Shouldnt i see a plot? Should i putpt in data folder?

    • It saves straight to the data folder and is quiet otherwise.
    • it does not print any progress but generates the data files and images
      • Where can i find these images? In data folder it does not appear.
        • Try plot/ folder next to the data/ folder
          • Ah yes, thanks!
  24. Is Snakemake similar to C makefiles?

    • Yes! Well, in concept. That's where the name comes from at least. (Makefiles run shell commands so can be used for any language)
    • it is the same idea but 40 years later so more nice features
  25. I'm getting an error with conda: "CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'. To initialize your shell, run $ conda init <SHELL_NAME>"

    • Is this on your own computer where you installed conda itself?
    • yes. noramlly i use zsh shell, which i have my own environments that i use and works fine. ive been using bash for this course, so maybe that's a point of issue?
    • Yes, probably. If you conda init in zsh it might work if you restart the shell. Or go back to zsh, if you know it it will work (you can still bash script.sh from zsh)
      • okay. can conda init bash outputs No action taken, so i will work through zsh instead. Thanks for the help!
  26. When I try running bash script.sh I get "No such file or directory" error. I have cloned the repository and cd to word-count folder

    • I think you need to make the file script.sh from the webpage. Copy and paste into script.sh using an editor.
    • yes, I forgot to add that script to the repo
      • Sorry, I am lost. From what webpage?
    • https://coderefinery.github.io/reproducible-research/workflow-management/#exercise - Under Workflow-1, it shows a script starting in #!/usr/bin/env bash. This should be copied to script.sh
      • But I do not have the 'script.sh' file, should I create it first?
  27. I did the first exercise easily, but the snakemake exercise doesn't work - or probably i am doing something wrong. I tried to run the lines of code written in step 3. while using git bash inside the "word-count" repository on my pc. It says: "bash: snakemake: command not found".

    • It seems to me that the conda enviornment is not active, thus it cannot find the snakemake command.
      • I think you're right, i didn't install the conda part but kept the updated version of anaconda i already had - as it was started in your installation guides that it should be fine as well, but that you would only recommend conda instead.
        • Indeed snakemake comes with Anaconda.
          • How do i activate it with my own anaconda environement then? :) (The first exercise worked without trouble - that used python as well..)
            • First run source [PATH TO CONDA]/bin/activate.
              • nvm. condabin is the name, trying again.
                • Maybe close this terminal / git bash, open a new one, and then source [PATH_TO_CONDA]/bin/activate. Then try snakemake --version just to check that it finds it.
                  • I solved it myself by looking back at some of the installation guides - it was working as it should, but i needed an extra command (almost the one you mentioned): "source activate [PATH to Anaconda3]/envs/coderefinery". Thanks for getting me on the right track for a solution for this.
                    • Glad that you made it work.
  28. I saved the script bash file, but when trying to run i get "script.sh: line 5: statistics/abyss.data: No such file or directory", followed by other "No such file or directory errors" regarding paths ending with plot.py

    • What directory did you save it to? - I'm guessing to wherever I am :D coderefinery???
    • I think you need to do it from the repository you cloned. If you ls or dir there should be a data and statistics directory there.
    • Sorry, I forgot how to go to the cloned repo. Do I use the cd command for that?
      • correct.
    • Thanks!
  29. To create the script.sh file we need to do it from the terminal on Binder, correct? If so, I couldn't use nano, vim or vi there. So how to?

    • Yes. Or some other editor.
      • That must be on Binder? Or can I do it on my folder that I cloned?
      • Which text editor exists on the terminal inside binder that I can use? I know nano, vim and vi. None of them work, so how can I create the script.sh file?
        • You can install an editor of choice within the container.
  30. I created and ran the script.sh. Now i do git status and i only see the script.sh that is untracked. Why are the plots that i created with the script not there as untracked? They have not been changed?

    • If you did not delete the plots before starting (the repo included them already – maybe that was not mentioned in the script exercise) then the script produces exactly same figures and git considers them unchanged
  31. Am I correct in that Snakemake is completely language independent, and works as long as you can run every step from the command line?

    • Correct. (though it has some extra Python integration)
      • What is this additional Python integration? :)
  32. In my current project, I use a mix of Fortran and Python codes, which for example does not run on notebooks and requires a specific environment to be executed. Is it possible to save an environment in git and let the user install it in a one-line procedure? This would be useful even for myself when running from other machines.

    • The environment.yml thing we discussed in theory does this. Make a environment.yml file that has the requirements. In theory a user can conda env create -f environment.yml. Or similar ideas for other tools.
      • Thanks! I do not have much knowledge of that, but could I do something similar to other terminal instances like C, R, Fortran versions?
  33. After trying all the installations process, conda -- version on git bash gives me command not found. Not sure what to do now

    • I guess conda isn't activated...
    • Are you working with the coderefinery environment or anaconda without the CR env?
      • From anaconda terminal I activated the code refinery environment, then I tried to "conda env export > environment.yml" and it tells me that it cannot be done. so i started trying with git bash and i have tried to make it work in bash and it still tells me command not found!
  34. Snakefile and it's working looks cool but I am wondering how to write one. It looks complicated (I mean what are these? shell commands?)

    • they have very nice tutorials. but also it's not the only tool that works this way (see lower on that page) and maybe others look more intuitive. our goal is that you now know that tools like this exist where you can encode steps.
      • Yeah great, thanks
        • and it does not have to be shell commands. it can run basically "anything": shell, python, R, anything else
  35. Im a bit lost, I tried to clone the repository and after that I do the snakemake --delete-all-output -j 1, but error comes like this: Error: no Snakefile found, tried Snakefile, snakefile, workflow/Snakefile, workflow/snakefile. (coderefinery). It can be that I just need to follow today and see these exercises later, but if there is some help that I could catch up with now, would be great.

    • after the cloning did you cd into the cloned directory? is Snakefile in the same directory where you run the command?
      • ah right, thanks! I didn't
  36. It is still solving the environment for coderefinery env, haven't been able to do anything more than that.

    • it takes time to install for the first time but no problem for rest of today. We will need it tomorrow. Test of today will be discussions.
  37. Can I open the generated images in the terminal?

    • There are some tools for that but most of them are OS specific.
  38. why snake file created empty txt files in plot folder with name of book titles?

  39. How do we create a good "README" for a snakemake file?

    • Hm. I guess you would comment on what commands to run and what the setup is: can run with snakemake. Put input files in this place with this format. Run this command, and output appears [here].
  40. IG_question! I do git status --> snakemake --delete-all-output --> git status --> snakemake -j 1 --> git status. Why do the 1st & the last calls of git status tell there are no changes in the repo? Files were at least updated (actually, removed & created again, and git noticed this). git doesn't track time evolution of a file in repo if there are no changes in the file size / content?

    • If there are no changes to the content, it is considered the same.
      • Let me clarify, if only 1 symbol is different, but the size of a file is the same, is it considered as a change of a content?
      • Oh, yes. The content is the important part. I think it uses timestamps to see what to check, but then always checks content.
    • The .gitignore file shows *.log ignored.
  41. Question 56 is still unsolved.

    • we're having trouble figuring it out, if we can't by the end of the day might be good to ask someone local who can look at the screen and see the status.
    • we are puzzled by these problems. we are wondering whether it's maybe the wrong terminal open?
    • What operating system are you on?
      • (I use Windows, and writing in the git bash terminal) I solved it myself by looking back at some of the installation guides - it was working as it should, but i needed an extra command (almost the one you mentioned): "source activate [PATH to Anaconda3]/envs/coderefinery". Thanks for pointing me in the right direction with your suggestions.
        • Solved now.
  42. How to edit the files inside Binder? nano, vim or vi doesn't work. Can you give the names of the tools that we can use to edit the *.py files inside Binder?

    • From the Jupyter file browser you can open and save.
      • Thanks. But I was expecting to do it through the terminal.
    • The main question is "what editors does binder install by default" and maybe the answer is "none"... in which case not much to do.
  43. from anaconda terminal I activated the coderefinery environment, then I tried to "conda env export > environment.yml" and it tells me that it cannot be done. so i started trying with git bash and i have tried to make it work in bash and it still tells me command not found!

    • you tried activating the code refinery environment in git bash? conda activate coderefinery
      • yes, command not found again; still the same
  44. If I try to visualize the DAG I get an error:

$ snakemake -j 1 --dag | dot -Tpng > dag.png
bash: dot: command not found
Building DAG of jobs...
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='cp1252'>
OSError: [Errno 22] Invalid argument
  1. The best ever joke about paper peer review was recorded in this video. Hopefully, there are no hard feelings regarding the original movie.
    • ..

Recording computational environments

https://coderefinery.github.io/reproducible-research/environments/

  1. It sounds like the instructor was talking about what is on his screen without sharing? nvm video was stuck on my end somehow

    • Something a bit strange is going on here... but seems to work overall.
  2. I was wondering if the workflow is using some tool like Binder and or depends on some integration with github, it can be unaccessible in the future, right? Is there some way to use containers, dockers that is possible to rely on in the long run?

    • Possibly. There are other steps to help: Binder can use things other than Github. And the binder web service uses repo2docker which can also be run locally. Binder isn't actually that fancy, it uses standard reproducibility things like environment.yml that you want to do anyway. So it's even more reproducible if binder disapperad.
      • Great! So one can say that github, gitlab is not enough anymore, we should use some more advanced tool, right?
  3. What's the difference between a container and an environment? The container is a next-level up?

    • Yeah, pretty much. "container" includes a lot more of the operating system libraries, which means that there is a lot less to go wrong (but also it's a lot larger and harder to make)

Sharing code and data

https://coderefinery.github.io/reproducible-research/sharing/

  1. We are supposed to follow along? with the exercise? or just watch?

    • Just watch for now, demo.
  2. is the full history of the git shown on zenodo when you link the GitHub to a zenodo?

    • No, it only saves the exact specific versions you archive. But you can archive multiple versions, for example each release.
  3. Does Zenodo only work with GitHub? Or other repository platforms are also supported?

    • I think the "automatic integration" may only be Github, but it's not that much harder to upload data from any other source.
  4. Can I use git to conduct control of basically everything between my local machine and a remote one, e.g. a supercomputer where I not only run simulations, but also store data?

    • Yes! And sync changes on both ends. The data should probably be managed / synced separate from your main git repo.
      • Was it mentioned last week? I'm asking, since I've been using my own bash script for synchronisation based on rsync with many rules like --include, --exclude
        • We mentioned how you can sync between a local repository and a remote repository on GitHub, we also mentioned that you can define a remote that points somewhere else, but we did not actually show how to do it. But that would be git remote add origin URL. Also, instead of the URL you could use a path to some other local repo to be synced with.
          • I remember about GitHub, GitLab, but I missed the remote storage/synch point. Alright, it's nice that it could be used. For some reason, I forgot about it.
  5. These "Digital Object Identifiers" feel a little bit like the hashes that we've seen previously. The advantage of the DOI is that it also links to where the object is, correct?

    • Yes, somehow fully managed and searchable through a central organization.
    • Though interestingly you can search for a hash in Github which makes it somehow findable.
  6. What is the difference between Zenodo and GitHub in terms of data sharing? A reference to GitHub could also be shared.

    • Zenodo promises to be more permanent than Github, and the DOIs can be cited in journal articles. (Github could too but people know it's not permanent, so they may ask for Zenodo instead).

Social coding

https://coderefinery.github.io/social-coding/

Discussion

Question 1: Why would I want to share my scripts/code/data?

Choose many. Vote by adding an o character:

Question 2: The most concerning thing for me, If I share my software now

**Choose

Question 3: Why is software often treated differently from papers?

Free-form answers:

Question 4: When you find a repository with code/library you would like to reuse, what are the things you look at to decide whether you use it?

Free-form answers:

Licensing

https://coderefinery.github.io/social-coding/licensing/

Discussion

Question 5: Which of these are derivative works?

Choose many. Vote by adding an o character:

Questions (continued)

  1. What about a script that I made that is pieces of code that i found randomly online. Can I license it?

    • Let's raise this on stream! ("practical recommendations")
    • Can you show it as an example? the stackoverflow citation
  2. Which entity sets the legal rules for licensing? We must be able this is more or less globally adhered to.

    • Basically, governments who decide what use they are willing to use their power to stop. Which become rules for what you can't do and rules for how to give permission to do that. (and since laws move slowly, plenty of court fights for this)
  3. Speaking of license and copyright and AI image generators. Please realise that AI image generators violate the copyrights of artists. Many artists have not agreed with the AIs training upon their work to the point that AI users can emulate the work of the original artist quite accurately. That is a big problem.

    • That is always a possibility for anything you upload online.
    • that is not an argument to leave artists completely unprotected in this regard - compare with the music industry.
    • Music is copyrighted, yet people still torrent it. (Or use adblockers.)
    • That's the point. Music is protected with strict rules against AIs training on it, drawings are not. Visual art is free game.
    • Couldn't a music AI just scrape audio from e.g. YouTube? There's copyrighted stuff there.
    • Possibly- that's why there are serious concerns with AI scraping everything without regulation.
    • I suppose this is an ideological thing as well: How much does one agree with the existence of copyright in the first place? But that's a bit off topic.
    • Yeah we should have more tools for artists to decide whether to allow the use of their work in training AI:s. Me personally I would allow the use of my music but maybe not all.
    • Agreed - I as a visual artist do not allow it yet I have no control over it and am forced to remove my art from the internet in that case.
    • There are ethical concerns about AI training materials, but it's not clear/decided yet if it violates copyright. Court cases are coming and decisions may be different in different regions. "Style" is not copyrightable (and shouldn't be - copyrighting "style" would be a disaster for artists)
    • To me there is a big difference between another artist being inspired (who has learned how to draw) versus an ai scraping and people who cannot draw running away with images that seem to be created by said artist whose work has been scraped. That is disrespectful to the artist to say the least.
      • Yes, but that's an ethical stance - ethics and legality are related but not the same, and it's not always possible or desirable to legislate ethics. If "style" becomes copyrightable, entire genres of art are in danger. So I would hesitate to confidently declare that AI training violates copyright while these issues are still legally undecided - calling it "unethical," sure. (Although personally, I think the AI images generally only look like actual artists' work if viewed in tiny thumbnails where it's mostly a color impression - the quality as art is usually... not good)
  4. Are there licenses that claim that linking to libraries constitute derivative work of the linked code? Is it generally safe to assume that linking to libraries is safe for your code's license?

    • Good question and may be beyond us answering. The GPL license says that linking can be a derivative work, if what you make is very closely connected to the thing you are linking to (has the API guided what you make?). You are probably too small to worry too much, but worth thinking about if you use GPL libraries.
  5. What does it mean, if I write Copyright 2023? How important is it to include a year. Under which conditions should I update the year/year range?

    • What you declare (unless you say it's licensed) doesn't matter, still copyrighted. Having an accurate notice would help if you ever wanted to bring action against someone using your stuff... thus why large companies care so much about this.
  6. So If I use the algorithm from a paper and i write a code about it, can I have problems with copyrights? (+1)

    • No (which means yes, you can re-write it). At least that's the line. You can find many re-implementations because of this. (if algorithm was patented, which is rare, then it would matter).
  7. What is copyleft?

    • Term invented for "opposite of copyright": if you use my code you have to license whatever else you make under the same license. To keep derivatives free.
    • legal jujitsu - using the copyright system against itself (in a way) - invented by people who disagreed with copyright... copyleft licenses are 'viral' in that they spread their properties through 'infection' (preventing derivative works from being more strongly licensed).
  8. what is the difference between the proprietary licenses and open source licenses?

    • proprietary = fully owned, no permissions. We use it to mean things you can't reuse, modify, share.
  9. How do you deal with projects that say they are released under "the MIT license" but don't include an acutal license text?

    • I would usually take that as a clear sign that they intended that. You could defend your use in a court if someone protested. But if you were Google maybe you'd want to be a little bit more cautious since you have big pockets...
  10. when you say "you can do" you mean what a person can do using the code that has that specific licence? but he is not the author?

    • Correct. Copyright owner can always do anything since copyright doesn't prohibit them from doing anything.
  11. If I create a code with the MIT licence. Hypotetically speaking, someone uses it and creates a software out of it but then the results of the softwares are wrong for some reason. Basically my initial code might have been a wip or something like that. Can he uses the fact that the code does not work properly against me?

    • Once someone pointed out that most good licenses explicitely say that there is no warranty: that the author can never be held accountable for anything that may go wrong. See the last paragraph: https://en.wikipedia.org/wiki/MIT_License. Yet another reason to use one
      • ok clear.
      • edited to make more clear.

Break until xx:08

Questions (continued)

  1. How about using generative AI in stuff?

    • Some universities already have guidelines (e.g. Helsinki University)
    • be critical with whatever is synthesised with these tools. For example the code generated might be protected by a strict license, but GPT or GithubCopilot won't cite the sources (or hallucinate fake sources when asked). There is an open court case against githubcopilot https://githubcopilotlitigation.com/ (and similarly synthetic images might be reproduction of copyrighted pictures, furthermore, personal data about living individuals can be produced with these technologies which can introduce other issues if the synthetic personal data is going to be published)
  2. What would a fully attributed code deposit look like? If the programmer referred to 10 tutorials, the documentation, and 25 answers on Stack Overflow...it doesn't seem feasible to keep track of exactly which snippets came from where. Attribution could be longer than the script. Am I missing something?

    • If I understand this correctly, then I would probably attribute the project/repository as a whole.
      • So attribute StackOverflow rather than the individual users?
        • Oh, then I misunderstood. Good question, I guess I would attribute each part at the point it was inserted
          • This feels very impractical in practice - if I have a bug and look at 10 different StackOverflow comments and eventually figure out how to fix it, it's hard to tell where I "got" the code in the end. Hmm.
          • If something is so small - I'm not copying directly but reading, understanding, and then putting in my own words, I wouldn't cite. But this is indeed a very important question!
            • Yes, it's harder (for me, anyway) to tell where to draw that line with code, I think. And code inherently has fewer ways to be put together than still work than natural language, so snippets will inevitably be repeated independently, given enough programmer monkeys with keyboards.
  3. So do you think that the repository on git hub should be private when you are working on it and only when it is ready for the public, make the repository public?

    • I would always make public from the start, since it reduces the "well when do I do it". History will be public anyway, and this helps "keep you clean".

Software citation

https://coderefinery.github.io/social-coding/software-citation/

  1. IG_question: how to assign a proper number to a version of my software, e.g. 1.2.3 or 1.2? What defines these digits?
    • It's up to you, but "semantic versioning" is one common guide: https://semver.org/ which says what the three digits should mean. But "they are just numbers".

Feedback, day 4

Today was:

One good thing about today:

One thing to improve for next time:

Any other feedback?

On behalf of the CodeRefinery team: We are really happy to read all these feedback! 🥳


Funding

CodeRefinery is a project within the Nordic e-Infrastructure Collaboration (NeIC). NeIC is an organisational unit under NordForsk.

Privacy

Privacy policy

Follow us

Contact

support@coderefinery.org

Improve this page

Source code