Questions and notes from workshop day 4

Day 4 - 17/09/2024

Icebreakers

Let's test the notes with some icebreakers! :icecream:

Have you heard the sentence "hmm... works on my computer"? What does this mean in practice? How do you solve this problem?

What are your experiences re-running or adjusting a script or a figure you created few months ago?

Have you continued working from a previous student's script/code/plot/notebook? What were the biggest challenges?

Questions for the session

  1. https://www.kth.se/form/build-systems-course-and-hackathon-part-i --> what are the prereqs for this course?

    • Ideally you have access to an HPC cluster. The participating clusters are Dardel (KTH), CSC Puhti and Mahti (all Finland), Triton (Aalto), and many more. Which affiliation do you have?
    • Uni of Helsinki
    • I meant are there technical prereqs?
  2. "Two" is rather quiet.

    • Is it better now?

Introduction

Summary:

Reproducible research

Intro https://coderefinery.github.io/reproducible-research/ Motivation https://coderefinery.github.io/reproducible-research/motivation/

  1. How many people here have failed to reproduce a (computational) experiment?

    • Several times with Molecular Dynamics simulations.
    • Classic neuroscience paper about same version of software giving different results in different operating systems: https://pubmed.ncbi.nlm.nih.gov/22675527/
    • It sounds like this happens all the time even pre-publication. I wonder how much of "I have to work this weekend" is "my results changed and I have to figure it out"
  2. Will we review snakemake?

    • Snakemake will be demonstrated, yes.

Organizing your projects

https://coderefinery.github.io/reproducible-research/organizing-projects/

  1. regarding "everything in one folder" - my raw data is gigantic and I work on it in the first place on a server to crunch it down - then I put the output of that onto my local computer. So it becomes already two folders in two different places. Do you have a recommendation for that?

    • I think this is fine, as long as the links/relationships are well-defined. My suggestion (I'm not quite sure how to say this) is that each unique "thing" (1: big raw data, 2: working code and data) has its unique folder name and carefully tracked itself. In working code it needs one variable inserted: path to the big raw data.
  2. Expanding on the question above. In the manuscript phase, when the data are in a different place (say, CSC Puhti), and people are working on the manuscript in a shared workspace, how can you keep the links/relationships sane?

    • Working on the text itself <- yes
    • multiple people editing the same files is really hard: either people have to edit at different times, or use version control (like we did in week 1!), or constantly synced servivce like Overleaf/Google Docs/etc.
  3. Is there a good way to catalog how the "external" data is obtained other than a README note? Often that information can be lost - or not clear to a new person.

    • If the data has a persintent ID (for example Zenodo DOI) that can be included. There could be setup scripts to automatically download it. DOIs and data repos are so recommended just because it provides a way to strictly define where and what the data is
    • We shouldn't forget one needs to know where to put the data locally, in order to work with the code
  4. In which folder would you then put "intermediate results"? such as data exploration plots (masses of those) that you just produce to screen the data but never really work on later?

    • I use a different filesystem for those (scratch), but in general it is up to you. Sometimes it is "nice" to keep intermediate results if it took many days of computing.
    • I might make a scratch/ or processed/ dir - in same place as I'm working if it's small enough, or elsewhere and link it if it's very big.
  5. Just a comment on the project structure: Something similar is available on https://cookiecutter-data-science.drivendata.org :+1: :+1:

    • Cool! :+1:
  6. Will this course review going over how to build a GUI from a script?

    • Unfortunately no, all the covered materials could be find here: https://coderefinery.github.io/2024-09-10-workshop/
    • Do you have a recommendation for any workshops that do this?

:::info

How do you collaborate on writing academic papers?

Are you using version control for academic papers?

How do you handle collaborative issues e.g. conflicting changes?

:::

  1. for git for papers: is it that code has to always be consistent in order to compile run. Thus async and pull requests is OK. A paper text doesn't, so synchronous editing is OK (as long as someone checks it at the end)?

    • Yes with synchronous editing tools (e.g. Google docs, Overleaf) you might not want to take many snapshots, but then when you reach a new version maybe you want to git commit it. (Disclaimer: I do not like to use git for papers)
  2. I find it easy to send it to my supervisor, and he can "check" it and I get back the checked version. :+1:

    • Yes, most of the time this is "good enough" and "good enough" is the best practice :)

Recording computational steps

https://coderefinery.github.io/reproducible-research/workflow-management/

  1. Is there an alternative to snakemake for R code?

    • I am not sure, but I like to create R scripts which work as command line scripts, so with that in mind you can run R scripts within snakemake.
      • I did a quick search and it seems that this is what they did here https://github.com/fritzbayer/snakemake-with-R
    • Here is a reproducible workflow for Snakemake and R:https://github.com/lachlandeer/snakemake-econ-r
  2. How is this different from a loop? How is snakemake helps reproducability in case there is added files...etc later?

    • A loop in a script is always executed independent of whether the results are up to date or not. With snakemake only the necessary steps to update the results are taken. If you add more data files, like more books in the current example, snakemake will execute the statistics for the add files and update the results files.
      • I'm sorry but i still dont get it. Is it making the workflow easier with only executing the command on the newly added files, or is it helping somehow to make the code reproducable by running it on a state that happend on lets say: last year: september 17, and reproducing that?
        • Snakemake creates a dependency graph. It knows what a result file is dependent upon, which original files so to say. When a result file is out of date. That is the result file is older than the original file, snakemake will remake the result, execute the steps needed to update the result file. If all result files are younger than the original files no steps are taken. You cannot get back to a state dependent on date. It that case you will need to use version control where you checkout the version from a certain date and run snakemake to achieve the state.
  3. How does Snakemake make the parallellization happen? Does it require the user to set some parameters?

    • Given the -j parameter checks the numer of workers (= how many cpus you can access). On HPC systems there are integration with snakemake and "slurm" (slurm is the tool to manage the queue of HPC systems.) <- That's amazing, thanks!
  4. When I should better use snakemake? At the beginning of the project, during the developing of the project or at the end?

    • Not at the end I would say. If you know that the project is going to repeat the same pipeline for many different types of inputs, then snakemake or other tools for repeating the same steps with differnet inputs.
  5. Is there an alternative to snakemake for matlab?

    • Matlab functions/scripts can be run in a non-graphical way, so similarly as per question 12, you can turn your matlab steps into command line scripts and then include those into snakemake steps. I am unsure if there is already an example of this way of doing things. Also: there might be some matlab toolbox that would not require you to go to the command line.
  6. On my machine at least, the twitch window shows an interesting phenomenon. The image keeps getting out of focus to the point of the text becoming too blurred to read, which then corrects it self, and again repeats

    • It is the bandwidth optimisation (tries to save your internet data). In the settings instead of "auto" set the resolution to a fixed value. Sorry we forgot to mention this at the beginning.
  7. Is it possible to get a page width view because the text gets quite small in the full page view?

    • do you mean popout view? (options wheel - popout player)
    • I don't see what you mean ... The twitch stream is in the browser window and the text page takes all the length but only 40% of the space centered horizontally. This way the text size gets very small.
  8. (11.) Can you give another example of when you want to use Snakemake -- do you always need to break your script into several smaller ones e.g., tables, plotting, etc., or can you provide another usecase?

    • I think it is up to you. E.g. when working with medical imaging data, you have one dataset per subject, and for each subject I might want to do preprocessing, quality control plots, etc. It can all be in a single script that has calls to other scripts that snakemake does not "see".
  9. (12.) What about sipler tools as scki-learn pipeline. Is that usefull in some situation?

    • I am not familiar with that, but sure why not? There are many tools and hopefully you can find which ones are useful for your case / popular in your field. Take this day as an opportunity for exploring these. :)

    Recording dependencies

    https://coderefinery.github.io/reproducible-research/dependencies/

  10. I don't have experience, but have touched, using conda or renv. My first thought was: oh, this will fill up my harddrive very quickly because I will have to install and re-install all these packages again and again for every single project. And also it took ages to install all the packages in the renv. I stopped at that point. What is your experience with that?

    • I have experienced that installing packages can take time, but that was on a cluster. I on my local computer both conda and pip has been fast.
    • I think it is the tradeoff between keeping projects separated with different software dependencies vs saving disk space and just use a "base" installation for all projects. Ideally we only work on one project at a time, but yeah we know it's never like that. :)

:::info Dependencies-1: Time-capsule of dependencies https://coderefinery.github.io/reproducible-research/dependencies/#demo

A:

B:

C:

D:

E:

  1. How about the dependencies (and their versions) of the packages listed on the environments above? Should they be listed as well?

    • What are you refering to exactly?
    • The environment file that just got shown has a bunch of other packages, not just python, seaborn and the two others. Should we have the full dependency listed. (I think this just got answered.):+1:
  2. I didn't understand the issue with the master branch of git

    • In example we assume we are looking at/using a repository which is 3 years old. It is very likely that the master branch has moved ahead in these 3 years. Hence, the environment is dependent upon moving git branches and the environment is then not static or equal to what was 3 years ago.
  3. could you go through an example with R? (not using conda)

    • I don't think we have time for that but I will highlight this to the instructors.

Recording environments

https://coderefinery.github.io/reproducible-research/environments/

  1. Using containers = more HDD and / or RAM requirements on the hardware side ?

    • This really depend upon the size of the container you are running. A quick search on Docker shows that a minimum requirment is 512 MB, but Go or Java applications would require much more (4x, at least).
  2. While docker /containers can partialy address the requirement of a snapshot of a machine, Docker or containers or jails will not capture a machine because it does not capture the OS kernel. Docker only captures the userspace libraries when you specify the os version (ubuntu:20.4, etc). However, in the end you will be using the underlying OS kernel which might be different from the one used when building the container image. If you need a snapshot of the machine, then the closest you can get is by creating a VM image.

  3. What if we do not have the sudo? but for running apt-get we need that. right?

    • Docker can be configured to not need superuser. It is tricky, and might not work in all cases. [https://docs.docker.com/engine/security/rootless/]. Note the root user in the container is different from the root on the host machine.
      • that crashes which what the instructors just said. ?! now I am confused.
      • Doesn't that just make a group of docker users, while the docker deamon still has root rights, i.e. you can still run stuff as root?
  4. If you do not have access to the dockerfile that produced a container, how easy is it to probe a container and figure out either how it was created / what it does / potential for malicious activity?

    • Good question. I don't think there is an easy way to probe how the container was created. If you believe there is a potential for malicous activity, you should not use a undocumented container.
    • Docker does not keep a log of the activities / commands executed in the container, so you cannot accurately know how the image was created.
    • There are tools to explore a docker container to see what's inside, but it can become difficult and not exactly possible to extract the dockerfile

Where to go from here?

https://coderefinery.github.io/reproducible-research/where-to-go/

  1. Is it common to have som document describing the overall project features: like milestone and/or dictonary of data. Somethieng like a central document?!

    • We will touch upon this when we discuss documentation(tomorrow). The short answer is that this something that can go into a Readme file in the git repository, or on static web page or on Read the docs. This is really dependent upon on how much effort you put into the documentation.
  2. The Future-you point is really important - better to know that in advance rather than discover it from scratch, as it were. Future you is ignorant, and past-you was a jerk who didn't bother explaining anything :+1:

  3. For future you: just want to share this meme image that really hits home! Meme

Social Coding and open software

https://coderefinery.github.io/social-coding/

Social Coding

https://coderefinery.github.io/social-coding/social-coding/

  1. Is the stream still here: https://player.twitch.tv/?channel=coderefinery&enableExtensions=true&muted=false&parent=twitch.tv&player=popout&quality=auto&volume=0.5 ? no one there yet?
    • It should be here: https://www.twitch.tv/coderefinery
      • thanks!

:::info

Question 1: Why would I want to share my scripts/code/data?

Choose many. Vote by adding an o character:

Question 2: The most concerning thing for me, If I share my software now

Choose one. Vote by adding an o character:

Question 3: Why is software often treated differently from papers?

Free-form answers:

Question 4: When you find a repository with code/library you would like to reuse, what are the things you look at to decide whether you use it?

Free-form answers:

:::

Software licensing

https://coderefinery.github.io/social-coding/software-licensing/

  1. What is the difference between conda and docker? (I ask specifically b/c of environments)

    • Conda is a way to install software packages (Ptyhno/R/C etc), and make a self-contained directory you can activate and use that software. It still basically runs on your computer.
    • Docker is a tool that makes "containers", which can have sofware installed, but it's like it contains the whole operating system. So it's more strict, a bit more complex to make, and more portable.
  2. Why would I use docker v. kubernetes? Can you explain a bit more what bootstap is?

    • Docker is one program to make and run containers. There are others like it. Whatever you use, it is good for running single things. It can be practical for running one-off software things.
    • Kubernetes is a whole system that manages running many containers and services. It's mainly used for things like running web servers and other persintent services (but could be used for much more).
  3. When you license your code, does it belong to your employer or you? how does this work with patents v. licenses? :+1:

    • It depends on your job contract and your employers policy.
    • --> follow up: so if the data belongs to the employer, does the code also belong to the employer? What happens when you change jobs and you want to showcase your work then?
      • If it is just for showcasing your work you are probably fine, but if it is for using your code your are dependent upon your agreement with your employer. Though the data belongs to the employter, this do not necessarily reflect upon the code. You could still be able to take the code along with you, but it depends the agreements governing your work.
    • --> what type of license would you need for this use-case?
      • MIT License gives you a lot of freedom to do what ever you want with code you write.
  4. Where should you put instructions for using the code -- in the README.md file?

    • Yes, usually setup/install and how to use are part of the README of your project. It is a very important part too.
  5. Following up on ChatGPT, how do you cite it if you used it as part of your code for ideas?

    • You would become the owner of the code. However mentioning that you use GenAI to create the code depends on the usecases. Personally I don't mention it unless I'm publishing for journal.
    • --> how do you state this, on GitHub after you publish? in the README or elsewhere? Is there an EU Commission or somewhere else that we can follow for guidaince on this?
      • I noticed some repositories which mention this in their README. However this is still a new concept and there is no muture/standard way yet.
    • --> is GitHub Copilot considered 'better' than OpenAI for code?
      • Not sure if it is better or not... Perosonally I use OpenAI and Claude.

:::info

Question 5: Which of these are derivative works?

Choose many. Vote by adding an o character:

  1. I like the CC licenses because the content in the licenses is easy enough for the average user to roughly understand. However, I haven't seen any mention of warranty or liability in the CC license, unlike in the GPL3 licenses. Can I be held liable for something if I have a bug in my CC-licensed code?

    • No you are not responsible for any possible bug/mistakes.
    • If you are talking about software, CC licenses aren't recommended to be used for software: https://creativecommons.org/faq/#can-i-apply-a-creative-commons-license-to-software . There are only a few major software licenes you need to know that are roughly the equivalent of each of the CC levels.
      • Thanks!
  2. I did ask Chat GPT for an algorithm and it found something, which i used. However, i am sure that the algorithm exists somewhere else. Do you've got a hint on how to find the original publisher? Chat GPT just gives reference to some general algorithm books.

    • You can try to search part of the algorithm, or the algorithm name.
    • For clarification: An algorithm itself can't be copyrighted. The code of that algorithm could be.
    • Thank you both!
  3. What about AFL (Academic Free License)? Allowed to use, but no changes are allowed. Reduces the workflow for the original author...

    • "others can't modify" sounsd good but is a trap: would someone want to resuse and improve it, if it is "dead" and your improvements can't be re-shared? Most people wouldn't recommend this :+1:
    • "can't modify" isn't considered "free software" and most projects will stay far away from it
    • Note that an author never has any responsibility for doing anything. If you are tho only one that can modify, wouldn't that make more effort for you since you have to fix it? If others can fix and re-share, then you don't have to feel bad about not doing anything else.
  4. How does naming the tool work, both pre- and post- publication? Is there licensing for the name and/or copyright?

    • "trademark" is the thing that covers protection of a name or identity. Copyright isn't relevant for a name. --> how do you establish a trademark, pre- and post- publication? See 44 below
  5. Does the tool belong to the University, or to you as the developer? Or is it shared across both?

    • As an instructor said "it depends": usually your employer would own it but at least in Finland it can be more complex depending on funding source. Always check up the rules at your own place.
  6. Should you wait to post anything to a public GitHub until the tool is published?

    • What do you mean with tool getting published?
    • It really depends on your interests: "I release when my publication is ready" or "when it's done" are valid, but also "I'm being radically open to get more interest".
    • But things are rarely "done" so release can get delayed more and more... and if you don't plan for release from the start there might be private and public stuff mixed together and it gets harder and harder.
  7. Yeah licenses considered "open" can't be closeable retrocatively (old versions, new versions can be changed - if they still hold the full license). It's one of the standard tests for free software licenses.

    • The author of the code can release the code under several licenses. So, I can release version 1 as GPL, and version 2 as (C)
    • Yes. This kind of GPL for the open version, and closed license sold otherwise, that is often done. But you can just release under the GPL and if anyone wants more than the GPL offers, they contact you and negotiate another license.
  8. How do you establish a trademark, pre- and post- publication?

    • Check with lawyers unfortunately. It's a thing you need to register. (Most academic software doesn't do this. Unless it's truly big or commercialized, it's not really relevant)
    • --> where do you go to register?
    • A quick search finds this: https://www.euipo.europa.eu/en/trade-marks - I haven't use this, I don't know if it's relevant, maybe it's per-country?
  9. . Can I add EUPL license to GitHub? I cannot find the optione there

    • I haven't checked but I believe you, they have relatively few automatically pickable. You can always copy the text and put it in the LICENSE file yourself

:::info Break until xx:02 :::

  1. Here are the three "free software" tests used in the Debian linux distribution. I think it has some good lessons

    • "desert island test": can you use and modify it and share with others on your island if you can't communicate with anyone? "send improvements to me" or "send me an email if you share it" don't work. This would also mean you can't cite it, for example.
    • "dissident test": if you are hiding/don't want to communicate and share your actions, can you still use it and share with your friends? This is useful even in academia because do you want to have to publish just to meet some criteria?
    • "tenticales of evil test": if a big corporation bought it from the author, can they retroactively stop usage, make you liable for something, etc. This is very useful for us anyway, we wouldn't want to use something with an unclear future.
  2. what if I want to publish some code that is "just" to do some data exploration, analysis and plotting (as addendum to a paper, for example). Is that by definition also "software"? How would you licence?

    • We can debate what the definition of "software" is, but in practice everything we are talking about equally applies to this case.
    • I'd throw it somewhere, give it a basic readme and license, and not worry too much.
      • what is a basic license?
      • "basic" in that sentence doesn't mean too much but what i would do: MIT if it's "small, anyone can use for any purpose" or GPL if "this is somewhat valuble to me and gives me advantage, I don't want others making closed versions that I can't access." These are basically my go-tos that don't require any thought but gets the main two levels. (roughly like CC-BY and CC-BY-SA, for the comment about CC licenses above)
        • thank you! very helpful

Software citation

https://coderefinery.github.io/social-coding/software-citation/

Sharing data

https://coderefinery.github.io/social-coding/sharing-data/

  1. I have a question regarding github/-lab etc. accounts. When I get employed in a new job and my new employer pays for a professional type of account - how do I add or connect this to my existing account? And how do I disconnect again after the contract ends?

    • Some companies don't accept using personal repositories to be used. Of course it depends on the companies policy and the fact on what project you are working on. In the end, for the companies who accept using a peronal account, they can add you to an organization during your employment and then remove you after your contract ends.
      • so maybe I have to get a new account when I start for a certain employer in order to get the "professional" flavour they pay for? How do I link this (and possibly multiple such accounts) with my private account that hopefully stays?
        • My experience was they pay for the account and manage the account too. For example all the accounts are created using the companies email address to make sure after your contract ends, you don't have access to the repositories anymore.
        • In the end, private and confidential work are undernstable. If you work on something confidential/private with your company account, you can explain it to the next employer; even if you don't have the code publicly available.
        • Counter question: Why do you want to link he two? If it's about showing the work you have done, you can indicte the other account, but otherwise?
          • I was mainly thinking about how I could benefit from the fact that the company pays for a professional account that has more options. I realized I don't see / cannot use many of the options in my privat github that we worked with in the workshop last week
  2. If you publish a methods paper where you develop a code, you make the code available and citable, and you then write the "software-paper", wouldn't it divide your citations?

    • Possibly. But it also depends on how central your methodology/code is to your research paper. Your research paper might e.g. not get any traction from the computatioal side, since they are not interested in your specific topic. But if the algorithm/code is more general than your field, you could get citations from places you would otherwise not have been seen.

Feedback

:::info News for day 4 and prep for day 5:

Today was:

One good thing about today:

One thing to improve for next time:

Any other feedback?

For those in groups, how many people are you watching with:


Funding

CodeRefinery is a project within the Nordic e-Infrastructure Collaboration (NeIC). NeIC is an organisational unit under NordForsk.

Privacy

Privacy policy

Follow us

Contact

support@coderefinery.org

Improve this page

Source code