Big Data Science

Description
Big Data Science channel gathers together all interesting facts about Data Science.
For cooperation: [email protected]
? — https://t.me/bds_job — channel about Data Science jobs and career
? — https://t.me/bdscience_ru — Big Data Science [RU]
Advertising
We recommend to visit

Community chat: https://t.me/hamster_kombat_chat_2

Twitter: x.com/hamster_kombat

YouTube: https://www.youtube.com/@HamsterKombat_Official

Bot: https://t.me/hamster_kombat_bot
Game: https://t.me/hamster_kombat_bot/

Last updated 3 weeks, 3 days ago

Your easy, fun crypto trading app for buying and trading any crypto on the market

Last updated 2 weeks, 3 days ago

Turn your endless taps into a financial tool.
Join @tapswap_bot


Collaboration - @taping_Guru

Last updated 3 days, 5 hours ago

2 months ago

?TOP DS-events all over the world in August
Aug 2-4 - MLMI 2024 - Osaka, Japan - https://mlmi.net/
Aug 3-9 - International Joint Conference on Artificial Intelligence (IJCAI) - Jeju, South Korea - https://ijcai24.org/
Aug 5-6 - ICASAM 2024 - Vancouver, Canada - https://waset.org/applied-statistics-analysis-and-modeling-conference-in-august-2024-in-vancouver
Aug 7-8 - CDAO Chicago - Chicago, United States - https://da-metro-chicago.coriniumintelligence.com/
Aug 12-14 - AI4 2024 - Las Vegas, United States - https://ai4.io/vegas/
Aug 16-17 - Machine Learning for Healthcare 2024 - Toronto, Canada - https://www.mlforhc.org/
Aug 19-20 - Artificial Intelligence and Machine Learning - Toronto, Canada - https://www.scitechseries.com/artificial-intelligence-machine
Aug 19-22 - The Bioprocessing Summit - Boston, USA - https://www.bioprocessingsummit.com/
Aug 25-29 - ACM KDD 2024 - Barcelona, Spain - https://kdd2024.kdd.org/
Aug 27 - Azure AI Summer Jam -
Aug 27-29 - ITCN Asia 25th - Karachi, Pakistan - https://itcnasia.com/karachi/
Aug 31 - DATA SATURDAY #52 - Oslo, Norway - https://datasaturdays.com/Event/20240831-datasaturday0052

waset.org

International Conference on Applied Statistics, Analysis and Modeling ICASAM in August 2024 in Vancouver

Applied Statistics, Analysis and Modeling scheduled on August 05-06, 2024 in August 2024 in Vancouver is for the researchers, scientists, scholars, engineers, academic, scientific and university practitioners to present research activities that might want…

2 months, 1 week ago

*?Datasets used to build various ML bases*

Iphone dataset - a set of datasets on the basis of which more than 40 thousand dynamic and more than 100 thousand static Gaussians, 20 SE(3) bases were built using Shape of Motion

The training time on 1xGPU A100 using the Adam optimizer with a resolution of 960x720 was just over 2 hours at a rendering speed of 40 frames per second.

According to the results of tests during the training process, Shape of Motion showed good results in the quality and consistency of scene construction.
However, the method still requires optimization for each specific scene and cannot handle significant changes in camera angle. There is also a critical dependence on precise camera parameters and user input to create a moving object mask.

GitHub

GitHub - vye16/shape-of-motion

Contribute to vye16/shape-of-motion development by creating an account on GitHub.

2 months, 1 week ago

*?*?Benchmark for comprehensive assessment of LLM logical thinking**

ZebraLogic is a benchmark based on logic puzzles and is a set of 1000 program-generated tasks of varying difficulty - with a grid from 2x2 to 6x6.

Each puzzle consists of N houses (numbered from left to right) and M features for each house. The task is to determine the unique distribution of feature values across the houses based on the provided clues.
Language models are given one example of the puzzle solution with a detailed explanation of the reasoning process and the answer in JSON format. Models must then solve a new problem, providing both the reasoning progress and the final solution in a given format.

Evaluation Metrics:
1. Puzzle-level accuracy (percentage of completely correctly solved puzzles).
2. Cell-level accuracy (percentage of correctly completed cells in the solution matrix).

? Project Page
? Dataset

Local launch of ZebraLogic as part of the ZeroEval framefork:

```
# Install via conda

conda create -n zeroeval python=3.10
conda activate zeroeval

# pip install vllm -U # pip install -e vllm

pip install vllm==0.5.1
pip install -r requirements.txt
# export HF_HOME=/path/to/your/custom/cache_dir/

# Run Meta-Llama-3-8B-Instruct via local, with greedy decoding on zebra\-grid
bash zero_eval_local.sh -d zebra-grid -m meta-llama/Meta-Llama-3-8B-Instruct -p Meta-Llama-3-8B-Instruct -s 4
```

GitHub

GitHub - yuchenlin/ZeroEval: A simple unified framework for evaluating LLMs

A simple unified framework for evaluating LLMs. Contribute to yuchenlin/ZeroEval development by creating an account on GitHub.

4 months, 3 weeks ago

*?Selection of vector databases
Vector databases* are a special type of database designed to organize data based on similarity. To do this, they transform raw data—such as images, text, video, or audio—into mathematical representations known as multidimensional vectors. Each vector can have from tens to thousands of dimensions, depending on the complexity of the source data. At the moment there are such vector databases as:
Chroma is an open source vector database designed to provide developers and organizations of all sizes with the resources they need to build Large Language Model (LLM) based applications. It provides developers with a highly scalable and efficient solution for storing, searching, and retrieving multidimensional vectors.
One of the reasons for Chroma's popularity is its flexibility.
Pinecone - This is a cloud-based managed vector database. Its broad support for high-dimensional vectors makes Pinecone suitable for a variety of use cases, including similarity search, recommender systems, personalization, and semantic search. It also supports single-stage filtering capabilities. And its ability to analyze data in real time makes it an excellent choice for detecting threats and monitoring cybersecurity attacks.
Weviate - A notable feature of this database is that it can be used to store both vectors and objects. This makes it suitable for applications that combine multiple search methods, such as vector search and keyword search.
Milvus - uses the most modern algorithms to speed up the search process, which allows you to quickly find similar vectors even when working with large amounts of data.

Trychroma

the AI-native open-source embedding database

4 months, 3 weeks ago

*⚖️Apache Superset: advantages and disadvantages
Apache Superset is an open source data visualization tool that provides a rich set of capabilities for analyzing data and creating interactive dashboards.
Benefits of Apache Superset:
1. Open Source: Apache Superset is developed and maintained by the community, which provides a high degree of flexibility and extensibility to suit different needs.
2. Powerful Data Visualization: Superset offers a wide selection of graphs, charts and visuals, allowing users to create colorful and informative dashboards for data analysis.
3. Interactive capabilities: Users can easily interact with dashboards, apply filters, change parameters and drill down/expand data to gain a deeper understanding of the information.
4. Integration with various data sources: Superset supports multiple data sources including databases, data warehouses, Apache Druid and many more, making it a versatile tool for working with data from various sources.
5. Scalability and Performance: Thanks to its architecture and the use of technologies such as Apache Druid, Superset is able to efficiently process large amounts of data and provide high performance when working with dashboards.
Disadvantages of Apache Superset:
1. Difficulty in Setup: Although Superset provides extensive capabilities, its setup and configuration can be complex, especially for beginners, requiring a certain level of technical expertise.
2. Insufficient Documentation: Some users have noted that Superset's documentation is not always detailed or up-to-date enough, which can make it difficult to learn and use the tool.
Overall*, Apache Superset is a powerful open source data visualization tool that comes with several advantages such as flexibility, scalability, and powerful visualization capabilities. However, before using it, you should also take into account the disadvantages, such as the complexity of setup and some restrictions on its availability.

superset.apache.org

Welcome | Superset

Community website for Apache Superset***™***, a data visualization and data exploration platform

4 months, 4 weeks ago

*?Extract data with Quivr*
Quivr is an open-source service that allows you to extract information from local files (PDF, CSV, Excel, Word, audio, video, etc.)
Quivr can work offline, so you can always access your data anytime, anywhere.
Quivr is also compatible with Ubuntu 22 or later
The open source code can be obtained from this link

Quivr

Quivr - Open source chat-powered second brains

Your GenAI Second Brain ***🧠*** A personal productivity assistant (RAG) ***⚡️******🤖*** Chat with your docs (PDF, CSV, ...) & apps using Langchain, GPT 3.5 / 4 turbo, Private, Anthropic, VertexAI, Ollama, LLMs, that you can share with users ! Local & Private alternative to…

6 months, 3 weeks ago

*?*⚔️Sensei will tell you
Sensei is a relatively new Python tool for generating synthetic data using systems such as OpenAI, MistralAI or AnthropicAII.
To start, you need to make the following preset:
pip install openai mistralai numpy**
The developers also wrote detailed instructions for setup.

GitHub

GitHub - migtissera/Sensei: Generate Synthetic Data Using OpenAI, MistralAI or AnthropicAI

Generate Synthetic Data Using OpenAI, MistralAI or AnthropicAI - migtissera/Sensei

6 months, 3 weeks ago

*?*?Generic set of annotated images**
The ImageNet dataset includes 14,197,122 annotated images structured according to the WordNet hierarchy.
Since early 2010, this dataset has been used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and serves as a standard for image classification and object detection tasks.
This large public dataset contains images that have been manually annotated for training purposes.

Kaggle

ImageNet Object Localization Challenge

Identify the objects in images

7 months ago

*?*?DeltaLake: advantages and disadvantages
Delta Lake is an abstraction layer for working with data in data warehouses. Delta Lake provides additional capabilities and data integrity guarantees for storing and processing large volumes of data.
Delta Lake benefits:
1. Transactional Consistency: Delta Lake provides ACID transactions, ensuring transactional data consistency. This ensures reliable operations and data integrity management.
2. Partitioning: Delta Lake supports data partitioning, which improves query performance and data management. Partitioning allows you to effectively filter data based on certain criteria.
3. Improved Performance: Delta Lake optimizes queries and operations on data, leading to improved performance compared to conventional data warehouses.
4. Streaming Data Processing: Delta Lake supports streaming data processing, allowing you to instantly update and analyze data in real time.
Disadvantages of Delta Lake:
1. Difficulty in Setup: Some users may find it difficult to set up and use Delta Lake due to its advanced functionality.
2. Compatibility: Compatibility issues may arise when integrating Delta Lake with other tools and storage systems.
Overall,** Delta Lake provides powerful tools for data management and processing, but its use should be considered based on the specific project requirements and team experience.

delta.io

Home

8 months, 3 weeks ago

*?*?Dataset programming is no longer a problem**
Snorkel - a framework for data programming. The approach of this framework is to use various heuristics and a priori knowledge to automatically label datasets. The project started at Stanford as a tool to help mark up datasets for the information extraction task, and now the developers are creating a platform for use by external customers.
Snorkel's arsenal includes three key tools:
-marking functions for creating a dataset;
-transforming functions for dataset augmentation;
-slicing functions that highlight subsets in the dataset that are critical for the performance of learning models.

We recommend to visit

Community chat: https://t.me/hamster_kombat_chat_2

Twitter: x.com/hamster_kombat

YouTube: https://www.youtube.com/@HamsterKombat_Official

Bot: https://t.me/hamster_kombat_bot
Game: https://t.me/hamster_kombat_bot/

Last updated 3 weeks, 3 days ago

Your easy, fun crypto trading app for buying and trading any crypto on the market

Last updated 2 weeks, 3 days ago

Turn your endless taps into a financial tool.
Join @tapswap_bot


Collaboration - @taping_Guru

Last updated 3 days, 5 hours ago